Dynamic Programming For Partially Observable Stochastic Games
Dynamic Programming For Partially Observable Stochastic Games
UNCERTAINTY 709
P 0
Background where P(o|s, a) = s0 ∈S P(s , o|s, a), and the updated
As background, we review the POSG model and two algo- value function is computed for all belief states b ∈ B. Ex-
rithms that we generalize to create a dynamic programming act DP algorithms for POMDPs rely on Smallwood and
algorithm for POSGs: dynamic programming for POMDPs Sondik’s (1973) proof that the DP operator preserves the
and elimination of dominated strategies in solving normal piecewise linearity and convexity of the value function. This
form games. means that the value function can be represented exactly
by a finite set of |S|-dimensional value vectors, denoted
Partially observable stochastic games V = {v1 , v2 , . . . , vk }, where
A partially observable stochastic game (POSG) is a tuple X
V (b) = max b(s)vj (s). (2)
hI, S, {b0 }, {Ai }, {Oi }, P, {Ri }i, where, 1≤j≤k
s∈S
• I is a finite set of agents (or controllers) indexed 1, . . . , n
• S is a finite set of states As elucidated by Kaelbling et al. (1998), each value vec-
tor corresponds to a complete conditional plan that speci-
• b0 ∈ ∆(S) represents the initial state distribution fies an action for every sequence of observations. Adopting
• Ai is a finite set of actions available to agent i and A~ = the terminology of game theory, we often refer to a com-
×i∈I Ai is the set of joint actions (i.e., action profiles), plete conditional plan as a strategy. We use this interchange-
where ~a = ha1 , . . . , an i denotes a joint action ably with policy tree, because a conditional plan for a finite-
horizon POMDP can be viewed as a tree.
• Oi is a finite set of observations for agent i and O ~ =
The DP operator of Equation (1) computes an updated
×i∈I Oi is the set of joint observations, where ~o = value function, but can also be interpreted as computing an
ho1 , . . . , on i denotes a joint observation updated set of policy trees. In fact, the simplest algorithm
• P is a set of Markovian state transition and observation for computing the DP update has two steps, which are de-
probabilities, where P(s0 , ~o|s, ~a) denotes the probability scribed below.
that taking joint action ~a in state s results in a transition In the first step, the DP operator is given a set Qt of depth-
to state s0 and joint observation ~o t policy trees and a corresponding set V t of value vectors
~ → < is a reward function for agent i representing the horizon-t value function. It computes Qt+1
• Ri : S × A
and V t+1 in two steps. First, a set of depth t + 1 policy
A game unfolds over a finite or infinite sequence of stages, trees, Qt+1 , is created by generating every possible depth
where the number of stages is called the horizon of the game. t + 1 policy tree that makes a transition, after an action and
In this paper, we consider finite-horizon POSGs; some of the observation, to the root node of some depth-t policy tree in
challenges involved in solving the infinite-horizon case are Qt . This operation will hereafter be called an exhaustive
discussed at the end of the paper. At each stage, all agents backup. Note that |Qt+1 | = |A||Qt ||O| . For each policy tree
simultaneously select an action and receive a reward and ob- qj ∈ Qt+1 , it is straightforward to compute a corresponding
servation. The objective, for each agent, is to maximize the value vector, vj ∈ V t+1 .
expected sum of rewards it receives during the game. The second step is to eliminate policy trees that need not
Whether agents compete or cooperate in seeking reward be followed by a decision maker that is maximizing expected
depends on their reward functions. The case in which the value. This is accomplished by eliminating (i.e., pruning)
agents share the same reward function has been called a any policy tree when this can be done without decreasing
decentralized partially observable Markov decision process the value of any belief state.
(DEC-POMDP) (Bernstein et al. 2002).
Formally, a policy tree qj ∈ Qt+1 i with corresponding
t+1
Dynamic programming for POMDPs value vector vj ∈ Vi is considered dominated if for all
A POSG with a single agent corresponds to a POMDP. We b ∈ B there exists a vk ∈ Vit+1 \ vj such that b · vk ≥
briefly review an exact dynamic programming algorithm for b · vj . This test for dominance is performed using linear
POMDPs that provides a foundation for our exact dynamic programming. When qj is removed from the set Qt+1 i , its
programming algorithm for POSGs. We use the same nota- corresponding value vector vj is also removed from Vit+1 .
tion for POMDPs as for POSGs, but omit the subscript that The dual of this linear program can also be used as a test
indexes an agent. for dominance. In this case, a policy tree qj with correspond-
The first step in solving a POMDP by dynamic program- ing value vector vj is dominated when there is a probability
ming (DP) is to convert it into a completely observable MDP distribution p over the other policy trees, such that
with a state set B = ∆(S) that consists of all possible be- X
liefs about the current state. Let ba,o denote the belief state p(k)vk (s) ≥ vj (s), ∀s ∈ S. (3)
that results from belief state b, after action a and observation k6=j
o. The DP operator can be written in the form,
( " #) This alternative, and equivalent, test for dominance plays
X X
V t+1
(b) = max b(s) R(s, a) + t
P(o|s, a)V (b a,o
) , a role in iterated strategy elimination, as we will see in
a∈A the next section, and was recently applied in the context of
s∈S o∈O
(1) POMDPs (Poupart & Boutilier 2004).
710 UNCERTAINTY
Iterated elimination of dominated strategies can consider eliminating additional strategies that may only
Techniques for eliminating dominated strategies in solving a have been best responses to strategies of other agents that
POMDP are very closely related to techniques for eliminat- have since been eliminated. The procedure of alternating
ing dominated strategies in solving games in normal form. between agents until no agent can eliminate another strategy
A game in normal form is a tuple G = {I, {Di }, {Vi }}, is called iterated elimination of dominated strategies.
where I is a finite set of agents, Di is a finite set of strate- In solving normal-form games, iterated elimination of
gies available to agent i, and Vi : D ~ → < is the value (or dominated strategies is a somewhat weak solution concept,
payoff) function for agent i. Unlike a stochastic game, there in that it does not (usually) identify a specific strategy for
are no states or state transitions in this model. an agent to play, but rather a set of possible strategies. To
Every strategy di ∈ Di is a pure strategy. Let δi ∈ ∆(Di ) select a specific strategy requires additional reasoning, and
denote a mixed strategy, that is, a probability distribution introduces the concept of a Nash equilibrium, which is a pro-
over the pure strategies available to agent i, where δi (di ) de- file of strategies (possibly mixed), such that δi ∈ Bi (δ−i )
notes the probability assigned to strategy di ∈ Di . Let d−i for all agents i. Since there are often multiple equilibria,
denote a profile of pure strategies for the other agents (i.e., the problem of equilibrium selection is important. (It has
all the agents except agent i), and let δ−i denote a profile a more straightforward solution for cooperative games than
of mixed strategies for the other agents. Since agents select for general-sum games.) But in this paper, we focus on the
strategies simultaneously, δ−i can also represent agent i’s issue of elimination of dominated strategies.
belief about thePother agents’ likely strategies. If we define
Vi (di , δ−i ) = d−i δ−i (d−i )Vi (di , d−i ), then Dynamic programming for POSGs
In the rest of the paper, we develop a dynamic program-
Bi (δ−i ) = {di ∈ Di |Vi (di , δ−i ) ≥ Vi (d0i , δ−i ) ∀d0i ∈ Di } ming algorithm for POSGs that is a synthesis of dynamic
(4) programming for POMDPs and iterated elimination of dom-
denotes the best response function of agent i, which is the inated strategies in normal-form games. We begin by intro-
set of strategies for agent i that maximize the value of some ducing the concept of a normal-form game with hidden state,
belief about the strategies of the other agents. Any strategy which provides a way of relating the POSG and normal-form
that is not a best response to some belief can be deleted. representations of a game. We describe a method for elim-
A dominated strategy di is identified by using linear pro- inating dominated strategies in such games, and then show
gramming. The linear program identifies a probability dis- how to generalize this method in order to develop a dynamic
tribution σi over the other strategies such that programming algorithm for finite-horizon POSGs.
Vi (σi , d−i ) > Vi (di , d−i ), ∀d−i ∈ D−i . (5)
Normal-form games with hidden state
This test for dominance is very similar to the test for Consider a game that takes the form of a tuple G =
dominance used to prune strategies in solving a POMDP. {I, S, {Di }, {Vi }}, where I is a finite set of agents, S is
It differs in using strict inequality, which is called strict a finite set of states, Di is a finite set of strategies available
dominance. Game theorists also use weak dominance to agent i, and Vi : S × D~ → < is the value (or payoff) func-
to prune strategies. A strategy di is weakly dominated tion for agent i. This definition resembles the definition of a
if Vi (σi , d−i ) ≥ Vi (di , d−i ) for all d−i ∈ D−i , and POSG in that the payoff received by each agent is a function
Vi (σi , d−i ) > Vi (di , d−i ) for some d−i ∈ D−i . The test of the state of the game, as well as the joint strategies of all
for dominance which does not require any strict inequality agents. But it resembles a normal-form game in that there is
is sometimes called very weak dominance, and corresponds no state-transition model. In place of one-step actions and
exactly to the test for dominance in POMDPs, as given in rewards, the payoff function specifies the value of a strategy,
Equation (3). Because a strategy that is very weakly dom- which is a complete conditional plan.
inated but not weakly dominated must be payoff equivalent In a normal form game with hidden state, we define an
to a strategy that very weakly dominates it, eliminating very agent’s belief in a way that synthesizes the definition of be-
weakly dominated strategies may have the same effect as lief for POMDPs (a distribution over possible states) and
eliminating weakly dominated strategies in the reduced nor- the definition of belief in iterated elimination of dominated
mal form representation of a game, where the reduced nor- strategies (a distribution over the possible strategies of the
mal form representation is created by combining any set of other agents). For each agent i, a belief is defined as a dis-
payoff-equivalent strategies into a single strategy. tribution over S × D−i , where the distribution is denoted bi .
There are a couple other interesting differences between The value of a belief of agent i is defined as
the tests for dominance in Equations (3) and (5). First, there X
is a difference in beliefs. In normal-form games, beliefs are Vi (bi ) = max bi (s, d−i )Vi (s, di , d−i ).
di ∈Di
about the strategies of other agents, whereas in POMDPs, s∈S,d−i ∈D−i
beliefs are about the underlying state. Second, elimination
of dominated strategies is iterative when there are multiple A strategy di for agent i is very weakly dominated if elimi-
agents. When one agent eliminates its dominated strategies, nating it does not decrease the value of any belief. The test
this can affect the best-response function of other agents (as- for very weak dominance is a linear program that determines
suming common knowledge of rationality). After all agents whether there is a mixed strategy σi ∈ ∆(Di \ di ) such that
take a turn in eliminating their dominated strategies, they Vi (s, σi , d−i ) ≥ Vi (s, di , d−i ), ∀s ∈ S, ∀d−i ∈ D−i . (6)
UNCERTAINTY 711
These generalizations of the key concepts of belief, value of Multi-agent dynamic programming operator
belief, and dominance play a central role in our development The key step of our algorithm is a multi-agent dynamic
of a DP algorithm for POSGs in the rest of this paper. programming operator that generalizes the DP operator for
In our definition of a normal form game with hidden state, POMDPs. As for POMDPs, the operator has two steps. The
we do not include an initial state probability distribution. first is a backup step that creates new policy trees and vec-
As a result, each strategy profile is associated with an |S|- tors. The second is a pruning step.
dimensional vector that can be used to compute the value In the backup step, the DP operator is given a set of depth-
of this strategy profile for any state probability distribution. t policy trees Qti for each agent i, and corresponding sets of
This differs from a standard normal form game in which value vectors Vit of dimension |S × Qt−i |.1 Based on the ac-
each strategy profile is associated with a scalar value. By tion transition, observation, and reward model of the POSG,
assuming an initial state probability distribution, we could it performs an exhaustive backup on each of the sets of trees,
convert our representation to a standard normal form game to form Qt+1i for each agent i. It also recursively computes
in which each strategy profile has a scalar value. But our the value vectors in Vit+1 for each agent i. Note that this
representation is more in keeping with the approach taken step corresponds to recursively creating a normal form with
by the DP algorithm for POMDPs, and lends itself more hidden state representation of a horizon t + 1 POSG, given a
easily to development of a DP algorithm for POSGs. The normal form with hidden state representation of the horizon
initial state probability distribution given in the definition of t POSG.
a POMDP is not used by the DP algorithm for POMDPs; it The second step of the multi-agent DP operator consists
is only used to select a policy after the algorithm finishes. of pruning dominated policy trees. As in the single agent
The same holds in the DP algorithm for POSGs we develop. case, an agent i policy tree can be pruned if its removal does
Like the POMDP algorithm, it computes a solution for all not decrease the value of any belief for agent i. As with
possible initial state probability distributions. normal form games, removal of a policy tree reduces the di-
mensionality of the other agents’ belief space, and it can be
Normal form of finite-horizon POSGs repeated until no more policy trees can be pruned from any
agent’s set. (Note that different agent orderings may lead to
Disregarding the initial state probability distribution, a
different sets of policy trees and value vectors. The question
finite-horizon POSG can be converted to a normal-form
of order dependence in eliminating dominated strategies has
game with hidden state. When the horizon of a POSG is
been extensively studied in game theory, and we do not con-
one, the two representations of the game are identical, since
sider it here.) Pseudocode for the multi-agent DP operator is
a strategy corresponds to a single action, and the payoff
given in Table 1.
functions for the normal-form game correspond to the re-
The validity of the pruning step follows from a version of
ward functions of the POSG. When the horizon of a POSG
the optimality principle of dynamic programming, which we
is greater than one, the POSG representation of the game can
prove for a single iteration of the multi-agent DP operator.
be converted to a normal form representation with hidden
By induction, it follows for any number of iterations.
state, by a recursive construction. Given the sets of strate-
gies and the value (or payoff) functions for a horizon t game, Theorem 1 Consider a set Qti of depth t policy trees for
the sets of strategies and value functions for the horizon agent i, and consider the set Qt+1
i of depth t + 1 policy trees
t + 1 game are constructed by exhaustive backup, as in the created by exhaustive backup, in the first step of the multi-
case of POMDPs. When a horizon-t POSG is represented agent DP operator. If any policy tree qj ∈ Qti is very weakly
in normal form with hidden state, the strategy sets include dominated, then any policy tree q 0 ∈ Qt+1 i that contains qj
all depth-t policy trees, and the value function is piecewise as a subtree is also very weakly dominated.
linear and convex; each strategy profile is associated with an Proof: Consider a very weakly dominated policy tree qj ∈
|S|-vector that represents the expected t-step cumulative re- Qti . According to the dual formulation of the test for domi-
ward achieved for each potential start state (and so any start t
state distribution) by following this joint strategy.
nance, therePexists a distribution p over policy trees in Qi \qj
such that k6=j p(k)vk (s, q−i ) ≥ vj (s, q−i ) for all s ∈ S
If a finite-horizon POSG is represented this way, iterated and q−i ∈ Qk−i . (Recall that vj ∈ Vit is the value vector cor-
elimination of dominated strategies can be used in solving responding to policy tree qj .) Now consider any policy tree
the game, after the horizon t normal form game is con-
structed. The problem is that this representation can be 1
The value function Vit of agent i can be represented as a set Vit
much larger than the original representation of a POSG. In of value vectors of dimension |S ×Qt−i |, with one for each strategy
fact, the size of the strategy set for each agent i is greater in Qti , or as a set of value vectors of dimension |S|, with one for
t
than |Ai ||Oi | , which is doubly exponential in the horizon each strategy profile in Qti × Qt−i . The two representations are
t. Because of the large sizes of the strategy sets, it is usu- equivalent. The latter is more useful in terms of implementation,
since it means the size of vectors does not change during iterated
ally not feasible to work directly with this representation.
elimination of dominated strategies; only the number of vectors
The dynamic programming algorithm we develop partially changes. (Using this representation, multiple |S|-vectors must be
alleviates this problem by performing iterated elimination deleted for each strategy deleted.) The former representation is
of dominated strategies at each stage in the construction of more useful in explaining the algorithm, since it entails a one-to-
the normal form representation, rather than waiting until the one correspondence between strategies and value vectors, and so
construction is finished. we adopt it in this section.
712 UNCERTAINTY
q 0 ∈ Qt+1
i with qj as a subtree. We can replace instances of Input: Sets of depth-t policy trees Qti and corresponding
qj in q 0 with the distribution p to get a behavioral strategy, value vectors Vit for each agent i.
which is a stochastic policy tree. From the test for domi-
nance, it follows that the value of this behavioral strategy is 1. Perform exhaustive backups to get Qt+1
i for each i.
at least as high as that of q 0 , for any distribution over states 2. Recursively compute Vit+1 for each i.
and strategies of the other agents. Since any behavioral strat-
egy can be represented by a distribution over pure strategies, 3. Repeat until no more pruning is possible:
it follows that q 0 is very weakly dominated. 2 (a) Choose an agent i, and find a policy tree qj ∈ Qt+1
i
Thus, pruning very weakly dominated strategies from the for which the following condition is satisfied:
sets Qti before using the dynamic programming operator is ∀b ∈ ∆(S×Qt+1 −i ), ∃vk ∈ Vi
t+1
\vj s.t. b·vk ≥ b·vj .
equivalent to performing the dynamic programming opera- (b) Qt+1 ← Qt+1 \ qj .
i i
tor without first pruning Qti . The advantage of first pruning t+1 t+1
very weakly dominated strategies from the sets Qti is that it (c) Vi ← Vi \ vj .
improves the efficiency of dynamic programming by reduc- Output: Sets of depth-t + 1 policy trees Qt+1
i and corre-
ing the initial size of the sets Qt+1i generated by exhaustive sponding value vectors Vit+1 for each agent i.
backup.
It is possible to define a multi-agent DP operator that
Table 1: The multi-agent dynamic programming operator.
prunes strongly dominated strategies. However, sometimes
a strategy that is not strongly dominated will have a strongly
dominated subtree. This is referred to as an incredible threat
Theorem 2 Dynamic programming applied to a finite-
in the literature. Thus it is an open question whether we can
horizon POSG corresponds to iterated elimination of very
define a multi-agent DP operator that prunes only strongly
weakly dominated strategies in the normal form of the
dominated strategies. In this paper, we focus on pruning
POSG.
very weakly dominated strategies. As already noted, this is
identical to the form of pruning used for POMDPs. Proof: Let T be the horizon of the POSG. If the initial state
There is an important difference between this algorithm distribution of the POSG is not fixed, then the POSG can be
and the dynamic programming operator for single-agent thought of as a normal form game with hidden state. Theo-
POMDPs, in terms of implementation. In the single agent rem 1 implies that each time a policy tree is pruned by the
case, only the value vectors need to be kept in memory. At DP algorithm, every strategy containing it as a subtree is
execution time, an optimal action can be extracted from the very weakly dominated in this game. And if a strategy is
value function using one-step lookahead, at each time step. very weakly dominated when the initial state distribution is
We do not currently have a way of doing this when there not fixed, then it is certainly very weakly dominated for a
are multiple agents. In the multi-agent case, instead of se- fixed initial state distribution. Thus, the DP algorithm can
lecting an action at each time step, each agent must select a be viewed as iteratively eliminating very weakly dominated
policy tree (i.e., a complete strategy) at the beginning of the strategies in the POSG.2
game. Thus, the policy tree sets must also be remembered. In the case of cooperative games, also known as DEC-
Of course, some memory savings is possible by realizing POMDPs, removing very weakly dominated strategies pre-
that the policy trees for an agent share subtrees. serves at least one optimal strategy profile. Thus, the multi-
agent DP operator can be used to solve finite-horizon DEC-
Solving finite-horizon POSGs POMDPs optimally. When the DP algorithm reaches step
T , we can simply extract the highest-valued strategy profile
As we have described, any finite-horizon POSG can be given for the start state distribution.
a normal form representation. The process of computing
the normal form representation is recursive. Given the def- Corollary 1 Dynamic programming applied to a finite-
inition of a POSG, we successively compute normal form horizon DEC-POMDP yields an optimal strategy profile.
games with hidden state for horizons one, two, and so on, For general-sum POSGs, the DP algorithm converts the
up to horizon T . Instead of computing all possible strategies POSG to a normal form representation with reduced sets
for each horizon, we have defined a multi-agent dynamic of strategies in which there are no very weakly dominated
programming operator that performs iterated elimination of strategies. Although selecting an equilibrium presents a
very weakly dominated strategies at each stage. This im- challenging problem in the general-sum case, standard tech-
proves the efficiency of the algorithm because if a policy tree niques for selecting an equilibrium in a normal form game
is pruned by the multi-agent DP operator at one stage, ev- can be used.
ery policy tree containing it as a subtree is effectively elimi-
nated, in the sense that it will not be created at a later stage.
We now show that performing iterated elimination of very Example
weakly dominated strategies at each stage in the construc- We ran initial tests on a cooperative game involving control
tion of the normal form game is equivalent to waiting until of a multi-access broadcast channel (Ooi & Wornell 1996).
the final stage to perform iterated elimination of very weakly In this problem, nodes need to broadcast messages to each
dominated strategies. other over a channel, but only one node may broadcast at
UNCERTAINTY 713
Horizon Brute force Dynamic programming s = send message
d = don t send message
1 (2, 2) (2, 2) c = collision
n = no collision
2 (8, 8) (6, 6) s d
n c n c
3 (128, 128) (20, 20) s d d d
4 (32768, 32768) (300, 300) n c n c n c n c
d d d d s d d d
n c n c n c n c n c n c n c n c
Table 2: Performance of both algorithms on the multi-access
broadcast channel problem. Each cell displays the number s d s d s d d d d d d d s d d d
of policy trees produced for each agent. The brute force
Agent 1 Agent 2
algorithm could not compute iteration 4. The numbers (in
italics) shown in that cell reflect how many policy trees it
would need to create for each agent. Figure 1: A pair of policy trees that is optimal for the
horizon-4 problem when both message buffers start out full.
a time, otherwise a collision occurs. The nodes share the Future work
common goal of maximizing the throughput of the channel. Development of an exact dynamic programming approach to
The process proceeds in discrete time steps. At the start solving POSGs suggests several avenues for future research,
of each time step, each node decides whether or not to send and we briefly describe some possibilities.
a message. The nodes receive a reward of 1 when a message
is successfully broadcast and a reward of 0 otherwise. At the Improving efficiency
end of the time step, each node receives a noisy observation A major scalability bottleneck is the fact that the number of
of whether or not a message got through. policy trees grows rapidly with the horizon and can quickly
consume a large amount of memory. There are several possi-
The message buffer for each agent has space for only one
ble ways to address this. One technique that provides com-
message. If a node is unable to broadcast a message, the
putational leverage in solving POMDPs is to prune policy
message remains in the buffer for the next time step. If a
trees incrementally, so that an exhaustive backup never has
node i is able to send its message, the probability that its
to be done (Cassandra, Littman, & Zhang 1997). Whether
buffer will fill up on the next step is pi . Our problem has
this can be extended to the multi-agent case is an open prob-
two nodes, with p1 = 0.9 and p2 = 0.1. There are 4 states,
lem. Other techniques seem easier to extend. More aggres-
2 actions per agent, and 2 observations per agent.
sive pruning, such as pruning strategies that are almost very
We compared our DP algorithm with a brute-force algo- weakly dominated, can reduce the number of policy trees
rithm, which also builds sets of policy trees, but never prunes in exchange for bounded sub-optimality (Feng & Hansen
any of them. On a machine with 2 gigabytes of memory, the 2001). The number of policy trees may be reduced by al-
brute-force algorithm was able to complete iteration 3 before lowing stochastic policies, as in (Poupart & Boutilier 2004).
running out of memory, while the DP algorithm was able to Work on compactly represented POMDPs and value func-
complete iteration 4. At the end of iteration 4, the num- tions may be extended to the multi-agent case (Hansen &
ber of policy trees for the DP algorithm was less than 1% Feng 2000).
of the number that would have been produced by the brute- In addition, there exist POMDP algorithms that leverage
force algorithm, had it been able to complete the iteration. a known start state distribution for greater efficiency. These
This result, shown in Table 2, indicates that the multi-agent algorithms perform a forward search from the start state and
DP operator can prune a significant number of trees. How- are able to avoid unreachable belief states. Whether some
ever, even with pruning, the number of policy trees grows kind of forward search can be done in the multi-agent case
quickly with the horizon. At the end of the fourth iteration, is an important open problem.
each agent has 300 policy trees that are not dominated. Be-
cause the piecewise linear and convex value function con- Extension to infinite-horizon POSGs
sists of one |S|-vector for each pair of policy trees from the It should be possible to extend our dynamic programming al-
two agents, the representation of the value function requires gorithm to infinite-horizon, discounted POSGs, and we are
3002 |S|-vectors. In the fifth iteration, an exhaustive backup currently exploring this. In the infinite-horizon case, the
would create a value function that consists of 2 · 3004 |S|- multi-agent DP operator is applied to infinite trees. A fi-
vectors, or more than 16 billion |S|-vectors, before begin- nite set of infinite trees can be represented by a finite-state
ning the process of pruning. This illustrates how the algo- controller, and policy iteration algorithms for single-agent
rithm can run out of memory. In the next section, we dis- POMDPs have been developed based on this representa-
cuss possible ways to avoid the explosion in size of the value tion (Hansen 1998; Poupart & Boutilier 2004). We believe
function. that they can be extended to develop a policy iteration algo-
Figure 1 shows a pair of depth-4 policy trees constructed rithm for infinite-horizon POSGs. Because our definition of
by the DP algorithm. In the case where the message buffers belief depends on explicit representation of a policy as a pol-
both start out full, this pair is optimal, yielding a total reward icy tree or finite-state controller, it is not obvious that a value
of 3.89. iteration algorithm for infinite-horizon POSGs is possible.
714 UNCERTAINTY
Conclusion Hansen, E. 1998. Solving POMDPs by searching in policy
space. In Proceedings of the 14th Conference on Uncer-
We have presented an algorithm for solving POSGs that gen-
tainty in Artificial Intelligence (UAI-98), 211–219.
eralizes both dynamic programming for POMDPs and it-
erated elimination of dominated strategies for normal form Hsu, K., and Marcus, S. I. 1982. Decentralized control
games. It is the first exact algorithm for general POSGs, and of finite state Markov processes. IEEE Transactions on
we have shown that it can be used to find optimal solutions Automatic Control AC-27(2):426–431.
for cooperative POSGs. Although currently limited to solv- Hu, J., and Wellman, M. 2003. Nash Q-learning for
ing very small problems, its development helps to clarify the general-sum stochastic games. Journal of Machine Learn-
relationship between POMDPs and game-theoretic models. ing Research 4:1039–1069.
There are many avenues for future research, in both making Kaelbling, L.; Littman, M.; and Cassandra, A. 1998. Plan-
the algorithm more time and space efficient and extending it ning and acting in partially observable stochastic domains.
beyond finite-horizon POSGs. Artificial Intelligence 101:99–134.
Kearns, M.; Mansour, Y.; and Singh, S. 2000. Fast plan-
Acknowledgments We thank the anonymous reviewers ning in stochastic games. In Proceedings of the 16th Con-
for helpful comments. This work was supported in part by ference on Uncertainty in Artificial Intelligence (UAI-00),
the National Science Foundation under grants IIS-0219606 309–316.
and IIS-9984952, by NASA under cooperative agreement Koller, D., and Pfeffer, A. 1997. Representations and so-
NCC 2-1311, and by the Air Force Office of Scientific Re- lutions for game-theoretic problems. Artificial Intelligence
search under grant F49620-03-1-0090. Daniel Bernstein 94(1):167–215.
was supported by a NASA GSRP Fellowship. Any opinions, Koller, D.; Megiddo, N.; and von Stengel, B. 1994. Fast
findings, and conclusions or recommendations expressed in algorithms for finding randomized strategies in game trees.
this material are those of the authors and do not reflect the In Proceedings of the 26th ACM Symposium on Theory of
views of the NSF, NASA or AFOSR. Computing, 750–759.
Kuhn, H. 1953. Extensive games and the problem of infor-
References mation. In Kuhn, H., and Tucker, A., eds., Contributions to
Becker, R.; Zilberstein, S.; Lesser, V.; and Goldman, C. V. the Theory of Games II. Princeton University Press. 193–
2003. Transition-independent decentralized Markov deci- 216.
sion processes. In Proceedings of the 2nd International Littman, M. 1994. Markov games as a framework for
Conference on Autonomous Agents and Multi-agent Sys- multi-agent reinforcement learning. In Proceedings of the
tems, 41–48. 11th International Conference on Machine Learning, 157–
Bernstein, D.; Givan, R.; Immerman, N.; and Zilberstein, 163.
S. 2002. The complexity of decentralized control of Nair, R.; Pynadath, D.; Yokoo, M.; Tambe, M.; and
Markov decision processes. Mathematics of Operations Marsella, S. 2003. Taming decentralized POMDPs: To-
Research 27(4):819–840. wards efficient policy computation for multiagent settings.
Boutilier, C. 1999. Sequential optimality and coordination In Proceedings of the 18th International Joint Conference
in multiagent systems. In Proceedings of the 16th Interna- on Artificial Intelligence, 705–711.
tional Joint Conference on Artificial Intelligence, 478–485. Ooi, J. M., and Wornell, G. W. 1996. Decentralized con-
trol of a multiple access broadcast channel: Performance
Brafman, R., and Tennenholtz, M. 2002. R-MAX–a gen-
bounds. In Proceedings of the 35th Conference on Deci-
eral polynomial time algorithm for near-optimal reinforce-
sion and Control, 293–298.
ment learning. Journal of Machine Learning Research
3:213–231. Peshkin, L.; Kim, K.-E.; Meuleau, N.; and Kaelbling, L. P.
2000. Learning to cooperate via policy search. In Proceed-
Cassandra, A.; Littman, M. L.; and Zhang, N. L. 1997. ings of the 16th International Conference on Uncertainty
Incremental pruning: A simple, fast, exact method for par- in Artificial Intelligence, 489–496.
tially observable Markov decision processes. In Proceed-
ings of the 13th Annual Conference on Uncertainty in Arti- Poupart, P., and Boutilier, C. 2004. Bounded finite state
ficial Intelligence, 54–61. controllers. In Advances in Neural Information Processing
Systems 16: Proceedings of the 2003 Conference. MIT
Feng, Z., and Hansen, E. 2001. Approximate planning Press.
for factored POMDPs. In Proceedings of the 6th European
Conference on Planning. Shapley, L. 1953. Stochastic games. Proceedings of
the National Academy of Sciences of the United States of
Filar, J., and Vrieze, K. 1997. Competitive Markov Deci- America 39:1095–1100.
sion Processes. Springer-Verlag.
Smallwood, R., and Sondik, E. 1973. The optimal con-
Hansen, E., and Feng, Z. 2000. Dynamic programming trol of partially observable Markov processes over a finite
for POMDPs using a factored state representation. In Pro- horizon. Operations Research 21:1071–1088.
ceedings of the 5th International Conference on Artificial
Intelligence Planning and Scheduling, 130–139.
UNCERTAINTY 715