POKER Science - Aay2400.full
POKER Science - Aay2400.full
In recent years there have been great strides in artificial intelligence (AI), with games often serving as
challenge problems, benchmarks, and milestones for progress. Poker has served for decades as such a
challenge problem. Past successes in such benchmarks, including poker, have been limited to two-player
games. However, poker in particular is traditionally played with more than two players. Multiplayer games
present fundamental additional issues beyond those in two-player games, and multiplayer poker is a
recognized AI milestone. In this paper we present Pluribus, an AI that we show is stronger than top human
professionals in six-player no-limit Texas hold’em poker, the most popular form of poker played by
Poker has served as a challenge problem for the fields of ar- exist in all finite games, and many infinite games, though
tificial intelligence (AI) and game theory for decades (1). In finding an equilibrium may be difficult.
fact, the foundational papers on game theory used poker to Two-player zero-sum games are a special class of games
illustrate their concepts (2, 3). The reason for this choice is in which Nash equilibria also have an extremely useful addi-
simple: no other popular recreational game captures the tional property: any player who chooses to use a Nash equi-
challenges of hidden information as effectively and as ele- librium is guaranteed to not lose in expectation no matter
gantly as poker. Although poker has been useful as a bench- what the opponent does (as long as one side does not have an
mark for new AI and game-theoretic techniques, the intrinsic advantage under the game rules, or the players al-
challenge of hidden information in strategic settings is not ternate sides). In other words, a Nash equilibrium strategy is
limited to recreational games. The equilibrium concepts of unbeatable in two-player zero-sum games that satisfy the
von Neumann and Nash have been applied to many real- above criteria. For this reason, to “solve” a two-player zero-
world challenges such as auctions, cybersecurity, and pricing. sum game means to find an exact Nash equilibrium. For ex-
The past two decades have witnessed rapid progress in the ample, the Nash equilibrium strategy for Rock-Paper-Scissors
ability of AI systems to play increasingly complex forms of is to randomly pick Rock, Paper, or Scissors with equal prob-
poker (4–6). However, all prior breakthroughs have been lim- ability. Against such a strategy, the best that an opponent can
ited to settings involving only two players. Developing a su- do in expectation is tie (10). In this simple case, playing the
perhuman AI for multiplayer poker was the widely- Nash equilibrium also guarantees that the player will not win
recognized main remaining milestone. In this paper we de- in expectation. However, in more complex games even deter-
scribe Pluribus, an AI capable of defeating elite human pro- mining how to tie against a Nash equilibrium may be diffi-
fessionals in six-player no-limit Texas hold’em poker, the cult; if the opponent ever chooses suboptimal actions, then
most commonly played poker format in the world. playing the Nash equilibrium will indeed result in victory in
expectation.
Theoretical and practical challenges of multiplayer In principle, playing the Nash equilibrium can be com-
games bined with opponent exploitation by initially playing the
AI systems have reached superhuman performance in games equilibrium strategy and then over time shifting to a strategy
such as checkers (7), chess (8), two-player limit poker (4), Go that exploits the opponent’s observed weaknesses (for exam-
(9), and two-player no-limit poker (6). All of these involve ple, by switching to always playing Paper against an oppo-
only two players and are zero-sum games (meaning that nent that always plays Rock) (11). However, except in certain
whatever one player wins, the other player loses). Every one restricted ways (12), shifting to an exploitative non-equilib-
of those superhuman AI systems was generated by attempt- rium strategy opens oneself up to exploitation because the
ing to approximate a Nash equilibrium strategy rather than opponent could also change strategies at any moment. Addi-
by, for example, trying to detect and exploit weaknesses in tionally, existing techniques for opponent exploitation re-
the opponent. A Nash equilibrium is a list of strategies, one quire too many samples to be competitive with human ability
for each player, in which no player can improve by deviating outside of small games. Pluribus plays a fixed strategy that
to a different strategy. Nash equilibria have been proven to does not adapt to the observed tendencies of the opponents.
First release: 11 July 2019 [Link] (Page numbers not final at time of first release) 1
Although a Nash equilibrium strategy is guaranteed to ex- strategies in a wider class of strategic settings.
ist in any finite game, efficient algorithms for finding one are
only proven to exist for special classes of games, among which Description of Pluribus
two-player zero-sum games are the most prominent. No pol- The core of Pluribus’s strategy was computed via self play, in
ynomial-time algorithm is known for finding a Nash equilib- which the AI plays against copies of itself, without any data
rium in two-player non-zero-sum games, and the existence of of human or prior AI play used as input. The AI starts from
one would have sweeping surprising implications in compu- scratch by playing randomly, and gradually improves as it de-
tational complexity theory (13, 14). Finding a Nash equilib- termines which actions, and which probability distribution
rium in zero-sum games with three or more players is at least over those actions, lead to better outcomes against earlier
as hard (because a dummy player can be added to the two- versions of its strategy. Forms of self play have previously
player game to make it a three-player zero-sum game). Even been used to generate powerful AIs in two-player zero-sum
approximating a Nash equilibrium is hard (except in special games such as backgammon (18), Go (9, 19), Dota 2 (20),
cases) in theory (15) and in games with more than two play- StarCraft 2 (21), and two-player poker (4–6), though the pre-
ers, even the best complete algorithm can only address games cise algorithms that were used have varied widely. Although
with a handful of possible strategies per player (16). Moreo- it is easy to construct toy games with more than two players
First release: 11 July 2019 [Link] (Page numbers not final at time of first release) 2
time to such “off-tree” actions. received for choosing an action versus what the traverser ac-
The other form of abstraction we use in Pluribus is infor- tually achieved (in expectation) on the iteration is added to
mation abstraction, in which decision points that are similar the counterfactual regret for the action. Counterfactual regret
in terms of what information has been revealed (in poker, the represents how much the traverser regrets having not chosen
player’s cards and revealed board cards) are bucketed to- that action in previous iterations. At the end of the iteration,
gether and treated identically (26–28). For example, a ten- the traverser’s strategy is updated so that actions with higher
high straight and a nine-high straight are distinct hands, but counterfactual regret are chosen with higher probability.
are nevertheless strategically similar. Pluribus may bucket For two-player zero-sum games, CFR guarantees that the
these hands together and treat them identically, thereby re- average strategy played over all iterations converges to a
ducing the number of distinct situations for which it needs to Nash equilibrium, but convergence to a Nash equilibrium is
determine a strategy. Information abstraction drastically re- not guaranteed outside of two-player zero-sum games. Nev-
duces the complexity of the game, but may wash away subtle ertheless, CFR guarantees in all finite games that all counter-
differences that are important for superhuman performance. factual regrets grow sublinearly in the number of iterations.
Therefore, during actual play against humans, Pluribus uses This, in turn, guarantees in the limit that the average perfor-
information abstraction only to reason about situations on mance of CFR on each iteration that was played matches the
First release: 11 July 2019 [Link] (Page numbers not final at time of first release) 3
We set the size of the blueprint strategy abstraction to allow adjust to a strategy of always playing Paper. In that case, the
Pluribus to run during live play on a machine with no more value of always playing Rock would actually be −1 .
than 128 GB of memory while storing a compressed form of This example illustrates that in imperfect-information
the blueprint strategy in memory. subgames (the part of the game in which search is being con-
ducted) (40), leaf nodes do not have fixed values. Instead,
Depth-limited search in imperfect-information games their values depend on the strategy that the searcher chooses
The blueprint strategy for the entire game is necessarily in the subgame (that is, the probabilities that the searcher
coarse-grained owing to the size and complexity of no-limit assigns to his actions in the subgame). In principle, this could
Texas hold’em. Pluribus only plays according to this blueprint be addressed by having the value of a subgame leaf node be
strategy in the first betting round (of four), where the number a function of the searcher’s strategy in the subgame, but this
of decision points is small enough that the blueprint strategy is impractical in large games. One alternative is to make the
can afford to not use information abstraction and have a lot value of a leaf node conditional only on the belief distribution
of actions in the action abstraction. After the first round (and of both players at that point in the game. This was used to
even in the first round if an opponent chooses a bet size that generate the two-player poker AI DeepStack (5). However,
is sufficiently different from the sizes in the blueprint action this option is extremely expensive because it requires one to
First release: 11 July 2019 [Link] (Page numbers not final at time of first release) 4
holding the best possible hand, then the opponents would Merson, Nicholas Petrangelo, Sean Ruane, Trevor Savage,
know to always fold in response. and Jacob Toole. In this experiment, 10,000 hands of poker
To cope with this, Pluribus keeps track of the probability were played over 12 days. Each day, five volunteers from the
it would have reached the current situation with each possi- pool of professionals were selected to participate based on
ble hand according to its strategy. Regardless of which hand availability. The participants were not told who else was par-
Pluribus is actually holding, it will first calculate how it would ticipating in the experiment. Instead, each participant was
act with every possible hand, being careful to balance its assigned an alias that remained constant throughout the ex-
strategy across all the hands so as to remain unpredictable to periment. The alias of each player in each game was known,
the opponent. Once this balanced strategy across all hands is so that players could track the tendencies of each player
computed, Pluribus then executes an action for the hand it is throughout the experiment. $50,000 was divided among the
actually holding. The structure of a depth-limited imperfect- human participants based on their performance to incentiv-
information subgame as used in Pluribus is shown in Fig. 4. ize them to play their best. Each player was guaranteed a
Pluribus used one of two different forms of CFR to com- minimum of $0.40 per hand for participating, but this could
pute a strategy in the subgame depending on the size of the increase to as much as $1.60 per hand based on performance.
subgame and the part of the game. If the subgame is rela- After applying AIVAT, Pluribus won an average of 48
First release: 11 July 2019 [Link] (Page numbers not final at time of first release) 5
conventional human wisdom that limping (calling the “big 10. Recently, in the real-time strategy games Dota 2 (20) and StarCraft 2 (21), AIs have
beaten top humans, but as humans have gained more experience against the AIs,
blind” rather than folding or raising) is suboptimal for any
humans have learned to beat them. This may be because for those two-player
player except the “small blind” player who already has half zero-sum games, the AIs were generated by techniques not guaranteed to
the big blind in the pot by the rules, and thus has to invest converge to a Nash equilibrium, so they do not have the unbeatability property
only half as much as the other players to call. While Pluribus that Nash equilibruim strategies have in two-player zero-sum games. (Dota 2
involves two teams of five players each. However, because the players on the
initially experimented with limping when computing its blue-
same team have the same objective and are not limited in their communication,
print strategy offline through self play, it gradually discarded the game is two-player zero-sum from an AI and game-theoretic perspective).
this action from its strategy as self play continued. However, 11. S. Ganzfried, T. Sandholm, in International Conference on Autonomous Agents and
Pluribus disagrees with the folk wisdom that “donk betting” Multi-Agent Systems (AAMAS) (2011), pp. 533–540.
12. S. Ganzfried, T. Sandholm, ACM Trans. Econ. Comp. (TEAC) 3, 8 (2015). Best of
(starting a round by betting when one ended the previous
EC-12 special issue.
betting round with a call) is a mistake; Pluribus does this far 13. C. Daskalakis, P. W. Goldberg, C. H. Papadimitriou, The Complexity of Computing
more often than professional humans do. a Nash Equilibrium. SIAM J. Comput. 39, 195–259 (2009).
doi:10.1137/070699652
14. X. Chen, X. Deng, S.-H. Teng, Settling the complexity of computing two-player
Conclusions
Nash equilibria. J. Assoc. Comput. Mach. 56, 14 (2009).
Forms of self play combined with forms of search has led to doi:10.1145/1516512.1516516
First release: 11 July 2019 [Link] (Page numbers not final at time of first release) 6
Conference on Autonomous Agents and Multiagent Systems (AAMAS) (2012), pp. 10.1126/science.aay2400
837–846.
37. R. Gibson, M. Lanctot, N. Burch, D. Szafron, M. Bowling, in AAAI Conference on
Artificial Intelligence (AAAI) (2012), pp. 1355–1361.
38. N. Brown, T. Sandholm, AAAI Conference on Artificial Intelligence (AAAI) (2019).
39. S. Ganzfried, T. Sandholm, in International Joint Conference on Artificial
Intelligence (IJCAI) (2013), pp. 120–128.
40. Here we use term subgame the way it is usually used in AI. In game theory that
word is used differently by requiring a subgame to start with a node where the
player whose turn it is to move has no uncertainty about state—in particular, no
uncertainty about the opponents’ private information.
41. N. Brown, T. Sandholm, B. Amos, in Neural Information Processing Systems
(NeurIPS) (2018), pp. 7663–7674.
42. M. Johanson, K. Waugh, M. Bowling, M. Zinkevich, in International Joint Conference
on Artificial Intelligence (IJCAI) (2011), pp. 258–265.
43. E. P. DeBenedictis, Rebooting Computers as Learning Machines. Computer 49,
84–87 (2016). doi:10.1109/MC.2016.156
44. N. Burch, M. Schmid, M. Moravcik, D. Morill, M. Bowling, in AAAI Conference on
Artificial Intelligence (AAAI) (2018), pp. 949–956.
First release: 11 July 2019 [Link] (Page numbers not final at time of first release) 7
Fig. 1. An example of the equilibrium selection problem. In the
First release: 11 July 2019 [Link] (Page numbers not final at time of first release) 8
Downloaded from [Link] on July 14, 2019
Fig. 2. A game tree traversal via Monte Carlo CFR. In this figure player P1 is traversing the game tree. Left: A
game is simulated until an outcome is reached. Middle: For each P1 decision point encountered in the simulation
in the Left figure, P1 explores each other action that P1 could have taken and plays out a simulation to the end of
the game. P1 then updates its strategy to pick actions with higher payoff with higher probability. Right: P1 explores
each other action that P1 could have taken at every new decision point encountered in the Middle figure, and P1
updates its strategy at those hypothetical decision points. This process repeats until no new P1 decision points
are encountered, which in this case is after three steps but in general may be more. Our implementation of MCCFR
(described in the supplementary material) is equivalent but traverses the game tree in a depth-first manner. (The
percentages in the figure are for illustration purposes only and may not correspond to actual percentages that the
algorithm would compute.)
First release: 11 July 2019 [Link] (Page numbers not final at time of first release) 9
Downloaded from [Link] on July 14, 2019
Fig. 3. Perfect-information game search in Rock-Paper-Scissors.
Top: A sequential representation of Rock-Paper-Scissors in which Player
1 acts first but does not reveal her action to Player 2, who acts second.
The dashed lines between the Player 2 nodes signify that Player 2 does
not know which of those nodes he is in. The terminal values are shown
only for Player 1. Bottom: A depiction of the depth-limited subgame if
Player 1 conducts search (with a depth of one) using the same approach
as is used in perfect-information games. The approach assumes that
after each action Player 2 will play according to the Nash equilibrium
1
strategy of choosing Rock, Paper, and Scissors with probability each.
3
This results in a value of zero for Player 1 regardless of her strategy.
First release: 11 July 2019 [Link] (Page numbers not final at time of first release) 10
Fig. 4. Real-time search in Pluribus. The subgame shows just two players for simplicity. A dashed line between
First release: 11 July 2019 [Link] (Page numbers not final at time of first release) 11
Downloaded from [Link] on July 14, 2019
Fig. 5. Performance of Pluribus in the 5 humans + 1 AI experiment.
Top: The lines show the win rate (solid line) plus or minus the standard
error (dashed lines). Bottom: The lines show the cumulative number of
chips won (solid line) plus or minus the standard error (dashed lines). The
relatively steady performance of Pluribus over the course of the 10,000-
hand experiment suggests the humans were unable to find exploitable
weaknesses in the bot.
First release: 11 July 2019 [Link] (Page numbers not final at time of first release) 12
Superhuman AI for multiplayer poker
Noam Brown and Tuomas Sandholm
REFERENCES This article cites 21 articles, 5 of which you can access for free
[Link]
PERMISSIONS [Link]
Science (print ISSN 0036-8075; online ISSN 1095-9203) is published by the American Association for the Advancement of
Science, 1200 New York Avenue NW, Washington, DC 20005. 2017 © The Authors, some rights reserved; exclusive licensee
American Association for the Advancement of Science. No claim to original U.S. Government Works. The title Science is a
registered trademark of AAAS.