0% found this document useful (0 votes)
149 views13 pages

POKER Science - Aay2400.full

Play poker wisely

Uploaded by

Edwin Hauwert
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
149 views13 pages

POKER Science - Aay2400.full

Play poker wisely

Uploaded by

Edwin Hauwert
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

RESEARCH ARTICLES

Cite as: N. Brown and T. Sandholm,


Science 10.1126/science.aay2400 (2019).

Superhuman AI for multiplayer poker


Noam Brown1,2* and Tuomas Sandholm1,3,4,5*
1Computer Science Department, Carnegie Mellon University Pittsburgh, PA 15213, USA. 2Facebook AI Research New York, NY 10003, USA. 3Strategic Machine, Inc.
Pittsburgh, PA 15213, USA. 4Strategy Robot, Inc. Pittsburgh, PA 15213, USA. 5Optimized Markets, Inc. Pittsburgh, PA 15213, USA.
*Corresponding author. E-mail: noamb@[Link] (N.B.); sandholm@[Link] (T.S.)

In recent years there have been great strides in artificial intelligence (AI), with games often serving as
challenge problems, benchmarks, and milestones for progress. Poker has served for decades as such a
challenge problem. Past successes in such benchmarks, including poker, have been limited to two-player
games. However, poker in particular is traditionally played with more than two players. Multiplayer games
present fundamental additional issues beyond those in two-player games, and multiplayer poker is a
recognized AI milestone. In this paper we present Pluribus, an AI that we show is stronger than top human
professionals in six-player no-limit Texas hold’em poker, the most popular form of poker played by

Downloaded from [Link] on July 14, 2019


humans.

Poker has served as a challenge problem for the fields of ar- exist in all finite games, and many infinite games, though
tificial intelligence (AI) and game theory for decades (1). In finding an equilibrium may be difficult.
fact, the foundational papers on game theory used poker to Two-player zero-sum games are a special class of games
illustrate their concepts (2, 3). The reason for this choice is in which Nash equilibria also have an extremely useful addi-
simple: no other popular recreational game captures the tional property: any player who chooses to use a Nash equi-
challenges of hidden information as effectively and as ele- librium is guaranteed to not lose in expectation no matter
gantly as poker. Although poker has been useful as a bench- what the opponent does (as long as one side does not have an
mark for new AI and game-theoretic techniques, the intrinsic advantage under the game rules, or the players al-
challenge of hidden information in strategic settings is not ternate sides). In other words, a Nash equilibrium strategy is
limited to recreational games. The equilibrium concepts of unbeatable in two-player zero-sum games that satisfy the
von Neumann and Nash have been applied to many real- above criteria. For this reason, to “solve” a two-player zero-
world challenges such as auctions, cybersecurity, and pricing. sum game means to find an exact Nash equilibrium. For ex-
The past two decades have witnessed rapid progress in the ample, the Nash equilibrium strategy for Rock-Paper-Scissors
ability of AI systems to play increasingly complex forms of is to randomly pick Rock, Paper, or Scissors with equal prob-
poker (4–6). However, all prior breakthroughs have been lim- ability. Against such a strategy, the best that an opponent can
ited to settings involving only two players. Developing a su- do in expectation is tie (10). In this simple case, playing the
perhuman AI for multiplayer poker was the widely- Nash equilibrium also guarantees that the player will not win
recognized main remaining milestone. In this paper we de- in expectation. However, in more complex games even deter-
scribe Pluribus, an AI capable of defeating elite human pro- mining how to tie against a Nash equilibrium may be diffi-
fessionals in six-player no-limit Texas hold’em poker, the cult; if the opponent ever chooses suboptimal actions, then
most commonly played poker format in the world. playing the Nash equilibrium will indeed result in victory in
expectation.
Theoretical and practical challenges of multiplayer In principle, playing the Nash equilibrium can be com-
games bined with opponent exploitation by initially playing the
AI systems have reached superhuman performance in games equilibrium strategy and then over time shifting to a strategy
such as checkers (7), chess (8), two-player limit poker (4), Go that exploits the opponent’s observed weaknesses (for exam-
(9), and two-player no-limit poker (6). All of these involve ple, by switching to always playing Paper against an oppo-
only two players and are zero-sum games (meaning that nent that always plays Rock) (11). However, except in certain
whatever one player wins, the other player loses). Every one restricted ways (12), shifting to an exploitative non-equilib-
of those superhuman AI systems was generated by attempt- rium strategy opens oneself up to exploitation because the
ing to approximate a Nash equilibrium strategy rather than opponent could also change strategies at any moment. Addi-
by, for example, trying to detect and exploit weaknesses in tionally, existing techniques for opponent exploitation re-
the opponent. A Nash equilibrium is a list of strategies, one quire too many samples to be competitive with human ability
for each player, in which no player can improve by deviating outside of small games. Pluribus plays a fixed strategy that
to a different strategy. Nash equilibria have been proven to does not adapt to the observed tendencies of the opponents.

First release: 11 July 2019 [Link] (Page numbers not final at time of first release) 1
Although a Nash equilibrium strategy is guaranteed to ex- strategies in a wider class of strategic settings.
ist in any finite game, efficient algorithms for finding one are
only proven to exist for special classes of games, among which Description of Pluribus
two-player zero-sum games are the most prominent. No pol- The core of Pluribus’s strategy was computed via self play, in
ynomial-time algorithm is known for finding a Nash equilib- which the AI plays against copies of itself, without any data
rium in two-player non-zero-sum games, and the existence of of human or prior AI play used as input. The AI starts from
one would have sweeping surprising implications in compu- scratch by playing randomly, and gradually improves as it de-
tational complexity theory (13, 14). Finding a Nash equilib- termines which actions, and which probability distribution
rium in zero-sum games with three or more players is at least over those actions, lead to better outcomes against earlier
as hard (because a dummy player can be added to the two- versions of its strategy. Forms of self play have previously
player game to make it a three-player zero-sum game). Even been used to generate powerful AIs in two-player zero-sum
approximating a Nash equilibrium is hard (except in special games such as backgammon (18), Go (9, 19), Dota 2 (20),
cases) in theory (15) and in games with more than two play- StarCraft 2 (21), and two-player poker (4–6), though the pre-
ers, even the best complete algorithm can only address games cise algorithms that were used have varied widely. Although
with a handful of possible strategies per player (16). Moreo- it is easy to construct toy games with more than two players

Downloaded from [Link] on July 14, 2019


ver, even if a Nash equilibrium could be computed efficiently in which commonly-used self-play algorithms fail to converge
in a game with more than two players, it is not clear that to a meaningful solution (22), in practice self play has never-
playing such an equilibrium strategy would be wise. If each theless been shown to do reasonably well in some games with
player in such a game independently computes and plays a more than two players (23).
Nash equilibrium, the list of strategies that they play (one Pluribus’s self play produces a strategy for the entire game
strategy per player) may not be a Nash equilibrium and play- offline, which we refer to as the blueprint strategy. Then dur-
ers might have an incentive to deviate to a different strategy. ing actual play against opponents, Pluribus improves upon
One example of this is the Lemonade Stand Game (17), illus- the blueprint strategy by searching for a better strategy in
trated in Fig. 1, in which each player simultaneously picks a real time for the situations it finds itself in during the game.
point on a ring and wants to be as far away as possible from In subsections below, we discuss both of those phases in de-
any other player. The Nash equilibrium is for all players to be tail, but first we discuss abstraction, forms of which are used
spaced uniformly along the ring, but there are infinitely many in both phases to make them scalable.
ways this can be accomplished and therefore infinitely many
Nash equilibria. If each player independently computes one Abstraction for large imperfect-information games
of those equilibria, the joint strategy is unlikely to result in There are far too many decision points in no-limit Texas
all players being spaced uniformly along the ring. Two-player hold’em to reason about individually. To reduce the complex-
zero-sum games are a special case where even if the players ity of the game, we eliminate some actions from considera-
independently compute and select Nash equilibria, the list of tion and also bucket similar decision points together in a
strategies is still a Nash equilibrium. process called abstraction (24, 25). After abstraction, the
The shortcomings of Nash equilibria outside of two-player bucketed decision points are treated as identical. We use two
zero-sum games, and the failure of any other game-theoretic kinds of abstraction in Pluribus: action abstraction and infor-
solution concept to convincingly overcome them, have raised mation abstraction.
the question of what the right goal should even be in such Action abstraction reduces the number of different ac-
games. In the case of six-player poker, we take the viewpoint tions the AI needs to consider. No-limit Texas hold’em nor-
that our goal should not be a specific game-theoretic solution mally allows any whole-dollar bet between $100 and $10,000.
concept, but rather to create an AI that empirically consist- However, in practice there is little difference between betting
ently defeats human opponents, including elite human pro- $200 and betting $201. To reduce the complexity of forming
fessionals. a strategy, Pluribus only considers a few different bet sizes at
The algorithms we used to construct Pluribus, discussed any given decision point. The exact number of bets it consid-
in the next two sections, are not guaranteed to converge to a ers varies between one and 14 depending on the situation.
Nash equilibrium outside of two-player zero-sum games. Nev- Although Pluribus can limit itself to only betting one of a few
ertheless, we observe that Pluribus plays a strong strategy in different sizes between $100 and $10,000, when actually play-
multiplayer poker that is capable of consistently defeating ing no-limit poker, the opponents are not constrained to
elite human professionals. This shows that even though the those few options. What happens if an opponent bets $150
techniques do not have known strong theoretical guarantees while Pluribus has only been trained to consider bets of $100
on performance outside of the two-player zero-sum setting, or $200? Generally, Pluribus will rely on its search algorithm,
they are nevertheless capable of producing superhuman described in a later section, to compute a response in real

First release: 11 July 2019 [Link] (Page numbers not final at time of first release) 2
time to such “off-tree” actions. received for choosing an action versus what the traverser ac-
The other form of abstraction we use in Pluribus is infor- tually achieved (in expectation) on the iteration is added to
mation abstraction, in which decision points that are similar the counterfactual regret for the action. Counterfactual regret
in terms of what information has been revealed (in poker, the represents how much the traverser regrets having not chosen
player’s cards and revealed board cards) are bucketed to- that action in previous iterations. At the end of the iteration,
gether and treated identically (26–28). For example, a ten- the traverser’s strategy is updated so that actions with higher
high straight and a nine-high straight are distinct hands, but counterfactual regret are chosen with higher probability.
are nevertheless strategically similar. Pluribus may bucket For two-player zero-sum games, CFR guarantees that the
these hands together and treat them identically, thereby re- average strategy played over all iterations converges to a
ducing the number of distinct situations for which it needs to Nash equilibrium, but convergence to a Nash equilibrium is
determine a strategy. Information abstraction drastically re- not guaranteed outside of two-player zero-sum games. Nev-
duces the complexity of the game, but may wash away subtle ertheless, CFR guarantees in all finite games that all counter-
differences that are important for superhuman performance. factual regrets grow sublinearly in the number of iterations.
Therefore, during actual play against humans, Pluribus uses This, in turn, guarantees in the limit that the average perfor-
information abstraction only to reason about situations on mance of CFR on each iteration that was played matches the

Downloaded from [Link] on July 14, 2019


future betting rounds, never the betting round it is actually average performance of the best single fixed strategy in hind-
in. Information abstraction is also applied during offline self sight. CFR is also proven to eliminate iteratively strictly dom-
play. inated actions in all finite games (23).
Because the difference between counterfactual value and
Self play via improved Monte Carlo counterfactual re- expected value is added to counterfactual regret rather than
gret minimization replacing it, the first iteration in which the agent played com-
The blueprint strategy in Pluribus was computed using a var- pletely randomly (which is typically a very bad strategy) still
iant of counterfactual regret minimization (CFR) (29). CFR is influences the counterfactual regrets, and therefore the strat-
an iterative self-play algorithm in which the AI starts by play- egy that is played, for iterations far into the future. In the
ing completely at random but gradually improves by learning vanilla form of CFR, the influence of this first iteration decays
to beat earlier versions of itself. Every competitive Texas 1
hold’em AI for at least the past six years has computed its at a rate of , where T is the number of iterations played.
T
strategy using some variant of CFR (4–6, 23, 28, 30–34). We In order to more quickly decay the influence of these early
use a form of Monte Carlo CFR (MCCFR) that samples actions “bad” iterations, Pluribus uses a recent form of CFR called
in the game tree rather than traversing the entire game tree Linear CFR (38) in early iterations. (We stop the discounting
on each iteration (33, 35–37). after that because the time cost of doing the multiplications
On each iteration of the algorithm, MCCFR designates one with the discount factor is not worth the benefit later on.)
player as the traverser whose current strategy is updated on Linear CFR assigns a weight of T to the regret contributions
the iteration. At the start of the iteration, MCCFR simulates of iteration T . Therefore, the influence of the first iteration
a hand of poker based on the current strategy of all players 1 2
(which is initially completely random). Once the simulated decays at a rate of = . This leads to the strategy
∑ t ( + 1)
T
hand is completed, the AI reviews each decision that was T T
t =1
made by the traverser and investigates how much better or improving significantly more quickly in practice while still
worse it would have done by choosing the other available ac- maintaining a near-identical worst-case bound on total re-
tions instead. Next, the AI reviews each hypothetical decision gret. To speed up the blueprint strategy computation even
that would have been made following those other available further, actions with extremely negative regret are not ex-
actions and investigates how much better it would have done plored in 95% of iterations.
by choosing the other available actions, and so on. This tra- The blueprint strategy for Pluribus was computed in 8
versal of the game tree is illustrated in Fig. 2. Exploring other days on a 64-core server for a total of 12,400 CPU core hours.
hypothetical outcomes is possible because the AI knows each It required less than 512 GB of memory. At current cloud com-
player’s strategy for the iteration, and can therefore simulate puting spot instance rates, this would cost about $144 to pro-
what would have happened had some other action been cho- duce. This is in sharp contrast to all the other recent
sen instead. This counterfactual reasoning is one of the fea- superhuman AI milestones for games, which used large num-
tures that distinguishes CFR from other self-play algorithms bers of servers and/or farms of GPUs. More memory and com-
that have been deployed in domains such as Go (9), Dota 2 putation would enable a finer-grained blueprint that would
(20), and StarCraft 2 (21). lead to better performance, but would also result in Pluribus
The difference between what the traverser would have using more memory or being slower during real-time search.

First release: 11 July 2019 [Link] (Page numbers not final at time of first release) 3
We set the size of the blueprint strategy abstraction to allow adjust to a strategy of always playing Paper. In that case, the
Pluribus to run during live play on a machine with no more value of always playing Rock would actually be −1 .
than 128 GB of memory while storing a compressed form of This example illustrates that in imperfect-information
the blueprint strategy in memory. subgames (the part of the game in which search is being con-
ducted) (40), leaf nodes do not have fixed values. Instead,
Depth-limited search in imperfect-information games their values depend on the strategy that the searcher chooses
The blueprint strategy for the entire game is necessarily in the subgame (that is, the probabilities that the searcher
coarse-grained owing to the size and complexity of no-limit assigns to his actions in the subgame). In principle, this could
Texas hold’em. Pluribus only plays according to this blueprint be addressed by having the value of a subgame leaf node be
strategy in the first betting round (of four), where the number a function of the searcher’s strategy in the subgame, but this
of decision points is small enough that the blueprint strategy is impractical in large games. One alternative is to make the
can afford to not use information abstraction and have a lot value of a leaf node conditional only on the belief distribution
of actions in the action abstraction. After the first round (and of both players at that point in the game. This was used to
even in the first round if an opponent chooses a bet size that generate the two-player poker AI DeepStack (5). However,
is sufficiently different from the sizes in the blueprint action this option is extremely expensive because it requires one to

Downloaded from [Link] on July 14, 2019


abstraction) Pluribus instead conducts real-time search to de- solve huge numbers of subgames that are conditional on be-
termine a better, finer-grained strategy for the current situa- liefs. It becomes even more expensive as the amount of hid-
tion it is in. For opponent bets on the first round that are den information or the number of players grows. The two-
slightly off the tree, Pluribus rounds the bet to a nearby on- player poker AI Libratus sidestepped this issue by only doing
tree size (using the pseudoharmonic mapping (39)) and pro- real-time search when the remaining game was short enough
ceeds to play according to the blueprint as if the opponent that the depth limit would extend to the end of the game (6).
had used the latter bet size. However, as the number of players grows, always solving to
Real-time search has been necessary for achieving super- the end of the game also becomes computationally prohibi-
human performance in many perfect-information games, in- tive.
cluding backgammon (18), chess (8), and Go (9, 19). For Pluribus instead uses a modified form of an approach that
example, when determining their next move, chess AIs com- we recently designed—previously only for two-player zero-
monly look some number of moves ahead until a leaf node is sum games (41)—in which the searcher explicitly considers
reached at the depth limit of the algorithm’s lookahead. An that any or all players may shift to different strategies beyond
evaluation function then estimates the value of the board the leaf nodes of a subgame. Specifically, rather than assum-
configuration at the leaf node if both players were to play a ing all players play according to a single fixed strategy beyond
Nash equilibrium from that point forward. In principle, if an the leaf nodes (which results in the leaf nodes having a single
AI could accurately calculate the value of every leaf node (e.g., fixed value) we instead assume that each player may choose
win, draw, or loss), this algorithm would choose the optimal between k different strategies, specialized to each player, to
next move. play for the remainder of the game when a leaf node is
However, search as has been done in perfect-information reached. In the experiments in this paper, k = 4 . One of the
games is fundamentally broken when applied to imperfect- four continuation strategies we use in the experiments is the
information games. For example, consider a sequential form precomputed blueprint strategy, another is a modified form
of Rock-Paper-Scissors, illustrated in Fig. 3, in which Player 1 of the blueprint strategy in which the strategy is biased to-
acts first but does not reveal her action to Player 2, followed ward folding, another is the blueprint strategy biased toward
by Player 2 acting. If Player 1 were to conduct search that calling, and the final option is the blueprint strategy biased
looks just one move ahead, every one of her actions would toward raising. This technique results in the searcher finding
appear to lead to a leaf node with zero value. After all, if a strategy that is more balanced because choosing an unbal-
Player 2 plays the Nash equilibrium strategy of choosing each anced strategy (e.g., always playing Rock in Rock-Paper-Scis-
1 sors) would be punished by an opponent shifting to one of
action with probability, the value to Player 1 of choosing
3 the other continuation strategies (e.g., always playing Paper).
Rock is zero, as is the value of choosing Scissors. So Player 1’s Another major challenge of search in imperfect-infor-
search algorithm could choose to always play Rock because, mation games is that a player’s optimal strategy for a partic-
given the values of the leaf nodes, this appears to be equally ular situation depends on what the player’s strategy is for
good as any other strategy. every situation the player could be in from the perspective of
Indeed, if Player 2’s strategy were fixed to always playing her opponents. For example, suppose the player is holding
the Nash equilibrium, always playing Rock would be an opti- the best possible hand. Betting in this situation could be a
mal Player 1 strategy. However, in reality Player 2 could good action. But if the player bets in this situation only when

First release: 11 July 2019 [Link] (Page numbers not final at time of first release) 4
holding the best possible hand, then the opponents would Merson, Nicholas Petrangelo, Sean Ruane, Trevor Savage,
know to always fold in response. and Jacob Toole. In this experiment, 10,000 hands of poker
To cope with this, Pluribus keeps track of the probability were played over 12 days. Each day, five volunteers from the
it would have reached the current situation with each possi- pool of professionals were selected to participate based on
ble hand according to its strategy. Regardless of which hand availability. The participants were not told who else was par-
Pluribus is actually holding, it will first calculate how it would ticipating in the experiment. Instead, each participant was
act with every possible hand, being careful to balance its assigned an alias that remained constant throughout the ex-
strategy across all the hands so as to remain unpredictable to periment. The alias of each player in each game was known,
the opponent. Once this balanced strategy across all hands is so that players could track the tendencies of each player
computed, Pluribus then executes an action for the hand it is throughout the experiment. $50,000 was divided among the
actually holding. The structure of a depth-limited imperfect- human participants based on their performance to incentiv-
information subgame as used in Pluribus is shown in Fig. 4. ize them to play their best. Each player was guaranteed a
Pluribus used one of two different forms of CFR to com- minimum of $0.40 per hand for participating, but this could
pute a strategy in the subgame depending on the size of the increase to as much as $1.60 per hand based on performance.
subgame and the part of the game. If the subgame is rela- After applying AIVAT, Pluribus won an average of 48

Downloaded from [Link] on July 14, 2019


tively large or it is early in the game, then Monte Carlo Linear mbb/game (with a standard error of 25 mbb/game). This is
CFR is used just as it was for the blueprint strategy computa- considered a very high win rate in six-player no-limit Texas
tion. Otherwise, Pluribus uses an optimized vector-based hold’em poker, especially against a collection of elite profes-
form of Linear CFR (38) that samples only chance events sionals, and implies that Pluribus is stronger than the human
(such as board cards) (42). opponents. Pluribus was determined to be profitable with a
When playing, Pluribus runs on two Intel Haswell E5- p-value of 0.028. The performance of Pluribus over the course
2695 v3 CPUs and uses less than 128 GB of memory. For com- of the experiment is shown in Fig. 5. Due to the extremely
parison, AlphaGo used 1,920 CPUs and 280 GPUs for real- high variance in no-limit poker and the impossibility of ap-
time search in its 2016 matches against top Go professional plying AIVAT to human players, the win rate of individual
Lee Sedol (43), Deep Blue used 480 custom-designed chips in human participants could not be determined with statistical
its 1997 matches against top chess professional Garry Kaspa- significance.
rov (8), and Libratus used 100 CPUs in its 2017 matches The human participants in the 1H+5AI experiment were
against top professionals in two-player poker (6). The amount Chris “Jesus” Ferguson and Darren Elias. Each of the two hu-
of time Pluribus takes to conduct search on a single subgame mans separately played 5,000 hands of poker against five cop-
varies between 1 s and 33 s depending on the particular situ- ies of Pluribus. Pluribus does not adapt its strategy to its
ation. On average, Pluribus plays at a rate of 20 s per hand opponents and does not know the identity of its opponents,
when playing against copies of itself in six-player poker. This so the copies of Pluribus could not intentionally collude
is roughly twice as fast as professional humans tend to play. against the human player. To incentivize strong play, we of-
fered each human $2,000 for participation and an additional
Experimental evaluation $2,000 if he performed better against the AI than the other
We evaluated Pluribus against elite human professionals in human player did. The players did not know who the other
two formats: five human professionals playing with one copy participant was and were not told how the other human was
of Pluribus (5H+1AI), and one human professional playing performing during the experiment. For the 10,000 hands
with five copies of Pluribus (1H+5AI). Each human partici- played, Pluribus beat the humans by an average of 32
pant has won more than $1 million playing poker profession- mbb/game (with a standard error of 15 mbb/game). Pluribus
ally. Performance was measured using the standard metric in was determined to be profitable with a p-value of 0.014. (Dar-
this field of AI, milli big blinds per game (mbb/game). This ren Elias was behind Pluribus by 40 mbb/game with a stand-
measures how many big blinds (the initial money the second ard error of 22 mbb/game and a p-value of 0.033, and Chris
player must put into the pot) were won on average per thou- Ferguson was behind Pluribus by 25 mbb/game with a stand-
sand hands of poker. In all experiments, we used the vari- ard error of 20 mbb/game and a p-value of 0.107. Ferguson’s
ance-reduction technique AIVAT (44) to reduce the luck lower loss rate may be a consequence of variance, skill,
factor in the game (45) and measured statistical significance and/or the fact that he used a more conservative strategy that
at the 95% confidence level using a one-tailed t test to deter- was biased toward folding in unfamiliar difficult situations.)
mine whether Pluribus is profitable. Because Pluribus’s strategy was determined entirely from
The human participants in the 5H+1AI experiment were self-play without any human data, it also provides an outside
Jimmy Chou, Seth Davies, Michael Gagliano, Anthony Gregg, perspective on what optimal play should look like in multi-
Dong Kim, Jason Les, Linus Loeliger, Daniel McAulay, Greg player no-limit Texas hold’em. Pluribus confirms the

First release: 11 July 2019 [Link] (Page numbers not final at time of first release) 5
conventional human wisdom that limping (calling the “big 10. Recently, in the real-time strategy games Dota 2 (20) and StarCraft 2 (21), AIs have
beaten top humans, but as humans have gained more experience against the AIs,
blind” rather than folding or raising) is suboptimal for any
humans have learned to beat them. This may be because for those two-player
player except the “small blind” player who already has half zero-sum games, the AIs were generated by techniques not guaranteed to
the big blind in the pot by the rules, and thus has to invest converge to a Nash equilibrium, so they do not have the unbeatability property
only half as much as the other players to call. While Pluribus that Nash equilibruim strategies have in two-player zero-sum games. (Dota 2
involves two teams of five players each. However, because the players on the
initially experimented with limping when computing its blue-
same team have the same objective and are not limited in their communication,
print strategy offline through self play, it gradually discarded the game is two-player zero-sum from an AI and game-theoretic perspective).
this action from its strategy as self play continued. However, 11. S. Ganzfried, T. Sandholm, in International Conference on Autonomous Agents and
Pluribus disagrees with the folk wisdom that “donk betting” Multi-Agent Systems (AAMAS) (2011), pp. 533–540.
12. S. Ganzfried, T. Sandholm, ACM Trans. Econ. Comp. (TEAC) 3, 8 (2015). Best of
(starting a round by betting when one ended the previous
EC-12 special issue.
betting round with a call) is a mistake; Pluribus does this far 13. C. Daskalakis, P. W. Goldberg, C. H. Papadimitriou, The Complexity of Computing
more often than professional humans do. a Nash Equilibrium. SIAM J. Comput. 39, 195–259 (2009).
doi:10.1137/070699652
14. X. Chen, X. Deng, S.-H. Teng, Settling the complexity of computing two-player
Conclusions
Nash equilibria. J. Assoc. Comput. Mach. 56, 14 (2009).
Forms of self play combined with forms of search has led to doi:10.1145/1516512.1516516

Downloaded from [Link] on July 14, 2019


a number of high-profile successes in perfect-information 15. A. Rubinstein, Inapproximability of Nash Equilibrium. SIAM J. Comput. 47, 917–959
two-player zero-sum games. However, most real-world stra- (2018). doi:10.1137/15M1039274
16. K. Berg, T. Sandholm, AAAI Conference on Artificial Intelligence (AAAI) (2017).
tegic interactions involve hidden information and more than
17. M. A. Zinkevich, M. Bowling, M. Wunder, The lemonade stand game competition:
two players. This makes the problem very different and sig- Solving unsolvable puzzles. ACM SIGecom Exchanges 10, 35–38 (2011).
nificantly more difficult both theoretically and practically. doi:10.1145/1978721.1978730
Developing a superhuman AI for multiplayer poker was a 18. G. Tesauro, Temporal difference learning and TD-Gammon. Commun. ACM 38,
58–68 (1995). doi:10.1145/203330.203343
widely-recognized milestone in this area and the major re-
19. D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T.
maining milestone in computer poker. In this paper we de- Hubert, L. Baker, M. Lai, A. Bolton, Y. Chen, T. Lillicrap, F. Hui, L. Sifre, G. van den
scribed Pluribus, an AI capable of defeating elite human Driessche, T. Graepel, D. Hassabis, Mastering the game of Go without human
professionals in six-player no-limit Texas hold’em poker, the knowledge. Nature 550, 354–359 (2017). doi:10.1038/nature24270 Medline
20. A. I. Open, A. I. Open, Five, [Link] (2018).
most commonly played poker format in the world. Pluribus’s
21. O. Vinyals et al., AlphaStar: Mastering the Real-Time Strategy Game StarCraft II,
success shows that despite the lack of known strong theoret- [Link]
ical guarantees on performance in multiplayer games, there starcraft-ii/ (2019).
are large-scale, complex multiplayer imperfect-information 22. L. S. Shapley, Advances in Game Theory, M. Drescher, L. S. Shapley, A. W. Tucker,
settings in which a carefully constructed self-play-with- Eds. (Princeton Univ. Press, 1964).
23. R. Gibson, Regret minimization in games and the development of champion
search algorithm can produce superhuman strategies. multiplayer computer poker-playing agents, Ph.D. thesis, University of Alberta
REFERENCES AND NOTES (2014).
1. D. Billings, A. Davidson, J. Schaeffer, D. Szafron, The challenge of poker. Artif. Intell. 24. T. Sandholm, AAAI Conference on Artificial Intelligence (AAAI) (2015), pp. 4127–
134, 201–240 (2002). doi:10.1016/S0004-3702(01)00130-8 4131. Senior Member Track.
2. J. von Neumann, Zur Theorie der Gesellschaftsspiele. Math. Ann. 100, 295–320 25. T. Sandholm, Computer science. Solving imperfect-information games. Science
(1928). doi:10.1007/BF01448847 347, 122–123 (2015). doi:10.1126/science.aaa4614 Medline
3. J. Nash, Non-Cooperative Games. Ann. Math. 54, 286 (1951). doi:10.2307/1969529 26. M. Johanson, N. Burch, R. Valenzano, M. Bowling, in International Conference on
4. M. Bowling, N. Burch, M. Johanson, O. Tammelin, Computer science. Heads-up limit Autonomous Agents and Multiagent Systems (AAMAS) (2013), pp. 271–278.
hold’em poker is solved. Science 347, 145–149 (2015). 27. S. Ganzfried, T. Sandholm, in AAAI Conference on Artificial Intelligence (AAAI)
doi:10.1126/science.1259433 Medline (2014), pp. 682–690.
5. M. Moravčík, M. Schmid, N. Burch, V. Lisý, D. Morrill, N. Bard, T. Davis, K. Waugh, M. 28. N. Brown, S. Ganzfried, T. Sandholm, in International Conference on Autonomous
Johanson, M. Bowling, DeepStack: Expert-level artificial intelligence in heads-up Agents and Multiagent Systems (AAMAS) (2015), pp. 7–15.
no-limit poker. Science 356, 508–513 (2017). doi:10.1126/science.aam6960 29. M. Zinkevich, M. Johanson, M. H. Bowling, C. Piccione, in Neural Information
Medline Processing Systems (NeurIPS) (2007), pp. 1729–1736.
6. N. Brown, T. Sandholm, Superhuman AI for heads-up no-limit poker: Libratus beats 30. E. G. Jackson, AAAI Workshop on Computer Poker and Imperfect Information
top professionals. Science 359, 418–424 (2018). doi:10.1126/science.aao1733 (2013).
Medline 31. M. B. Johanson, Robust strategies and counter-strategies: from superhuman to
7. J. Schaeffer, One Jump Ahead: Challenging Human Supremacy in Checkers optimal play, Ph.D. thesis, University of Alberta (2016).
(Springer-Verlag, New York, 1997). 32. E. G. Jackson, AAAI Workshop on Computer Poker and Imperfect Information
8. M. Campbell, A. J. Hoane Jr., F.-H. Hsu, Deep Blue. Artif. Intell. 134, 57–83 (2002). (2016).
doi:10.1016/S0004-3702(01)00129-1 33. N. Brown, T. Sandholm, in International Joint Conference on Artificial Intelligence
9. D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. (IJCAI) (2016), pp. 4238–4239.
Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. 34. E. G. Jackson, AAAI Workshop on Computer Poker and Imperfect Information
Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Games (2017).
Kavukcuoglu, T. Graepel, D. Hassabis, Mastering the game of Go with deep neural 35. M. Lanctot, K. Waugh, M. Zinkevich, in M. Bowling, Neural Information Processing
networks and tree search. Nature 529, 484–489 (2016). Systems (NeurIPS) (2009), pp. 1078–1086.
doi:10.1038/nature16961 Medline 36. M. Johanson, N. Bard, M. Lanctot, R. Gibson, M. Bowling, in International

First release: 11 July 2019 [Link] (Page numbers not final at time of first release) 6
Conference on Autonomous Agents and Multiagent Systems (AAMAS) (2012), pp. 10.1126/science.aay2400
837–846.
37. R. Gibson, M. Lanctot, N. Burch, D. Szafron, M. Bowling, in AAAI Conference on
Artificial Intelligence (AAAI) (2012), pp. 1355–1361.
38. N. Brown, T. Sandholm, AAAI Conference on Artificial Intelligence (AAAI) (2019).
39. S. Ganzfried, T. Sandholm, in International Joint Conference on Artificial
Intelligence (IJCAI) (2013), pp. 120–128.
40. Here we use term subgame the way it is usually used in AI. In game theory that
word is used differently by requiring a subgame to start with a node where the
player whose turn it is to move has no uncertainty about state—in particular, no
uncertainty about the opponents’ private information.
41. N. Brown, T. Sandholm, B. Amos, in Neural Information Processing Systems
(NeurIPS) (2018), pp. 7663–7674.
42. M. Johanson, K. Waugh, M. Bowling, M. Zinkevich, in International Joint Conference
on Artificial Intelligence (IJCAI) (2011), pp. 258–265.
43. E. P. DeBenedictis, Rebooting Computers as Learning Machines. Computer 49,
84–87 (2016). doi:10.1109/MC.2016.156
44. N. Burch, M. Schmid, M. Moravcik, D. Morill, M. Bowling, in AAAI Conference on
Artificial Intelligence (AAAI) (2018), pp. 949–956.

Downloaded from [Link] on July 14, 2019


45. Due to the presence of AIVAT and because the players did not know each others’
scores during the experiment, there was no incentive for the players to play a risk-
averse or risk-seeking strategy in order to outperform the other human.
46. A. Gilpin, T. Sandholm, Lossless abstraction of imperfect information games. J.
Assoc. Comput. Mach. 54, 25 (2007). doi:10.1145/1284320.1284324
47. K. Waugh, AAAI Workshop on Computer Poker and Imperfect Information (2013).
48. A. Gilpin, T. Sandholm, T. B. Sørensen, in Proceedings of the AAAI Conference on
Artificial Intelligence (AAAI) (2007), pp. 50–57.
49. S. Ganzfried, T. Sandholm, in International Conference on Autonomous Agents and
Multi-Agent Systems (AAMAS) (2015), pp. 37–45.
50. N. Burch, M. Johanson, M. Bowling, in AAAI Conference on Artificial Intelligence
(AAAI) (2014), pp. 602–608.
51. M. Moravcik, M. Schmid, K. Ha, M. Hladik, S. Gaukrodger, in AAAI Conference on
Artificial Intelligence (AAAI) (2016), pp. 572–578.
52. N. Brown, T. Sandholm, in Neural Information Processing Systems (NeurIPS)
(2017), pp. 689-699.
ACKNOWLEDGMENTS
We thank Pratik Ringshia for building a GUI and thank Jai Chintagunta, Ben Clayman,
Alex Du, Carl Gao, Sam Gross, Thomas Liao, Christian Kroer, Joe Langas, Adam
Lerer, Vivek Raj, and Steve Wu for playing against Pluribus as early testing.
Funding: This material is based on Carnegie Mellon University research
supported by the National Science Foundation under grants IIS-1718457, IIS-
1617590, IIS-1901403, and CCF-1733556, and the ARO under award W911NF-17-
1-0082, as well as XSEDE computing resources provided by the Pittsburgh
Supercomputing Center. Facebook funded the player payments. Author
contributions: N.B. and T.S. designed the algorithms. N.B. wrote the code. N.B.
and T.S. designed the experiments and wrote the paper. Competing interests:
The authors have ownership interest in Strategic Machine, Inc. and Strategy
Robot, Inc. which have exclusively licensed prior game-solving code from Prof.
Sandholm’s Carnegie Mellon University laboratory, which constitutes the bulk of
the code in Pluribus. Data and materials availability: The data presented in this
paper are shown in the main text and supplementary materials. Because poker is
played commercially, the risk associated with releasing the code outweighs the
benefits. To aid reproducibility, we have included the pseudocode for the major
components of our program in the supplementary materials.
SUPPLEMENTARY MATERIALS
[Link]/cgi/content/full/science.aay2400/DC1
Supplementary Text
Table S1
References (46–52)
Data File S1

31 May 2019; accepted 2 July 2019


Published online 11 July 2019

First release: 11 July 2019 [Link] (Page numbers not final at time of first release) 7
Fig. 1. An example of the equilibrium selection problem. In the

Downloaded from [Link] on July 14, 2019


Lemonade Stand Game, players simultaneously choose a point on a ring
and want to be as far away as possible from any other player. In every
Nash equilibrium, players are spaced uniformly around the ring. There are
infinitely many such Nash equilibria. However, if each player
independently chooses one Nash equilibrium to play, their joint strategy
is unlikely to be a Nash equilibrium. Left: An illustration of three different
Nash equilibria in this game, distinguished by three different colors. Right:
Each player independently chooses one Nash equilibrium. Their joint
strategy is not a Nash equilibrium.

First release: 11 July 2019 [Link] (Page numbers not final at time of first release) 8
Downloaded from [Link] on July 14, 2019
Fig. 2. A game tree traversal via Monte Carlo CFR. In this figure player P1 is traversing the game tree. Left: A
game is simulated until an outcome is reached. Middle: For each P1 decision point encountered in the simulation
in the Left figure, P1 explores each other action that P1 could have taken and plays out a simulation to the end of
the game. P1 then updates its strategy to pick actions with higher payoff with higher probability. Right: P1 explores
each other action that P1 could have taken at every new decision point encountered in the Middle figure, and P1
updates its strategy at those hypothetical decision points. This process repeats until no new P1 decision points
are encountered, which in this case is after three steps but in general may be more. Our implementation of MCCFR
(described in the supplementary material) is equivalent but traverses the game tree in a depth-first manner. (The
percentages in the figure are for illustration purposes only and may not correspond to actual percentages that the
algorithm would compute.)

First release: 11 July 2019 [Link] (Page numbers not final at time of first release) 9
Downloaded from [Link] on July 14, 2019
Fig. 3. Perfect-information game search in Rock-Paper-Scissors.
Top: A sequential representation of Rock-Paper-Scissors in which Player
1 acts first but does not reveal her action to Player 2, who acts second.
The dashed lines between the Player 2 nodes signify that Player 2 does
not know which of those nodes he is in. The terminal values are shown
only for Player 1. Bottom: A depiction of the depth-limited subgame if
Player 1 conducts search (with a depth of one) using the same approach
as is used in perfect-information games. The approach assumes that
after each action Player 2 will play according to the Nash equilibrium
1
strategy of choosing Rock, Paper, and Scissors with probability each.
3
This results in a value of zero for Player 1 regardless of her strategy.

First release: 11 July 2019 [Link] (Page numbers not final at time of first release) 10
Fig. 4. Real-time search in Pluribus. The subgame shows just two players for simplicity. A dashed line between

Downloaded from [Link] on July 14, 2019


nodes indicates that the player to act does not know which of the two nodes she is in. Left: The original imperfect-
information subgame. Right: The transformed subgame that is searched in real time to determine a player’s
strategy. An initial chance node reaches each root node according to the normalized probability that the node is
reached in the previously-computed strategy profile (or according to the blueprint strategy profile if this is the first
time in the hand that real-time search is conducted). The leaf nodes are replaced by a sequence of new nodes in
which each player still in the hand chooses among k actions, with no player first observing what another player
chooses. For simplicity, k = 2 in the figure. In Pluribus, k = 4 . Each action in that sequence corresponds to a
selection of a continuation strategy for that player for the remainder of the game. This effectively leads to a
terminal node (whose value is estimated by rolling out the remainder of the game according to the list of
continuation strategies the players chose).

First release: 11 July 2019 [Link] (Page numbers not final at time of first release) 11
Downloaded from [Link] on July 14, 2019
Fig. 5. Performance of Pluribus in the 5 humans + 1 AI experiment.
Top: The lines show the win rate (solid line) plus or minus the standard
error (dashed lines). Bottom: The lines show the cumulative number of
chips won (solid line) plus or minus the standard error (dashed lines). The
relatively steady performance of Pluribus over the course of the 10,000-
hand experiment suggests the humans were unable to find exploitable
weaknesses in the bot.

First release: 11 July 2019 [Link] (Page numbers not final at time of first release) 12
Superhuman AI for multiplayer poker
Noam Brown and Tuomas Sandholm

published online July 11, 2019

ARTICLE TOOLS [Link]

Downloaded from [Link] on July 14, 2019


SUPPLEMENTARY [Link]
MATERIALS

REFERENCES This article cites 21 articles, 5 of which you can access for free
[Link]

PERMISSIONS [Link]

Use of this article is subject to the Terms of Service

Science (print ISSN 0036-8075; online ISSN 1095-9203) is published by the American Association for the Advancement of
Science, 1200 New York Avenue NW, Washington, DC 20005. 2017 © The Authors, some rights reserved; exclusive licensee
American Association for the Advancement of Science. No claim to original U.S. Government Works. The title Science is a
registered trademark of AAAS.

You might also like