Control Strategies For Physically Simulated Characters Performing Two Player Competitive Sports
Control Strategies For Physically Simulated Characters Performing Two Player Competitive Sports
Fig. 1. Characters performing two-player competitive sports such as boxing (left) and fencing (right) using learned control strategies.
In two-player competitive sports, such as boxing and fencing, athletes often 1 INTRODUCTION
demonstrate efficient and tactical movements during a competition. In this Many competitive sports involve long periods of routine play inter-
paper, we develop a learning framework that generates control policies
spersed with occasional dramatic demonstrations of agility. Those
for physically simulated athletes who have many degrees-of-freedom. Our
framework uses a two step-approach, learning basic skills and learning bout-
strategic moments are often what determine the outcome of the
level strategies, with deep reinforcement learning, which is inspired by the competition and spectators wait and cheer for those highlights. But
way that people how to learn competitive sports. We develop a policy model both the routine play and the scoring moments are hard to reproduce
based on an encoder-decoder structure that incorporates an autoregressive automatically in animated characters because of the complexities
latent variable, and a mixture-of-experts decoder. To show the effectiveness involved in the interactions between the competing athletes. If we
of our framework, we implemented two competitive sports, boxing and had the ability to create virtual athletes who could automatically
fencing, and demonstrate control policies learned by our framework that perform all the movements of their sports and assemble them to
can generate both tactical and natural-looking behaviors. We also evaluate develop a winning strategy, that functionality would open up many
the control policies with comparisons to other learning configurations and new applications in computer games, commercial films, and sports
with ablation studies.
broadcasting.
CCS Concepts: • Computing methodologies → Animation; Physical sim- Creating animated scenes with multiple people is challenging
ulation; Reinforcement learning; Neural networks. because it requires not only that each individual behave in a natural
Additional Key Words and Phrases: Character Animation, Physics-based way but that their interactions with each other are synchronized in
Simulation and Control, Reinforcement Learning, Deep Learning, Neural both the temporal and spatial domains to appear natural. The denser
Network, Multi-agent the interactions are, the more challenging the problem is as there is
no time to “reset” between interactions. Using physically simulated
ACM Reference Format: characters simplifies one part of the problem because low-level
Jungdam Won, Deepak Gopinath, and Jessica Hodgins. 2021. Control Strate-
physical interactions such as collision are automatically generated
gies for Physically Simulated Characters Performing Two-player Compet-
through simulation. However, coordinating the different skills such
itive Sports. ACM Trans. Graph. 40, 4, Article 1 (August 2021), 11 pages.
https://doi.org/10.1145/3450626.3459761 as jabs and punches or thrusts and parries or the bout-level strategies
of countering and pressure-fighting has not been studied in depth
Authors’ addresses: Jungdam Won, Facebook AI Research; Deepak Gopinath, Facebook because of the computational complexity of learning the series of
AI Research; Jessica Hodgins, Facebook AI Research. skills that comprise a full competition. A key challenge in using
the simulated characters for competitive sports is that we need to
Permission to make digital or hard copies of part or all of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed learn both the basic skills and bout-level strategies so that they work
for profit or commercial advantage and that copies bear this notice and the full citation properly in concert.
on the first page. Copyrights for third-party components of this work must be honored. In recent years, deep reinforcement learning techniques have
For all other uses, contact the owner/author(s).
© 2021 Copyright held by the owner/author(s). shown promising results in creating controllers or control policies
0730-0301/2021/8-ART1 for physically simulated humanoids for common behaviors such as
https://doi.org/10.1145/3450626.3459761
ACM Trans. Graph., Vol. 40, No. 4, Article 1. Publication date: August 2021.
1:2 • Won et al.
locomotion and manipulation as well as more esoteric behaviors kinematics-based and physics is not considered in generating the
such as bicycle riding and gymnastics. Most of these behaviors in- animation.
volve only a single character and behaviors that require interactions A popular approach is patch-based generation. The key idea is
among the characters have not been studied in depth. to build short sequences of motion containing interactions which
In this paper, we explore techniques for training control systems can then be glued together to create longer sequences with inter-
for two-player competitive sports that involve physical interaction. mittent interactions. Because the patches are usually pre-built with
We develop a framework that generates control policies for this motion capture data involving real interactions, we can ensure that
scenario, where the humanoids have many degrees-of-freedom and the generated scenes will always have plausible interactions. This
are actuated by joint torques. Our framework is inspired by the approach was first introduced in [Lee et al. 2006], and the idea was
way that people learn how to play competitive sports. For most then extended to two character interaction [Shum et al. 2008], and
sports, people first learn the basic skills without an opponent and crowd scenes [Yersin et al. 2009]. Several methods have been devel-
then learn how to combine and refine those skills by competing oped to edit the details of the scene [Ho et al. 2010; Kim et al. 2009]
against an opponent. We mimic these two processes, learning basic and to make bigger and more complex scenes [Henry et al. 2014;
skills and learning bout-level strategies, with deep reinforcement Hyun et al. 2013; Kim et al. 2014; Kwon et al. 2008; Won et al. 2014]
learning. We develop a policy model based on an encoder-decoder using the patches as building blocks.
structure that incorporates an autoregressive latent variable, and a There have also been approaches that directly synthesize multi-
mixture-of-experts decoder. To show the effectiveness of our frame- character animation without using pre-defined motion patches, in-
work, we implemented two competitive sports, boxing and fencing, stead imposing soft or hard constraints on the interactions. For exam-
and demonstrate control policies learned by our framework that ple, given a motion graph constructed from boxing motions captured
can generate both responsive and natural-looking behaviors for from a single person, a kinematic controller that approaches and
the players. To evaluate our framework, we compare with other hits the opponent can be pre-computed by dynamic programming
plausible design choices and use ablation studies to understand the where the controller is represented by a transition table [Lee and Lee
contribution of individual components. 2004]. Liu et al. [2006] created multi-character motions by solving a
The contributions of this paper are as follows: space-time optimization problem with physical and user-specified
constraints. Shum and his colleagues proposed an online method
• Novel Results. We demonstrate successful control policies
based on game tree expansion and min-max tree search, where com-
that generate both responsive and natural-looking behaviors
petitive and collaborative objectives exist simultaneously [Shum
in competitive settings for high degree-of-freedom physically
et al. [n.d.], 2007; Shum et al. 2012]. Kwon et al. [2008] tackled a
simulated humanoids. This problem has not previously been
similar problem by learning a motion transition model based on
studied in depth.
a dynamic Bayesian network on top of coupled motion transition
• Policy Model and Learning Procedure. Our policy model
graphs. Wampler et al. [2010] demonstrated a system that can gen-
is designed for efficient transfer learning which enables us
erate feints or misdirection moves in two-player adversarial games
to use a few motion clips captured with a single actor only.
based on game theory, where stochastic and simultaneous deci-
For example, it generates plausible competitive policies for
sion allows such nuanced strategies based on unpredictability. Our
boxing by using just minutes of a boxer practicing alone.
method belongs to this class of approaches because we directly
• Baseline for Future Research We develop new competi-
synthesize multi-character animation by using only individually
tive environments, boxing and fencing, where two physically
captured motions (with no opponent present). However, we incor-
simulated players can fight either with their fists or with
porate dynamic simulation to ensure the naturalness of impacts and
swords to win the game. The environments were selected be-
other physical effects.
cause of the need for physical interactions. To support future
researchers, we plan to share our environments and learned
policies. 2.2 Physics-based Character Animation
Physics-based character animation involves characters that are mod-
2 RELATED WORK eled as interconnected rigid bodies with mass and moment of inertia
and controlled by joint torques or muscle models. The enforcement
There is a long history in computer animation of developing control
of physics laws prevents motions that are physically unrealistic
systems for physically simulated characters. We review the papers
but do not prevent unnatural motions such as those in which re-
that are most closely related to our work because they handle mul-
sponse times are too fast or joints are too strong. Many different
tiple characters or physically simulated characters. We also review
approaches have been proposed to supplement the constraints of
several studies of reinforcement learning techniques that are similar
physics with additional control, learning, or optimization to produce
to the ones we employ and focus on their applications to character
natural motion for humanoid characters. Although physics-based
animation and multi-agent problems.
methods has shown promising results for individual characters per-
forming a wide variety of behaviors, there exist only a few studies
2.1 Multi-character Animation for multi-character animations. Zordan et al. [2002; 2005] proposed
Many researchers have proposed techniques for creating realis- motion capture-driven simulated characters that can react to exter-
tic multi-character animation. Most of the existing techniques are nal perturbations, where two-player interactions such as fighting,
ACM Trans. Graph., Vol. 40, No. 4, Article 1. Publication date: August 2021.
Control Strategies for Physically Simulated Characters Performing Two-player Competitive Sports • 1:3
boxing, table-tennis, and fencing were shown as examples. Recently, competitive problem, each agent has their own goal that is incom-
deep reinforcement learning has shown ground breaking results for patible with those of the other agents, so communication channels
physically simulated characters performing behaviors such as loco- are usually ignored and each agent performs autonomously with
motion [Berseth et al. 2018; Peng et al. 2017; Yu et al. 2018], such as only observations and models of their opponent’s behavior [Bai and
imitation [Bergamin et al. 2019; Chentanez et al. 2018; Lee et al. 2019; Jin 2020; Baker et al. 2019; Bansal et al. 2018; Seldin and Slivkins
Merel et al. 2017; Peng et al. 2018; Wang et al. 2017; Won et al. 2020; 2014].
Won and Lee 2019], and such as other skills [Clegg et al. 2018; Liu and Recently, Lowe et al. [2017] proposed an algorithm called MAD-
Hodgins 2017, 2018; Xie et al. 2020]. However, the number of stud- DPG that solves the multi-agent RL problem as if it were a single-
ies on multi-characters is still limited. Park and colleagues [2019] agent RL problem by using a shared Q-function across all the agents.
showed an example of chicken hopping where two players hop on Bansal et al. [2018] showed control policies for physically simu-
one leg and bump into each other to knock their opponent over. Al- lated agents, which emerged from a competitive environment set-
though the paper demonstrated physically plausible multi-character ting. Baker et al. [2019] demonstrated the evolution of policies in a
animations, the characters in the scenes do not have high-level hide-and-seek game, where the policies were learned with simple
strategies (intelligence) for fighting. Haworth and colleagues [2020] competitive rewards in a fully automatic manner. Although the last
demonstrated a crowd simulation method for physically simulated two results showed the potential of multi-agent RL in generating
humanoids based on multi-agent reinforcement learning with a hi- multi-character motions, there exists room for improvement: the
erarchical policy that is composed of navigation, foot-step planning, characters were very simple, the motions were not human-like, or
and bipedal walking skills. Unlike the previous approaches, this ap- careful development of policies by curriculum learning was required.
proach learns control policies that can adapt to interactions among
multiple simulated humanoids. However, only limited behaviors 3 OVERVIEW
such as navigation and collision-avoidance were demonstrated. Our Our framework takes a set of motion data that includes the basic
work focuses on generating more complex and subtle interactions skills of a two-player competitive sport as input and generates
automatically for competitive environments, which is one of the control policies for two physically simulated players. The control
challenging problems in multi-character animation. policies allow the players to perform a sequence of basic skills
with the right movements and timing to win the match. Figure 2
2.3 Multi-agent Reinforcement Learning illustrates the overview of the framework. First, we collect a few
We solve a multi-agent reinforcement learning (RL) problem that motion clips that include the basic skills of the sport performed
learns optimal policies of multiple agents, where the agents interact without an opponent. A single imitation policy is then learned for
with each other to achieve either a competitive or a cooperative the motions by using single-agent deep reinforcement learning.
goal. This research topic has a long history in machine learning, and Finally, we transfer the imitation policy to competitive policies for
we recommend readers to [Busoniu et al. 2008; Hu and Wellman the players where each player enhances their own policy by multi-
1998; Nguyen et al. 2020] for a general introduction and details. agent deep reinforcement learning with competitive rewards. To
The multi-agent problem is much more challenging than the effectively transfer from the imitation policy to the competitive
single-agent problem because of the difficulty of learning with a policy, we use a new policy model composed of a task-encoder and
non-stationary (moving target). All agents update their policies si- a motor-decoder, the details will be explained in section 4.
multaneously, so models of the policies of other agents can become
invalid after just one learning iteration. In multi-agent RL, there are 3.1 Environments
two types of problems: cooperative and competitive. In the coop- We created two competitive sport environments, boxing and fencing,
erative problem, there exists a common goal that the agents need as our examples, where two players compete with each other in an
to achieve together, so communication channels among the agents attempt to win the match (Figure 1). We developed the environments
are commonly included in the policy model [Claus and Boutilier based on publicly available resources from [Won et al. 2020]. In the
1998; Lowe et al. 2017; OroojlooyJadid and Hajinezhad 2020]. In the boxing environment, we enlarged both hands so that the size is
ACM Trans. Graph., Vol. 40, No. 4, Article 1. Publication date: August 2021.
1:4 • Won et al.
Fig. 3. The boxing and fencing characters. The body parts where points are
scored if they are hit are marked in green.
ACM Trans. Graph., Vol. 40, No. 4, Article 1. Publication date: August 2021.
Control Strategies for Physically Simulated Characters Performing Two-player Competitive Sports • 1:5
pool of 𝑁 experts where each expert has only the body state as 𝑟 = 𝑟 match + 𝑤 close𝑟 close + 𝑤 facing𝑟 facing
input. The output of the motor decoder is a set of outputs from
Õ
𝑖
− 𝑤 energy𝑟 energy − 𝑤 penalty 𝑟 penalty (1)
the experts e𝑡 = (e𝑡1, e𝑡2, · · · , e𝑡𝑁 ). The task encoder receives the 𝑖
whole observed state as input and then generates expert weights
where 𝑟 match generates signals related to the match. In our boxing
𝝎 𝑡 = (𝜔𝑡1, 𝜔𝑡2, · · · , 𝜔𝑡𝑁 ). The output of the task encoder is used
example, it measures how much the player damaged the opponent
to update the latent expert weights 𝝎ˆ 𝑡 in an autoregressive man-
and how much the player was damaged by the opponent in the last
ner 𝝎ˆ 𝑡 = (1 − 𝛼)𝝎 𝑡 + 𝛼 𝝎ˆ 𝑡 −1 , where 𝛼 controls smoothness of the
step. We measure damage by the magnitude of force,
weight change. The mean action 𝝁 𝑡 is computed by the weighted
sum 𝝁 𝑡 = Σ𝑖=1
𝑁 𝜔ˆ 𝑖 e𝑖 , and we sample an action a stochastically from
𝑡 𝑡 𝑡 𝑟 match = ∥𝑓𝑝𝑙−>𝑜𝑝 ∥ − ∥ 𝑓𝑜𝑝−>𝑝𝑙 ∥ (2)
a Gaussian distribution whose mean and covariance are 𝝁 𝑡 and Σ, re- where 𝑓𝑝𝑙−>𝑜𝑝 is the contact normal force that the player applied
spectively, with a constant diagonal matrix used for the covariance. to the opponent and 𝑓𝑜𝑝−>𝑝𝑙 is the force applied by the opponent
When transferring the imitation policy to the competitive policy, to the player. When measuring damage, we only consider collisions
only the motor decoder of the imitation policy is reused. The motor between both gloves and a subset of the upper bodies: the head, the
decoder and a new task encoder whose input dimension matches to chest, and the belly (Figure 3). For numerical stability we clip the
the input competitive sport environment constitute a new policy magnitude of the force by 200N when computing ∥ 𝑓𝑝𝑙−>𝑜𝑝 ∥ and
for each player as illustrated in Figure 2. ∥ 𝑓𝑜𝑝−>𝑝𝑙 ∥. The close and facing terms increase the probability of
interaction while learning policies, 𝑟 close = 𝑒𝑥𝑝 (−3𝑑 2 ) encourages
4.2 Pre-training: Imitation Policy the players to be close where 𝑑 is the closest distance between the
To learn a policy that can imitate basic boxing skills, we first used player’s gloves and the opponent’s upper body and 𝑟 facing encour-
a set of boxing motion clips from the CMU Motion Capture Data- ages the players to face each other
base [CMU 2002]. The data used is approximately 90 𝑠, with four 𝑟 facing = 𝑒𝑥𝑝 (−5∥ v̄ · v̂ − 1∥)
motion clips each between 10-30 𝑠. We also used their mirrored ver- (3)
v̄ = ( p̂op − p̂pl )/∥ p̂op − p̂pl ∥
sions, so the total length is approximately 3 minutes. The actor in
the motion capture data performed several skills in an unstructured where v̂ is the facing direction and p̂op ,p̂pl are the facing positions of
way. We use this data without any extra post-processing such as the opponent and the player, respectively (i.e. the root joint positions
cutting them into individual skills or using manual labels such as projected to the ground). We also add an energy minimization term
phase. Instead, we learn a single imitation policy by using deep re- to make the players move in an energy-efficient manner
inforcement learning (RL) with imitation rewards. The body state is Õ
𝑟 energy = 𝜅 dist ∥a 𝑗 ∥ 2
b𝑡 = (p, q, v, w), where p,q,v, and w are positions, orientations, lin-
𝑗 (4)
ear velocities, and angular velocities of all joints, respectively, which
are represented with respect to the current facing transformation 𝜅 dist = 1 − 𝑒𝑥𝑝 (−50∥𝑚𝑎𝑥 (0, 𝑙 − 1)∥ 2 )
of the simulated player. where a 𝑗 is the angular acceleration of 𝑖-th joint and 𝑙 is the horizon-
𝑟𝑒 𝑓 𝑟𝑒 𝑓
The task-specific state g𝑡 = (b𝑡 +1 , b𝑡 +2 , · · · ) includes a sequence tal distance between the two players. The energy term has an effect
of future body states extracted from the current reference motion. only when the players are more than 1 𝑚 apart. This term approxi-
In the examples presented here, we use two samples 0.05 and 0.15 𝑠 mates the behaviors of boxers as they tend to conserve energy by
in the future. We use the multiplicative imitation reward function only throwing punches when their opponent is close enough that
and also perform early-termination based on the reward in a similar they might be effective. Without this term, the boxers would throw
fashion to [Won et al. 2020]. punches indiscriminately as the punches have no cost. Finally, we
ACM Trans. Graph., Vol. 40, No. 4, Article 1. Publication date: August 2021.
1:6 • Won et al.
add extra penalty terms to prevent the players from entering un- only to the new task) because the motor decoder does not change
desirable or unrecoverable situations that hinder sample efficiency. at all. The forgetting problem does not matter if the performance
Each of these penalty terms is a binary signal: on the new task is the only concern, however, it can be problematic
( for generating human-like motions for the player because a large
1 if the condition is satisfied
𝑖
𝑟 penalty = (5) deviation from the original motor decoder trained on motion cap-
0 otherwise ture data can easily lead to a situation where the motor decoder is
We also terminate the current episode if one of the conditions is no longer generating natural motions. To tackle this challenge, we
activated. In our boxing example, we use two conditions: fall and do Enc-only in the early stage of transfer learning, then alternate
stuck. The fall checks whether any body part of the player except between Enc-Dec-e2e and Enc-only. This learning procedure enables
for the foot touches the ground. This penalty is helpful to learning the players to learn meaningful tactics while preserving the style
to maintain balance in the early phase of the learning (before tactics of the motions existing in the original motion capture data. In our
emerge) and also for recovering from blows by the opponent. The experiments, we start with Enc-only for 300-500 iterations, then
stuck was inspired by real boxing matches where a judge interrupts alternate between them every 50th iteration.
and restarts the match if the two players are grasping each other We use the decentralized distributed proximal policy optimiza-
in their arms or one player is trapped on the rope and can only tion (DD-PPO) algorithm [2020] to learn imitation and competitive
defend themselves. The judge restarts the match because there are policies. As the name implies, DD-PPO is an extended version of
almost no meaningful actions that can occur when the players are in the proximal policy optimization (PPO) algorithm, which runs on
these configurations. We detect the first situation when the distance multiple machines (distributed), requires no master machine (decen-
between the players is less than 0.4 𝑚 for more than 5 𝑠 and the tralized), and does synchronized update for policy parameters. In
second situation when one of the players is located near the ropes the original PPO, training tuples generated on each machine are col-
for more than 5 𝑠. lected by the master machine, the machine computes gradients for
The observed state of the player is composed of the body state all the training tuples, the current policy is updated by the gradients,
and the task-specific state, where the body state is the same as the and the updated policy parameters are shared over the network by
one used for learning the imitation policy and the task-specific state broadcasting. In the DD-PPO, each machine computes the gradi-
includes the current status of the match ents for training tuples that it generated and only the gradients are
shared over the network to compute the average gradients. This
g𝑡 = (parena, darena, pop, vop, pglove, vglove, ptarget, vtarget ) (6)
process reduces the communication and the overhead of computing
where parena , darena are the relative position and the facing direc- gradients when the number of training tuples per iteration is large.
tion of the player in the arena. pop , vop are relative positions and
velocities of the opponent’s body parts, which are represented in a 4.5 Details for Fencing
coordinate system aligned with the player’s facing direction. pglove , Building control policies for the fencing environment requires mini-
vglove are the relative positions and velocities of the opponent’s mal changes in the state and rewards. For the state, the terms related
gloves with respect to the player’s upper body parts, and ptarget , to the gloves pglove , vglove are replaced by the state of the sword
vtarget are relative positions and velocities of the opponent’s up- attached to the right hand of the fencer. We also include a trigger
per body parts with respect to the player’s gloves (Figure 5). The vector (𝑙 pl, 𝑙 op ) that represents the history of touch, where 𝑙 pl be-
dimension of the entire observed state is 623 in total. comes 1 if the player touches the opponent and it does not change
until the end of the match, otherwise the value is zero, and 𝑙 op is
4.4 Learning the opposite. The reward function for fencing is
When transferring a pre-trained policy with an encoder and a de- Õ
𝑖
coder to a new task, there are two popular approaches. One is to 𝑟 = 𝑟 match + 𝑤 close𝑟 close − 𝑤 penalty 𝑟 penalty (7)
learn only the encoder while the decoder remains fixed (Enc-only), 𝑖
the other is to update the entire structure in an end-to-end manner where 𝑟 close is the same as the boxing environment, 𝑟 match gives 1
(Enc-Dec-e2e). There are pros and cons to both approaches. Both for the winner, −1 for the loser, and 0 when the match becomes a
approaches show similar results if the new task is the same or sim- draw. In contrast to that the boxers can collect 𝑟 match continuously
ilar to the previous task (pre-training), whereas the Enc-Dec-e2e until the end of the match, the fencers can get non-zero 𝑟 match only
usually shows faster and better adaptation and convergence if the once when the match terminates (i.e. when the winner and the loser
new task is quite different from the original task. For example in are decided). We use two penalty terms fall and out-of-arena. fall is
the boxing environment, we pre-train our motor decoder in the the same as in boxing and out-of-arena is activated when the fencer
single-agent setting by imitating motion clips that do not have any goes out of the arena because there is no physical boundary in the
interaction with the opponent. The competitive policy, however, fencing environment unlike the boxing environment. We use the
needs to learn abilities such as withstanding an attack or punching same learning algorithm without changing any hyper-parameters.
the opponent. Enc-only is not sufficient to learn sophisticated tactics
for the competitive environment unless we use motion capture data 5 RESULTS
that covers almost all possible scenarios in pre-training. Enc-only is Table 1 includes all parameters used for physics simulation, DD-PPO
more robust to the forgetting problem (where a pre-trained policy algorithm, and the reward functions. We used the PyTorch [2019]
easily forgets its original abilities for the original task and adapts implementation in RLlib [2018] for the DD-PPO algorithm. We
ACM Trans. Graph., Vol. 40, No. 4, Article 1. Publication date: August 2021.
Control Strategies for Physically Simulated Characters Performing Two-player Competitive Sports • 1:7
ACM Trans. Graph., Vol. 40, No. 4, Article 1. Publication date: August 2021.
1:8 • Won et al.
Fig. 7. Other learning procedures. Enc-Dec-e2e generates unnatural postures especially for the red boxer and all experts are activated equally all the time so
the trajectory of the expert weights (top-row) looks like a point, which implies that the learned experts have deviated far from the original experts. On the
other hand, Enc-only generates similar postures included in the input motions, however, the boxers can only stand in place, which implies that high-level
tactics were not learned successfully.
ACM Trans. Graph., Vol. 40, No. 4, Article 1. Publication date: August 2021.
Control Strategies for Physically Simulated Characters Performing Two-player Competitive Sports • 1:9
ACM Trans. Graph., Vol. 40, No. 4, Article 1. Publication date: August 2021.
1:10 • Won et al.
Fig. 11. An example of a fight between two boxers. The boxers move around and get closer (top-row), keep each other in check by using light jabs (middle-row),
and the red boxer is knocked down by the blue boxer (bottom-row).
Fig. 12. Three fencing match examples. The blue fencer wins (top-row), a draw (middle-row), and the red fencer wins (bottom-row).
characters in a physically plausible manner in various applications L. Busoniu, R. Babuska, and B. De Schutter. 2008. A Comprehensive Survey of Multia-
such as video games and virtual reality. gent Reinforcement Learning. IEEE Transactions on Systems, Man, and Cybernetics,
Part C (Applications and Reviews) 38, 2 (2008), 156–172. https://doi.org/10.1109/
TSMCC.2007.913919
Nuttapong Chentanez, Matthias Müller, Miles Macklin, Viktor Makoviychuk, and Stefan
REFERENCES Jeschke. 2018. Physics-based motion capture imitation with deep reinforcement
[n.d.]. learning. In Motion, Interaction and Games, MIG 2018. ACM, 1:1–1:10. https://doi.
Yu Bai and Chi Jin. 2020. Provable Self-Play Algorithms for Competitive Reinforcement org/10.1145/3274247.3274506
Learning. In Proceedings of the 37th International Conference on Machine Learning, Caroline Claus and Craig Boutilier. 1998. The Dynamics of Reinforcement Learning
Vol. 119. PMLR, 551–560. http://proceedings.mlr.press/v119/bai20a.html in Cooperative Multiagent Systems. In Proceedings of the Fifteenth National/Tenth
Bowen Baker, Ingmar Kanitscheider, Todor M. Markov, Yi Wu, Glenn Powell, Bob Mc- Conference on Artificial Intelligence/Innovative Applications of Artificial Intelligence
Grew, and Igor Mordatch. 2019. Emergent Tool Use From Multi-Agent Autocurricula. (AAAI ’98/IAAI ’98). 746–752.
CoRR (2019). arXiv:1909.07528 Alexander Clegg, Wenhao Yu, Jie Tan, C. Karen Liu, and Greg Turk. 2018. Learning
Trapit Bansal, Jakub Pachocki, Szymon Sidor, Ilya Sutskever, and Igor Mordatch. 2018. to Dress: Synthesizing Human Dressing Motion via Deep Reinforcement Learning.
Emergent Complexity via Multi-Agent Competition. arXiv:1710.03748 ACM Trans. Graph. 37, 6, Article 179 (2018). http://doi.acm.org/10.1145/3272127.
Kevin Bergamin, Simon Clavet, Daniel Holden, and James Richard Forbes. 2019. DReCon: 3275048
Data-driven Responsive Control of Physics-based Characters. ACM Trans. Graph. CMU. 2002. CMU Graphics Lab Motion Capture Database. http://mocap.cs.cmu.edu/.
38, 6, Article 206 (2019). http://doi.acm.org/10.1145/3355089.3356536 Brandon Haworth, Glen Berseth, Seonghyeon Moon, Petros Faloutsos, and Mubbasir
Glen Berseth, Cheng Xie, Paul Cernek, and Michiel van de Panne. 2018. Progressive Kapadia. 2020. Deep Integration of Physical Humanoid Control and Crowd Naviga-
Reinforcement Learning with Distillation for Multi-Skilled Motion Control. CoRR tion. In Motion, Interaction and Games (MIG ’20). Article 15. https://doi.org/10.1145/
abs/1802.04765 (2018). 3424636.3426894
ACM Trans. Graph., Vol. 40, No. 4, Article 1. Publication date: August 2021.
Control Strategies for Physically Simulated Characters Performing Two-player Competitive Sports • 1:11
Joseph Henry, Hubert P. H. Shum, and Taku Komura. 2014. Interactive Formation 3201311
Control in Complex Environments. IEEE Transactions on Visualization and Computer Xue Bin Peng, Glen Berseth, Kangkang Yin, and Michiel Van De Panne. 2017. DeepLoco:
Graphics 20, 2 (2014), 211–222. https://doi.org/10.1109/TVCG.2013.116 Dynamic Locomotion Skills Using Hierarchical Deep Reinforcement Learning. ACM
Edmond S. L. Ho, Taku Komura, and Chiew-Lan Tai. 2010. Spatial Relationship Pre- Trans. Graph. 36, 4, Article 41 (2017). http://doi.acm.org/10.1145/3072959.3073602
serving Character Motion Adaptation. ACM Trans. Graph. 29, 4, Article 33 (2010). Yevgeny Seldin and Aleksandrs Slivkins. 2014. One Practical Algorithm for Both
https://doi.org/10.1145/1778765.1778770 Stochastic and Adversarial Bandits. In Proceedings of the 31st International Conference
Daniel Holden, Taku Komura, and Jun Saito. 2017. Phase-functioned Neural Networks on Machine Learning (Proceedings of Machine Learning Research, Vol. 32). 1287–1295.
for Character Control. ACM Trans. Graph. 36, 4, Article 42 (2017). http://doi.acm. http://proceedings.mlr.press/v32/seldinb14.html
org/10.1145/3072959.3073663 Hubert P. H. Shum, Taku Komura, Masashi Shiraishi, and Shuntaro Yamazaki. 2008.
Junling Hu and Michael P. Wellman. 1998. Multiagent Reinforcement Learning: Theo- Interaction Patches for Multi-Character Animation. ACM Trans. Graph. 27, 5 (2008).
retical Framework and an Algorithm. In Proceedings of the Fifteenth International https://doi.org/10.1145/1409060.1409067
Conference on Machine Learning (ICML ’98). 242–250. Hubert P. H. Shum, Taku Komura, and Shuntaro Yamazaki. [n.d.]. Simulating Inter-
K. Hyun, M. Kim, Y. Hwang, and J. Lee. 2013. Tiling Motion Patches. IEEE Transactions actions of Avatars in High Dimensional State Space. In Proceedings of the 2008
on Visualization and Computer Graphics 19, 11 (2013), 1923–1934. https://doi.org/10. Symposium on Interactive 3D Graphics and Games (I3D ’08). 131–138. https:
1109/TVCG.2013.80 //doi.org/10.1145/1342250.1342271
Jongmin Kim, Yeongho Seol, Taesoo Kwon, and Jehee Lee. 2014. Interactive Manipula- Hubert P. H. Shum, Taku Komura, and Shuntaro Yamazaki. 2007. Simulating Com-
tion of Large-Scale Crowd Animation. ACM Trans. Graph. 33, 4, Article 83 (2014). petitive Interactions Using Singly Captured Motions. In Proceedings of the 2007
https://doi.org/10.1145/2601097.2601170 ACM Symposium on Virtual Reality Software and Technology (VRST ’07). 65–72.
Manmyung Kim, Kyunglyul Hyun, Jongmin Kim, and Jehee Lee. 2009. Synchronized https://doi.org/10.1145/1315184.1315194
Multi-Character Motion Editing. ACM Trans. Graph. 28, 3, Article 79 (2009). https: H. P. H. Shum, T. Komura, and S. Yamazaki. 2012. Simulating Multiple Character Inter-
//doi.org/10.1145/1531326.1531385 actions with Collaborative and Adversarial Goals. IEEE Transactions on Visualization
T. Kwon, Y. Cho, S. I. Park, and S. Y. Shin. 2008. Two-Character Motion Analysis and and Computer Graphics 18, 5 (2012), 741–752. https://doi.org/10.1109/TVCG.2010.257
Synthesis. IEEE Transactions on Visualization and Computer Graphics 14, 3 (2008), Sebastian Starke, He Zhang, Taku Komura, and Jun Saito. 2019. Neural state machine
707–720. https://doi.org/10.1109/TVCG.2008.22 for character-scene interactions. ACM Trans. Graph. 38, 6 (2019), 209:1–209:14.
Taesoo Kwon, Kang Hoon Lee, Jehee Lee, and Shigeo Takahashi. 2008. Group Motion https://doi.org/10.1145/3355089.3356505
Editing. ACM Trans. Graph. 27, 3 (2008), 1–8. https://doi.org/10.1145/1360612. Jie Tan, C. Karen Liu, and Greg Turk. 2011. Stable Proportional-Derivative Controllers.
1360679 IEEE Computer Graphics and Applications 31, 4 (2011), 34–44. https://doi.org/10.
Jehee Lee and Kang Hoon Lee. 2004. Precomputing Avatar Behavior from Human 1109/MCG.2011.30
Motion Data. In Proceedings of the 2004 ACM SIGGRAPH/Eurographics Symposium Kevin Wampler, Erik Andersen, Evan Herbst, Yongjoon Lee, and Zoran Popović. 2010.
on Computer Animation (SCA ’04). 79–87. https://doi.org/10.1145/1028523.1028535 Character Animation in Two-Player Adversarial Games. ACM Trans. Graph. 29, 3,
Kang Hoon Lee, Myung Geol Choi, and Jehee Lee. 2006. Motion Patches: Building Article 26 (2010). https://doi.org/10.1145/1805964.1805970
Blocks for Virtual Environments Annotated with Motion Data. ACM Trans. Graph. Ziyu Wang, Josh Merel, Scott Reed, Greg Wayne, Nando de Freitas, and Nicolas Heess.
25, 3 (2006), 898–906. https://doi.org/10.1145/1141911.1141972 2017. Robust Imitation of Diverse Behaviors. In Proceedings of the 31st International
Seunghwan Lee, Moonseok Park, Kyoungmin Lee, and Jehee Lee. 2019. Scalable Muscle- Conference on Neural Information Processing Systems (NIPS’17). http://dl.acm.org/
actuated Human Simulation and Control. ACM Trans. Graph. 38, 4, Article 73 (2019). citation.cfm?id=3295222.3295284
http://doi.acm.org/10.1145/3306346.3322972 Erik Wijmans, Abhishek Kadian, Ari Morcos, Stefan Lee, Irfan Essa, Devi Parikh,
Eric Liang, Richard Liaw, Philipp Moritz, Robert Nishihara, Roy Fox, Ken Goldberg, Manolis Savva, and Dhruv Batra. 2020. DD-PPO: Learning Near-Perfect PointGoal
Joseph E. Gonzalez, Michael I. Jordan, and Ion Stoica. 2018. RLlib: Abstractions for Navigators from 2.5 Billion Frames. In 8th International Conference on Learning
Distributed Reinforcement Learning. arXiv:1712.09381 Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020.
Michael L. Littman. 1994. Markov Games as a Framework for Multi-Agent Reinforce- Jungdam Won, Deepak Gopinath, and Jessica Hodgins. 2020. A Scalable Approach to
ment Learning. In Proceedings of the Eleventh International Conference on Interna- Control Diverse Behaviors for Physically Simulated Characters. ACM Trans. Graph.
tional Conference on Machine Learning (ICML’94). 157–163. 39, 4, Article 33 (2020). https://doi.org/10.1145/3386569.3392381
C. Karen Liu, Aaron Hertzmann, and Zoran Popović. 2006. Composition of Com- Jungdam Won and Jehee Lee. 2019. Learning Body Shape Variation in Physics-based
plex Optimal Multi-Character Motions. In Proceedings of the 2006 ACM SIG- Characters. ACM Trans. Graph. 38, 6, Article 207 (2019). http://doi.acm.org/10.1145/
GRAPH/Eurographics Symposium on Computer Animation (SCA ’06). 215–222. 3355089.3356499
Libin Liu and Jessica Hodgins. 2017. Learning to Schedule Control Fragments for Jungdam Won, Kyungho Lee, Carol O’Sullivan, Jessica K. Hodgins, and Jehee Lee. 2014.
Physics-Based Characters Using Deep Q-Learning. ACM Trans. Graph. 36, 3, Article Generating and Ranking Diverse Multi-Character Interactions. ACM Trans. Graph.
42a (2017). http://doi.acm.org/10.1145/3083723 33, 6, Article 219 (2014). https://doi.org/10.1145/2661229.2661271
Libin Liu and Jessica Hodgins. 2018. Learning Basketball Dribbling Skills Using Tra- Zhaoming Xie, Hung Yu Ling, Nam Hee Kim, and Michiel van de Panne. 2020. ALL-
jectory Optimization and Deep Reinforcement Learning. ACM Trans. Graph. 37, 4, STEPS: Curriculum-driven Learning of Stepping Stone Skills. In Proc. ACM SIG-
Article 142 (2018). http://doi.acm.org/10.1145/3197517.3201315 GRAPH / Eurographics Symposium on Computer Animation.
Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, Pieter Abbeel, and Igor Mordatch. 2017. Barbara Yersin, Jonathan Maïm, Julien Pettré, and Daniel Thalmann. 2009. Crowd
Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments. In Patches: Populating Large-Scale Virtual Environments for Real-Time Applications.
Proceedings of the 31st International Conference on Neural Information Processing In Proceedings of the 2009 Symposium on Interactive 3D Graphics and Games (I3D
Systems (NIPS’17). 6382–6393. ’09). 207–214. https://doi.org/10.1145/1507149.1507184
Josh Merel, Yuval Tassa, Dhruva TB, Sriram Srinivasan, Jay Lemmon, Ziyu Wang, Greg Wenhao Yu, Greg Turk, and C. Karen Liu. 2018. Learning Symmetric and Low-energy
Wayne, and Nicolas Heess. 2017. Learning human behaviors from motion capture Locomotion. ACM Trans. Graph. 37, 4, Article 144 (2018). http://doi.acm.org/10.
by adversarial imitation. CoRR abs/1707.02201 (2017). 1145/3197517.3201397
T. T. Nguyen, N. D. Nguyen, and S. Nahavandi. 2020. Deep Reinforcement Learning Victor Brian Zordan and Jessica K. Hodgins. 2002. Motion Capture-Driven Simulations
for Multiagent Systems: A Review of Challenges, Solutions, and Applications. IEEE That Hit and React. In Proceedings of the 2002 ACM SIGGRAPH/Eurographics Sym-
Transactions on Cybernetics 50, 9 (2020), 3826–3839. https://doi.org/10.1109/TCYB. posium on Computer Animation (SCA ’02). 89–96. https://doi.org/10.1145/545261.
2020.2977374 545276
Afshin OroojlooyJadid and Davood Hajinezhad. 2020. A Review of Cooperative Multi- Victor Brian Zordan, Anna Majkowska, Bill Chiu, and Matthew Fast. 2005. Dynamic
Agent Deep Reinforcement Learning. arXiv:1908.03963 Response for Motion Capture Animation. ACM Trans. Graph. 24, 3 (2005), 697–701.
Soohwan Park, Hoseok Ryu, Seyoung Lee, Sunmin Lee, and Jehee Lee. 2019. Learning https://doi.org/10.1145/1073204.1073249
Predict-and-simulate Policies from Unorganized Human Motion Data. ACM Trans.
Graph. 38, 6, Article 205 (2019). http://doi.acm.org/10.1145/3355089.3356501
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory
Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Des-
maison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan
Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chin-
tala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library.
In Advances in Neural Information Processing Systems 32. 8024–8035.
Xue Bin Peng, Pieter Abbeel, Sergey Levine, and Michiel van de Panne. 2018. DeepMimic:
Example-guided Deep Reinforcement Learning of Physics-based Character Skills.
ACM Trans. Graph. 37, 4, Article 143 (2018). http://doi.acm.org/10.1145/3197517.
ACM Trans. Graph., Vol. 40, No. 4, Article 1. Publication date: August 2021.