Scene Transformer A Unified Architecture For Predicting Multiple Agent Trajectories
Scene Transformer A Unified Architecture For Predicting Multiple Agent Trajectories
A BSTRACT
1 I NTRODUCTION
Motion planning in a dense real-world urban environment is a mission-critical problem for deploying
autonomous driving technology. Autonomous driving is traditionally considered too difficult for a
single end-to-end learned system (Thrun et al., 2006). Thus, researchers have opted to split the task
into sequential sub-tasks (Zeng et al., 2019): (i) perception, (ii) motion prediction, and (iii) planning.
Perception is the task of detecting and tracking objects in the scene from sensors such as LiDARs
and cameras. Motion prediction involves predicting the futures actions of other agents in the scene.
Finally, planning involves creating a motion plan that navigates through dynamic environments.
Dividing the larger problem into sub-tasks achieves optimal performance when each sub-task is truly
independent. However, such a strategy breaks down when the assumption of independence does not
hold. For instance, the sub-tasks of motion prediction and planning are not truly independent—the
autonomous vehicle’s actions may significantly impact the behaviors of other agents. Similarly, the
behaviors of other agents may dramatically change what is a good plan. The goal of this work is
to take a step in the direction of unifying motion prediction and planning by developing a model
that can exploit varying forms of conditioning information, such as the AV’s goal, and produce joint
consistent predictions about the future for all agents simultaneously.
While the motion prediction task has traditionally been formulated around per-agent independent
predictions, recent datasets (Ettinger et al., 2021; Zhan et al., 2019) have introduced interaction
prediction tasks that enable us to study joint future prediction (Figure 1). These interaction prediction
1
Published as a conference paper at ICLR 2022
tasks require models to predict the joint futures of multiple agents: models are expected to produce
future predictions for all agents such that the agents futures are consistent 1 with one another.
A naive approach to producing joint futures is to consider the exponential number of combinations of
marginal agent predictions. Many of the combinations are not consistent, especially when agents
have overlapping trajectories. We present a unified model that naturally captures the interactions
between agents, and can be trained as a joint model to produce scene-level consistent predictions
across all agents (Figure 1, right). Our model uses a scene-centric representation for all agents (Lee
et al., 2017; Hong et al., 2019; Casas et al., 2020a; Salzmann et al., 2020) to allow scaling to large
numbers of agents in dense environments. We employ a simple variant of self-attention (Vaswani
et al., 2017) in which the attention mechanism is efficiently factorized across the agent-time axes. The
resulting architecture simply alternates attention between dimensions representing time and agents
across the scene, resulting in a computationally-efficient, uniform, and scalable architecture.
We find that the resulting model, termed Scene Transformer, achieves superior performance on both
independent (marginal) and interactive (joint) prediction benchmarks.
Moreover, we demonstrate a novel formulation of the task using a masked sequence model, inspired by
recent advances in language modeling (Brown et al., 2020; Devlin et al., 2019), to allow conditioning
of the model on the autonomous vehicle (AV) goal state or full trajectory. In this reformulation, a
single model can naturally perform tasks such as motion prediction, conditional motion prediction,
and goal-conditioned prediction simply by changing which data is visible at inference time.
We hope that our unified architecture and flexible problem formulation opens up new research
directions for further combining motion prediction and planning. In summary, our key contributions
in this work are:
• A novel, scene-centric approach that allows us to gracefully switch training the model to
produce either marginal (independent) and joint agent predictions in a single feed-forward
pass. Our model achieves state-of-the-art on both marginal and joint prediction tasks on both
the Argoverse and the Waymo Open Motion Dataset.
• A permutation equivariant Transformer-based architecture factored over agents, time, and
road graph elements that exploits the inherent symmetries of the problem. The resulting
architecture is efficient and integrates the world state in a unified way.
• A masked sequence modeling approach that enables us to condition on hypothetical agent fu-
tures at inference time, enabling conditional motion prediction or goal conditioned prediction.
2 R ELATED W ORK
Motion prediction architectures. Motion prediction models have flourished in recent years, due
to the rise in interest in self-driving applications and the release of related datasets and bench-
1
Marginal agent predictions may conflict with each other (have overlaps), while consistent joint predictions
should have predictions where agents respect each other’s behaviors (avoid overlaps) within the same future.
2
Published as a conference paper at ICLR 2022
marks (Kesten et al., 2019; Chang et al., 2019; Caesar et al., 2020; Ettinger et al., 2021). Successful
models must take into account the history of agent motion, and the elements of the road graph (e.g.,
lanes, stop lines, traffic light dynamic state). Furthermore, such models must learn the relationships
between these agents in the context of the road graph environment.
One class of models draws heavily upon the computer vision literature, rendering inputs as a multi-
channel rasterized top-down image (Cui et al., 2019; Chai et al., 2019; Lee et al., 2017; Hong et al.,
2019; Casas et al., 2020a; Salzmann et al., 2020; Zhao et al., 2019). In this approach, relationships
between scene elements are captured via convolutional deep architectures. However, the localized
structure of the receptive field makes capturing spatially-distant interactions challenging. A popular
alternative is to use an entity-centric approach. With this approach, agent state history is typically
encoded via sequence modeling techniques like RNNs (Mercat et al., 2020; Khandelwal et al., 2020;
Lee et al., 2017; Alahi et al., 2016; Rhinehart et al., 2019) or temporal convolutions (Liang et al.,
2020). Road elements are approximated with basic primitives (e.g. piecewise-linear segments)
which encode pose information and semantic type. Modeling relationships between entities is often
presented as an information aggregation process, and models employ pooling (Zhao et al., 2020; Gao
et al., 2020; Lee et al., 2017; Alahi et al., 2016; Gupta et al., 2018), soft-attention (Mercat et al., 2020;
Zhao et al., 2020; Salzmann et al., 2020) as well as graph neural networks (Casas et al., 2020a; Liang
et al., 2020; Khandelwal et al., 2020).
Like our proposed method, several recent models use Transformers (Vaswani et al., 2017), composed
of multi-headed attention layers. Transformers are a popular state-of-the-art choice for sequence
modeling in natural language processing (Brown et al., 2020; Devlin et al., 2019), and have recently
shown promise in core computer vision tasks such as detection (Bello et al., 2019; Carion et al., 2020;
Srinivas et al., 2021), tracking (Hung et al., 2020) and classification (Ramachandran et al., 2019;
Vaswani et al., 2021; Dosovitskiy et al., 2021; Bello, 2013; Bello et al., 2019). For motion modeling,
recent work has employed variations of self-attention and Transformers for modeling different axes:
temporal trajectory encoding and decoding (Yu et al., 2020; Giuliari et al., 2020; Yuan et al., 2021),
encoding relationships between agents (Li et al., 2020; Park et al., 2020; Yuan et al., 2021; Yu et al.,
2020; Mercat et al., 2020; Bhat et al., 2020), and encoding relationships with road elements. When
applying self-attention over multiple axes, past work used independent self-attention for each axis (Yu
et al., 2020), or flattened two axes together into one joint self-attention layer (Yuan et al., 2021) – by
comparison, our method proposes axis-factored attention to model relationships between time steps,
agents, and road graph elements in a unified way.
Scene-centric versus agent-centric representations. Another key design choice is the frame of
reference in which the representation is encoded. Some models do a majority of modeling in a
global, scene-level coordinate frame, such as work that employs a rasterized top-down image (Cui
et al., 2019; Chai et al., 2019; Lee et al., 2017; Hong et al., 2019; Casas et al., 2020a; Salzmann
et al., 2020). This can lead to a more efficient model due to a single shared representation of world
state in a common coordinate frame, but comes with the potential sacrifice of pose-invariance. On
the other hand, models that reason in the agent-coordinate frame (Mercat et al., 2020; Zhao et al.,
2020; Khandelwal et al., 2020) are intrinsically pose-invariant, but scale linearly with the number
of agents, or quadratically with the number of pairwise interactions between agents. Many works
employ a mix of a top-down raster representation for road representation fused with a per-agent
representations (Rhinehart et al., 2019; Tang & Salakhutdinov, 2019; Lee et al., 2017). Similar to our
own work, LaneGCN (Liang et al., 2020) is agent-centric yet representations are in a global frame – to
the best of our knowledge, this is the only other work to do so. This enables efficient reasoning while
capturing arbitrarily distant interactions and high-fidelity state representations without rasterization.
Representing multi-agent futures. A common way to represent agent futures is via a weighted set
of trajectories per agent (Alahi et al., 2016; Biktairov et al., 2020; Buhet et al., 2020; Casas et al.,
2020a;a; Chai et al., 2019; Cui et al., 2019; Gao et al., 2020; Hong et al., 2019; Lee et al., 2017;
Marchetti et al., 2020; Mercat et al., 2020; Salzmann et al., 2020; Zhao et al., 2020; Chandra et al.,
2018). This representation is encouraged by benchmarks which primarily focus on per-agent distance
error metrics (Caesar et al., 2020; Chang et al., 2019; Zhan et al., 2019). We argue in this work that
modeling joint futures in a multi-agent environment (Figure 1, right) is an important concept that
has been minimally explored in prior work. Some prior work consider a factorized pairwise joint
distribution, where a subset of agent futures are conditioned on other agents – informally, modeling
P (X) and P (Y |X) for agents X and Y (Khandelwal et al., 2020; Tolstaya et al., 2021; Salzmann
et al., 2020). To generalize joint prediction to arbitrary multi-agent settings, other work (Tang
3
Published as a conference paper at ICLR 2022
Figure 2: Single model architecture for multiple motion prediction tasks. Left: Different masking
strategies define distinct tasks. The left column represents current time and the top row represents
the agent indicating the autonomous vehicle (AV). A single model can be trained for data associated
with motion prediction, conditional motion prediction, and goal-directed prediction, by matching
the masking strategy to each prediction task. Right: Attention-based encoder-decoder architecture
for joint scene modeling. Architecture employs factored attention along the time and agent axes to
exploit the dependencies in the data, and cross-attention to inject side information.
& Salakhutdinov, 2019; Rhinehart et al., 2019; Casas et al., 2020b; Suo et al., 2021; Yeh et al.,
2019) iteratively roll out samples per-agent, where each agent is conditioned on previously sampled
trajectory steps. In contrast, our model directly decodes a set of k distinct joint futures with associated
likelihoods.
3 M ETHODS
The Scene Transformer model has three stages: (i) Embed the agents and the road graph into a high
dimensional space, (ii) Employ an attention-based network to encode the interactions between agents
and the road graph, (iii) Decode multiple futures using an attention-based network. The model takes
as input a feature for every agent at every time step, and also predicts an output for every agent
at every time step. We employ an associated mask, where every agent time step has an associated
indicator of 1 (hidden) or 0 (visible), indicating whether the input feature is hidden (i.e. removed)
from the model. This approach mirrors the approach of masked-language models such as BERT
(Devlin et al., 2019). The approach is flexible, enabling us to simultaneously train a single model for
motion prediction (MP) (Cui et al., 2019; Chai et al., 2019; Lee et al., 2017; Hong et al., 2019; Casas
et al., 2020a; Salzmann et al., 2020; Casas et al., 2020a; Liang et al., 2020; Khandelwal et al., 2020),
conditional motion prediction (CMP) (Khandelwal et al., 2020; Tolstaya et al., 2021; Salzmann et al.,
2020) and goal-conditioned prediction (GCP) (Deo & Trivedi, 2020) simply by changing what data
is shown to the model (Figure 2, left). We summarize the key contributions below, and reserve details
for the Appendix.
Multi-task representation. The key representation in the model is a 3-dimensional tensor of A
agents with D feature dimensions across T time steps. At every layer within the architecture, we aim
to maintain a representation of shape [A, T, D], or when decoding, [F, A, T, D] across F potential
futures. Each task (MP, CMP, GCP) can be formulated as a query with a specific masking strategy by
setting the indicator mask to 0, thus providing that data to the model (Figure 2, left). The goal of the
model is to impute the features for each shaded region corresponding to subsets of time and agents in
the scenario that are masked.
We use a scene-centric embedding where we use an agent of interest’s position as the origin 2 ,
and encode all roadgraph and agents with respect to it. This is contrast to approaches which use
an agent-centric representation, where the representations are computed separately for each agent,
treating each agent in turn as the origin.
2
For WOMD, we center the scene with respect to the autonomous vehicle (AV). For Argoverse, we center
the scene with respect to the agent that needs to be predicted. Both are centered around what would be the last
visible time step in a motion prediction setup for all tasks.
4
Published as a conference paper at ICLR 2022
In detail, we first generate a feature for every agent time step if that time step is visible. Second, we
generate a set of features for the static road graph, road elements static in space and time, learning one
feature vector per polyline (with signs being polylines of length 1) using a PointNet (Qi et al., 2017).
Last, we generate a set of features for the dynamic road graph, which are road elements static in
space but dynamic in time (e.g. traffic lights), also one feature vector per object. All three categories
have xyz position information, which we preprocess to center and rotate around the agent of interest
and then encode with sinusoidal position embeddings (Vaswani et al., 2017).
Our model also needs to predict a probability score for each future (in the joint model) or trajectory
(in the marginal model). In order to do so, we need a feature representation that summarizes the scene
and each agent. After the first set of factorized self-attention layers, we compute the mean of the
agent features tensor across the agent and time dimension separately, and add these as an additional
artificial agent and time making our internal representation [A + 1, T + 1, D] (Figure 2, left panel).
This artificial agent and time step propagates through the network, and provides the model with extra
capacity for representing each agent, that is not tied to any timestep. At the final layer, we slice out
the artificial agent and time step to obtain summary features for each agent (the additional time per
5
Published as a conference paper at ICLR 2022
agent), and for the scene (the ‘corner’ feature that is both additional time and agent). This feature is
then processed by a 2-layer MLP producing a single logit value that we use with a softmax classifier
for a permutation equivariant estimate of probabilities for each futures.
The output of our model is a tensor of shape [F, A, T, 7] representing the location and heading of each
agent at the given time step. Because the model uses a scene-centric representation for the locations
through positional embeddings, the model is able to predict all agents simultaneously in a single
feed-forward pass. This design also makes it possible to have a straight-forward switch between joint
future predictions and marginal future predictions.
To perform joint future prediction, we treat each future (in the first dimension) to be coherent futures
across all agents. Thus, we aggregate the displacement loss across all agents 3 and time steps to
build a loss tensor of shape [F ]. We only back-propagate the loss through the individual future
that most closely matches the ground-truth in terms of displacement loss (Gupta et al., 2018; Yeh
et al., 2019). For marginal future predictions, each agent is treated independently. After computing
the displacement loss of shape [F, A], we do not aggregate across agents. Instead, we select the
future with minimum loss for each agent separately, and back-propagate the error correspondingly
(Appendix, Figure 8).
Evaluation metrics for motion prediction. We evaluate the quality of k weighted trajectory hy-
potheses using the standard evaluation metrics: minADE, minFDE, miss rate, and mAP. Each
evaluation metric attempts to measure how close the top k trajectories are to ground truth observation.
A simple and common distance-based metric is to measure the L2 norm between a given trajectory
and the ground truth (Alahi et al., 2016; Pellegrini et al., 2009). minADE reports the L2 norm of the
trajectory with the minimal distance. minFDE likewise reports the L2 norm of the trajectory with the
smallest distance only evaluated at the final location of the trajectory. We additionally report the miss
rate (MR) and mean average precision (mAP) to capture how well a model predicts all of the future
trajectories of agents probabilistically (Yeh et al., 2019; Chang et al., 2019; Ettinger et al., 2021). For
joint future evaluation settings, we measure the scene-level equivalents (minSADE, minSFDE, and
SMR) that evaluate the prediction of the best single consistent future (Casas et al., 2020b).
4 R ESULTS
We evaluate the Scene Transformer on motion prediction tasks from the Argoverse dataset (Chang
et al., 2019) and Waymo Open Motion Dataset (WOMD) (Ettinger et al., 2021). The Argoverse dataset
consists of 324,000 run segments (each 5 seconds in length) from 290 km of roadway containing 11.7
million agent tracks in total, and focuses on single agent (marginal) motion prediction. The WOMD
dataset consists of 104,000 run segments (each 20 seconds in length) from 1,750 km of roadway
containing 7.64 million unique agent tracks. Importantly, the WOMD has two tasks, each with their
own set of evaluation metrics: a marginal motion prediction challenge that evaluates the quality of
the motion predictions independently for each agent (up to 8 per scene), and a joint motion prediction
challenge that evaluates the quality of a model’s joint predictions of exactly 2 agents per scene. We
train each model on a Cloud TPU (Jouppi et al., 2017) See Appendix for all training details.
First, in Section 4.1, we focus on the marginal prediction task and show that Scene Transformer
achieves competitive results on both Argoverse (Chang et al., 2019) and the WOMD (Ettinger et al.,
2021). In Section 4.2, we focus on the joint prediction task and train the Scene Transformer with
our joint loss formulation (see Section 3.4). We show that with a single switch of the loss formula,
we can achieve superior joint motion prediction performance. In Section 4.3 we discuss factorized
versus non-factorized attention. Finally, in Section 4.4 we show how our masked sequence model
formulation allows us to train a multi-task model capable of motion prediction, conditional motion
prediction, and goal-conditioned prediction. In Appendix B.1 we discuss the trade-off between
marginal and joint models.
3
Each dataset identifies a subset of agents to be predicted. We only include this subset in our loss calculation.
For Argoverse, this is 1 agent; for WOMD this is 2-8 agents.
6
Published as a conference paper at ICLR 2022
Table 2: Marginal predictive performance on Waymo Open Motion Dataset motion prediction.
Results presented on the standard splits of the validation and test datasets (Ettinger et al., 2021)
evaluated with traditional marginal metrics for t = 8 seconds. minADE, minFDE reported for k = 6
predictions (Alahi et al., 2016; Pellegrini et al., 2009); Miss Rate (MR) (Chang et al., 2019) within 2
meters of the target. See Appendix for t = 3 or 5 seconds. We include the challenge winner results in
this table (Waymo, 2021).
We first evaluate the performance of Scene Transformer trained and evaluated as a traditional marginal,
per-agent motion prediction model. This is analogous to the problem illustrated in Figure 1 (left).
For all results until Section 4.4, we use a masking strategy that provides the model with all agents as
input, but with their futures hidden. We also mask out future traffic light information.
Argoverse. We evaluate on the popular Argoverse (Chang et al., 2019) benchmark to demonstrate the
efficacy of our architecture. During training and evaluation, the model is only required to predict the
future of the single agent of interest. Our best Argoverse model uses D = 512 feature dimensions and
label smoothing for trajectory classification. Our model achieves state-of-the-art results compared to
published, prior work 4 in terms of minADE and minFDE (Table 1).
Waymo Open Motion Dataset (WOMD). We next evaluate the performance of our model with
D = 256 on the recently released WOMD (Ettinger et al., 2021) for the marginal motion prediction
task. This task is a standard motion prediction task where up to 8 agents per scene are selected to
have their top 6 motion predictions evaluated independently. Our model trained with the marginal
loss achieves state-of-the-art results on the minADE, minFDE, and miss rate metrics (Table 2).
To evaluate the effectiveness of Scene Transformer when trained with a joint loss formulation (Section
3.4), we evaluate our model on the Interaction Prediction challenge in WOMD (Ettinger et al., 2021).
This task measures the performance of the model at predicting the joint future trajectories of two
4
We exclude comparing to public leaderboard entries that have not been published since their details are not
available, but note that our results are competitive on the leaderboard as of the submission date.
7
Published as a conference paper at ICLR 2022
Table 3: Joint predictive performance on Waymo Open Motion Dataset motion prediction.
Results presented on the interactive splits of the validation and test datasets (Ettinger et al., 2021)
evaluated with scene-level joint metrics for t = 8 seconds. minSADE, minSFDE for k = 6 predictions
(Alahi et al., 2016; Pellegrini et al., 2009); Miss Rate (MR) (Chang et al., 2019) within 2 meters of
the target. “S” indicates a scene-level joint metric. See Appendix for t = 3 or 5 seconds. We include
the challenge winner results in this table (Waymo, 2021).
interacting agents (Figure 1, right), and employs joint variants of the common minADE, minFDE,
and Miss Rate (MR) metrics denoted as minSADE, minSFDE, SMR. Note that the “S” indicates
“scene-level” metrics. These metrics aim to measure the quality and consistency of the two agents
joint prediction - for example, the joint variant of Miss Rate (SMR) only records a "hit" if both
interacting agent’s predicted trajectories are within the threshold of their respective ground truths.
We find that for the Interaction Prediction challenge our joint model’s joint predictions easily
outperforms the WOMD provided baseline as well as a marginal version of the model converted into
a joint prediction 5 into joint predictions. (Table 3). This shows that beyond the strength of our overall
architecture and approach, that explicitly training a model as a joint model significantly improves joint
performance on joint metrics. A notable observation is that even though the Interaction Prediction
task only requires predicting the joint trajectories of two agents, our method is fully general and
predicts joint consistent futures of all agents.
Factorized self-attention confers two benefits to our model: (a) it is more efficient since the attention
is over a smaller set, and (b) it provides an implicit identity to each agent during the attention across
time. We ran an experiment where we replaced each axis-factorized attention layer (each pair of time
and agent factorized layer) with a non-axis factored attention layer. This increased the computational
cost of the model and performed worse on the Argoverse validation dataset: the factorized version
achieved a minADE of 0.609 with the factored version, and 0.639 with the non-factorized version.
Our model is formulated as a masked sequence model (Devlin et al., 2019), where at training and
inference time we specify which agent timesteps to mask from the model. This formulation allows
us to select which information about any agent at any timestep to supply to the model, and measure
how the model exploits or responds to this additional information. We can express several motion
prediction tasks at inference time in mask space (Figure 2), in effect providing a multi-task model.
We can use this unique capability to query the models for counterfactuals. What would the model
predict given a subset of agents full trajectories (conditional motion prediction), or given a subset of
agents final goals (goal-conditioned motion prediction)? This feature could be particularly useful
for autonomous vehicle planning to predict what various future rollouts of the scene would look like
given a desired goal for the autonomous vehicle (Figure 3).
5
Note that the output of a joint model can be directly used in a marginal evaluation. However, converting the
output of a marginal model into a joint evaluation is nuanced because there lacks an association of futures across
agents (Figure 1). We employ a simple heuristic to convert the outputs of a marginal model for joint evaluation:
we take the top 6 pairs of trajectories from the combination of both agent’s trajectories for 36 total pairs, and
retain the top 6 pairs with the highest product of probabilities.
8
Published as a conference paper at ICLR 2022
In the previous results, we used a prefix-mask during training and inference, which shows the first
few steps and predicts the remaining timesteps for all agents. For this experiment, we employ
two masking strategies to test on the interactive split of the WOMD, namely "conditional motion
prediction" (CMP) where we show one of the two interacting agents’ full trajectory, and "goal-
conditioned motion prediction" (GCP) where we show the autonomous vehicle’s desired goal state
(Figure 2). We train a model using each strategy (including MP masking) 1/3 of the time, and
evaluate the model on each all tasks. We find that across the 44k evaluation segments, the multi-task
model matches the performance of our MP-only trained joint model on both joint (Table 3) and
marginal metrics (see also Appendix, Table 11); multi-tasking training does not significantly degrade
standard motion prediction performance. We qualitatively examine the performance of the multi-task
model in a GCP setting, and observe that the joint motion predictions for the AV and non-AV agents
flexibly adapt to selected goal points for the AV (Figure 3).
We note that dataset average metrics like min(S)ADE do not capture forecasting subtleties (Ivanovic
& Pavone, 2021) that are critical for measuring progress in the field. Unfortunately, the Waymo
Motion Open Dataset Interactive Prediction set does not contain pre-defined metadata necessary to
evaluate specific interesting slices using different models. In Appendix C we provide some initial
analysis of slices that we could compute on the dataset, which illustrate how joint and marginal model
predictions behavior differently as the scene a) scales with more agents in complicated scenes, b)
differs in the average speed of agents. We also show that joint prediction models demonstrate joint
consistent futures via lower inter-prediction overlap compared to marginal models.
5 D ISCUSSION
We propose a unified architecture for autonomous driving that is able to model the complex interac-
tions of agents in the environment. Our approach enables a single model to perform motion prediction,
conditional motion prediction, and goal-conditioned prediction. In the field of autonomous driving,
elaborations of this problem formulation may result in learning models for planning systems that
quantitatively improve upon existing systems (Buehler et al., 2009; Montemerlo et al., 2008; Ziegler
et al., 2014; Zeng et al., 2019; Liu et al., 2021a). Likewise, such modeling efforts may be used to
directly go after issues of identifying interacting agents in an environment, and potentially provide an
important task for identifying causal relationships (Arjovsky et al., 2019; Schölkopf et al., 2021).
9
Published as a conference paper at ICLR 2022
R EFERENCES
Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S.
Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew
Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath
Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike
Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent
Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg,
Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-scale machine learning on
heterogeneous systems, 2015. URL http://tensorflow.org/. Software available from
tensorflow.org.
Alexandre Alahi, Kratarth Goel, Vignesh Ramanathan, Alexandre Robicquet, Li Fei-Fei, and Silvio
Savarese. Social LSTM: Human trajectory prediction in crowded spaces. In Proceedings of the
IEEE conference on computer vision and pattern recognition, pp. 961–971, 2016.
Martin Arjovsky, Léon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk minimization.
arXiv preprint arXiv:1907.02893, 2019.
Irwan Bello. LambdaNetworks: Modeling long-range interactions without attention. In International
Conference on Learning Representations (ICLR), 2013.
Irwan Bello, Barret Zoph, Ashish Vaswani, Jonathon Shlens, and Quoc V Le. Attention augmented
convolutional networks. In Proceedings of the IEEE/CVF International Conference on Computer
Vision, pp. 3286–3295, 2019.
Manoj Bhat, Jonathan Francis, and Jean Oh. Trajformer: Trajectory prediction with local self-attentive
contexts for autonomous driving. arXiv preprint arXiv:2011.14910, 2020.
Yuriy Biktairov, Maxim Stebelev, Irina Rudenko, Oleh Shliazhko, and Boris Yangel. Prank: motion
prediction based on ranking. In Advances in neural information processing systems, 2020.
Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal,
Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are
few-shot learners. In Advances in Neural Information Processing Systems, 2020.
Martin Buehler, Karl Iagnemma, and Sanjiv Singh. The DARPA urban challenge: autonomous
vehicles in city traffic, volume 56. springer, 2009.
Thibault Buhet, Emilie Wirbel, and Xavier Perrotton. Plop: Probabilistic polynomial objects trajectory
planning for autonomous driving. In Conference on Robot Learning, 2020.
Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush
Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for
autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pp. 11621–11631, 2020.
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey
Zagoruyko. End-to-end object detection with transformers. In European Conference on Computer
Vision, pp. 213–229. Springer, 2020.
Sergio Casas, Cole Gulino, Renjie Liao, and Raquel Urtasun. Spagnn: Spatially-aware graph
neural networks for relational behavior forecasting from sensor data. In 2020 IEEE International
Conference on Robotics and Automation, ICRA 2020, Paris, France, May 31 - August 31, 2020, pp.
9491–9497. IEEE, 2020a.
Sergio Casas, Cole Gulino, Simon Suo, Katie Luo, Renjie Liao, and Raquel Urtasun. Implicit
latent variable model for scene-consistent motion forecasting. In Proceedings of the European
Conference on Computer Vision (ECCV). Springer, 2020b.
Yuning Chai, Benjamin Sapp, Mayank Bansal, and Dragomir Anguelov. Multipath: Multiple
probabilistic anchor trajectory hypotheses for behavior prediction. In Conference on Robot
Learning, 2019.
10
Published as a conference paper at ICLR 2022
Rohan Chandra, Uttaran Bhattacharya, Aniket Bera, and Dinesh Manocha. Traphic: Trajectory
prediction in dense and heterogeneous traffic using weighted interactions. CoRR, abs/1812.04767,
2018. URL http://arxiv.org/abs/1812.04767.
Ming-Fang Chang, John Lambert, Patsorn Sangkloy, Jagjeet Singh, Slawomir Bak, Andrew Hartnett,
De Wang, Peter Carr, Simon Lucey, Deva Ramanan, et al. Argoverse: 3d tracking and forecasting
with rich maps. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pp. 8748–8757, 2019.
Henggang Cui, Vladan Radosavljevic, Fang-Chieh Chou, Tsung-Han Lin, Thi Nguyen, Tzu-Kuo
Huang, Jeff Schneider, and Nemanja Djuric. Multimodal trajectory predictions for autonomous
driving using deep convolutional networks. In 2019 International Conference on Robotics and
Automation (ICRA), pp. 2090–2096. IEEE, 2019.
Nachiket Deo and Mohan M. Trivedi. Trajectory forecasts in unknown environments conditioned
on grid-based plans. CoRR, abs/2001.00735, 2020. URL http://arxiv.org/abs/2001.
00735.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep
bidirectional transformers for language understanding. In Conference of the North American
Chapter of the Association for Computational Linguistics, 2019.
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas
Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image
is worth 16x16 words: Transformers for image recognition at scale. In International Conference
on Learning Representations (ICLR), 2021.
Scott Ettinger, Shuyang Cheng, Benjamin Caine, Chenxi Liu, Hang Zhao, Sabeek Pradhan, Yuning
Chai, Ben Sapp, Charles Qi, Yin Zhou, et al. Large scale interactive motion forecasting for
autonomous driving: The waymo open motion dataset. arXiv preprint arXiv:2104.10133, 2021.
Jiyang Gao, Chen Sun, Hang Zhao, Yi Shen, Dragomir Anguelov, Congcong Li, and Cordelia Schmid.
VectorNet: Encoding hd maps and agent dynamics from vectorized representation. In Proceedings
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020.
Thomas Gilles, Stefano Sabatini, Dzmitry Tsishkou, Bogdan Stanciulescu, and Fabien Moutarde.
HOME: heatmap output for future motion estimation. CoRR, abs/2105.10968, 2021. URL
https://arxiv.org/abs/2105.10968.
Francesco Giuliari, Irtiza Hasan, Marco Cristani, and Fabio Galasso. Transformer networks for
trajectory forecasting. In International Conference on Pattern Recognition, 2020.
Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural
networks. In Proceedings of the thirteenth international conference on artificial intelligence and
statistics, pp. 249–256. JMLR Workshop and Conference Proceedings, 2010.
Junru Gu, Chen Sun, and Hang Zhao. Densetnt: End-to-end trajectory prediction from dense goal
sets. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021.
Agrim Gupta, Justin Johnson, Li Fei-Fei, Silvio Savarese, and Alexandre Alahi. Social GAN: Socially
acceptable trajectories with generative adversarial networks. In CVPR, 2018.
Jonathan Ho, Nal Kalchbrenner, Dirk Weissenborn, and Tim Salimans. Axial attention in multidi-
mensional transformers. arXiv preprint arXiv:1912.12180, 2019.
Joey Hong, Benjamin Sapp, and James Philbin. Rules of the road: Predicting driving behavior with a
convolutional model of semantic interactions. In CVPR, 2019.
Zhiyu Huang, Xiaoyu Mo, and Chen Lv. Recoat: A deep learning framework with attention
mechanism for multi-modal motion prediction. In Workshop on Autonomous Driving, CVPR, 2021.
Wei-Chih Hung, Henrik Kretzschmar, Tsung-Yi Lin, Yuning Chai, Ruichi Yu, Ming-Hsuan Yang,
and Drago Anguelov. Soda: Multi-object tracking with soft data association. arXiv preprint
arXiv:2008.07725, 2020.
11
Published as a conference paper at ICLR 2022
Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by
reducing internal covariate shift. In International conference on machine learning, pp. 448–456.
PMLR, 2015.
Boris Ivanovic and Marco Pavone. Rethinking trajectory forecasting evaluation. arXiv preprint
arXiv:2107.10297, 2021.
Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa,
Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. In-datacenter performance analysis of
a tensor processing unit. In Proceedings of the 44th annual international symposium on computer
architecture, pp. 1–12, 2017.
Siddhesh Khandelwal, William Qi, Jagjeet Singh, Andrew Hartnett, and Deva Ramanan. What-if
motion prediction for autonomous driving. arXiv preprint arXiv:2008.10587, 2020.
Stepan Konev, Kirill Brodt, and Artsiom Sanakoyeu. Motioncnn: A strong baseline for motion
prediction in autonomous driving. In Workshop on Autonomous Driving, CVPR, 2021.
Namhoon Lee, Wongun Choi, Paul Vernaza, Christopher B Choy, Philip HS Torr, and Manmohan
Chandraker. Desire: Distant future prediction in dynamic scenes with interacting agents. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 336–345,
2017.
Lingyun Luke Li, Bin Yang, Ming Liang, Wenyuan Zeng, Mengye Ren, Sean Segal, and Raquel Urta-
sun. End-to-end contextual perception and prediction with interaction transformer. In IEEE/RSJ
International Conference on Intelligent Robots and Systems (IROS), 2020.
Ming Liang, Bin Yang, Rui Hu, Yun Chen, Renjie Liao, Song Feng, and Raquel Urtasun. Learning
lane graph representations for motion forecasting. In European Conference on Computer Vision
(ECCV), 2020.
Jerry Liu, Wenyuan Zeng, Raquel Urtasun, and Ersin Yumer. Deep structured reactive planning.
arXiv preprint arXiv:2101.06832, 2021a.
Yicheng Liu, Jinghuai Zhang, Liangji Fang, Qinhong Jiang, and Bolei Zhou. Multimodal motion
prediction with stacked transformers, 2021b.
Francesco Marchetti, Federico Becattini, Lorenzo Seidenari, and Alberto Del Bimbo. Mantra:
Memory augmented networks for multiple trajectory prediction. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, pp. 7143–7152, 2020.
Jean Mercat, Thomas Gilles, Nicole Zoghby, Guillaume Sandou, Dominique Beauvois, and Guillermo
Gil. Multi-head attention for joint multi-modal vehicle motion forecasting. In IEEE International
Conference on Robotics and Automation, 2020.
Gregory P. Meyer and Niranjan Thakurdesai. Learning an uncertainty-aware object detector for
autonomous driving. In IEEE/RSJ International Conference on Intelligent Robots and Systems
(IROS), pp. 10521–10527, 2020.
Xiaoyu Mo, Zhiyu Huang, and Chen Lv. Multi-modal interactive agent trajectory prediction using
heterogeneous edge-enhanced graph attention network. In Workshop on Autonomous Driving,
CVPR, 2021.
Michael Montemerlo, Jan Becker, Suhrid Bhat, Hendrik Dahlkamp, Dmitri Dolgov, Scott Ettinger,
Dirk Haehnel, Tim Hilden, Gabe Hoffmann, Burkhard Huhnke, et al. Junior: The stanford entry in
the urban challenge. Journal of field Robotics, 25(9):569–597, 2008.
12
Published as a conference paper at ICLR 2022
Seong Hyeon Park, Gyubok Lee, Jimin Seo, Manoj Bhat, Minseok Kang, Jonathan Francis, Ashwin
Jadhav, Paul Pu Liang, and Louis-Philippe Morency. Diverse and admissible trajectory forecasting
through multimodal context understanding. In European Conference on Computer Vision, pp.
282–298. Springer, 2020.
Stefano Pellegrini, Andreas Ess, Konrad Schindler, and Luc Van Gool. You’ll never walk alone:
Modeling social behavior for multi-target tracking. In 2009 IEEE 12th International Conference
on Computer Vision, pp. 261–268. IEEE, 2009.
Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets
for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision
and pattern recognition, pp. 652–660, 2017.
Prajit Ramachandran, Niki Parmar, Ashish Vaswani, Irwan Bello, Anselm Levskaya, and Jon Shlens.
Stand-alone self-attention in vision models. In Advances in Neural Information Processing Systems,
volume 32, 2019.
Nicholas Rhinehart, Rowan McAllister, Kris Kitani, and Sergey Levine. Precog: Prediction con-
ditioned on goals in visual multi-agent settings. In Proceedings of the IEEE/CVF International
Conference on Computer Vision, pp. 2821–2830, 2019.
Tim Salzmann, Boris Ivanovic, Punarjay Chakravarty, and Marco Pavone. Trajectron++: Dynamically-
feasible trajectory forecasting with heterogeneous data. arXiv preprint arXiv:2001.03093, 2020.
Bernhard Schölkopf, Francesco Locatello, Stefan Bauer, Nan Rosemary Ke, Nal Kalchbrenner,
Anirudh Goyal, and Yoshua Bengio. Toward causal representation learning. Proceedings of the
IEEE, 2021.
Aravind Srinivas, Tsung-Yi Lin, Niki Parmar, Jonathon Shlens, Pieter Abbeel, and Ashish Vaswani.
Bottleneck transformers for visual recognition. In CVPR, 2021.
Simon Suo, Sebastian Regalado, Sergio Casas, and Raquel Urtasun. Trafficsim: Learning to simulate
realistic multi-agent behaviors. In Conference on Computer Vision and Pattern Recognition
(CVPR), 2021.
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking
the inception architecture for computer vision. In Proceedings of the IEEE conference on computer
vision and pattern recognition, pp. 2818–2826, 2016.
Charlie Tang and Russ R Salakhutdinov. Multiple futures prediction. In NeurIPS, 2019.
Sebastian Thrun, Mike Montemerlo, Hendrik Dahlkamp, David Stavens, Andrei Aron, James Diebel,
Philip Fong, John Gale, Morgan Halpenny, Gabriel Hoffmann, et al. Stanley: The robot that won
the darpa grand challenge. Journal of field Robotics, 23(9):661–692, 2006.
Ekaterina Tolstaya, Reza Mahjourian, Carlton Downey, Balakrishnan Vadarajan, Benjamin Sapp,
and Dragomir Anguelov. Identifying driver interactions via conditional behavior prediction. 2021
IEEE International Conference on Robotics and Automation (ICRA), 2021.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz
Kaiser, and Illia Polosukhin. Attention is all you need. arXiv preprint arXiv:1706.03762, 2017.
Ashish Vaswani, Prajit Ramachandran, Aravind Srinivas, Niki Parmar, Blake Hechtman, and Jonathon
Shlens. Scaling local self-attention for parameter efficient visual backbones. In CVPR, 2021.
Huiyu Wang, Yukun Zhu, Bradley Green, Hartwig Adam, Alan Yuille, and Liang-Chieh Chen.
Axial-deeplab: Stand-alone axial-attention for panoptic segmentation. In European Conference on
Computer Vision, pp. 108–126. Springer, 2020.
Waymo. Waymo open dataset challenge 2021 winners. https://waymo.com/open/
challenges#winners, 2021. Accessed: 2021-10-01.
Maosheng Ye, Tongyi Cao, and Qifeng Chen. Tpcn: Temporal point cloud networks for motion
forecasting. arXiv preprint arXiv:2103.03067, 2021.
13
Published as a conference paper at ICLR 2022
Raymond A Yeh, Alexander G Schwing, Jonathan Huang, and Kevin Murphy. Diverse generation for
multi-agent sports games. In Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pp. 4610–4619, 2019.
Cunjun Yu, Xiao Ma, Jiawei Ren, Haiyu Zhao, and Shuai Yi. Spatio-temporal graph transformer
networks for pedestrian trajectory prediction. In European Conference on Computer Vision, pp.
507–523. Springer, 2020.
Ye Yuan, Xinshuo Weng, Yanglan Ou, and Kris Kitani. Agentformer: Agent-aware transformers for
socio-temporal multi-agent forecasting. arXiv preprint arXiv:2103.14023, 2021.
Wenyuan Zeng, Wenjie Luo, Simon Suo, Abbas Sadat, Bin Yang, Sergio Casas, and Raquel Urtasun.
End-to-end interpretable neural motion planner. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, pp. 8660–8669, 2019.
Wei Zhan, Liting Sun, Di Wang, Haojie Shi, Aubrey Clausse, Maximilian Naumann, Julius Küm-
merle, Hendrik Königshof, Christoph Stiller, Arnaud de La Fortelle, and Masayoshi Tomizuka.
INTERACTION Dataset: An INTERnational, Adversarial and Cooperative moTION Dataset in
Interactive Driving Scenarios with Semantic Maps. arXiv:1910.03088 [cs, eess], 2019.
Hang Zhao, Jiyang Gao, Tian Lan, Chen Sun, Benjamin Sapp, Balakrishnan Varadarajan, Yue Shen,
Yi Shen, Yuning Chai, Cordelia Schmid, et al. Tnt: Target-driven trajectory prediction. arXiv
preprint arXiv:2008.08294, 2020.
Tianyang Zhao, Yifei Xu, Mathew Monfort, Wongun Choi, Chris Baker, Yibiao Zhao, Yizhou Wang,
and Ying Nian Wu. Multi-agent tensor fusion for contextual trajectory prediction. In CVPR, pp.
12126–12134, 2019.
Julius Ziegler, Philipp Bender, Thao Dang, and Christoph Stiller. Trajectory planning for bertha—a
local, continuous method. In 2014 IEEE intelligent vehicles symposium proceedings, pp. 450–457.
IEEE, 2014.
14
Published as a conference paper at ICLR 2022
Appendix
Overview. Table 5 provides additional details about the Scene Transformer architecture and training.
The bulk of the computation and parameters reside in the Transformer layers. Table 6 lists all of
the operations and learned parameters in our implementation of a Transformer. MLPs are employed
to embed the original data (e.g, layers A, B, C), build F futures from the encoding (layer T ), and
predict the final position, headings and uncertainties of all agents (layers Z1 and Z2 ). The resulting
attention-based architecture using D = 256 feature dimensions contains 15,296,136 parameters.
Nearly all settings used for both WOMD and Argoverse are identical, except that for Argoverse we
use D = 512 and label smoothing to get our best marginal results. The resulting model is trained on
a TPU custom hardware accelerator (Jouppi et al., 2017) and converges in about 3 days of training.
Decoding multiple futures. In order to allow the decoder to output F distinct futures, we perform
the following operations. The decoder receives as input a tensor of shape [A, T, D] corresponding to
A agents across T time steps and D feature dimensions. The following series of simple operations
restructures the representation to predict F futures. First, the representation is tiled F times to
generate [F, A, T, D]. We append a one-hot encoding to the final dimension where a 1 indicates
which the identity of each future, resulting in a tensor of shape [F, A, T, D + F ]. The one-hot
encoding allows the network to learn an embedding for each of the F futures. For computational
simplicity, the resulting representation is propagated through a small MLP to produce a return a
tensor of the shape [F, A, T, D].
Padding and Hidden Masks. Padding and hidden masks are important to get right in the implemen-
tation of such a model. In particular, we need to ensure that the masks do not convey additional future
information to the model (e.g., if the model knows which timesteps an agent is visible or occluded
based on the padding in the data, it may take advantage of this and not be able to generalize). We
use the concept of padding to indicate the positions where the input is absent or provided. This is
distinct from the hidden mask of shape [A, T ] that is used for task specification. The hidden mask is
used to query the model to inform it on which locations to predict, while the padding tells us which
locations have inputs and ground-truth to compute a loss over. All padded positions are set to be
hidden during preprocessing, regardless of the masking scheme, so the model tries to predict their
values. All layers in our implementation are padding aware, including normalization layers (like
batch normalization used in our encoding MLPs) and attention operations. Our attention operations
set the attention weights to 0 for padded elements. If after providing our task specific hidden mask,
no non-padded (valid) time steps exist for an agent, the whole agent is set to be padded. This prevents
agent slots with no valid data from being used in the model.
Predicting uncertainties. For each predicted variate such as position or heading, we additionally
predict a corresponding uncertainty. In early experiments we found that predicting a paired uncertainty
improves the predictive performance of the original variate and provides a useful signal for interpreting
the fidelity of the predicted value. We employ a loss function that predicts a parameterization of the
uncertainty corresponding to a Laplace distribution (Meyer & Thakurdesai, 2020).
Loss in lateral and longitudinal coordinates. We use a Laplace distribution parameterization for
the loss, with a diagonal covariance such that each axis is independent. To enable the model to learn
meaningful uncertainties, we rotate the scene per box prediction such that each prediction’s associated
ground-truth is axis aligned. This formulation results in uncertainty estimates that correspond to the
agent’s lateral and longitudinal coordinates.
Heading representation: Our output format is 7 dimensional, where one of the dimensions, heading,
is represented in radians. To supervise the loss on the heading dimension, we employ a now standard
method of "wrapping" the angle difference between the predicted and groundtruth heading for each
agent; see Figure 4.
Marginal predictions from a joint model. The model has the added feature of being easily adapted
for making marginal predictions for each agent. To produce per-trajectory scores for each agents
predictions, we attach an artificial extra time step to the end of the [A, T, D] agent feature matrix
to give us a matrix of shape [A, T + 1, D]. This additional feature provides the model with extra
15
Published as a conference paper at ICLR 2022
With Augmentation? Vehicle minADE @8s Ped minADE @8s Cyclist minADE @ 8s
No 1.30 0.69 1.29
Yes 1.18 0.59 1.15
Table 4: WOMD validation set performance with and without data augmentation during training.
capacity for representing each agent, that is not tied to any timestep. This additional feature can be
used to predict which of the k trajectories most closely matches the ground truth for that agent.
Data augmentation. We use two methods for data augmentation to aid generalization and combat
overfitting on both datasets. First, we use agent dropout, where we artificially remove non-predicted
agents with some probability. We found a probability of 0.1 worked well for both datasets. We also
found it beneficial to randomly rotate the entire scene between [− π2 , π2 ] after centering to the agent of
interest. On Argoverse, this agent of interest is the agent the task designates to be predicted, where as
on the WOMD we always center around the autonomous vehicle (AV), even though for some tasks
it is not one of the predicted agents. Lastly, each Argoverse scene contains many agents that the
model is not required to predict; we employ these contextual agents as additional target training data
if the contextual agents moved by at least 6m. On WOMD, Table 4 shows data augmentation on the
standard validation set does improve minADE modestly, though does not account for the majority of
the benefit of the model.
Argoverse classification label smoothing: Our best model on Argoverse uses D = 512 feature
dimensions, but naively scaling the model this way leads to severe overfitting on the classification
subtask. During training of our best model we employ label smoothing (0.1 + 1/6 for negatives and
0.9 + 1/6 for positive target).
WOMD redundant trajectory combination: The AP evaluation metric on the WOMD expects that
exactly one of the trajectories is given a high probability and the other trajectories a low probability.
For example, the future prediction for a stationary agent is expected to have only one future with
a high probability score, with a low score for the rest – even though the trajectories may all be the
same, in this case, stationary. Our best model combines redundant trajectories together if they are
close in spatial location (less than 3.2m) at the final timestep.
Embedding of agents and road graph. To generate input features, we use sinusoidal positional
embeddings (Vaswani et al., 2017) to embed the time (for agents and dynamic roadgraph) and
xyz-coordinates separately into a D dimensional features per dimension. We encode the type of
each object using a one-hot encoding (e.g. object type, lane type, etc), and concatenate any other
features provided in the data set such as yaw, width, length, height, and velocity. Dynamic road
graphs have a second one-hot encoding indicating state, like the traffic light state. Lastly, for agents
and the dynamic road graph, we add a binary indicator on whether the agent is hidden at that time
step. If the agent or dynamic road graph element is hidden, all input features (e.g. position, type,
velocity, state, etc) are set to 0 before encoding except the time embedding, which are linearly spaced
values at the dataset’s update rate starting at 0, and the hidden indicator.
For agents and the dynamic road graph, we use a 2 layer MLP with a hidden and output dimension of
D to produce a final feature per agent or object and per time step. For the static road graph, we must
reduce a point cloud of up to 20,000 road graph points, each belonging to a smaller set of polylines,
to a single vector per polyline. Because some lanes can be extremely long, we break up any polyline
longer than 20 points into a new set of smaller polylines. Then, we apply a small PointNet (Qi et al.,
16
Published as a conference paper at ICLR 2022
2017) architecture with a 2 layer MLP with a hidden and output dimension of D to each point, and
use max pooling per polyline to get a final feature per element.
Inference latency. Scene Transformer produces a prediction for every agent in the scene in a single
pass. Its inference speed depends on the number of agents in the scene. Our preliminary profiling of
inference speed ranged from 52 ms (32 agents) to 175 ms (128 agents) on the Waymo Open Motion
Dataset (128 agents, 91 timesteps for each agent, up to 1400 roadgraph elements). This was measured
on an Nvidia V100 using a standard TensorFlow inference graph in float32 without optimization
tricks, optimization tuning, etc. This is in line with the expected linear scaling of the factorized
attention modules.
17
Published as a conference paper at ICLR 2022
Meta-Arch Name Input Operation Queries Keys/Values Across Atten Matrix Output Size # Param
Encoder
A Agents MLP + BN – – – – [A, T, D] 334080
B Dyna RG MLP + BN – – – – [GD , T, D] 337408
C Static RG MLP + BN – – – – [GS , T, D] 270592
Decoder
R Q Tile – – – – [F, A, T, D] 0
S R Concat – – – – [F, A, T, D+F ] 0
T S MLP – – – – [F, A, T, D] 68096
Table 5: Scene Transformer architecture and training details. The network receives as input of
A agents across T time steps and K features. K is the total number of input features (e.g. 3-D
position, velocity, object type, bounding box size). A subset of these inputs are masked. GS and GD
is the maximum number of road graph (RG) elements and D is the total number of features. MLP and
BN denote multilayer perception and batch normalization (Ioffe & Szegedy, 2015), respectively. The
output of the network is Z1 and Z2 . Z1 corresponds to predicting the logits for classifying which one
of the F futures is most likely. Z2 corresponds to the predicted xyz-coordinates with their associated
uncertainties, and a single value for heading. In our model D = 256, K = 7 and F = 6 for a total of
15,296,136 parameters for both datasets. For the Waymo Open Motion Dataset Gs = 1400, Gd = 16,
T = 91, A = 128, and for Argoverse Gs = 256, Gd = 0, T = 50, and A = 64. All layers employ
ReLU nonlinearities.
18
Published as a conference paper at ICLR 2022
Table 6: Transformer architecture. The network receives as input Xo and outputs Z. All MLP’s
employ a ReLU nonlinearity. D is the number of feature dimensions; H is the number of attention
heads. In our model D=256, H=4 and k=4.
19
Published as a conference paper at ICLR 2022
Table 7: Marginal predictive performance on Waymo Open Motion Dataset motion prediction
for t = 3 seconds. Please see Table 2 for details.
Table 8: Joint predictive performance on Waymo Open Motion Dataset motion prediction for
t = 3 seconds. Please see Table 3 for details.
Table 9: Marginal predictive performance on Waymo Open Motion Dataset motion prediction
for t = 5 seconds. Please see Table 2 for details.
20
Published as a conference paper at ICLR 2022
Table 10: Joint predictive performance on Waymo Open Motion Dataset motion prediction for
t = 5 seconds. Please see Table 3 for details.
Table 11: Marginal predictive performance on Waymo Open Motion Dataset motion prediction
for t = 8 seconds. Please see Table 2 for details. We additionally include "joint-as-marginal" results
for our standard MP masked joint model, and our multi-task joint model. These are joint models
evaluated as if they were marginal models (with no changes in the outputs).
Table 12: Joint predictive performance on Waymo Open Motion Dataset motion prediction for
t = 8 seconds. These results are the same as Table 3 but is included here for completeness with the
other Appendix tables.
21
Published as a conference paper at ICLR 2022
B.1 U NDERSTANDING THE TRADE - OFFS BETWEEN THE JOINT AND MARGINAL MODELS
Figure 5: Quantitative comparison of the marginal and joint prediction models. Left: The
marginal minADE broken down as a function of number of predicted agents. When there are more
agents, producing internally consistent predictions is more challenging, and hence the joint model
performs slightly worse. Right: Overlap rate between pairs of agent predictions. The joint model
produces internally consistent predictions with lower inter-prediction overlap rates.
While the joint mp-only model performs better on the interactive prediction task than the marginal
model when evaluated with joint metrics, it performs worse on the motion prediction task when
evaluated with marginal metrics (see Table 11 "ours (joint)"). This is because the joint model is
trained to produce predictions that are internally consistent between agents, while the marginal model
does not, which is a strictly harder task. In this section, we examine this quality difference and the
internal consistency of the predictions from both models.
In the WOMD dataset, each example has a different number of agents to be predicted. By slicing
the marginal minADE results based on the number of agents to be predicted (Figure 5, left), we find
that the joint model performs worse as there are more predicted agents, while the marginal model
performs the same. This is expected: when there are more agents, the joint model has a more difficult
task since it needs to produce internally consistent predictions. One may expect a joint model to need
an exponential number of trajectory combinations to perform competitively with the marginal model,
but we are encouraged to find that this is not actually the case. We believe this is due to the fact that
when many agents are interacting, the number of realistic scenarios is actually heavily constrained by
interactions - most combinations of marginal predictions don’t actually make sense together.
However, when more agents are to be predicted, the possibility of interactions are higher and we
would expect that the joint model is able to capture these interactions through internally consistent
predictions. We measure the models’ internal consistency by selecting the best trajectory and
measuring the inter-prediction overlap rate (details below). We find that the joint model has a
consistently lower inter-prediction overlap rate, showing that it is able to capture the interactions
between agents. The ability to model interactions enables the model to be suitable for conditional
motion prediction, and goal conditioned prediction tasks, which we discuss in Section 4.4.
Measuring agent inter-prediction overlap. We measure predicted agent overlap on the best set of
trajectory produced by the model. For the joint model, this corresponds to the joint prediction that
has the higher probability score. For the marginal model, we take the top scoring trajectory for each
agent to obtain the best set. For every predicted agent in the trajectory, we determine if it has an
overlap with any other predicted agent by comparing the rotated upright 3D bounding boxes using
our predicted xyz and heading. The inter-prediction overlap rate is the number of predicted agents
that are involved in some overlap, divided by the total number of predicted agents. We count two
agents as overlapping if the intersection over the union (IoU) exceeds 0.01.
22
Published as a conference paper at ICLR 2022
C S LICING RESULTS
Figure 6: We show marginal minADE on a per scene basis broken down by different scene level
statistics. Interestingly, we show that both the marginal and joint models become out of distribution
above 20 m/s, where there is minimal training data.
23
Published as a conference paper at ICLR 2022
Figure 7: Analysis of the multi-task model for goal-conditioned motion prediction. Left: Cumu-
lative probability of the AV minADE in goal conditioned prediction (blue) and motion prediction
(black) masking strategies. Right: AV minADE for goal conditioned prediction (blue) and motion
prediction (black) as a function of AV speed (averaged over the ground truth trajectory).
24
Published as a conference paper at ICLR 2022
D L OSS I MPLEMENTATION
The Scene Transformer model is both agent permutation equivariant and scene-centric. These
properties allow us to easily switch between a marginal vs joint loss formulation (Figure 8). In
the marginal formulation, we reduce the loss of the best future for every agent separately; while in
the joint formulation, we reduce the loss of the best future for all agents jointly. In practice, this
determines when the reduce_min operation is performed.
Our final loss composes the regression loss on the trajectory and the classification loss of the best
trajectory. A weighted linear combination of the loss terms is used to combine these two losses
together. The classification loss weight was set to be 0.1, while the regression losses have weight 1.0.
The weights were determined using the hold-out validation set.
# agent_predictions and agent_gt are
# [F, A, T, 7] Tensors with [x, y, z], uncertainty terms, and yaw.
# The marginal loss, we only apply the loss to the best trajectory
# per agent (so min across the future dimension).
marginal_loss = tf.reduce_min(loss, axis=0)
# The joint loss, we sum over all agents to get a loss value
# per future.
joint_loss = tf.reduce_sum(loss, axis=1)
Figure 8: Pseudo-code in TensorFlow (Abadi et al., 2015) demonstrating the joint versus marginal
loss formulation.
25