STGCN
STGCN
3634
Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18)
networks to extract spatial and temporal features from the in- Time … vt+H
put jointly. Moreover, within narrow constraints or even com- … vt
vt-M+1
plete absence of spatial attributes, the representative ability of
these networks would be hindered seriously.
To take full advantage of spatial features, some researchers wij
use convolutional neural network (CNN) to capture adjacent
relations among the traffic network, along with employing
recurrent neural network (RNN) on time axis. By combin- Figure 1: Graph-structured traffic data. Each vt indicates a frame
ing long short-term memory (LSTM) network [Hochreiter of current traffic status at time step t, which is recorded in a graph-
structured data matrix.
and Schmidhuber, 1997] and 1-D CNN, Wu and Tan [2016]
presented a feature-level fused architecture CLTFP for short-
term traffic forecast. Although it adopted a straightforward not independent but linked by pairwise connection in graph.
strategy, CLTFP still made the first attempt to align spatial Therefore, the data point vt can be regarded as a graph sig-
and temporal regularities. Afterwards, Shi et al. [2015] pro- nal that is defined on an undirected graph (or directed one) G
posed the convolutional LSTM, which is an extended fully- with weights wij as shown in Figure 1. At the t-th time step,
connected LSTM (FC-LSTM) with embedded convolutional in graph Gt = (Vt , E, W ), Vt is a finite set of vertices, corre-
layers. However, the normal convolutional operation applied sponding to the observations from n monitor stations in a traf-
restricts the model to only process grid structures (e.g. im- fic network; E is a set of edges, indicating the connectedness
ages, videos) rather than general domains. Meanwhile, recur- between stations; while W 2 Rn⇥n denotes the weighted
rent networks for sequence learning require iterative training, adjacency matrix of Gt .
which introduces error accumulation by steps. Additionally,
RNN-based networks (including LSTM) are widely known to 2.2 Convolutions on Graphs
be difficult to train and computationally heavy. A standard convolution for regular grids is clearly not appli-
For overcoming these issues, we introduce several strate- cable to general graphs. There are two basic approaches cur-
gies to effectively model temporal dynamics and spatial de- rently exploring how to generalize CNNs to structured data
pendencies of traffic flow. To fully utilize spatial informa- forms. One is to expand the spatial definition of a convolu-
tion, we model the traffic network by a general graph instead tion [Niepert et al., 2016], and the other is to manipulate in
of treating it separately (e.g. grids or segments). To handle the spectral domain with graph Fourier transforms [Bruna et
the inherent deficiencies of recurrent networks, we employ a al., 2013]. The former approach rearranges the vertices into
fully convolutional structure on time axis. Above all, we pro- certain grid forms which can be processed by normal con-
pose a novel deep learning architecture, the spatio-temporal volutional operations. The latter one introduces the spectral
graph convolutional networks, for traffic forecasting tasks. framework to apply convolutions in spectral domains, often
This architecture comprises several spatio-temporal convolu- named as the spectral graph convolution. Several following-
tional blocks, which are a combination of graph convolutional up studies make the graph convolution more promising by
layers [Defferrard et al., 2016] and convolutional sequence reducing the computational complexity from O(n2 ) to linear
learning layers, to model spatial and temporal dependencies. [Defferrard et al., 2016; Kipf and Welling, 2016].
To the best of our knowledge, it is the first time that to ap- We introduce the notion of graph convolution operator
ply purely convolutional structures to extract spatio-temporal “⇤G ” based on the conception of spectral graph convolution,
features simultaneously from graph-structured time series in as the multiplication of a signal x 2 Rn with a kernel ⇥,
a traffic study. We evaluate our proposed model on two real- ⇥ ⇤G x = ⇥(L)x = ⇥(U ⇤U T )x = U ⇥(⇤)U T x, (2)
world traffic datasets. Experiments show that our framework where graph Fourier basis U 2 Rn⇥n is the matrix of
outperforms existing baselines in prediction tasks with multi- eigenvectors of the normalized graph Laplacian L = In
ple preset prediction lengths and network scales. 1 1
D 2 W D 2 = U ⇤U T 2 Rn⇥n (In is an identity matrix,
D2R n⇥n
is the diagonal degree matrix with Dii = ⌃j Wij );
2 Preliminary ⇤ 2 Rn⇥n is the diagonal matrix of eigenvalues of L, and fil-
2.1 Traffic Prediction on Road Graphs ter ⇥(⇤) is also a diagonal matrix. By this definition, a graph
Traffic forecast is a typical time-series prediction problem, signal x is filtered by a kernel ⇥ with multiplication between
i.e. predicting the most likely traffic measurements (e.g. ⇥ and graph Fourier transform U T x [Shuman et al., 2013].
speed or traffic flow) in the next H time steps given the pre-
vious M traffic observations as, 3 Proposed Model
v̂t+1 , ..., v̂t+H = 3.1 Network Architecture
arg max log P (vt+1 , ..., vt+H |vt M +1 , ..., vt ), (1) In this section, we elaborate on the proposed architecture of
vt+1 ,...,vt+H spatio-temporal graph convolutional networks (STGCN). As
where vt 2 Rn is an observation vector of n road segments shown in Figure 2, STGCN is composed of several spatio-
at time step t, each element of which records historical obser- temporal convolutional blocks, each of which is formed as a
vation for a single road segment. “sandwich” structure with two gated sequential convolution
In this work, we define the traffic network on a graph and layers and one spatial graph convolution layer in between.
focus on structured traffic time series. The observation vt is The details of each module are described as follows.
3635
Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18)
vl l l
(vt-M+1, … vt ) polynomial approximation, the cost of Eq. (2) can be reduced
(vt-M+1, … vt) W W
to O(K|E|) as Eq. (3) shows [Defferrard et al., 2016].
Temporal
ST-Conv Block Gated-Conv, C=64
1-D 1st -order Approximation A layer-wise linear formulation
ST-Conv Block
Spatial Conv can be defined by stacking multiple localized graph convo-
Graph-Conv, C=16
lutional layers with the first-order approximation of graph
Temporal Laplacian [Kipf and Welling, 2016]. Consequently, a deeper
GLU
Output Layer Gated-Conv, C=64 Temporal architecture can be constructed to recover spatial information
ST-Conv Block Gated-Conv
in depth without being limited to the explicit parameteriza-
tion given by the polynomials. Due to the scaling and nor-
v̂ vl+1 l l
(vt-M+Kt , … vt ) malization in neural networks, we can further assume that
max ⇡ 2. Thus, the Eq. (3) can be simplified to,
3636
Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18)
3.3 Gated CNNs for Extracting Temporal Features where l0 , l1 are the upper and lower temporal kernel within
Although RNN-based models become widespread in time- block l, respectively; ⇥l is the spectral kernel of graph con-
series analysis, recurrent networks for traffic prediction still volution; ReLU(·) denotes the rectified linear units function.
suffer from time-consuming iterations, complex gate mecha- After stacking two ST-Conv blocks, we attach an extra tem-
nisms, and slow response to dynamic changes. On the con- poral convolution layer with a fully-connected layer as the
trary, CNNs have the superiority of fast training, simple struc- output layer in the end (See the left of Figure 2). The tempo-
tures, and no dependency constraints to previous steps. In- ral convolution layer maps outputs of the last ST-Conv block
spired by [Gehring et al., 2017], we employ entire convolu- to a single-step prediction. Then, we can obtain a final output
tional structures on time axis to capture temporal dynamic Z 2 Rn⇥c from the model and calculate the speed predic-
behaviors of traffic flows. This specific design allows parallel tion for n nodes by applying a linear transformation across
and controllable training procedures through multi-layer con- c-channels as v̂ = Zw + b, where w 2 Rc is a weight vector
volutional structures formed as hierarchical representations. and b is a bias. We use L2 loss to measure the performance
As Figure 2 (right) shows, the temporal convolutional layer of our model. Thus, the loss function of STGCN for traffic
contains a 1-D causal convolution with a width-Kt kernel fol- prediction can be written as,
lowed by gated linear units (GLU) as a non-linearity. For X
each node in graph G, the temporal convolution explores L(v̂; W✓ ) = ||v̂(vt M +1 , ..., vt , W✓ ) vt+1 ||2 , (9)
Kt neighbors of input elements without padding which lead- t
ing to shorten the length of sequences by Kt -1 each time. where W✓ are all trainable parameters in the model; vt+1 is
Thus, input of temporal convolution for each node can be the ground truth and v̂(·) denotes the model’s prediction.
regarded as a length-M sequence with Ci channels as Y 2 We now summarize the main characteristics of our model
RM ⇥Ci . The convolution kernel 2 RKt ⇥Ci ⇥2Co is de- STGCN in the following,
signed to map the input Y to a single output element [P Q] 2 • STGCN is a universal framework to process structured
R(M Kt +1)⇥(2Co ) (P , Q is split in half with the same size of time series. It is not only able to tackle traffic network
channels). As a result, the temporal gated convolution can be modeling and prediction issues but also to be applied to
defined as, more general spatio-temporal sequence learning tasks.
⇤T Y = P (Q) 2 R(M Kt +1)⇥Co , (7)
• The spatio-temporal block combines graph convolutions
where P , Q are input of gates in GLU respectively; denotes and gated temporal convolutions, which can extract the
the element-wise Hadamard product. The sigmoid gate (Q) most useful spatial features and capture the most essen-
controls which input P of the current states are relevant for tial temporal features coherently.
discovering compositional structure and dynamic variances
in time series. The non-linearity gates contribute to the ex- • The model is entirely composed of convolutional struc-
ploiting of the full input filed through stacked temporal layers tures and therefore achieves parallelization over input
as well. Furthermore, residual connections are implemented with fewer parameters and faster training speed. More
among stacked temporal convolutional layers. Similarly, the importantly, this economic architecture allows the model
temporal convolution can also be generalized to 3-D variables to handle large-scale networks with more efficiency.
by employing the same convolution kernel to every node
Yi 2 RM ⇥Ci (e.g. sensor stations) in G equally, noted as 4 Experiments
“ ⇤T Y” with Y 2 RM ⇥n⇥Ci . 4.1 Dataset Description
3.4 Spatio-temporal Convolutional Block We verify our model on two real-world traffic datasets,
In order to fuse features from both spatial and temporal BJER4 and PeMSD7, collected by Beijing Municipal Traffic
domains, the spatio-temporal convolutional block (ST-Conv Commission and California Deportment of Transportation,
block) is constructed to jointly process graph-structured time respectively. Each dataset contains key attributes of traffic
series. The block itself can be stacked or extended based on observations and geographic information with corresponding
the scale and complexity of particular cases. timestamps, as detailed below.
As illustrated in Figure 2 (mid), the spatial layer in the BJER4 was gathered from the major areas of east ring
middle is to bridge two temporal layers which can achieve No.4 routes in Beijing City by double-loop detectors. There
fast spatial-state propagation from graph convolution through are 12 roads selected for our experiment. The traffic data are
temporal convolutions. The “sandwich” structure also helps aggregated every 5 minutes. The time period used is from 1st
the network sufficiently apply bottleneck strategy to achieve July to 31st August, 2014 except the weekends. We select the
scale compression and feature squeezing by downscaling and first month of historical speed records as training set, and the
upscaling of channels C through the graph convolutional rest serves as validation and test set respectively.
layer. Moreover, layer normalization is utilized within every PeMSD7 was collected from Caltrans Performance Mea-
ST-Conv block to prevent overfitting. surement System (PeMS) in real-time by over 39, 000 sensor
The input and output of ST-Conv blocks are all 3-D tensors. stations, deployed across the major metropolitan areas of Cal-
l ifornia state highway system [Chen et al., 2001]. The dataset
For the input v l 2 RM ⇥n⇥C of block l, the output v l+1 2
l+1 is also aggregated into 5-minute interval from 30-second data
R(M 2(Kt 1))⇥n⇥C is computed by, samples. We randomly select a medium and a large scale
v l+1 = l1 ⇤T ReLU(⇥l ⇤G ( l0 ⇤T v l )), (8) among the District 7 of California containing 228 and 1, 026
3637
Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18)
3638
Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18)
Figure 4: Speed prediction in the morning peak and evening rush do. For PeMSD7(L), GCGRU has to use the half of batch
hours of the dataset PeMSD7. size since its GPU consumption exceeded the memory capac-
ity of a single card (results marked as “*” in Table 2); while
STGCN only need to double the channels in the middle of
ST-Conv blocks. Even though our model still consumes less
than a tenth of the training time of model GCGRU under this
circumstance. Meanwhile, the advantages of the 1st -order
approximation have appeared since it is not restricted to the
parameterization of polynomials. The model STGCN(1st )
speeds up around 20% on a larger dataset with a satisfactory
performance compared with STGCN(Cheb).
In order to further investigate the performance of compared
Figure 5: Test RMSE versus the training time (left); Test MAE ver- deep learning models, we plot the RMSE and MAE of the test
sus the number of training epochs (right). (PeMSD7(M)) set of PeMSD7(M) during the training process, see Figure 5.
Those figures also suggest that our model can achieve much
faster training procedure and easier convergences. Thanks to
predictions during morning peak and evening rush hours, as
the special designs in ST-Conv blocks, our model has superior
shown in Figure 4. It is easy to observe that our proposal
performances in balancing time consumption and parameter
STGCN captures the trend of rush hours more accurately than
settings. Specifically, the number of parameters in STGCN
other methods; and it detects the ending of the rush hours ear-
(4.54 ⇥ 105 ) only accounts for around two third of GCGRU,
lier than others. Stemming from the efficient graph convolu-
and saving over 95% parameters compared to FC-LSTM.
tion and stacked temporal convolution structures, our model
is capable of fast responding to the dynamic changes among
the traffic network without over-reliance on historical average 5 Related Works
as most of recurrent networks do. There are several recent deep learning studies that are also
motivated by the graph convolution in spatio-temporal tasks.
Training Efficiency and Generalization Seo et al. [2016] introduced graph convolutional recurrent
To see the benefits of the convolution along time axis in our network (GCRN) to identify jointly spatial structures and dy-
proposal, we summarize the comparison of training time be- namic variation from structured sequences of data. The key
tween STGCN and GCGRU in Table 3. In terms of fairness, challenge of this study is to determine the optimal combi-
GCGRU consists of three layers with 64, 64, 128 units re- nations of recurrent networks and graph convolution under
spectively in the experiment for PeMSD7(M), and STGCN specific settings. Based on principles above, Li et al. [2018]
uses the default settings as described in Section 4.3. Our successfully employed the gated recurrent units (GRU) with
model STGCN only consumes 272 seconds, while RNN-type graph convolution for long-term traffic forecasting. In con-
of model GCGRU spends 3, 824 seconds on PeMSD7(M). trast to these works, we build up our model completely from
This 14 times acceleration of training speed mainly bene- convolutional structures; The ST-Conv block is specially de-
fits from applying the temporal convolution instead of re- signed to uniformly process structured data with residual con-
current structures, which can achieve fully parallel training nection and bottleneck strategy inside; More efficient graph
rather than exclusively relying on chain structures as RNN convolution kernels are employed in our model as well.
3639
Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18)
6 Conclusion and Future Work [Jia et al., 2016] Yuhan Jia, Jianping Wu, and Yiman Du.
In this paper, we propose a novel deep learning framework Traffic speed prediction using deep learning method. In
STGCN for traffic prediction, integrating graph convolution ITSC, pages 1217–1222. IEEE, 2016.
and gated temporal convolution through spatio-temporal con- [Kipf and Welling, 2016] Thomas N Kipf and Max Welling.
volutional blocks. Experiments show that our model out- Semi-supervised classification with graph convolutional
performs other state-of-the-art methods on two real-world networks. arXiv preprint arXiv:1609.02907, 2016.
datasets, indicating its great potentials on exploring spatio- [Li et al., 2015] Yexin Li, Yu Zheng, Huichu Zhang, and Lei
temporal structures from the input. It also achieves faster Chen. Traffic prediction in a bike-sharing system. In
training, easier convergences, and fewer parameters with flex- SIGSPATIAL, page 33. ACM, 2015.
ibility and scalability. These features are quite promising and
practical for scholarly development and large-scale industry [Li et al., 2018] Yaguang Li, Rose Yu, Cyrus Shahabi, and
deployment. In the future, we will further optimize the net- Yan Liu. Diffusion convolutional recurrent neural net-
work structure and parameter settings. Moreover, our pro- work: Data-driven traffic forecasting. In ICLR, 2018.
posed framework can be applied into more general spatio- [Lv et al., 2015] Yisheng Lv, Yanjie Duan, Wenwen Kang,
temporal structured sequence forecasting scenarios, such as Zhengxi Li, and Fei-Yue Wang. Traffic flow prediction
evolving of social networks, and preference prediction in rec- with big data: a deep learning approach. IEEE Trans-
ommendation systems, etc. actions on Intelligent Transportation Systems, 16(2):865–
873, 2015.
References [Niepert et al., 2016] Mathias Niepert, Mohamed Ahmed,
[Ahmed and Cook, 1979] Mohammed S Ahmed and Allen R and Konstantin Kutzkov. Learning convolutional neural
Cook. Analysis of freeway traffic time-series data by using networks for graphs. In ICML, pages 2014–2023, 2016.
Box-Jenkins techniques. 1979. [Seo et al., 2016] Youngjoo Seo, Michaël Defferrard, Pierre
[Bruna et al., 2013] Joan Bruna, Wojciech Zaremba, Arthur Vandergheynst, and Xavier Bresson. Structured sequence
Szlam, and Yann LeCun. Spectral networks and lo- modeling with graph convolutional recurrent networks.
cally connected networks on graphs. arXiv preprint arXiv preprint arXiv:1612.07659, 2016.
arXiv:1312.6203, 2013. [Shi et al., 2015] Xingjian Shi, Zhourong Chen, Hao Wang,
[Chen et al., 2001] Chao Chen, Karl Petty, Alexander Sk- Dit-Yan Yeung, Wai-Kin Wong, and Wang-chun Woo.
abardonis, Pravin Varaiya, and Zhanfeng Jia. Freeway per- Convolutional lstm network: A machine learning approach
formance measurement system: mining loop detector data. for precipitation nowcasting. In NIPS, pages 802–810,
Transportation Research Record: Journal of the Trans- 2015.
portation Research Board, (1748):96–102, 2001. [Shuman et al., 2013] David I Shuman, Sunil K Narang, Pas-
[Chen et al., 2016] Quanjun Chen, Xuan Song, Harutoshi cal Frossard, Antonio Ortega, and Pierre Vandergheynst.
Yamada, and Ryosuke Shibasaki. Learning deep represen- The emerging field of signal processing on graphs: Ex-
tation from big and heterogeneous data for traffic accident tending high-dimensional data analysis to networks and
inference. In AAAI, pages 338–344, 2016. other irregular domains. IEEE Signal Processing Maga-
zine, 30(3):83–98, 2013.
[Defferrard et al., 2016] Michaël Defferrard, Xavier Bres-
son, and Pierre Vandergheynst. Convolutional neural net- [Sutskever et al., 2014] Ilya Sutskever, Oriol Vinyals, and
works on graphs with fast localized spectral filtering. In Quoc V Le. Sequence to sequence learning with neural
NIPS, pages 3844–3852, 2016. networks. In NIPS, pages 3104–3112, 2014.
[Gehring et al., 2017] Jonas Gehring, Michael Auli, David [Vlahogianni, 2015] Eleni I Vlahogianni. Computational in-
Grangier, Denis Yarats, and Yann N Dauphin. Convo- telligence and optimization for transportation big data:
lutional sequence to sequence learning. arXiv preprint challenges and opportunities. In Engineering and Applied
arXiv:1705.03122, 2017. Sciences Optimization, pages 107–128. Springer, 2015.
[Hammond et al., 2011] David K Hammond, Pierre Van- [Williams and Hoel, 2003] Billy M Williams and Lester A
dergheynst, and Rémi Gribonval. Wavelets on graphs via Hoel. Modeling and forecasting vehicular traffic flow
spectral graph theory. Applied and Computational Har- as a seasonal arima process: Theoretical basis and em-
monic Analysis, 30(2):129–150, 2011. pirical results. Journal of transportation engineering,
129(6):664–672, 2003.
[Hochreiter and Schmidhuber, 1997] Sepp Hochreiter and
[Wu and Tan, 2016] Yuankai Wu and Huachun Tan. Short-
Jürgen Schmidhuber. Long short-term memory. Neural
term traffic flow forecasting with spatial-temporal correla-
computation, 9(8):1735–1780, 1997.
tion in a hybrid deep learning framework. arXiv preprint
[Huang et al., 2014] Wenhao Huang, Guojie Song, Haikun arXiv:1612.01022, 2016.
Hong, and Kunqing Xie. Deep architecture for traffic flow
prediction: deep belief networks with multitask learning.
IEEE Transactions on Intelligent Transportation Systems,
15(5):2191–2201, 2014.
3640