
Figure 2: Boundary prediction comparison of (a) local informa-
tion based and (b) global proposal information based methods.
However, it drops actionness information and only adopts
the boundary matching to capture low-level features, which
can not handle complex activities and clutter background.
Besides, different from our method shown in Fig. 1, it em-
ploys the same methods of (Lin et al. 2018) to generate
boundary probability sequence instead of map, which lacks
a global scope for action instances with blurred boundaries
and variable temporal durations. Fig. 2 illustrates the dif-
ference between local information and our global proposal
information for boundary prediction.
To address the aforementioned drawbacks, we propose
dense boundary generator (DBG) to employ global pro-
posal features to predict the boundary map, and explore
action-aware features for action completeness analysis. In
our framework, a dual stream BaseNet (DSB) takes spatial
and temporal video representation as input to exploit the rich
local behaviors within the video sequence, which is super-
vised via actionness classification loss. DSB generates two
types of features: Low-level dual stream feature and high-
level actionness score feature. In addition, a proposal fea-
ture generation (PFG) layer is designed to transfer these two
types of sequence features into a matrix-like feature. And an
action-aware completeness regression (ACR) module is de-
signed to input the actionness score feature to generate a re-
liable completeness score map. Finally, a temporal boundary
classification (TBC) module is designed to produce tempo-
ral boundary score maps based on dual stream feature. These
three score maps will be combined to generate proposals.
The main contributions of this paper are summarized as:
• We propose a fast and unified dense boundary genera-
tor (DBG) for temporal action proposal, which evaluates
dense boundary confidence maps for all proposals.
• We introduce auxiliary supervision via actionness classifi-
cation to effectively facilitate action-aware feature for the
action-aware completeness regression.
• We design an efficient proposal feature generation layer to
capture global proposal features for subsequent regression
and classification modules.
• Experiments conducted on popular benchmarks like
ActivityNet-1.3 (Heilbron et al. 2015) and THUMOS14
(Idrees et al. 2017) demonstrate the superiority of our net-
work over the state-of-the-art methods.
Related Work
Action recognition. Early methods for video action recog-
nition mainly relied on hand-crafted features such as HOF,
HOG and MBH. Recent advances resort to deep convolu-
tional networks to promote recognition accuracy. These net-
works can be divided into two patterns: Two-stream net-
works (Feichtenhofer, Pinz, and Zisserman 2016; Simonyan
and Zisserman 2014; Wang et al. 2015; Wang et al. 2016),
and 3D networks (Tran et al. 2015; Qiu, Yao, and Mei 2017;
Carreira and Zisserman 2017). Two-stream networks ex-
plore video appearance and motion clues by passing RGB
image and stacked optical flow through ConvNet pretrained
on ImageNet separately. Instead, 3D methods directly cre-
ate hierarchical representations of spatio-temporal data with
spatio-temporal filters.
Temporal action proposal. Temporal action proposal aims
to detect action instances with temporal boundaries and con-
fidence in untrimmed videos. Anchor-based methods gener-
ate proposals by designing a set of multi-scale anchors with
regular temporal interval. The work in (Shou, Wang, and
Chang 2016) adopts C3D network (Tran et al. 2015) as the
binary classifier for anchor evaluation. (Heilbron, Niebles,
and Ghanem 2016) proposes a sparse learning framework
for scoring temporal anchors. (Gao et al. 2017) proposes to
apply temporal regression to adjust the action boundaries.
Boundary-based methods evaluate each temporal location in
video. (Zhao et al. 2017a) groups continuous high-score
region to generate proposals by temporal watershed algo-
rithm. (Lin et al. 2018) locates locally temporal boundaries
with high probabilities and evaluate global confidences of
candidate proposals generated by these boundaries. (Lin
et al. 2019) proposes a boundary-matching mechanism for
confidence evaluation of densely distributed proposals in an
end-to-end pipeline. MGG (Liu et al. 2019) combines an-
chor based method and boundary based method to accu-
rately generate temporal action proposal.
Temporal action detection. The temporal action detection
includes generating temporal proposal generation and recog-
nizing actions, which can be divided into two patterns, i.e.,
one-stage (Lin, Zhao, and Shou 2017; Long et al. 2019) and
two-stage (Shou, Wang, and Chang 2016; Gao, Yang, and
Nevatia 2017; Zhao et al. 2017b; Xu, Das, and Saenko 2017;
Chao et al. 2018). The two-stage method first generates can-
didate proposals, and then classifies these proposals. (Chao
et al. 2018) improves two-stage temporal action detection
by addressing both receptive field alignment and context fea-
ture extraction. For one-stage method, (Lin, Zhao, and Shou
2017) skips the proposal generation via directly detecting
action instances in untrimmed video. (Long et al. 2019) in-
troduces Gaussian kernels to dynamically optimize temporal
scale of each action proposal.