Online Detection and Classification of Dynamic Hand Gestures With Recurrent 3D Convolutional Neural Networks
Online Detection and Classification of Dynamic Hand Gestures With Recurrent 3D Convolutional Neural Networks
Pavlo Molchanov Xiaodong Yang Shalini Gupta Kihwan Kim Stephen Tyree Jan Kautz
NVIDIA
{pmolchanov,xiaodongy,shalinig,kihwank,styree,jkautz}@nvidia.com
Authorized licensed use limited to: Nat. Inst. of Elec. & Info. Tech (NIELIT). Downloaded on September 11,2022 at 16:19:08 UTC from IEEE Xplore. Restrictions apply.
addresses the challenge of early detection of gestures, re-
sulting in zero or negative lag, which is a crucial element for
responsive user interfaces. We present a new multi-modal
hand gesture dataset1 with 25 classes for comparing our al-
gorithm against state-of-the-art methods and human subject
performance.
2. Related Work
Many hand-crafted spatio-temporal features for effective
video analysis have been introduced in the area of gesture
and action recognition [33, 36, 39]. They typically capture
shape, appearance, and motion cues via image gradients
and optical flow. Ohn-Bar and Trivedi [27] evaluate sev-
eral global features for automotive gesture recognition. A
number of video classification systems successfully employ Figure 1: Classification of dynamic gestures with
improved dense trajectories [39] and Fisher vector [30] rep- R3DCNN. A gesture video is presented in the form of short
resentations, which are widely regarded as state-of-the-art clips Ct to a 3D-CNN for extracting local spatial-temporal
local features and aggregation techniques for video analy- features, ft . These features are input to a recurrent network,
sis. Features for depth sensors are usually designed accord- which aggregates transitions across several clips. The re-
ing to the specific characteristics of the depth data. For in- current network has a hidden state ht−1 , which is computed
stance, random occupancy patterns [40] utilize point clouds from the previous clips. The updated hidden state for the
and super normal vectors [42] employ surface normals. current clip, ht , is input into a softmax layer to estimate
In contrast to hand-crafted features, there is a growing class-conditional probabilities, st of the various gestures.
trend toward feature representations learned by deep neu- During training, CTC is used as the cost function.
tral networks. Neverova et al. [25] employ CNNs to com-
bine color and depth data from hand regions and upper-body strate the applicability of CTC for gesture recognition from
skeletons to recognize sign language gestures. Molchanov unsegmented video steams.
et al. [22, 23] apply a 3D-CNN on the whole video sequence
and introduce space-time video augmentation techniques to 3. Method
avoid overfitting. In the context of action recognition, Si-
In this section, we describe the architecture and training
monyan and Zisserman [34] propose separate CNNs for the
of our algorithm for multi-modal dynamic hand gesture de-
spatial and temporal streams that are late-fused and that ex-
tection and classification.
plicitly use optical flow. Tran et al. [37] employ a 3D-CNN
to analyze a series of short video clips and average the net- 3.1. Network Architecture
work’s responses for all clips. Most previous methods either
We propose a recurrent 3D convolutional neural network
employ pre-segmented video sequences or treat detection
(R3DCNN) for dynamic hand gesture recognition, illus-
and classification as separate problems.
trated in Fig. 1. The architecture consists of a deep 3D-CNN
To the best of our knowledge, none of the previous meth-
for spatio-temporal feature extraction, a recurrent layer for
ods for hand gesture recognition address the problem of
global temporal modeling, and a softmax layer for predict-
early gesture recognition to achieve the zero or negative lag
ing class-conditional gesture probabilities.
necessary for designing effective gesture interfaces. Early
We begin by formalizing the operations performed by
detection techniques have been proposed for classifying fa-
the network. We define a video clip as a volume Ct ∈
cial expressions and articulated body motion [12, 32], as
Rk××c×m of m ≥ 1 sequential frames with c channels of
well as for predicting future events based on incoming video
size k × pixels ending at time t. Each clip is transformed
streams [15, 16]. The predicted motions in many of these
into a feature representation ft by a 3D-CNN F:
methods are aided by the appearance of their environments
(i.e., road or parking lot)—something we cannot rely on for F : Rk××c×m → Rq , where ft = F(Ct ),
gesture recognition. Recently, connectionist temporal clas-
sification has been shown to be effective for classification of by applying spatio-temporal filters to the clip. A recurrent
unsegmented handwriting and speech [9, 10]. We demon- layer computes a hidden state vector ht ∈ Rd as a func-
tion of the hidden state after the previous clip ht−1 and the
1 https://research.nvidia.com/publication/
feature representation of the current clip ft :
online-detection-and-classification-dynamic-
hand-gestures-recurrent-3d-convolutional ht = R(Win ft + Wh ht−1 ),
4208
Authorized licensed use limited to: Nat. Inst. of Elec. & Info. Tech (NIELIT). Downloaded on September 11,2022 at 16:19:08 UTC from IEEE Xplore. Restrictions apply.
with weight matrices Win ∈ Rd×q and Wh ∈ Rd×d , and We consider two training cost functions: negative log-
truncated rectified linear unit R : Rd → Rd , R(x) = likelihood for the entire video and connectionist temporal
min(max(0, x), 4) to limit gradient explosion [29] during classification (CTC) for online sequences. The negative
training. Finally, a softmax layer transforms the hidden log-likelihood function for a mini-batch of videos is:
state vector ht into class-conditional probabilities st of w
P −1
classes: 1
st = S(Ws ht + b), Lv = − log p(yi |Vi ) ,
P i=0
with weights Ws ∈ Rw×d , bias b ∈ Rw , and a softmax func-
tion S : Rw → Rw [0,1] , where [S(x)] i = e xi
/ xk
ke .
where p(yi |Vi ) = [savg ]yi is the probability of gesture label
We perform classification by splitting the entire video V yi given gesture video Vi as predicted by R3DCNN.
into T clips of length m and computing the set of class-
conditional probabilities S = {s0 , s1 , ..., sT −1 } for each Connectionist temporal classification. CTC is a cost
individual clip. For offline gesture classification, we av- function designed for sequence prediction in unsegmented
erage the probabilities of all theclips belonging to a pre- or weakly segmented input streams [9, 10]. CTC is applied
segmented gesture savg = 1/T s∈S s, and the predicted in this work to identify and correctly label the nucleus of
class is ŷ = argmaxi ([savg ]i ), across all gesture classes i. the gesture, while assigning the no gesture class to the re-
When predicting online with unsegmented streams, we con- maining clips, addressing the alignment of class labels to
sider only clip-wise probabilities st . particular clips in the video. In this work we consider only
We combine multiple modalities by avereging the class- the CTC forward algorithm.
conditional probabilities estimated by the modality-specific We extend the dictionary of existing gestures with a no
networks. During online operation, we average probabili- gesture class: A = A ∪ {no gesture}. Consequently,
ties across modalities for the current clip only. As an alter- the softmax layer outputs a class-conditional probability for
native to the softmax layer, we additionally consider com- this additional no gesture class. Instead of averaging pre-
puting the final classification score with a support vector dictions across clips in a pre-segmented gesture, the net-
machine (SVM) [6] classifier operating on features ft or ht work computes the probability of observing a particular
extracted by the R3DCNN. We average the features across gesture (or no gesture) k at time t in an input sequence X :
video clips and normalize by their 2 -norms to form a single p(k, t|X ) = skt ∀t ∈ [0, N ).
representation for the entire video. We define a path π as a possible mapping of the input se-
quence X into a sequence of class labels
y. The probability
3.2. Training of observing path π is p(π|X ) = t sπt t , where πt is the
Let X = {V0 , V1 , ..., VP−1 } be a mini-batch of training class label predicted at time t in path π.
examples in the form of weakly-segmented gesture videos Paths are mapped into a sequence of event labels y by
Vi .2 Each video consists of T clips, making X a set of operator B as y = B(π), condensing repeated class labels
N = T ·P clips. Class labels yi are drawn from the alphabet and removing no gesture labels, e.g., B([−, 1, 2, −, −]) =
A to form a vector of class labels y with size |y| = P . B([1, 1, −, 2, −]) = [1, 2], where 1, 2 are actual gesture
classes and “−” is no gesture. Under B, many paths π result
Pre-training the 3D-CNN. We initialize the 3D-CNN
in the same event sequence y. The probability of observing
with the C3D network [37] trained on the large-scale Sport-
a particular sequence y given an input sequence X is the
1M [13] human action recognition dataset. The network has
sum of the conditional probabilities of all paths π mapping
8 convolutional layers of 3×3×3 filters and 2 fully-connected
to that sequence, B −1 (y) = {π : B(π) = y}:
layers trained on 16-frame clips. We append a softmax pre-
diction layer to the last fully-connected layer and fine-tune
p(y|X ) = p(π|X ).
by back-propagation with negative log-likelihood to predict
π∈B−1 (y)
gestures classes from individual clips Ci .
Training the full model. After fine-tuning the 3D- Computation of p(y|X ) is simplified by dynamic pro-
CNN, we train the entire R3DCNN with back-propagation- graming. First, we create an assistant vector ẏ by adding
through-time (BPTT) [41]. BPTT is equivalent to unrolling a no gesture label before and after each gesture clip in y,
the recurrent layers, transforming them into a multi-layer so that ẏ contains |ẏ| = P = 2P + 1 labels. Then, we
feed-forward network, applying standard gradient-based compute a forward variable α ∈ RN ×P where αt (u) is the
back-propagation, and averaging the gradients to consoli- combined probability of all mappings of events up to clip t
date updates to weights duplicated by unrolling. and event u. The transition function for α is:
2 Weakly-segmented videos contain the preparation, nucleus, and re-
traction phases and frames from the no gesture class. αt (u) = stẏu αt−1 (u) + αt−1 (u − 1) + βt−1 (u − 2) ,
4209
Authorized licensed use limited to: Nat. Inst. of Elec. & Info. Tech (NIELIT). Downloaded on September 11,2022 at 16:19:08 UTC from IEEE Xplore. Restrictions apply.
where feature maps in the convolutional layers improves general-
ization in pre-trained networks. For this, we randomly set
αt (u), if ẏu+1 = no gesture and ẏu = ẏu+2 10% of the feature maps of each convolutional layer to 0 and
βt (u) =
0, otherwise rescale the activations of the others neurons accordingly.
and ẏu denotes the class label of event u. The forward vari- Implementation. We train our gesture classifier in
able is initialized with α0 (0) = sẏ00 , the probability of a Theano [2] with cuDNN3 on an NVIDIA DIGITS DevBox
with four Titan X GPUs.
path beginning with ẏ0 = no gesture, and α0 (1) = sẏ01 ,
the probability of a path starting with the first actual event We fine-tune the 3D-CNN for 16 epochs with an initial
ẏ1 . Since a valid path cannot begin with a later event, we learning rate of λ = 3 · 10−3 , reduced by a factor of 10 af-
initialize α0 (i) = 0 ∀i > 1. At each time step t > 0, we ter every 4 epochs. Next, we train the R3DCNN end-to-end
consider paths in which the event u is currently active (with for an additional 100 epochs with a constant learning rate of
probability stẏu ) and (1) remains unchanged from the pre- λ = 3·10−4 . All network parameters without pre-trained ini-
vious time t − 1 (αt−1 (u)), (2) changes from no gesture tializations are randomly sampled from a zero-mean Gaus-
to the next actual gesture or vice versa (αt−1 (u − 1)), or sian distribution with standard deviation 0.01.
(3) transitions from one actual gesture to the next while Each video of a weakly-segmented gesture is stored with
skipping no gesture if the two gestures have distinct labels 80 frames of 120×160 pixels. We train with frames of size
(βt−1 (u − 2)). Finally, any valid path π must end at time 112×112 generated by random crops. Videos from the test
N − 1 with the last actual gesture ẏP −1 or with no gesture set are evaluated with the central crop of each frame. To
ẏP , hence p(y|X ) = αN −1 (P − 1) + αN −1 (P ). increase variability in the training examples, we apply the
Using this computation for p(y|X ), the CTC loss is: following data augmentation steps to each video in addi-
tion to cropping: random spatial rotation (±15◦ ) and scal-
LCT C = − ln(p(y|X )), ing (±20%), temporal scaling (±20%), and jittering (±3
frames). The parameters for each augmentation step are
expressed in the log domain [9]. While CTC is used as a drawn from a uniform distribution with a specified range.
training cost function only, it affects the architecture of the Since recurrent connections can learn the specific order of
network by adding the extra no gesture class label. For pre- gesture videos in the training set, we randomly permute the
segmented video classification, we simply remove the no- training gesture videos for each training epoch.
gesture output and renormalize probabilities by the 1 -norm We use CNNs pre-trained on three-channel RGB images.
after modality fusion. To apply them to one-channel depth or IR images, we sum
the convolutional kernels for the three channels of the first
Learning rule. To optimize the network parameters W layer to obtain one kernel. Similarly, to employ the pre-
with respect to either of the loss functions we use stochastic trained CNN with two-channel inputs (e.g., optical flow),
gradient descent (SGD) with a momentum term μ = 0.9. we remove the third channel of each kernel and rescale the
We update each parameter of the network θ ∈ W at every first two by a factor of 1.5.
back-propagation step i by: For the 3D-CNN, we find that splitting a gesture into
non-overlapping clips of m = 8 frames yields the best com-
θi = θi−1 + vi − γλθi−1 , bination of classification accuracy, computational complex-
δE ity and prediction latency. To work with clips of size m = 8
vi = μvi−1 − λJ ,
δθ batch frames on the C3D network [37] (originally trained with
m = 16 frames), we remove temporal pooling after the last
δθ batch is the gradient value
where λ is the learning rate, δE convolutional layer. Since data transfer and inference on
of the chosen cost function E with respect to the parameter a single 8-frame clip takes less than 30ms on an NVIDIA
θ averaged over the mini-batch, and γ is the weight decay Titan X, we can predict at a faster rate than clips are accu-
parameter. To prevent gradient explosion in the recurrent mulated.
layers during training, we apply a soft gradient clipping op-
erator J (·) [29] with a threshold of 10. 4. Dataset
Regularization. We apply a number of regularization Recently, several public dynamic gesture datasets have
techniques to reduce overfitting. We train with weight de- been introduced [5, 18, 19, 27]. The datasets differ in the
cay (γ = 0.5%) on all weights in the network. We apply complexity of gestures, the number of subjects and gesture
drop-out [11] to the fully-connected layers of the 3D-CNN classes, and the types of sensors used for data collection.
at a rate of p = 75%, rescaling the remaining activations by Among them, the Chalearn dataset [5] provides the largest
a factor of 1/(1 − p). Additionally, we find that dropping number of subjects and samples, but its 20 gesture classes,
4210
Authorized licensed use limited to: Nat. Inst. of Elec. & Info. Tech (NIELIT). Downloaded on September 11,2022 at 16:19:08 UTC from IEEE Xplore. Restrictions apply.
derived from the Italian sign language, are quite different
from the set of gestures common for user interfaces. The
VIVA challenge dataset [27] provides driver hand gestures
performed by a small number of subjects (8) against a plain
background and from a single viewpoint.
Given the limitations of existing datasets, to validate
our proposed gesture recognition algorithm, we acquired a
large dataset of 25 gesture types, each intended for human-
computer interfaces and recorded by multiple sensors and
viewpoints. We captured continuous data streams, con-
taining a total of 1532 dynamic hand gestures, indoors in
a car simulator with both bright and dim artificial lighting Figure 2: Environment for data collection. (Top) Driv-
(Fig. 2). A total of 20 subjects participated in data collec- ing simulator with main monitor displaying simulated driv-
tion, some with two recorded sessions and some with par- ing scenes and a user interface for prompting gestures, (A)
tial sessions. Subjects performed gestures with their right a SoftKinetic depth camera (DS325) recording depth and
hand while observing the simulator’s display and control- RGB frames, and (B) a DUO 3D camera capturing stereo
ling the steering wheel with their left hand. An interface on IR. Both sensors capture 320×240 pixels at 30 frames per
the display prompted subjects to perform each gesture with second. (Bottom) Examples of each modality, from left:
an audio description and a 5s sample video of the gesture. RGB, optical flow, depth, IR-left, and IR-disparity.
Gestures were prompted in random order with each type re-
quested 3 times over the course of a full session. Table 1: Comparison of modalities and their combinations.
Gestures (Fig. 3) include moving either the hand or two
fingers up, down, left or right; clicking with the index fin- Sensors Accuracy Combinations
ger; beckoning; opening or shaking the hand; showing the Depth 80.3%
index finger, or two or three fingers; pushing the hand up, Optical flow 77.8%
Color 74.1%
down, out or in; rotating two fingers clockwise or counter- IR image 63.5%
clockwise; pushing two fingers forward; closing the hand IR disparity 57.8%
twice; and showing “thumb up” or “OK”.
83.8%
66.2%
81.5%
82.0%
82.0%
82.4%
82.6%
83.2%
83.4%
We used the SoftKinetic DS325 sensor to acquire front- Fusion Accuracy 79.3%
view color and depth videos and a top-mounted DUO 3D
sensor to record a pair of stereo-IR streams. In addition, we
computed dense optical flow [7] from the color stream and ing different modalities of the same sensor (e.g., color and
the IR disparity map from the IR-stereo pair [4]. We ran- optical flow) also improves the accuracy. The best gesture
domly split the data by subject into training (70%) and test recognition accuracy (83.8%) is observed for the combina-
(30%) sets, resulting in 1050 training and 482 test videos. tion of all modalities.
4211
Authorized licensed use limited to: Nat. Inst. of Elec. & Info. Tech (NIELIT). Downloaded on September 11,2022 at 16:19:08 UTC from IEEE Xplore. Restrictions apply.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Figure 3: Twenty-five dynamic hand gesture classes. Some gestures were adopted from existing commercial systems [1] or
popular datasets [23, 27]. Each column shows a different gesture class (0−24). The top and bottom rows show the starting
and ending depth frames, respectively, of the nucleus phase for each class. (Note that we did not crop the start and end
frames in the actual training and evaluation data.) Yellow arrows indicate the motion of each hand gesture. (A more detailed
description of each gesture is available in the supplementary video.)
utilizes the pre-trained VGG-Net [35]. We fine-tune its spa- Table 2: Comparison of our method to the state-of-the-art
tial stream with the color modality and the temporal stream methods and human predictions with various modalities.
with optical flow, each from our gesture dataset. We also
compare against the C3D [37] method, which is trained with Method Modality Accuracy
the Sports-1M [13] dataset and fine-tuned with the color or
HOG+HOG2 [27] color 24.5%
depth modalities of our dataset. Spatial stream CNN [34] color 54.6%
Lastly, we evaluate human performance by asking six iDT-HOG [39] color 59.1%
subjects to label each of the 482 gesture videos in the test C3D [37] color 69.3%
set after viewing the corresponding front-view SoftKinetic Ours color 74.1%
color video. Prior to the experiment, each subject familiar- HOG+HOG2 [27] depth 36.3%
ized themselves with all 25 gesture types. Gestures were SNV [42] depth 70.7%
presented in random order to each subject for labelling. C3D [37] depth 78.8%
To be consistent with machine classifiers, human subjects Ours depth 80.3%
viewed each gesture video only once, but were not restricted
iDT-HOF [39] opt flow 61.8%
in the time allowed to decide each label. Temporal stream CNN [34] opt flow 68.0%
The results of these comparisons are shown in Ta- iDT-MBH [39] opt flow 76.8%
ble 2. Among the individual modalities, the best results Ours opt flow 77.8%
are achieved by depth, followed by optical flow and color.
HOG+HOG2 [27] color + depth 36.9%
This could be because the depth sensor is more robust to Two-stream CNNs [34] color + opt flow 65.6%
indoor lighting change and more easily precludes the noisy iDT [39] color + opt flow 73.4%
background scene, relative to the color sensor. Optical flow Ours all 83.8%
explicitly encapsulates motion, which is important to recog-
Human color 88.4%
nize dynamic gestures. Unlike the two-stream network with
action classification [34], its accuracy for gesture recogni-
tion is not improved by combining the spatial and tempo- Table 3: Comparison of 2D-CNN and 3D-CNN trained with
ral streams. We conjecture that videos for action classifica- different architectures on depth or color data. (CTC∗ de-
tion can be associated with certain static objects or scenes, notes training without drop-out of feature maps.)
e.g., sports or ceremonies, which is not the case for dy-
namic hand gestures. Although C3D captures both shape Color Depth
and motion cues in each clip, the temporal relationship be- 2D-CNN 3D-CNN 2D-CNN 3D-CNN
tween clips is not considered. Our approach achieves the No RNN 55.6% 67.2% 68.1% 73.3%
best performances in each individual modality and signif- RNN 57.9% 72.0% 64.7% 79.5%
icantly outperforms other methods with combined modali- CTC 65.6% 74.1% 69.1% 80.3%
ties, meanwhile it is still below human accuracy (88.4%). CTC∗ 59.5% 66.5% 67.0% 75.6%
Design choices. We analyze the individual components of [35] and train similarly to the 3D-CNN. We also asses the
our proposed R3DCNN algorithm (Table 3). First, to un- importance of the recurrent network, the CTC cost func-
derstand the utility of the 3D-CNN we substitute it with a tion and feature map drop-out. Classification accuracies for
2D-CNN initialized with the pre-trained 16-layer VGG-Net these experiments are listed in Table 3. When the recurrent
4212
Authorized licensed use limited to: Nat. Inst. of Elec. & Info. Tech (NIELIT). Downloaded on September 11,2022 at 16:19:08 UTC from IEEE Xplore. Restrictions apply.
Table 4: Accuracy of a linear SVM (C = 1) trained on nition. However, we observe (columns 1-2, Table 4) that
features extracted from different networks and layers (final following fc by a recurrent layer and training on full videos
fully-connected layer fc and recurrent layer rnn). (R3DCNN) improves the accuracy of the extracted features.
A plausible explanation is that the recurrent layer help the
Clip-wise C3D [37] R3DCNN preceding convolutional network to learn more general fea-
Modality fc fc rnn tures. Moreover, features from the recurrent layer when
coupled with an SVM classifier, demonstrate a further im-
Color 69.3% 73.0% 74.1%
Depth 78.8% 79.9% 80.1%
provement in performance (column 3, Table 4).
4213
Authorized licensed use limited to: Nat. Inst. of Elec. & Info. Tech (NIELIT). Downloaded on September 11,2022 at 16:19:08 UTC from IEEE Xplore. Restrictions apply.
Figure 4: A comparison of the gesture recognition performance of R3DCNN trained with (middle) and without (bottom)
CTC. (The no gesture class is not shown for CTC.) The various colors and line types indicate different gesture classes.
Table 5: Results for the SKIG RGBD gesture dataset.
4214
Authorized licensed use limited to: Nat. Inst. of Elec. & Info. Tech (NIELIT). Downloaded on September 11,2022 at 16:19:08 UTC from IEEE Xplore. Restrictions apply.
References [23] P. Molchanov, S. Gupta, K. Kim, and K. Pulli. Multi-sensor
system for driver’s hand-gesture recognition. In IEEE Auto-
[1] Bayerische Motoren Werke AG. Gesture control interface in matic Face and Gesture Recognition, 2015. 1, 2, 6
BMW 7 Series. https://www.bmw.com/. 6 [24] N. Neverova, C. Wolf, G. W. Taylor, and F. Nebout. Mod-
[2] J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, drop: adaptive multi-modal gesture recognition. arXiv, 2014.
G. Desjardins, J. Turian, D. Warde-Farley, and Y. Bengio. 8
Theano: a CPU and GPU math expression compiler. In [25] N. Neverova, C. Wolf, G. W. Taylor, and F. Nebout. Multi-
Python for Scientific Computing Conference, 2010. 4 scale deep learning for gesture detection and localization. In
[3] S. K. Card, G. G. Robertson, and J. D. Mackinlay. The infor- ECCVW, 2014. 1, 2
mation visualizer, an information workspace. In ACM CHI, [26] N. Nishida and H. Nakayama. Multimodal gesture recogni-
pages 181–186, 1991. 1 tion using multistream recurrent neural network. In PSIVT,
[4] Code Laboratories Inc. Duo3D SDK. https://duo3d. pages 1–14, 2015. 8
com/. 5 [27] E. Ohn-Bar and M. Trivedi. Hand gesture recognition in real
[5] S. Escalera, X. Baró, J. Gonzalez, M. A. Bautista, time for automotive interfaces: a multimodal vision-based
M. Madadi, M. Reyes, V. Ponce-López, H. J. Escalante, approach and evaluations. IEEE ITS, 15(6):1–10, 2014. 1, 2,
J. Shotton, and I. Guyon. ChaLearn Looking at People Chal- 4, 5, 6
lenge 2014: dataset and results. In ECCVW, 2014. 4, 8 [28] M. Osadchy, Y. L. Cun, and M. L. Miller. Synergistic face
[6] R. Fan, K. Chang, C. Hsieh, X. Wang, and C. Lin. Liblinear: detection and pose estimation with energy-based models.
A library for large linear classification. Journal of Mach. Journal of Mach. Learn. Research, 8:1197–1215, 2007. 1
Learn. Research, 2008. 3 [29] R. Pascanu, T. Mikolov, and Y. Bengio. On the difficulty of
[7] G. Farnebäck. Two-frame motion estimation based on poly- training recurrent neural networks. In ICML, 2013. 3, 4
nomial expansion. In Scand. Conf. on Im. Anal., 2003. 5 [30] F. Perronnin, J. Sanchez, and T. Mensink. Improving the
[8] D. M. Gavrila. The visual analysis of human movement: a fisher kernel for large-scale image classification. In ECCV,
survey. CVIU, 73(1):82–98, 1999. 1 2010. 2, 5
[9] A. Graves. Supervised sequence labelling with recurrent [31] L. Pigou, A. v. d. Oord, S. Dieleman, M. Van Herreweghe,
neural networks. Springer, 2012. 2, 3, 4 and J. Dambre. Beyond temporal pooling: recurrence
[10] A. Graves, S. Fernández, F. J. Gomez, and J. Schmidhu- and temporal convolutions for gesture recognition in video.
ber. Connectionist temporal classification: labelling unseg- arXiv, 2015. 8
mented sequence data with recurrent neural networks. In [32] M. Ryoo. Human activity prediction: early recognition of
ICML, pages 369–376, 2006. 1, 2, 3 ongoing activities from streaming videos. In ICCV, 2011. 2
[11] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and [33] X. Shen, G. Hua, L. Williams, and Y. Wu. Dynamic hand
R. R. Salakhutdinov. Improving neural networks by prevent- gesture recognition: an examplar based approach from mo-
ing co-adaptation of feature detectors. arXiv, 2012. 4, 7 tion divergence fields. Image and Vis. Computing, 2012. 2
[12] M. Hoai and F. De la Torre. Max-margin early event detec- [34] K. Simonyan and A. Zisserman. Two-stream convolutional
tors. In CVPR, 2012. 2, 7 networks for action recognition. In NIPS, 2014. 1, 2, 5, 6
[13] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, [35] K. Simonyan and A. Zisserman. Very deep convolutional
and L. Fei-Fei. Large-scale video classification with convo- networks for large-scale visual recognition. In ICLR, 2015.
lutional neural networks. In CVPR, 2014. 1, 3, 6 6
[14] A. Kendon. Current issues in the study of gesture. In The [36] A. Tamrakar, S. Ali, Q. Yu, J. Liu, O. Javed, A. Divakaran,
biological foundations of gestures: motor and semiotic as- H. Cheng, and H. Sawhney. Evaluation of low-level features
pects, pages 23–47. Lawrence Erlbaum Associates, 1986. 1 and their combinations for complex event detection in open
[15] K. Kim, D. Lee, and I. Essa. Gaussian process regression source videos. In CVPR, 2012. 2
flow for analysis of motion trajectories. In ICCV, 2011. 2 [37] D. Tran, L. D. Bourdev, R. Fergus, L. Torresani, and
[16] K. M. Kitani, B. D. Ziebart, J. A. Bagnell, and M. Hebert. M. Paluri. Learning spatiotemporal features with 3D con-
Activity forecasting. In ECCV, 2012. 2 volutional networks. In ICCV, 2015. 1, 2, 3, 4, 5, 6, 7
[17] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld. [38] P. T. Tung and L. Q. Ngoc. Elliptical density shape model
Learning realistic human actions from movies. In CVPR, for hand gesture recognition. In ACM SoICT, 2014. 8
2008. 5 [39] H. Wang, D. Oneata, J. Verbeek, and C. Schmid. A ro-
[18] L. Liu and L. Shao. Learning discriminative representations bust and efficient video representation for action recognition.
from RGB-D video data. In IJCAI, 2013. 4, 8 IJCV, 2015. 2, 5, 6
[19] G. Marin, F. Dominio, and P. Zanuttigh. Hand gesture recog- [40] J. Wang, Z. Liu, J. Chorowski, Z. Chen, and Y. Wu. Robust
nition with leap motion and kinect devices. In ICIP, 2014. 3D action recognition with random occupancy patterns. In
4 ECCV, 2012. 2
[20] R. B. Miller. Response time in man-computer conversational [41] P. J. Werbos. Backpropagation through time: what it does
transactions. AFIPS, 1968. 1 and how to do it. Proceedings of the IEEE, 78(10):1550–
[21] S. Mitra and T. Acharya. Gesture recognition: a survey. In 1560, 1990. 3
IEEE Systems, Man, and Cybernetics, 2007. 1 [42] X. Yang and Y. Tian. Super normal vector for activity recog-
[22] P. Molchanov, S. Gupta, K. Kim, and J. Kautz. Hand ges- nition using depth sequences. In CVPR, 2014. 2, 5, 6
ture recognition with 3D convolutional neural networks. In
CVPRW, 2015. 1, 2
4215
Authorized licensed use limited to: Nat. Inst. of Elec. & Info. Tech (NIELIT). Downloaded on September 11,2022 at 16:19:08 UTC from IEEE Xplore. Restrictions apply.