0% found this document useful (0 votes)
233 views9 pages

Online Detection and Classification of Dynamic Hand Gestures With Recurrent 3D Convolutional Neural Networks

Uploaded by

Pawan Chaurasiya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
233 views9 pages

Online Detection and Classification of Dynamic Hand Gestures With Recurrent 3D Convolutional Neural Networks

Uploaded by

Pawan Chaurasiya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

2016 IEEE Conference on Computer Vision and Pattern Recognition

Online Detection and Classification of Dynamic Hand Gestures


with Recurrent 3D Convolutional Neural Networks

Pavlo Molchanov Xiaodong Yang Shalini Gupta Kihwan Kim Stephen Tyree Jan Kautz
NVIDIA
{pmolchanov,xiaodongy,shalinig,kihwank,styree,jkautz}@nvidia.com

Abstract ing lighting conditions [23, 27].


However, real-world systems for dynamic hand gesture
Automatic detection and classification of dynamic hand recognition present numerous open challenges. First, these
gestures in real-world systems intended for human com- systems receive continuous streams of unprocessed visual
puter interaction is challenging as: 1) there is a large diver- data, where gestures from known classes must be simul-
sity in how people perform gestures, making detection and taneously detected and classified. Most prior work, e.g.,
classification difficult; 2) the system must work online in [21, 23, 25, 27], regards gesture segmentation and classifi-
order to avoid noticeable lag between performing a gesture cation separately. Two classifiers, a detection classifier to
and its classification; in fact, a negative lag (classification distinguish between gesture and no gesture and a recogni-
before the gesture is finished) is desirable, as feedback to the tion classifier to identify the specific gesture type, are often
user can then be truly instantaneous. In this paper, we ad- trained separately and applied in sequence to the input data
dress these challenges with a recurrent three-dimensional streams. There are two reasons for this: (1) to compensate
convolutional neural network that performs simultaneous for variability in the duration of the gesture and (2) to re-
detection and classification of dynamic hand gestures from duce noise due to unknown hand motions in the no gesture
multi-modal data. We employ connectionist temporal clas- class. However, this limits the system’s accuracy to that of
sification to train the network to predict class labels from in- the upstream detection classifier. Additionally, since both
progress gestures in unsegmented input streams. In order to problems are highly interdependent, it is advantageous to
validate our method, we introduce a new challenging multi- address them jointly. A similar synergy was shown to be
modal dynamic hand gesture dataset captured with depth, useful for joint face detection and pose estimation [28].
color and stereo-IR sensors. On this challenging dataset, Second, dynamic hand gestures generally contain three
our gesture recognition system achieves an accuracy of temporally overlapping phases: preparation, nucleus, and
83.8%, outperforms competing state-of-the-art algorithms, retraction [8, 14], of which the nucleus is most discrimina-
and approaches human accuracy of 88.4%. Moreover, our tive. The other two phases can be quite similar for different
method achieves state-of-the-art performance on SKIG and gesture classes and hence less useful or even detrimental to
ChaLearn2014 benchmarks. accurate classification. This motivates designing classifiers
which rely primarily on the nucleus phase.
1. Introduction Finally, humans are acutely perceptive of the response
time of user interfaces, with lags greater than 100 ms per-
Hand gestures and gesticulations are a common form of ceived as annoying [3, 20]. This presents the challenge of
human communication. It is therefore natural for humans to detecting and classifying gestures immediately upon or be-
use this form of communication to interact with machines fore their completion to provide rapid feedback.
as well. For instance, touch-less human computer inter- In this paper, we present an algorithm for joint segmenta-
faces can improve comfort and safety in vehicles. Com- tion and classification of dynamic hand gestures from con-
puter vision systems are useful tools in designing such in- tinuous depth, color and stereo-IR data streams. Building
terfaces. Recent work using deep convolutional neural net- on the recent success of CNN classifiers for gesture recog-
works (CNN) with video sequences has significantly ad- nition, we propose a network that employs a recurrent three
vanced the accuracy of dynamic hand gesture [22, 23, 25] dimensional (3D)-CNN with connectionist temporal classi-
and action [13, 34, 37] recognition. CNNs are also useful fication (CTC) [10]. CTC enables gesture classification to
for combining multi-modal data inputs [23, 25], a technique be based on the nucleus phase of the gesture without requir-
which has proved useful for gesture recognition in challeng- ing explicit pre-segmentation. Furthermore, our network

1063-6919/16 $31.00 © 2016 IEEE 4207


DOI 10.1109/CVPR.2016.456

Authorized licensed use limited to: Nat. Inst. of Elec. & Info. Tech (NIELIT). Downloaded on September 11,2022 at 16:19:08 UTC from IEEE Xplore. Restrictions apply.
addresses the challenge of early detection of gestures, re-
sulting in zero or negative lag, which is a crucial element for
responsive user interfaces. We present a new multi-modal
hand gesture dataset1 with 25 classes for comparing our al-
gorithm against state-of-the-art methods and human subject
performance.

2. Related Work
Many hand-crafted spatio-temporal features for effective
video analysis have been introduced in the area of gesture
and action recognition [33, 36, 39]. They typically capture
shape, appearance, and motion cues via image gradients
and optical flow. Ohn-Bar and Trivedi [27] evaluate sev-
eral global features for automotive gesture recognition. A
number of video classification systems successfully employ Figure 1: Classification of dynamic gestures with
improved dense trajectories [39] and Fisher vector [30] rep- R3DCNN. A gesture video is presented in the form of short
resentations, which are widely regarded as state-of-the-art clips Ct to a 3D-CNN for extracting local spatial-temporal
local features and aggregation techniques for video analy- features, ft . These features are input to a recurrent network,
sis. Features for depth sensors are usually designed accord- which aggregates transitions across several clips. The re-
ing to the specific characteristics of the depth data. For in- current network has a hidden state ht−1 , which is computed
stance, random occupancy patterns [40] utilize point clouds from the previous clips. The updated hidden state for the
and super normal vectors [42] employ surface normals. current clip, ht , is input into a softmax layer to estimate
In contrast to hand-crafted features, there is a growing class-conditional probabilities, st of the various gestures.
trend toward feature representations learned by deep neu- During training, CTC is used as the cost function.
tral networks. Neverova et al. [25] employ CNNs to com-
bine color and depth data from hand regions and upper-body strate the applicability of CTC for gesture recognition from
skeletons to recognize sign language gestures. Molchanov unsegmented video steams.
et al. [22, 23] apply a 3D-CNN on the whole video sequence
and introduce space-time video augmentation techniques to 3. Method
avoid overfitting. In the context of action recognition, Si-
In this section, we describe the architecture and training
monyan and Zisserman [34] propose separate CNNs for the
of our algorithm for multi-modal dynamic hand gesture de-
spatial and temporal streams that are late-fused and that ex-
tection and classification.
plicitly use optical flow. Tran et al. [37] employ a 3D-CNN
to analyze a series of short video clips and average the net- 3.1. Network Architecture
work’s responses for all clips. Most previous methods either
We propose a recurrent 3D convolutional neural network
employ pre-segmented video sequences or treat detection
(R3DCNN) for dynamic hand gesture recognition, illus-
and classification as separate problems.
trated in Fig. 1. The architecture consists of a deep 3D-CNN
To the best of our knowledge, none of the previous meth-
for spatio-temporal feature extraction, a recurrent layer for
ods for hand gesture recognition address the problem of
global temporal modeling, and a softmax layer for predict-
early gesture recognition to achieve the zero or negative lag
ing class-conditional gesture probabilities.
necessary for designing effective gesture interfaces. Early
We begin by formalizing the operations performed by
detection techniques have been proposed for classifying fa-
the network. We define a video clip as a volume Ct ∈
cial expressions and articulated body motion [12, 32], as
Rk××c×m of m ≥ 1 sequential frames with c channels of
well as for predicting future events based on incoming video
size k × pixels ending at time t. Each clip is transformed
streams [15, 16]. The predicted motions in many of these
into a feature representation ft by a 3D-CNN F:
methods are aided by the appearance of their environments
(i.e., road or parking lot)—something we cannot rely on for F : Rk××c×m → Rq , where ft = F(Ct ),
gesture recognition. Recently, connectionist temporal clas-
sification has been shown to be effective for classification of by applying spatio-temporal filters to the clip. A recurrent
unsegmented handwriting and speech [9, 10]. We demon- layer computes a hidden state vector ht ∈ Rd as a func-
tion of the hidden state after the previous clip ht−1 and the
1 https://research.nvidia.com/publication/
feature representation of the current clip ft :
online-detection-and-classification-dynamic-
hand-gestures-recurrent-3d-convolutional ht = R(Win ft + Wh ht−1 ),

4208

Authorized licensed use limited to: Nat. Inst. of Elec. & Info. Tech (NIELIT). Downloaded on September 11,2022 at 16:19:08 UTC from IEEE Xplore. Restrictions apply.
with weight matrices Win ∈ Rd×q and Wh ∈ Rd×d , and We consider two training cost functions: negative log-
truncated rectified linear unit R : Rd → Rd , R(x) = likelihood for the entire video and connectionist temporal
min(max(0, x), 4) to limit gradient explosion [29] during classification (CTC) for online sequences. The negative
training. Finally, a softmax layer transforms the hidden log-likelihood function for a mini-batch of videos is:
state vector ht into class-conditional probabilities st of w
P −1
classes: 1   
st = S(Ws ht + b), Lv = − log p(yi |Vi ) ,
P i=0
with weights Ws ∈ Rw×d , bias b ∈ Rw , and a softmax func-
tion S : Rw → Rw [0,1] , where [S(x)] i = e xi
/ xk
ke .
where p(yi |Vi ) = [savg ]yi is the probability of gesture label
We perform classification by splitting the entire video V yi given gesture video Vi as predicted by R3DCNN.
into T clips of length m and computing the set of class-
conditional probabilities S = {s0 , s1 , ..., sT −1 } for each Connectionist temporal classification. CTC is a cost
individual clip. For offline gesture classification, we av- function designed for sequence prediction in unsegmented
erage the probabilities of all theclips belonging to a pre- or weakly segmented input streams [9, 10]. CTC is applied
segmented gesture savg = 1/T s∈S s, and the predicted in this work to identify and correctly label the nucleus of
class is ŷ = argmaxi ([savg ]i ), across all gesture classes i. the gesture, while assigning the no gesture class to the re-
When predicting online with unsegmented streams, we con- maining clips, addressing the alignment of class labels to
sider only clip-wise probabilities st . particular clips in the video. In this work we consider only
We combine multiple modalities by avereging the class- the CTC forward algorithm.
conditional probabilities estimated by the modality-specific We extend the dictionary of existing gestures with a no
networks. During online operation, we average probabili- gesture class: A = A ∪ {no gesture}. Consequently,
ties across modalities for the current clip only. As an alter- the softmax layer outputs a class-conditional probability for
native to the softmax layer, we additionally consider com- this additional no gesture class. Instead of averaging pre-
puting the final classification score with a support vector dictions across clips in a pre-segmented gesture, the net-
machine (SVM) [6] classifier operating on features ft or ht work computes the probability of observing a particular
extracted by the R3DCNN. We average the features across gesture (or no gesture) k at time t in an input sequence X :
video clips and normalize by their 2 -norms to form a single p(k, t|X ) = skt ∀t ∈ [0, N ).
representation for the entire video. We define a path π as a possible mapping of the input se-
quence X into a sequence of class labels
 y. The probability
3.2. Training of observing path π is p(π|X ) = t sπt t , where πt is the
Let X = {V0 , V1 , ..., VP−1 } be a mini-batch of training class label predicted at time t in path π.
examples in the form of weakly-segmented gesture videos Paths are mapped into a sequence of event labels y by
Vi .2 Each video consists of T clips, making X a set of operator B as y = B(π), condensing repeated class labels
N = T ·P clips. Class labels yi are drawn from the alphabet and removing no gesture labels, e.g., B([−, 1, 2, −, −]) =
A to form a vector of class labels y with size |y| = P . B([1, 1, −, 2, −]) = [1, 2], where 1, 2 are actual gesture
classes and “−” is no gesture. Under B, many paths π result
Pre-training the 3D-CNN. We initialize the 3D-CNN
in the same event sequence y. The probability of observing
with the C3D network [37] trained on the large-scale Sport-
a particular sequence y given an input sequence X is the
1M [13] human action recognition dataset. The network has
sum of the conditional probabilities of all paths π mapping
8 convolutional layers of 3×3×3 filters and 2 fully-connected
to that sequence, B −1 (y) = {π : B(π) = y}:
layers trained on 16-frame clips. We append a softmax pre-
diction layer to the last fully-connected layer and fine-tune 
p(y|X ) = p(π|X ).
by back-propagation with negative log-likelihood to predict
π∈B−1 (y)
gestures classes from individual clips Ci .
Training the full model. After fine-tuning the 3D- Computation of p(y|X ) is simplified by dynamic pro-
CNN, we train the entire R3DCNN with back-propagation- graming. First, we create an assistant vector ẏ by adding
through-time (BPTT) [41]. BPTT is equivalent to unrolling a no gesture label before and after each gesture clip in y,
the recurrent layers, transforming them into a multi-layer so that ẏ contains |ẏ| = P  = 2P + 1 labels. Then, we
feed-forward network, applying standard gradient-based compute a forward variable α ∈ RN ×P where αt (u) is the
back-propagation, and averaging the gradients to consoli- combined probability of all mappings of events up to clip t
date updates to weights duplicated by unrolling. and event u. The transition function for α is:
2 Weakly-segmented videos contain the preparation, nucleus, and re-  
traction phases and frames from the no gesture class. αt (u) = stẏu αt−1 (u) + αt−1 (u − 1) + βt−1 (u − 2) ,

4209

Authorized licensed use limited to: Nat. Inst. of Elec. & Info. Tech (NIELIT). Downloaded on September 11,2022 at 16:19:08 UTC from IEEE Xplore. Restrictions apply.
where feature maps in the convolutional layers improves general-
 ization in pre-trained networks. For this, we randomly set
αt (u), if ẏu+1 = no gesture and ẏu = ẏu+2 10% of the feature maps of each convolutional layer to 0 and
βt (u) =
0, otherwise rescale the activations of the others neurons accordingly.

and ẏu denotes the class label of event u. The forward vari- Implementation. We train our gesture classifier in
able is initialized with α0 (0) = sẏ00 , the probability of a Theano [2] with cuDNN3 on an NVIDIA DIGITS DevBox
with four Titan X GPUs.
path beginning with ẏ0 = no gesture, and α0 (1) = sẏ01 ,
the probability of a path starting with the first actual event We fine-tune the 3D-CNN for 16 epochs with an initial
ẏ1 . Since a valid path cannot begin with a later event, we learning rate of λ = 3 · 10−3 , reduced by a factor of 10 af-
initialize α0 (i) = 0 ∀i > 1. At each time step t > 0, we ter every 4 epochs. Next, we train the R3DCNN end-to-end
consider paths in which the event u is currently active (with for an additional 100 epochs with a constant learning rate of
probability stẏu ) and (1) remains unchanged from the pre- λ = 3·10−4 . All network parameters without pre-trained ini-
vious time t − 1 (αt−1 (u)), (2) changes from no gesture tializations are randomly sampled from a zero-mean Gaus-
to the next actual gesture or vice versa (αt−1 (u − 1)), or sian distribution with standard deviation 0.01.
(3) transitions from one actual gesture to the next while Each video of a weakly-segmented gesture is stored with
skipping no gesture if the two gestures have distinct labels 80 frames of 120×160 pixels. We train with frames of size
(βt−1 (u − 2)). Finally, any valid path π must end at time 112×112 generated by random crops. Videos from the test
N − 1 with the last actual gesture ẏP  −1 or with no gesture set are evaluated with the central crop of each frame. To
ẏP  , hence p(y|X ) = αN −1 (P  − 1) + αN −1 (P  ). increase variability in the training examples, we apply the
Using this computation for p(y|X ), the CTC loss is: following data augmentation steps to each video in addi-
tion to cropping: random spatial rotation (±15◦ ) and scal-
LCT C = − ln(p(y|X )), ing (±20%), temporal scaling (±20%), and jittering (±3
frames). The parameters for each augmentation step are
expressed in the log domain [9]. While CTC is used as a drawn from a uniform distribution with a specified range.
training cost function only, it affects the architecture of the Since recurrent connections can learn the specific order of
network by adding the extra no gesture class label. For pre- gesture videos in the training set, we randomly permute the
segmented video classification, we simply remove the no- training gesture videos for each training epoch.
gesture output and renormalize probabilities by the 1 -norm We use CNNs pre-trained on three-channel RGB images.
after modality fusion. To apply them to one-channel depth or IR images, we sum
the convolutional kernels for the three channels of the first
Learning rule. To optimize the network parameters W layer to obtain one kernel. Similarly, to employ the pre-
with respect to either of the loss functions we use stochastic trained CNN with two-channel inputs (e.g., optical flow),
gradient descent (SGD) with a momentum term μ = 0.9. we remove the third channel of each kernel and rescale the
We update each parameter of the network θ ∈ W at every first two by a factor of 1.5.
back-propagation step i by: For the 3D-CNN, we find that splitting a gesture into
non-overlapping clips of m = 8 frames yields the best com-
θi = θi−1 + vi − γλθi−1 , bination of classification accuracy, computational complex-


δE ity and prediction latency. To work with clips of size m = 8
vi = μvi−1 − λJ ,
δθ batch frames on the C3D network [37] (originally trained with
m = 16 frames), we remove temporal pooling after the last
δθ batch is the gradient value
where λ is the learning rate,  δE convolutional layer. Since data transfer and inference on
of the chosen cost function E with respect to the parameter a single 8-frame clip takes less than 30ms on an NVIDIA
θ averaged over the mini-batch, and γ is the weight decay Titan X, we can predict at a faster rate than clips are accu-
parameter. To prevent gradient explosion in the recurrent mulated.
layers during training, we apply a soft gradient clipping op-
erator J (·) [29] with a threshold of 10. 4. Dataset
Regularization. We apply a number of regularization Recently, several public dynamic gesture datasets have
techniques to reduce overfitting. We train with weight de- been introduced [5, 18, 19, 27]. The datasets differ in the
cay (γ = 0.5%) on all weights in the network. We apply complexity of gestures, the number of subjects and gesture
drop-out [11] to the fully-connected layers of the 3D-CNN classes, and the types of sensors used for data collection.
at a rate of p = 75%, rescaling the remaining activations by Among them, the Chalearn dataset [5] provides the largest
a factor of 1/(1 − p). Additionally, we find that dropping number of subjects and samples, but its 20 gesture classes,

4210

Authorized licensed use limited to: Nat. Inst. of Elec. & Info. Tech (NIELIT). Downloaded on September 11,2022 at 16:19:08 UTC from IEEE Xplore. Restrictions apply.
derived from the Italian sign language, are quite different
from the set of gestures common for user interfaces. The
VIVA challenge dataset [27] provides driver hand gestures
performed by a small number of subjects (8) against a plain
background and from a single viewpoint.
Given the limitations of existing datasets, to validate
our proposed gesture recognition algorithm, we acquired a
large dataset of 25 gesture types, each intended for human-
computer interfaces and recorded by multiple sensors and
viewpoints. We captured continuous data streams, con-
taining a total of 1532 dynamic hand gestures, indoors in
a car simulator with both bright and dim artificial lighting Figure 2: Environment for data collection. (Top) Driv-
(Fig. 2). A total of 20 subjects participated in data collec- ing simulator with main monitor displaying simulated driv-
tion, some with two recorded sessions and some with par- ing scenes and a user interface for prompting gestures, (A)
tial sessions. Subjects performed gestures with their right a SoftKinetic depth camera (DS325) recording depth and
hand while observing the simulator’s display and control- RGB frames, and (B) a DUO 3D camera capturing stereo
ling the steering wheel with their left hand. An interface on IR. Both sensors capture 320×240 pixels at 30 frames per
the display prompted subjects to perform each gesture with second. (Bottom) Examples of each modality, from left:
an audio description and a 5s sample video of the gesture. RGB, optical flow, depth, IR-left, and IR-disparity.
Gestures were prompted in random order with each type re-
quested 3 times over the course of a full session. Table 1: Comparison of modalities and their combinations.
Gestures (Fig. 3) include moving either the hand or two
fingers up, down, left or right; clicking with the index fin- Sensors Accuracy Combinations
ger; beckoning; opening or shaking the hand; showing the Depth 80.3%       
index finger, or two or three fingers; pushing the hand up, Optical flow 77.8%        
Color 74.1%      
down, out or in; rotating two fingers clockwise or counter- IR image 63.5%       
clockwise; pushing two fingers forward; closing the hand IR disparity 57.8%   
twice; and showing “thumb up” or “OK”.

83.8%
66.2%

81.5%

82.0%

82.0%

82.4%

82.6%

83.2%

83.4%
We used the SoftKinetic DS325 sensor to acquire front- Fusion Accuracy 79.3%
view color and depth videos and a top-mounted DUO 3D
sensor to record a pair of stereo-IR streams. In addition, we
computed dense optical flow [7] from the color stream and ing different modalities of the same sensor (e.g., color and
the IR disparity map from the IR-stereo pair [4]. We ran- optical flow) also improves the accuracy. The best gesture
domly split the data by subject into training (70%) and test recognition accuracy (83.8%) is observed for the combina-
(30%) sets, resulting in 1050 training and 482 test videos. tion of all modalities.

5. Results Comparisons. We compare our approach to state-of-the-


art methods: HOG+HOG2 [27], improved dense trajectories
We analyze the performance of R3DCNN for dynamic (iDT) [39], super normal vector (SNV) [42], two-stream
gesture recognition and early detection. CNNs [34], and convolutional 3D (C3D) [37], as well as
human labeling accuracy.
5.1. Offline Gesture Recognition To compute the HOG+HOG2 [27] descriptors, we re-
Modality fusion. We begin by evaluating our proposed sample the videos to 32 frames and tune the parameters
R3DCNN classifier for a variety of input modalities con- of the SVM classifier via grid search to maximize accu-
tained in our dataset: color (front view), optical flow from racy. For iDT [39], we densely sample and track interest
color (front view), depth (front view), stereo IR (top view), points at multiple spatial scales, and compute HOG, his-
and IR disparity (top view) (bottom row of Fig. 2). We togram of optical flow (HOF), and motion boundary his-
train a separate network for each modality and, when fus- togram (MBH) descriptors from each track. We employ
ing modalities, average their class-conditional probability Fisher vectors (FV) [30] to aggregate each type of iDT
vectors.3 Table 1 contains the accuracy for various combi- descriptor using 128 Gaussian components, as well as a
nations of sensor modalities. Observe that fusing any pair spatio-temporal pyramid [17] of FV to encode the space and
of sensors improves individual results. In addition, combin- time information.
Among the CNN-based methods, we compare against
3 Attempts to learn a parameterized fusion layer resulted in overfitting. the two-stream network for video classification [34], which

4211

Authorized licensed use limited to: Nat. Inst. of Elec. & Info. Tech (NIELIT). Downloaded on September 11,2022 at 16:19:08 UTC from IEEE Xplore. Restrictions apply.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Figure 3: Twenty-five dynamic hand gesture classes. Some gestures were adopted from existing commercial systems [1] or
popular datasets [23, 27]. Each column shows a different gesture class (0−24). The top and bottom rows show the starting
and ending depth frames, respectively, of the nucleus phase for each class. (Note that we did not crop the start and end
frames in the actual training and evaluation data.) Yellow arrows indicate the motion of each hand gesture. (A more detailed
description of each gesture is available in the supplementary video.)

utilizes the pre-trained VGG-Net [35]. We fine-tune its spa- Table 2: Comparison of our method to the state-of-the-art
tial stream with the color modality and the temporal stream methods and human predictions with various modalities.
with optical flow, each from our gesture dataset. We also
compare against the C3D [37] method, which is trained with Method Modality Accuracy
the Sports-1M [13] dataset and fine-tuned with the color or
HOG+HOG2 [27] color 24.5%
depth modalities of our dataset. Spatial stream CNN [34] color 54.6%
Lastly, we evaluate human performance by asking six iDT-HOG [39] color 59.1%
subjects to label each of the 482 gesture videos in the test C3D [37] color 69.3%
set after viewing the corresponding front-view SoftKinetic Ours color 74.1%
color video. Prior to the experiment, each subject familiar- HOG+HOG2 [27] depth 36.3%
ized themselves with all 25 gesture types. Gestures were SNV [42] depth 70.7%
presented in random order to each subject for labelling. C3D [37] depth 78.8%
To be consistent with machine classifiers, human subjects Ours depth 80.3%
viewed each gesture video only once, but were not restricted
iDT-HOF [39] opt flow 61.8%
in the time allowed to decide each label. Temporal stream CNN [34] opt flow 68.0%
The results of these comparisons are shown in Ta- iDT-MBH [39] opt flow 76.8%
ble 2. Among the individual modalities, the best results Ours opt flow 77.8%
are achieved by depth, followed by optical flow and color.
HOG+HOG2 [27] color + depth 36.9%
This could be because the depth sensor is more robust to Two-stream CNNs [34] color + opt flow 65.6%
indoor lighting change and more easily precludes the noisy iDT [39] color + opt flow 73.4%
background scene, relative to the color sensor. Optical flow Ours all 83.8%
explicitly encapsulates motion, which is important to recog-
Human color 88.4%
nize dynamic gestures. Unlike the two-stream network with
action classification [34], its accuracy for gesture recogni-
tion is not improved by combining the spatial and tempo- Table 3: Comparison of 2D-CNN and 3D-CNN trained with
ral streams. We conjecture that videos for action classifica- different architectures on depth or color data. (CTC∗ de-
tion can be associated with certain static objects or scenes, notes training without drop-out of feature maps.)
e.g., sports or ceremonies, which is not the case for dy-
namic hand gestures. Although C3D captures both shape Color Depth
and motion cues in each clip, the temporal relationship be- 2D-CNN 3D-CNN 2D-CNN 3D-CNN
tween clips is not considered. Our approach achieves the No RNN 55.6% 67.2% 68.1% 73.3%
best performances in each individual modality and signif- RNN 57.9% 72.0% 64.7% 79.5%
icantly outperforms other methods with combined modali- CTC 65.6% 74.1% 69.1% 80.3%
ties, meanwhile it is still below human accuracy (88.4%). CTC∗ 59.5% 66.5% 67.0% 75.6%

Design choices. We analyze the individual components of [35] and train similarly to the 3D-CNN. We also asses the
our proposed R3DCNN algorithm (Table 3). First, to un- importance of the recurrent network, the CTC cost func-
derstand the utility of the 3D-CNN we substitute it with a tion and feature map drop-out. Classification accuracies for
2D-CNN initialized with the pre-trained 16-layer VGG-Net these experiments are listed in Table 3. When the recurrent

4212

Authorized licensed use limited to: Nat. Inst. of Elec. & Info. Tech (NIELIT). Downloaded on September 11,2022 at 16:19:08 UTC from IEEE Xplore. Restrictions apply.
Table 4: Accuracy of a linear SVM (C = 1) trained on nition. However, we observe (columns 1-2, Table 4) that
features extracted from different networks and layers (final following fc by a recurrent layer and training on full videos
fully-connected layer fc and recurrent layer rnn). (R3DCNN) improves the accuracy of the extracted features.
A plausible explanation is that the recurrent layer help the
Clip-wise C3D [37] R3DCNN preceding convolutional network to learn more general fea-
Modality fc fc rnn tures. Moreover, features from the recurrent layer when
coupled with an SVM classifier, demonstrate a further im-
Color 69.3% 73.0% 74.1%
Depth 78.8% 79.9% 80.1%
provement in performance (column 3, Table 4).

5.2. Early Detection in Online Operation


network is absent, i.e., “No RNN” in Table 3, the CNN is
We now analyze the performance of our method for on-
directly connected to the softmax layer, and the network is
line gesture detection, including early detection, when ap-
trained with a negative log-likelihood cost function. When
plied to unsegmented input streams and trained with the
a recurrent layer with d = 256 hidden neurons is present, we
CTC cost function. R3DCNN receives input video streams
train using the negative log-likelihood and CTC cost func-
sequentially as 8-frame clips and outputs class-conditional
tions, denoted “RNN” and “CTC” in Table 3, respectively.
probabilities after processing each clip. Generally the nu-
We observe consistently superior performance with 3D-
cleus of the gesture spans multiple clips, potentially en-
CNN versus 2D-CNN for all sensor types and network con-
abling gesture classification before processing all clips.
figurations. This suggests that local motion information ex-
tracted by the spatio-temporal kernels of the 3D-CNN is Online operation. Fig. 4 shows ground truth labels and
important for dynamic hand gesture recognition. Notice network predictions during continuous online operation on
also that adding global temporal modeling via RNN into the a video sequence collected outside of our previously de-
classifier generally improves accuracy, and the best accu- scribed dataset. The ground truth in the top row shows
racy for all sensors is obtained with the CTC cost function, the hand-labeled nucleus phase of each gesture. In most
regardless of the type of CNN employed. cases, both networks—R3DCNN trained with negative
Finally, we evaluate the effect of feature map drop-out, log-likelihood (“RNN”) and CTC (“CTC”), respectively—
which involves randomly setting entire maps to zero while predict the correct class before the gesture ends. How-
training. This technique has been shown to provide lit- ever, the network trained with CTC produces significantly
tle or no improvement when training from a CNN from fewer false positives. The two networks also behave dif-
scratch [11]. However, when a network pre-trained on a ferently when the same gesture is performed sequentially,
larger dataset with more classes is fine -tuned for a smaller e.g., observe that three instances of the same gesture occur
domain with fewer training examples and classes, not all of at 13−17s and 27−31s. The CTC network yields an indi-
the original feature maps are likely to exhibit strong acti- vidual peak for each repetition, whereas RNN merges them
vations for the the new inputs. This can lead to overfitting into a single activation.
during fine-tuning. The accuracies of the various classifier
architectures, trained with and without feature map drop- Detection. To detect the presence of any one of the 25
out are denoted by “CTC” and “CTC*” in Table 3, respec- gestures relative to no gesture, we compare the highest cur-
tively. They show improved accuracy for all modalities and rent class conditional probability output by R3DCNN to a
networks with feature map drop-out, with a greater positive threshold τ ∈ [0,1]. When the detection threshold is ex-
effect for the 3D-CNN. ceeded, a classification label is assigned to the most prob-
able class. We evaluate R3DCNN trained with and with-
Recurrent layer as a regularizer for feature extractors. out CTC on the test set with hand-annotated gesture nuclei.
Tran et al. [37] perform video classification with a linear We compute the area under the curve (AUC) [12] of the
SVM classifier learned on features extracted from the fully receiver operating characteristic (ROC) curve. The ROC
connected layers of the C3D network. Features for each in- plots the true positive detection rate (TPR)—when the net-
dividual clip are averaged to form a single representation for work fires during the nucleus of a gesture—versus the false
the entire video. In Table 4, we compare the performance of positive rate (FPR)—when the network fires outside of the
the features extracted from the C3D network fine-tuned on nucleus—for a range of threshold values. With the depth
gesture clips with the features from R3DCNN trained with modality, CTC results in better AUC (0.91) versus without
CTC on entire gesture videos. Features extracted from each (0.69) due to fewer false positives. With modality fusion
clip are normalized by the 2 -norm. Since R3DCNN con- the AUC increases to 0.93.
nects a C3D architecture to a recurrent network, fc layer We also compute the normalized time to detect (NTtD)
features in both networks are computed by the same archi- [12] at a detection threshold (τ = 0.3) with a TPR=88% and
tecture, each with weights fine-tuned for the gesture recog- FPR=15%. The distribution of the NTtD values for vari-

4213

Authorized licensed use limited to: Nat. Inst. of Elec. & Info. Tech (NIELIT). Downloaded on September 11,2022 at 16:19:08 UTC from IEEE Xplore. Restrictions apply.
Figure 4: A comparison of the gesture recognition performance of R3DCNN trained with (middle) and without (bottom)
CTC. (The no gesture class is not shown for CTC.) The various colors and line types indicate different gesture classes.
Table 5: Results for the SKIG RGBD gesture dataset.

Method Modality Accuracy


Liu & Shao [18] color + depth 88.7%
Tung & Ngoc [38] color + depth 96.5%
Ours color + depth 97.7%
MRNN [26] color + depth + optical flow 97.8%
Ours color + depth + optical flow 98.6%

Table 6: Results on the Chalearn 2014 dataset. Accuracy


Figure 5: NTtD and gesture length for different classes.
is reported on pre-segmented videos. (∗ The ideal Jaccard
Static gestures are marked by red bars.
score is computed using ground truth localizations, i.e., the
class prediction is propagated for the ground truth gesture
ous gesture types is shown in Fig. 5. The average NTtD duration, representing an upper bound on Jaccard score.)
across all classes is 0.56. In general, static gestures require
the largest portion of the nucleus to be seen before classi- Method Modality Accuracy Jaccard
fication, while dynamic gestures are classified on average Neverova et al. [24] color + depth + skeleton - 0.85
within 70% of their completion. Intuitively, the meaning of Pigou et al. [31] color + depth + skeleton 97.2% 0.91
a static gesture is clear only when the hand is in the final Our, CTC color 97.4% 0.97∗
position. Our, CTC depth 93.6% 0.92∗
Our, CTC optical flow 95.0% 0.94∗
5.3. Evaluation on Previously Published Datasets Our, RNN color + depth 96.6% 0.96∗
Our, CTC color + depth 97.5% 0.97∗
Finally, we evaluate our method on two benchmark Our, RNN color + depth + optical flow 97.4% 0.97∗
datasets: SKIG [18], and ChaLearn 2014 [5]. SKIG con- Our, CTC color + depth + optical flow 98.2% 0.98∗
tains 1080 RGBD hand gesture sequences by 6 subjects col-
lected with a Kinect sensor. There are 10 gesture categories, the color modality, with threshold τ = 0.3 (AUC= 0.98)
each performed with 3 hand postures, 3 backgrounds, and we observed: TPR= 94.8%, FPR= 0.8%, and NTtD= 0.41,
2 illumination conditions. Table 5 shows classification ac- meaning our method is able to classify gestures within 41%
curacies, including the state-of-the-art result established by of completion, neglecting inference time.
the MRNN method [26]. Our method outperforms existing
methods both with and without the optical flow modality. 6. Conclusion
The ChaLearn 2014 dataset contains more than 13K
RGBD videos of 20 upper-body Italian sign language ges- In this paper, we proposed a novel recurrent 3D convolu-
tures performed by 20 subjects. A comparison of results tional neural network classifier for dynamic gesture recog-
is presented in Table 6, including Pigou et al. [31] with nition. It supports online gesture classification with zero
state-of-the-art classification accuracy of 97.2% and Jac- or negative lag, effective modality fusion, and training with
card score 0.91. On the classification task, our method (with weakly segmented videos. These improvements over the
color, depth and optical flow modalities) outperforms this state-of-the-art are demonstrated on a new dataset of dy-
method with an accuracy of 98.2%. For early detection on namic hand gestures and other benchmarks.

4214

Authorized licensed use limited to: Nat. Inst. of Elec. & Info. Tech (NIELIT). Downloaded on September 11,2022 at 16:19:08 UTC from IEEE Xplore. Restrictions apply.
References [23] P. Molchanov, S. Gupta, K. Kim, and K. Pulli. Multi-sensor
system for driver’s hand-gesture recognition. In IEEE Auto-
[1] Bayerische Motoren Werke AG. Gesture control interface in matic Face and Gesture Recognition, 2015. 1, 2, 6
BMW 7 Series. https://www.bmw.com/. 6 [24] N. Neverova, C. Wolf, G. W. Taylor, and F. Nebout. Mod-
[2] J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, drop: adaptive multi-modal gesture recognition. arXiv, 2014.
G. Desjardins, J. Turian, D. Warde-Farley, and Y. Bengio. 8
Theano: a CPU and GPU math expression compiler. In [25] N. Neverova, C. Wolf, G. W. Taylor, and F. Nebout. Multi-
Python for Scientific Computing Conference, 2010. 4 scale deep learning for gesture detection and localization. In
[3] S. K. Card, G. G. Robertson, and J. D. Mackinlay. The infor- ECCVW, 2014. 1, 2
mation visualizer, an information workspace. In ACM CHI, [26] N. Nishida and H. Nakayama. Multimodal gesture recogni-
pages 181–186, 1991. 1 tion using multistream recurrent neural network. In PSIVT,
[4] Code Laboratories Inc. Duo3D SDK. https://duo3d. pages 1–14, 2015. 8
com/. 5 [27] E. Ohn-Bar and M. Trivedi. Hand gesture recognition in real
[5] S. Escalera, X. Baró, J. Gonzalez, M. A. Bautista, time for automotive interfaces: a multimodal vision-based
M. Madadi, M. Reyes, V. Ponce-López, H. J. Escalante, approach and evaluations. IEEE ITS, 15(6):1–10, 2014. 1, 2,
J. Shotton, and I. Guyon. ChaLearn Looking at People Chal- 4, 5, 6
lenge 2014: dataset and results. In ECCVW, 2014. 4, 8 [28] M. Osadchy, Y. L. Cun, and M. L. Miller. Synergistic face
[6] R. Fan, K. Chang, C. Hsieh, X. Wang, and C. Lin. Liblinear: detection and pose estimation with energy-based models.
A library for large linear classification. Journal of Mach. Journal of Mach. Learn. Research, 8:1197–1215, 2007. 1
Learn. Research, 2008. 3 [29] R. Pascanu, T. Mikolov, and Y. Bengio. On the difficulty of
[7] G. Farnebäck. Two-frame motion estimation based on poly- training recurrent neural networks. In ICML, 2013. 3, 4
nomial expansion. In Scand. Conf. on Im. Anal., 2003. 5 [30] F. Perronnin, J. Sanchez, and T. Mensink. Improving the
[8] D. M. Gavrila. The visual analysis of human movement: a fisher kernel for large-scale image classification. In ECCV,
survey. CVIU, 73(1):82–98, 1999. 1 2010. 2, 5
[9] A. Graves. Supervised sequence labelling with recurrent [31] L. Pigou, A. v. d. Oord, S. Dieleman, M. Van Herreweghe,
neural networks. Springer, 2012. 2, 3, 4 and J. Dambre. Beyond temporal pooling: recurrence
[10] A. Graves, S. Fernández, F. J. Gomez, and J. Schmidhu- and temporal convolutions for gesture recognition in video.
ber. Connectionist temporal classification: labelling unseg- arXiv, 2015. 8
mented sequence data with recurrent neural networks. In [32] M. Ryoo. Human activity prediction: early recognition of
ICML, pages 369–376, 2006. 1, 2, 3 ongoing activities from streaming videos. In ICCV, 2011. 2
[11] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and [33] X. Shen, G. Hua, L. Williams, and Y. Wu. Dynamic hand
R. R. Salakhutdinov. Improving neural networks by prevent- gesture recognition: an examplar based approach from mo-
ing co-adaptation of feature detectors. arXiv, 2012. 4, 7 tion divergence fields. Image and Vis. Computing, 2012. 2
[12] M. Hoai and F. De la Torre. Max-margin early event detec- [34] K. Simonyan and A. Zisserman. Two-stream convolutional
tors. In CVPR, 2012. 2, 7 networks for action recognition. In NIPS, 2014. 1, 2, 5, 6
[13] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, [35] K. Simonyan and A. Zisserman. Very deep convolutional
and L. Fei-Fei. Large-scale video classification with convo- networks for large-scale visual recognition. In ICLR, 2015.
lutional neural networks. In CVPR, 2014. 1, 3, 6 6
[14] A. Kendon. Current issues in the study of gesture. In The [36] A. Tamrakar, S. Ali, Q. Yu, J. Liu, O. Javed, A. Divakaran,
biological foundations of gestures: motor and semiotic as- H. Cheng, and H. Sawhney. Evaluation of low-level features
pects, pages 23–47. Lawrence Erlbaum Associates, 1986. 1 and their combinations for complex event detection in open
[15] K. Kim, D. Lee, and I. Essa. Gaussian process regression source videos. In CVPR, 2012. 2
flow for analysis of motion trajectories. In ICCV, 2011. 2 [37] D. Tran, L. D. Bourdev, R. Fergus, L. Torresani, and
[16] K. M. Kitani, B. D. Ziebart, J. A. Bagnell, and M. Hebert. M. Paluri. Learning spatiotemporal features with 3D con-
Activity forecasting. In ECCV, 2012. 2 volutional networks. In ICCV, 2015. 1, 2, 3, 4, 5, 6, 7
[17] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld. [38] P. T. Tung and L. Q. Ngoc. Elliptical density shape model
Learning realistic human actions from movies. In CVPR, for hand gesture recognition. In ACM SoICT, 2014. 8
2008. 5 [39] H. Wang, D. Oneata, J. Verbeek, and C. Schmid. A ro-
[18] L. Liu and L. Shao. Learning discriminative representations bust and efficient video representation for action recognition.
from RGB-D video data. In IJCAI, 2013. 4, 8 IJCV, 2015. 2, 5, 6
[19] G. Marin, F. Dominio, and P. Zanuttigh. Hand gesture recog- [40] J. Wang, Z. Liu, J. Chorowski, Z. Chen, and Y. Wu. Robust
nition with leap motion and kinect devices. In ICIP, 2014. 3D action recognition with random occupancy patterns. In
4 ECCV, 2012. 2
[20] R. B. Miller. Response time in man-computer conversational [41] P. J. Werbos. Backpropagation through time: what it does
transactions. AFIPS, 1968. 1 and how to do it. Proceedings of the IEEE, 78(10):1550–
[21] S. Mitra and T. Acharya. Gesture recognition: a survey. In 1560, 1990. 3
IEEE Systems, Man, and Cybernetics, 2007. 1 [42] X. Yang and Y. Tian. Super normal vector for activity recog-
[22] P. Molchanov, S. Gupta, K. Kim, and J. Kautz. Hand ges- nition using depth sequences. In CVPR, 2014. 2, 5, 6
ture recognition with 3D convolutional neural networks. In
CVPRW, 2015. 1, 2

4215

Authorized licensed use limited to: Nat. Inst. of Elec. & Info. Tech (NIELIT). Downloaded on September 11,2022 at 16:19:08 UTC from IEEE Xplore. Restrictions apply.

You might also like