CNN and Stacked LSTM Model
for Indian Sign Language Recognition
C. Aparna(B) and M. Geetha(B)
Computer Science and Engineering, Amrita Vishwa Vidyapeetham,
Amritapuri, Kollam 690525, India
aparna1994c@gmail.com, geetham@am.amrita.edu
Abstract. In this paper, we propose a deep learning for sign language
recognition using convolutional neural network (CNN) and long short
term memory (LSTM). The architecture used CNN as a pretrained model
for feature extraction and is passed to the LSTM for capturing spatio-
temporal information. One more LSTM is stacked for increasing the
accuracy. Deep learning model which captures temporal information is
less. There is only less papers which deals with sign language recogni-
tion by using the deep learning architectures such as CNN and LSTM.
The algorithm was tested in Indian sign language (ISL) dataset. We have
presented the performance evaluation after testing with ISL dataset. Lit-
erature shows that deep learning models capturing temporal information
is still an open research problem.
Keywords: Indian sign language recognition · Convolutional Neural
Networks (CNN) · Long Short Term Memory (LSTM) · Deep learning
1 Introduction
Sign language is the most effective way of communication of the deaf with the
hearing majority. Sign language is expressed by various hand shapes, body move-
ments, and facial expressions. There are different signs associated with each
movement. The deaf people use sign language as a communication medium with
the hearing majority. Sign language includes dynamic sign, static sign and con-
tinuous sign. Static sign is for alphabets, dynamic sign is for isolated words and
continuous sign is used for sentence. Sign language recognition is a research area
which involves pattern recognition, computer vision, NLP (Natural language
processing).
There are few existing deep learning models which did sign language recog-
nition [2,3,9–12]. We propose a deep learning solution for Indian sign language
recognition. Since Indian sign language dictionary is standardized very recently,
not much work has happened in the area of Indian sign language recognition.
We propose a deep learning architecture which use CNN based LSTM for Indian
sign language recognition. When a deaf person is trying to communicate with
normal person who does not know sign language, a framework which translates
sign language to speech will be helpful.
c Springer Nature Singapore Pte Ltd. 2020
S. M. Thampi et al. (Eds.): SoMMA 2019, CCIS 1203, pp. 126–134, 2020.
https://doi.org/10.1007/978-981-15-4301-2_10
CNN and Stacked LSTM Model for Indian Sign Language Recognition 127
Fig. 1. Examples of isolated words
2 Related Work
In the work by Oscar Koller et al., they did a CNN-HMM model, which com-
bines the strong discriminative abilities of CNNs with the sequence modelling
capabilities of HMMs by taking the outputs of CNN in a bayesian fashion. The
contributions associated with this paper are, they are the first one to implement
the hybrid CNN-HMM model in the context of sign language and gesture recog-
nition. With their three challenging benchmark datasets got an improvement
of over 15% with the state of the arts [1]. In the work by Runpeng et al. [2]
they developed a real world continuous sign language recognition by using the
deep neural networks. They used CNN with temporal convolution and pooling
for spatio-temporal representation from the video and RNN with LSTM (Long
Short Term Memory) for the feature sequence. In the work by Geetha et al. they
proposed a method which can approximate the gesture image boundary into a
polygon. It also recognizes the open and closed gestures of the finger very effi-
ciently [3]. In the work by Brandon Garcia et al. they developed a fingerspelling
translator based on CNN for ASL. Their model correctly classifies a-e and a-k
letters correctly [4]. In the work by Eleni et al. [5], Gesture recognition involves
segmentation and feature extraction. They used frame level classification. Both
training and testing is done for CNN and CNN-LSTM combination for training.
In the work by Ce Li et al. they proposed hand gesture recognition for mobile
by using discriminant learning. The solution is they consolidate Fisher model
into the BiLSTM and BiGRU networks termed as F-BiLSTM and F-BiGRU
to improve the traditional softmax loss work for improving the mobile gesture
recognition performance. They proposed a gesture recognition strategy to dis-
tinguish arm movement employing a particular wearable device. Their major
contribution includes broad tests that appear prevalent execution of the pro-
posed strategy compared to the state-of- the-art BiLSTM and BiGRU on three
motion recognition databases and construct a large hand gesture database for
portable hand signal recognition [6]. In the work by Juan C.Núñez et al., they
used supervised learning; CNN-LSTM combination for training. By using their
method they solved the problem of overfitting [7]. It is different from our method,
that we are using CNN for feature extraction and LSTM is used for recognition.
128 C. Aparna and M. Geetha
3 System Architecture
We proposed a new deep learning architecture for sign language recognition by
using CNN and LSTM. In our proposed method, we use CNN (Convolutional
Neural Network) for feature extraction and LSTM (Long Short Term Memory)
for prediction. Input video will be split into frames of length 60. Batch size used
was 8; number of epochs used was 300. For the testing 6 words were given, each of
which contains 10 videos. Figure 2, shows CNN-LSTM architecture of proposed
method.
LSTM-LSTM
Input CNN Feature Training (stacked
I
training extraction LSTM’s)
LSTM-LSTM
Learning Testing (stacked Output
O
model LSTM’s)
CNN Feature
extraction
Input testing
Fig. 2. CNN-LSTM architecture of proposed method
3.1 Feature Extraction
Convolutional Neural Network or ConvNet is a multilayer neural network, which
consists of layers which recognize features such as edges and then these layers
recombine these features. The neural network neurons are having weights and
bias. Each neuron will receive some inputs and they will perform dot product
and follows it with non-linearity. In ConvNet we use only images that allow us
to encode certain properties. The main difference is that the hidden layer neuron
is only connected to a subset of neurons in the previous layer [5]. Because of this
connectivity it is capable of learning features directly. It is having convolution
layer, pooling layer, fully connected layer and loss layer. Convolution layer is
the one which performs computations. In the pooling layer it will perform the
functions for reducing spatial size. In the fully connected layer, the neurons
are connected fully to the neurons in the previous layer. Last layer is the loss
CNN and Stacked LSTM Model for Indian Sign Language Recognition 129
layer, this layer computes the error. Here CNN is used for feature extraction,
we haven’t used fully connected layer. The CNN used is pretrained CNN that
is the Inception net. Inception V3 Net is a pretrained CNN model which was
developed by Google for the image recognition. Inception net was used to classify
the images. During the feature extraction for each video, the models make use of
sequence length 60 and class limit 6. Batch size used is 8 and number of epochs
are 300. First, every frame from the video is taken and executed through the
inception net and the output is taken from the final pool layer. We took 2048
feature vectors and passed on to the LSTM. Then these extracted features are
converted to sequences. We used 60 frames sequence. These 60 frames were taken
together and output is saved and the LSTM models are trained. For the LSTM
layer we used 4096 wide 1024 dense layer and some drop out in between. After
training it is given to the testing and the output obtained is correctly classified
isolated words.
3.2 Classification
LSTM make use of Natural language processing, it is based on the concept
of memory cells. Memory cells are able to retain its value for a long time or
short time as a function of its inputs, which allows the cell to remember what
is important. Since the RNNs have vanishing gradient problem and long term
dependencies. We used LSTM instead of RNN for classification. LSTMs deal
with the problem allowing for higher accuracy on longer sequences of data. For
that we used single LSTM layer which is 4096 wide 1024 dense layer and some
drop out in between; the dropout used is 0.5. Dropout is used where the ran-
domly selected neurons are ignored during the training. The model used is the
sequential model. The activation functions used are Softmax and ReLU. Softmax
function is used for frequent classifications whereas ReLU is used for improving
the neural networks by improving training by speeding up. One more LSTM is
stacked with the first LSTM to get more accuracy. After the feature extraction,
during classification these images are given to the LSTM for the training and the
output obtained as a result of training and the validation is given for the test-
ing, where the words are classified. Figure 3, shows the architecture diagram of
stacked LSTM. Here we used LSTM for the temporal feature extraction. LSTM
is having three gates, they are, input gate, forget gate and the output gate. Input
gate decides what information flow should be there, forgotten gate decides what
should be kept in the memory and what should be forgotten, output gate con-
trols the flow of output. It is having four activation functions, they are tan h,
and four sigmoid functions. The Sigmoid Function curve looks as if a S-shape.
Sigmoid functions are used because the values are between 0 and 1. The other
activation functions is the tanh or hyperbolic tangent Activation Function. tanh
is likewise like logistic sigmoid but higher. tanh is having its range from −1 to 1.
130 C. Aparna and M. Geetha
Input
Softmax function
LSTM ADAM optimizer
ReLU
Softmax function
LSTM function
Softmax ADAM optimizer
ReLU
Output
Fig. 3. Architecture diagram of stacked LSTM
4 Results and Discussion
4.1 Dataset Used
We have tested our algorithm in Indian sign language dataset. The videos were
recorded on Nikon camera with 25fps. ISL dataset with 6 words, each having
20 videos for training and separate 6 words containing 10 videos for testing.
Six words used are app, bull, early morning(EM), energy, crocodile and kitchen.
Different variations of these videos are used. Inorder to avoid issues with complex
background. Dataset containing 6 words of 20 videos each was given as the
training input. It was randomly split to 70% and 30%, train and test. It uses
ADAM optimizer. Keras framework was used since it is the fastest growing in
deep learning framework. The open source library framework written in python
is capable of running on top of Tensorflow. During training it took two days on
Nvidia GeForce 940MX GPU. After the training, the accuracy obtained was 94%
on the training set. The challenges faced by our model was to eliminate issues of
hand segmentation in complex background. Figure 4, shows the word “app” from
the dataset. In the experiments, the dataset was taken by wearing dark clothes
with long sleeves and stand before a dark curtain under normal lighting. Total
videos of around 156 were taken. Then videos are given to the preprocessing step,
from there the video will be divided into frames(sequences). Then it is passed on
Fig. 4. Dataset which shows the word “app”
CNN and Stacked LSTM Model for Indian Sign Language Recognition 131
to the CNN where the feature extraction is done and a 2048 dimension vector is
obtained. Then it is given to LSTM for the training and testing. Stacked LSTM
is used that is two LSTMs are stacked so that the accuracy obtained is high. The
training result obtained was 94%. In general existing system doesn’t deal with
temporal feature extraction from video in case of isolated sign language word
by using the CNN based LSTM. The challenging factor about sign language
is that sign language is involved with manual parameters such as hand shape,
movements orientation, differentiation of the skin from background color, large
vocabulary, temporal feature extraction from video.
4.2 Performance Evaluation
Comparison of words with other words Fig. 4, Showing prediction probability of
app videos with other words such as bull, kitchen, energy, early morning and
crocodile.
Table 1. Prediction probability of “app” with others
Kitchen Bull Energy Early morning Crocodile APP
App 1 0.01 0.016 6.63 × 10−3 4.21 × 10−3 6.94 × 10−4 0.99997
App 2 7.2 × 10−3 7.11 × 10−3 3.59 × 10−3 7.77 × 10−3 6.54 × 10−4 0.999
App 3 8.55 × 10−3 0.024 5.77 × 10−3 0.02 7.61 × 10−4 0.999
App 4 0.015 5.55 × 10−3 8.09 × 10−3 5.17 × 10−3 4.86 × 10−4 0.999
App 5 0.011 4.04 × 10−3 3.47 × 10−3 0.02 3.85 × 10−4 0.99
App 6 0.011 5.94 × 10−3 8.95 × 10−3 3.59 × 10−3 5.56 × 10−4 0.99
App 7 9.23 × 10−3 5.05 × 10−3 4.43 × 10−3 0.01 7.11 × 10−4 0.99
App 8 9.5 × 10−3 4.63 × 10−3 7.76 × 10−3 6.4 × 10−3 4.15 × 10−4 0.99
App 9 0.011 0.011 8.53 × 10−3 2.75 × 10−3 6.5 × 10−4 0.99
App 10 0.011 3.61 × 10−3 6.19 × 10−3 5.09 × 10−3 3.45 × 10−4 0.99998
Fig. 5. Prediction probability of “app” with others
132 C. Aparna and M. Geetha
Table 2. Prediction probability of “Bull” with others
App Kitchen Early morning Energy Crocodile Bull
Bull 1 0.01 0.02 0.02 0.01 0.038 0.69
Bull 2 0.9 0.02 0.0003 0.04 0.012 0.89
Bull 3 0.02 0.0032 0.01 0.011 0.014 0.66
Bull 4 0.99 0.011 0.08 0.014 0.016 0.02
Bull 5 0.9 0.01 0.03 0.06 0.02 0.0001
Bull 6 0.04 0.01 0.01 0.04 0.06 0.55
Bull 7 0.9 0.06 0.01 0.04 0.01 0.01
Bull 8 0.9 0.06 0.08 0.01 0.01 0.07
Bull 9 0.9 0.02 0.02 0.06 0.01 0.01
Bull 10 0.9 0.01 0.0003 0.02 0.01 0.72
Table 3. Prediction probability of “Early morning” with others
App Kitchen Crocodile Energy Bull Early morning
Early morning 1 0.9 0.0001 0.09 0.01 0.06 0.001
Early morning 2 0.8 0.0002 0.08 0.0001 0.06 0.01
Early morning 3 0.9 0.0002 0.04 0.0002 0.02 0.1
Early morning 4 0.01 0.03 0.01 0.06 0.02 0.0002
Early morning 5 0.9 0.03 0.01 0.06 0.06 0.0005
Early morning 6 0.9 0.06 0.03 0.01 0.06 0.001
Early morning 7 0.04 0.04 0.07 0.02 0.06 0.03
Early morning 8 0.9 0.05 0.04 0.02 0.06 0.0041
Early morning 9 0.9 0.0001 0.08 0.06 0.01 0.014
Early morning 10 0.9 0.06 0.08 0.02 0.06 0.01
Table 4. Prediction probability of “Kitchen” with others
App Bull Early morning Energy Crocodile Kitchen
Kitchen 1 0.9 0.02 0.06 0.01 0.01 0.02
Kitchen 2 0.9 0.02 0.04 0.02 0.03 0.06
Kitchen 3 0.02 0.02 0.04 0.07 0.06 0.02
Kitchen 4 0.9 0.02 0.09 0.02 0.03 0.01
Kitchen 5 0.9 0.02 0.07 0.01 0.02 0.06
Kitchen 6 0.9 0.02 0.06 0.06 0.06 0.01
Kitchen 7 0.9 0.02 0.06 0.04 0.01 0.06
Kitchen 8 0.9 0.02 0.01 0.02 0.01 0.03
Kitchen 9 0.9 0.01 0.05 0.02 0.01 0.09
Kitchen 10 0.9 0.02 0.06 0.02 0.02 0.06
CNN and Stacked LSTM Model for Indian Sign Language Recognition 133
Table 5. Prediction probability of “Crocodile” with others
App Kitchen Early morning Energy Bull Crocodile
Crocodile 1 0.9 0.0002 0.004 0.0002 0.02 0.03
Crocodile 2 0.9 0.06 0.04 0.02 0.02 0.06
Crocodile 3 0.9 0.02 0.02 0.02 0.02 0.03
Crocodile 4 0.01 0.07 0.05 0.05 0.07 0.58
Crocodile 5 0.9 0.06 0.09 0.02 0.01 0.01
Crocodile 6 0.9 0.04 0.07 0.06 0.02 0.09
Crocodile 7 0.04 0.01 0.07 0.07 0.04 0.58
Crocodile 8 0.9 0.06 0.05 0.02 0.09 0.06
Crocodile 9 0.9 0.06 0.03 0.07 0.09 0.07
Crocodile 10 0.9 0.06 0.08 0.02 0.06 0.01
Table 6. Prediction probability of “Energy” with others
App Kitchen Early morning Bull Crocodile Energy
Energy 1 0.9 0.02 0.0002 0.01 0.01 0.01
Energy 2 0.9 0.02 0.01 0.01 0.04 0.01
Energy 3 0.9 0.0001 0.0001 0.05 0.09 0.09
Energy 4 0.9 0.03 0.06 0.01 0.09 0.01
Energy 5 0.9 0.01 0.01 0.02 0.02 0.01
Energy 6 0.9 0.06 0.02 0.02 0.07 0.01
Energy 7 0.9 0.01 0.01 0.04 0.02 0.06
Energy 8 0.9 0.03 0.001 0.01 0.02 0.02
Energy 9 0.9 0.06 0.02 0.07 0.07 0.01
Hereby looking at the Table 1 and the graph we can conclude that the samples
of app is having high accuracy of around 0.99, where as the prediction probability
of test “app” with other words are too less. A table and a graph are drawn out
of these values. Accuracy is calculated by,
Accuracy = Number of correctly classified words/Total number of words.
The Table 2 shows the prediction probability of word “bull” with other words,
where as the Tables 3, 4, 5 and 6 shows the prediction probabilities of “early
morning” , “kitchen”, “crocodile” and “energy”.
5 Conclusion
Here we proposed a deep learning based sign language recognition for isolated
words by using the CNN based LSTM model. One more LSTM model is stacked
134 C. Aparna and M. Geetha
inorder to increase the accuracy. CNN is used in the feature extraction and the
output obtained will be directly passed on to the LSTM for the training and the
testing. The accuracy obtained will be higher than other methods. In the future
work we can include the sentences, so that we will be able to check whether the
words will slide over other words in a sentence.
References
1. Koller, O., et al.: Deep sign: hybrid CNN-HMM for continuous sign language recog-
nition. In: Proceedings of the British Machine Vision Conference (2016)
2. Cui, R., Liu, H., Zhang, C.: Recurrent convolutional neural networks for continuous
sign language recognition by staged optimization. In: Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition (2017)
3. Geetha, M., et al.: Gesture recognition for American sign language with polygon
approximation. In: 2011 IEEE International Conference on Technology for Educa-
tion. IEEE (2011)
4. Garcia, B., Viesca, S.A.: Real-time American sign language recognition with con-
volutional neural networks. In: Convolutional Neural Networks for Visual Recog-
nition, vol. 2 (2016)
5. Tsironi, E., et al.: An analysis of convolutional long short-term memory recurrent
neural networks for gesture recognition. Neurocomputing 268, 76–86 (2017)
6. Li, C., et al.: Deep fisher discriminant learning for mobile hand gesture recognition.
Pattern Recognit. 77, 276–288 (2018)
7. Nunez, J.C., et al.: Convolutional neural networks and long short-term memory for
skeleton-based human activity and hand gesture recognition. Pattern Recognit. 76,
80–94 (2018)
8. Aloysius, N., Geetha, M.: A review on deep convolutional neural networks. In:
2017 International Conference on Communication and Signal Processing (ICCSP).
IEEE (2017)
9. Bantupalli, K., Xie, Y.: American sign language recognition using deep learning
and computer vision. In: 2018 IEEE International Conference on Big Data (Big
Data). IEEE (2018)
10. Taskiran, M., Killioglu, M., Kahraman, N.: A real-time system for recognition
of American sign language by using deep learning. In: 2018 41st International
Conference on Telecommunications and Signal Processing (TSP). IEEE (2018)
11. Nguyen, H.B.D., Do, H.N.: Deep learning for American sign language fingerspelling
recognition system. In: 2019 26th International Conference on Telecommunications
(ICT). IEEE (2019)
12. Soodtoetong, N., Gedkhaw, E.: The efficiency of sign language recognition using
3D convolutional neural networks. In: 2018 15th International Conference on Elec-
trical Engineering/Electronics, Computer, Telecommunications and Information
Technology (ECTI-CON). IEEE (2018)