0% found this document useful (0 votes)
0 views9 pages

Silent Expressions

This research presents a real-time Indian Sign Language (ISL) recognition system that utilizes MediaPipe for hand tracking and a Convolutional Neural Network (CNN) for gesture classification, achieving approximately 92% accuracy. The system incorporates two-hand detection and text-to-speech functionality to enhance accessibility for individuals with hearing and speech impairments. Future enhancements may include multi-language recognition and dynamic gesture interpretation.

Uploaded by

Riya Awalkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views9 pages

Silent Expressions

This research presents a real-time Indian Sign Language (ISL) recognition system that utilizes MediaPipe for hand tracking and a Convolutional Neural Network (CNN) for gesture classification, achieving approximately 92% accuracy. The system incorporates two-hand detection and text-to-speech functionality to enhance accessibility for individuals with hearing and speech impairments. Future enhancements may include multi-language recognition and dynamic gesture interpretation.

Uploaded by

Riya Awalkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Silent Expressions: Two-Handed Indian Sign Language

Recognition Using MediaPipe and Machine Learning


Riya Awalkar1; Aditi Sah2; Renuka Barahate3; Yash Kharche4
Department of Computer Science & Engineering,
Sandip University
Nashik, India
Ms. Ashwini Magar
Project Guide
Department of Computer Science & Engineering,
Sandip University
Nashik, India

Abstract:
Indian Sign Language (ISL) is an essential communication medium for individuals with hearing and speech
impairments. This research introduces an efficient ISL recognition system that integrates deep learning with
real-time hand tracking. Utilizing MediaPipe Hands for landmark detection and a Convolutional Neural
Network (CNN) for classification, the model enhances recognition accuracy by incorporating two-hand
detection. Additionally, pyttsx3 is used for speech synthesis, providing audio output for detected gestures. The
system is designed to function in diverse environments, ensuring accessibility. Experimental evaluations
demonstrate high accuracy, and the framework is adaptable for future enhancements, such as multi-language
recognition and dynamic gesture interpretation.

Keywords: Indian sign language, deep learning, MediaPipe, LSTM, CNN, sign recognition.
provide a more scalable and efficient alternative.
1. Introduction This study utilizes MediaPipe Hands for hand
tracking and a CNN-based model for gesture
Communication plays a fundamental role in human classification, while pyttsx3 enables real-time
interaction, and sign language is a vital tool for speech conversion.
individuals with hearing and speech impairments.
Indian Sign Language (ISL) is widely used across The main contributions of this research include:
India, yet automated tools for its recognition remain
limited. Advancements in artificial intelligence and • Development of a deep learning-based ISL
deep learning have facilitated real-time sign recognition system that does not require
language recognition, reducing the communication external sensors.
barrier for the deaf and mute communities. • A dataset of ISL alphabets and digits
collected using MediaPipe Hands.
Traditional sign recognition systems relied on • A classification model leveraging CNN for
sensor-based gloves or manual mapping techniques, accurate static gesture recognition.
which are costly and cumbersome. Computer • Integration of text-to-speech conversion to
vision-based approaches using deep learning enhance accessibility.

1
• Future scalability to incorporate American and 48% noise immunity after training and testing
Sign Language (ASL) and British Sign the model on the entire English alphabet, from A to
Language (BSL) recognition. Z.
Using a self-created dataset of 1200 samples of ten
The rest of this paper is structured as follows:
static signs or letters, Chen [2] suggested a model.
Section 2 covers related research, Section 3 explains
the methodology, Section 4 presents results and Pre-processing was done first using edge
evaluations, Section 5 discusses challenges, Section segmentation to identify the hand's edges in order to
6 outlines future research directions, and Section 7 recognize the gestures, and then RGB images were
concludes the study. converted to YUQ or YIQ color spaces in order to
segment the skin color [2]. The convex hull
approach was then used to identify the fingers on the
2. Associated Research previously identified hand. Lastly, the classification
The visual-spatial language known as Indian Sign technique that was employed was neural networks.
Language was created in India. Indian Sign This model's ultimate accuracy was 98.2%.
Language has its own phonology, morphology, and Sharma et al. [8] created a system for
grammar, making it a natural language. It makes use communicating with people who have hearing or
of the body/head, hands, arms, and facial speech impairments based on Indian Sign
expressions to produce semantic information that Language. After taking the picture, the data was first
conveys words and emotions. An approach for pre-processed using a Matlab environment to
identifying and detecting Indian Sign Language transform it from RGB to grayscale [8]. The image's
motions from grayscale photographs was put out by edges were then identified using a 3 × 3 filter and a
Nandy et al. [6]. Their method involves converting Sobel detector. The reduced image with 600
a video source with signing gestures into grayscale elements was then subjected to a hierarchical
frames, from which a directional histogram is used centroid technique, yielding 124 features. Neural
to extract characteristics. Finally, the signs are networks and KNNs were the classification
categorized into one of the pre-established classes methods that were employed. This methodology
based on their attributes using clustering. The yielded an accuracy of 97.10%.
authors came to the conclusion that the 36-bin By employing a sensor glove for signing, analyzing
histogram approach was more accurate than the the signs, and presenting the results in a coherent
18bin histogram method after achieving a 100% phrase, Agarwal et al. [1] sought to close the gap
sign identification rate in their investigation. In between individuals with speech impairment and
order to generate text in real time from the video those with normal speaking abilities. The sensor
stream and to recognize and monitor sign language, gloves were used by the individuals to make the
Mekala et al. [4] presented a neural network movements [1]. After the gestures were compared
architecture. Framing, image pre-processing, to the database, the identified gesture was
feature extraction based on hand position and transmitted to be parsed in order to produce a
movement, and other stages make up the system sentence. The application's accuracy in version 1
architecture. The hand's point of interest (POI) was 33.33%. A keyword denoting the necessary
serves as a representation of these hand tense was added in version 2, resulting in 100%
characteristics [4]. The authors' neural network accuracy when handling simple and continuous
design, which included CNN layers that predicted tenses.
the indications, employed the 55 distinct features Wazalwar and Shrawankar [11] suggested a
that were extracted using this method as input. They technique that uses segmentation and framing to
claimed to have achieved a 100% recognition rate translate sign language from input video. They
2
employed the P2DHMM algorithm for hand produced PHOENIX 14T, a continuous sign
tracking and the CamShift technique for tracking. language translation dataset.
The indications were identified using a Haar A unique method for identifying Indian Sign
Cascade classifier. Following the recognition of the Language (ISL) in real time was put forth by
sign, each word was given a tag by the WordNet Mariappan and Gomathi [12]. They suggested a way
POS tagger, and the LALR parser constructed the for doing so that uses OpenCV's skin segmentation
phrase and supplied the output as text, resulting in a function to identify and track symptoms depending
meaningful English sentence. In order to recognize on a region of interest (ROI). They used a fuzzy C-
signs and gestures, Shivashankara and Srinath [9] means (FCM) clustering algorithm to predict the
created a system based on American Sign Language sign. For training and testing, they gathered a
(ASL). The performance of skin color clustering dataset comprising 50 sentences and 80 words from
was optimized by the authors' model, which made ten distinct individuals. They also used
use of YCbCr [9]. This model was applied to the morphological operations on the binary image
pre-processing of images. To identify the gesture, produced after performing skin segmentation to
the pre-processed image's centroid of the hand was improve the features and filtering on the colored
located, and the gesture was then identified by its images to lessen noise from the digital source.
peak offset. This model's overall accuracy was Consequently, they used the FCM approach to
93.05%. recognize 40 words from ISL with a 75% accuracy
A system that demonstrated how sign language and rate. A system for continuous sign language
its translation into spoken language are sequenceto- recognition using an LSTM model with leap motion
sequence mappings rather than one-to-one was proposed by Mittal et al. [13]. For sign sentence
mappings was put out by Camgoz et al. [11]. By recognition, they used a four-gated LSTM cell with
mimicking the tokenization and embedding a 2D CNN architecture, giving each word a specific
processes of the conventional neural machine label. A forget gate that received the output at t-1-
translation, the authors presented a novel vision time produced output at the output gate and returned
technique. An attention-based encoder and decoder the label displaying the word that was specifically
that models the conditional likelihood of producing associated with that label when the three basic gates
a spoken language from a given signing video is were utilized as input gates.
integrated with the CNN architecture [11] in the In order to signal the change between the two
neural machine translation stage, which converts successive signs in the video, they introduced a
sign movies to spoken language. Beginning with unique symbol, $.
word embedding, the authors converted a sparse When they came across the $ sign, which denoted
vector into a denser form that placed words with the change between two signs, they employed the
related meanings closer together. The conditional RESET flag [13]. Using a 3-layer LSTM model for
probability was maximized through the encoder- sign sentence recognition, they trained this
decoder phase. A fixed-size vector of the sign improved LSTM model over a dataset and reached
videos' features was produced by encoding. In the a maximum accuracy of 72.3%. The accuracy
decoding stage, the inputs were the word attained for the recognition of sign words was
embedding and the previously hidden state. This 89.50%.
aided in word prediction. Additionally, the authors In order to identify sign language from movies,
included an attention mechanism in the decoding including non-manual components like the mouth
phase to circumvent the issues of vanishing and eyebrow alignment, De Coster et al. [14]
gradients and long-term dependencies. They employed a transformer network.

3
To identify indications using various neural
CNN and 2-layer 94%
networks, they suggested a posture transformer Aparna Custom
LSTM
network, video transformer network, and and Dataset
multimodal transformer network. Geetha (6 sig
The skeleton base graph technique was employed by
Jiang et al. [15] to detect isolated signs in their (2019)
multi-model-based sign language recognition
system. Jiang et DCNN with AUTSL 98%
The authors suggested a SAM-SLR framework to al.
SLGCN using
identify isolated signs, and they employed SL-GCN (2021)
RGB-D
and SSTCN models to produce skeleton key points modalities
for feature extraction [15]. The AULTS dataset was
used to assess the suggested framework.
BLSTM-3DRN was proposed by Liao et al. [16] to
ConvNet DEVISIG 89.8%
Liao et 3D-
recognize dynamic sign language. A bi-directional N_D
al. with BLSTM
LSTM model serialized in three stages—hand
(2019)
localization, spatiotemporal feature extraction, and
gesture identification over DEVISIGN_D (Chinese RGB + D 89.74%
hand sign language)—was employed by the authors Analog Inflated 3D
[17]. ous et ConvNet
I3D, a ResNet with B-LSTM, was presented by al. with BLSTM
Adaloglou et al. [17] for the continuous recognition (2021)
of syntax construction in sign language. The authors Performance Comparison of Deep Learning
used the suggested framework with three annotation Models for Sign Language Recognition posture
levels on various RGB + D data, particularly Greek Transformer was also utilized by Mathieu De Coster
sign language [17]. The performance comparison of et al. [14] by fusing the Transformer Network with
various deep learning models, particularly the posture LSTM. This work uses video frames to
CNN-LSTM combination, across various data sets identify the indications. The Flemish Sign
is displayed in Table 1. Language corpus was used to evaluate the
suggested methodology, which was used as a
Table 1 Performance Comparison of Deep Learning Models keypoint estimate using OpenPose. For 100 classes,
for Sign Language Recognition
its accuracy was 74.7%.

Author Methodology Dataset Accuracy The suggested research project presented a method
for recognizing Indian Sign Language that
combines multiple deep learning algorithms,
Mittal ASL 89.50% including LSTM and GRU, and does not necessitate
2D-CNN and
et al. a particular setting or camera configuration for
Modified LSTM,
(2019) inference. The data collection and trials took into
with Leap motion
account the current Indian situation. The four
sensor
distinct LSTM and GRU combinations were
employed in the simulation, and the isolated signs
were taken into account in the experiments.

4
3. Methodology ensure uniformity. Hand segmentation is performed
using background subtraction to reduce noise.
3.1 Data Collection
To develop an effective model, a dataset of ISL 3.4 CNN Model for Classification
alphabets and digits is collected using a webcam A deep learning model using CNN architecture is
setup. The dataset comprises images of different trained to classify hand gestures into corresponding
hand positions, orientations, and lighting conditions ISL alphabets and digits. The network consists of
to improve generalization. Multiple subjects convolutional layers, batch normalization, pooling
participate in the data collection process to layers, and fully connected layers to optimize
introduce variability in hand shapes and sizes. Data classification performance. The activation functions
is collected in multiple environments to ensure used include ReLU for non-linearity and Softmax
robustness against background noise. for multi-class classification.
The training process involves backpropagation with
categorical cross-entropy loss and Adam optimizer
for efficient convergence. The dataset is split into
training, validation, and testing sets, ensuring
proper generalization of the model. Hyperparameter
tuning is performed to optimize learning rates, batch
size, and the number of convolutional layers.

3.5 Audio Output with pyttsx3


Once a sign is recognized, the corresponding
alphabet or digit is converted into speech using the
pyttsx3 text-to-speech library, enabling real-time
audio feedback. This feature makes the system
Fig 1 Sample Dataset interactive and beneficial for users who rely on
auditory feedback for communication.

3.2 Hand Tracking with MediaPipe


MediaPipe Hands is used for detecting and tracking
hand landmarks in real-time. It provides 21 key
points per hand, which serve as input features for
further processing. The use of two-hand detection
enhances recognition accuracy, particularly for
signs requiring both hands. MediaPipe's efficient
processing pipeline enables real-time detection with
minimal computational overhead.
Fig 2 Architecture Diagram

3.3 Feature Extraction and Preprocessing


Extracted key points are normalized and fed into a
CNN model for classification. Data augmentation
techniques such as rotation, flipping, and brightness
adjustments are applied to enhance robustness. The
dataset is preprocessed by converting images to
grayscale and resizing them to a fixed dimension to
5
4. Experimental Results 4.2. Results
The model is trained and evaluated using a dataset
of ISL signs. Accuracy metrics such as precision, • Accuracy: The model achieves an accuracy
recall, and F1-score are used to assess performance. of approximately 91% on the test set.
The integration of MediaPipe significantly
improves detection speed and accuracy.
The model achieves an accuracy of approximately
92% on the test set, demonstrating its effectiveness
in recognizing ISL gestures. A comparative study
with traditional approaches, such as feature-based
classification and template matching, highlights the
superiority of the CNN-based approach.
Additionally, real-time performance is tested using
a webcam, achieving a frame rate of 25 FPS,
making the system suitable for practical
applications.

4. Experiments and Results


4.1. Dataset Fig 3 Training and Validation Accuracy

The dataset was collected using MediaPipe Hands,


which detects 21 key points per hand. The collected
data includes:
• Alphabets (A-Z): Each alphabet is recorded
with variations in position, orientation, and
lighting.
• Digits (0-9): Multiple variations of each
digit were captured.
• Two-Hand Gestures: Certain ISL signs
require both hands, and these were carefully
recorded.
• Different Backgrounds: Data was
collected under different lighting and
Fig 4 Training and Validation Loss
background conditions to improve model
generalization.
• Multiple Participants: Different users
• Precision and Recall: The precision and
contributed to the dataset to introduce
recall scores for different classes (alphabets
diversity in hand shapes, sizes, and
and digits) range between 89% and 95%.
orientations.

6
• Comparison with Baseline Models: Our
CNN model outperforms traditional feature-
based methods and achieves better
recognition rates compared to HOG+SVM
approaches.
• Real-Time Performance: The system
achieves an average frame rate of 25 FPS,
ensuring smooth real-time interaction.
• User Testing: A small-scale user study was
conducted with 10 participants, reporting a
90% satisfaction rate in terms of accuracy
and response time.

Fig 7 Sample Output

Fig 5 Precision and Recall

• Confusion Matrix Analysis: A confusion


matrix is used to analyze misclassified
gestures, highlighting common errors in
similar-looking signs.

Fig 8 Sample Output

5. Discussion and Limitations

Despite promising results, several challenges


remain in ISL recognition:
• Hand occlusions: Overlapping hands may
lead to incorrect landmark detection.
Fig 6 Confusion Matrix

7
• Lighting variations: Poor lighting 7. Conclusions
conditions impact feature extraction and
recognition accuracy. This research presents a vision-based ISL
• Similar gestures: Some ISL signs have recognition system leveraging deep learning
subtle differences, making classification techniques for accurate and real-time sign language
challenging. interpretation. The combination of MediaPipe
• Real-time deployment: Optimization is Hands for hand tracking and a CNN model for
required for mobile or embedded system classification ensures efficiency and robustness.
deployment. With an accuracy exceeding 90%, the system
demonstrates potential for real-world applications.
6. Future enhancements Future work will focus on enhancing dynamic
gesture recognition, integrating multiple sign
languages, and optimizing real-time deployment on
• Multi-Language Sign Recognition:
mobile and embedded devices.
Expanding the model to support American
This work serves as a step toward bridging the
Sign Language (ASL) and British Sign
communication gap for the hearing and speech-
Language (BSL) alongside Indian Sign
impaired community using advanced AI-driven
Language (ISL). The goal is to develop an
solutions.
intelligent system capable of automatically
detecting the sign language being used and
References
classifying the correct alphabet or digit
1. Agarwal, S.R.; Agrawal, S.B.; Latif, A.M.
accordingly.
Article: Sentence Formation in NLP Engine
• Transformer-Based Models: Exploring
on the Basis of Indian Sign Language using
transformer architectures such as Vision
Hand Gestures. Int. J. Comput. Appl. 2015,
Transformers (ViTs) and Self-Attention
116, 18–22.
mechanisms for improved accuracy and
2. Chen, J.K. Sign Language Recognition with
contextual understanding of hand gestures.
Unsupervised Feature Learning; CS229
• Dynamic Gesture Recognition: Extending
Project Final Report; Stanford University:
the system to recognize dynamic signs and
Stanford, CA, USA, 2011.
full words rather than just static alphabets
3. Manware, A.; Raj, R.; Kumar, A.; Pawar, T.
and digits, using LSTMs or 3D CNNs.
Smart Gloves as a Communication Tool for
• Deployment on Edge Devices: Optimizing
the Speech Impaired and Hearing Impaired.
the model for real-time execution on mobile
Int. J. Emerg. Technol. Innov. Res. 2017, 4,
and embedded devices, ensuring
78–82.
accessibility for a broader user base.
4. Mekala, P.; Gao, Y.; Fan, J.; Davari, A. Real-
• Improved Occlusion Handling:
time sign language recognition based on
Enhancing the robustness of hand tracking
neural network architecture. In Proceedings
algorithms to mitigate issues related to
of the IEEE 43rd Southeastern Symposium
occlusions and overlapping gestures.
on System Theory, Auburn, AL, USA, 14–
16 March 2011.
This expansion will significantly increase the
5. Ministry of Statistics & Programme
model's usability and inclusivity, making it a
Implementation. Available online:
universal solution for sign language communication
https://pib.gov.in/PressReleasePage.aspx?P
across different regions.
RID=1593253

8
6. Nandy, A.; Prasad, J.; Mondal, S.; Motion. IEEE Sens. J. 2019, 19, 7056–
Chakraborty, P.; Nandi, G. Recognition of 7063. [Google Scholar] [CrossRef]
Isolated Indian Sign Language Gesture in 15. De Coster, M.; Herreweghe, M.V.; Dambre,
Real Time. Commun. Comput. Inf. Sci. J. Sign Language Recognition with
Transformer Networks. In Proceedings of
2010, 70, 102–107.
the Conference on Language Resources and
7. Papastratis, I.; Chatzikonstantinou, C.; Evaluation (LREC 2020), Marseille, France,
Konstantinidis, D.; Dimitropoulos, K.; 13–15 May 2020; pp. 6018–6024. [Google
Daras, P. Artificial Intelligence Scholar]
Technologies for Sign Language. Sensors 16. Liao, Y.; Xiong, P.; Min, W.; Min, W.; Lu,
2021, 21, 5843. [CrossRef] [PubMed] J. Dynamic Sign Language Recognition
8. Sharma, M.; Pal, R.; Sahoo, A. Indian sign Based on Video Sequence with BLSTM-3D
language recognition using neural networks Residual Networks. IEEE Access 2019, 7,
38044–38054. [Google Scholar]
and KNN classifiers. J. Eng. Appl. Sci.
[CrossRef]
2014, 9, 1255–1259. 17. Adaloglou, N.; Chatzis, T. A
9. Shivashankara, S.; Srinath, S. American Comprehensive Study on Deep Learning-
Sign Language Recognition System: An based Methods for Sign Language
Optimal Approach. Int. J. Image Graph. Recognition. IEEE Trans.
Signal (accessed on 5 January 2022). Multimed. 2022, 24, 1750–1762. [Google
Process. 2018, 10, 18–30. Scholar] [CrossRef]
10. Wadhawan, A.; Kumar, P. Sign language
recognition systems: A decade systematic
literature review. Arch. Comput. Methods
Eng. 2021, 28, 785–813. [CrossRef]
11. Wazalwar, S.S.; Shrawankar, U.
Interpretation of sign language into English
using NLP techniques. J. Inf. Optim. Sci.
2017, 38, 895–910. [CrossRef]
12. Camgoz, N.C.; Hadfield, S.; Koller, O.; Ney,
H.; Bowden, R. Neural Sign Language
Translation. In Proceedings of the IEEE
Conference on Computer Vision and Pattern
Recognition (CVPR) 2018, Salt Lake City,
UT, USA, 18–22 June 2018; IEEE:
Piscataway, NJ, USA, 2018. [Google
Scholar]
13. Muthu Mariappan, H.; Gomathi, V. Real-
Time Recognition of Indian Sign Language.
In Proceedings of the International
Conference on Computational Intelligence
in Data Science, Haryana, India, 6–7
September 2019. [Google Scholar]
14. Mittal, A.; Kumar, P.; Roy, P.P.;
Balasubramanian, R.; Chaudhuri, B.B. A
Modified LSTM Model for Continuous Sign
Language Recognition Using Leap
9

You might also like