Report SLD
Report SLD
on
by
SIDDHANT GARG
229302317
Bachelor of Technology
in
Information Technology
May 2025
CERTIFICATE
Date: 17-04-2025
This is to certify that the minor project titled REAL TIME BI-DIRECTIONAL SIGN LANGUAGE
INTERPRETER is a record of the bonafide work done by SIDDHANT GARG (229302317)
submitted in partial fulfilment of the requirements for the award of the Degree of Bachelor of
Technology in Information Technology of Manipal University Jaipur, during the academic year 2024-
25.
Sign language serves as a vital means of communication for individuals who are hearing or speech
impaired. However, a considerable communication gap persists between those who use sign language
and the majority of the population who do not. To address this issue and promote inclusive interaction,
we propose a Real-Time Bi-Directional Sign Language Detection System that facilitates seamless
communication between sign language users and non-users. The proposed system comprises two integral
modules: (a) a sign-to-text-to-speech pipeline that employs computer vision and deep learning
techniques to detect and interpret hand gestures, convert them into text, and subsequently synthesize
speech output; and (b) a speech-to-text-sign module that uses advanced speech recognition technologies
to transcribe spoken words into text, which is then translated into corresponding sign language gestures
using a predefined gesture dataset. This dual-mode functionality ensures that communication flows
naturally in both directions, bridging the divide between spoken and visual language systems.
To ensure high accuracy and efficiency, our model utilizes Long Short-Term Memory (LSTM)
networks for dynamic gesture recognition, capable of understanding complex temporal patterns in sign
sequences. Additionally, optimized speech processing algorithms are integrated to handle real-time audio
input with minimal latency. The system architecture is designed to work seamlessly in real-time
environments, making it practical for everyday use in schools, workplaces, healthcare, and public spaces.
Extensive testing and evaluation have demonstrated that the system achieves 100% accuracy in both
translation directions under controlled conditions, highlighting its robustness and potential as a reliable
assistive communication tool. By enabling two-way communication between the hearing-impaired
community and the general population, this system represents a significant step toward digital inclusivity
and societal integration, empowering users through technology-driven accessibility.
LIST OF FIGURES
Figure No Figure Title Page No
1. Flow-chart Sign to text to speech 5
2. Flow-chart Speech to text to sign 6
3. Sample Dataset 7
4. LSTM Layer 8
5. System Predicting Alphabets 9
6. Media-player of predicted output 9
7. Training Loss and Validation Loss 10
8. Training Accuracy and Validation Accuracy 11
9. Confusion Matrix 12
10. Classification Report 13
11. Asking user to say the word/phrase for prediction 14
12. Confirming the phrase spoken 14
13. Showing expected output and accuracy 14
14. Grid showing “HELLO” in sign language 14
15. Accuracy of speech recognition with 15 pre-defined phrases 15
Table of Content
Page No
Chapter 1 INTRODUCTION
Chapter 3 METHODOLOGY
Chapter 4 RESULTS
REFERENCES 16
Page | 1
1. Introduction
1.1 Problem statement:
Clear and effective communication is a fundamental part of human interaction. People
typically communicate through two primary modes: verbal and non-verbal. However, a
significant portion of the global population encounters difficulties in communication due to
hearing and speech disabilities. This challenge is often compounded by the general public’s
limited familiarity with sign language, making it especially difficult for individuals with
hearing impairments to interact with society. According to the 76th round of the National
Sample Survey (NSS) conducted by the National Statistical Office, 2.2% of India’s
population was reported to have a disability between July and December 2018. This
included 2.3% in rural areas, 2.0% in urban zones, 2.4% of males, 1.9% of females, 2.23%
of Scheduled Caste (SC) individuals, and 1.92% of those belonging to Scheduled Tribes
(ST). For those who are deaf, hard of hearing, or unable to speak, sign language serves as a
vital mode of communication. It provides them with a way to express emotions, convey
information, and engage in social interactions, bypassing the limitations posed by auditory
or verbal communication. Bridging the gap between sign language users and those
unfamiliar with it requires innovative technological interventions. Sign language, being a
gestural and visual form of communication, plays a crucial role in helping the especially
abled access educational opportunities, employment, healthcare, and public services, thereby
promoting inclusivity and participation in everyday life.
1.2 Objective:
• Enable bi-directional communication between sign language and spoken language users
through real-time translation.
• Leverage deep learning models (LSTM) for accurate recognition of dynamic hand gestures
in sign language.
• Utilize computer vision techniques to detect and process hand gestures from live video
input.
• Translate text into sign language using a predefined dataset of sign gestures.
• Convert recognized signs into audible speech via text-to-speech synthesis for better
interaction.
• Ensure real-time processing with minimal latency for smooth, natural conversations.
• Achieve high accuracy in both gesture and speech recognition through robust training and
evaluation.
• Promote digital inclusivity and accessibility for individuals with hearing or speech
impairments.
Page | 2
1.3 Scope:
The scope of this project encompasses the design and development of a Real-Time
Bi-Directional Sign Language Interpreter capable of translating between sign
language and spoken language to facilitate inclusive communication. The project
covers key areas such as computer vision for hand gesture detection, deep learning
(specifically LSTM networks) for gesture recognition, speech recognition for
converting spoken words into text, and text-to-speech (TTS) and sign language
rendering for output generation. It also involves the creation of a predefined dataset
of sign gestures, real-time processing pipelines, and a user-friendly interface that
ensures accessibility. The system is aimed at bridging communication gaps in
practical settings such as educational institutions, healthcare facilities, workplaces,
and public service environments, where interaction between the hearing-impaired
and non-sign language users is essential.
2. Background Details
American Sign Language (ASL) is mostly used in the United States and Canada as a visual-
spatial language. ASL is a natural language complete with its own phonology (that is, the way small
visual units are combined to form larger meaningful units), morphology (that is, the way meaningful
signs are formed), syntax, and grammar separate from that of spoken English. It uses hand signs,
movements, arm movements, facial expressions, and body language to represent semantic
meanings, emotions, and abstract concepts. Though ASL is used widely, it is also specific to the
Deaf culture of North America, making it important for communication within deaf and hard-of-
hearing communities. Several studies have been conducted on the ASL.
(Kothadiya et al. 2022) proposed a dataset-based revolutionary neural network architecture for sign
language recognition and tracking and used a model for real-time text generation on video. The
architecture of this system is multi-phase, consisting of frame generation, image pre-processing,
hand movement analysis, and location-based feature extraction phases, among others. Hand
attributes were defined using a point of interest (POI) of the hand. Employing this approach, 55
distinct features were derived and applied as input within the neural network architecture which was
made up of CNN layers that forecasted the signs. The proposed model was trained and tested on
English alphabets (from A to Z) and achieved 100% accuracy and 48% of immunity to noise [6].
(Natarajan et al. 2022) The authors have set up the framework for handling sign language
recognition, translation, and production tasks using MediaPipe as the library, and a hybrid
Convolutional Neural Network + Bi-directional Long Short-Term Memory. A model hybrid NMT,
MediaPipe, and Dynamic GAN are adopted for video presentation of spoken sentences. For the
production of good recognition accuracy and visual quality, he experimented with various
multilingual benchmark sign corpuses, which resulted in above 95% classification accuracy.
The proposed model had an average Bilingual Evaluation understudy score of 38.06, excellent
human evaluation scores, 3.46 average Fréchet Inception Distance to videos FID2vid score, 0.921
average Structural Similarity Index Measure SSIM values, 8.4 average Inception Score, 29.73
average Peak Signal-to-Noise Ratio PSNR score, 14.06 average Fréchet Inception Distance FID
score, and an average 0.715 Temporal Consistency Metric TCM Score which is a proof of
the proposed work. [9].
Page | 3
(Alzubaidi, Otoom, and Abu Rwaq 2023) proposed a system that provides an assistive device that
helps special people to communicate with others. The author made an electronic glove that uses an
MPU6050 sensor to track hand movements and potentiometers to monitor finger positions. Arduino
board is then used to determine the meaning of gestures and get the voice of the corresponding
word. The highest accuracy achieved with the Decision Tress Algorithm was 98% [2].
(Adewale and Olamiti 2018) The author has translated ASL into text and speech applying
unsupervised feature learning in the paper. The developed framework undergoes data capturing via
a KINECT sensor, feature extraction from Region of Interest (ROI), and supervised and
unsupervised classification of images with K-Nearest Neighbour(KNN). The system achieved a
78% accuracy of unsupervised feature learning. [1].
(Mean Foong, Low, and La 2009) in the paper “presented a Template-based recognition in this
study to convert voice to sign language. The V2S system was first trained with speech pattern which
is based on a generic spectral parameter set. The Database is used to store the spectral parameters as
templates. Recognition of speech is performed by matching the parameter set of the input with the
templates stored in the database and finally displaying the sign language in video format. Results
showed that the system has an 80.3% recognition rate [7].
(Munde et al. 2024) In this method, an artificial neural network is used for the identification of hand
gestures; this is introduced with CNNs. The finger-spelt American Sign Language is very precisely
identified using deep-learning techniques. Where for new users, parametric depth techniques
provide 83-85% accuracy, repetitions previously trained give 99.99%, in an underwhelming and
unsophisticated manner of a multilevel approach provides a 98% accuracy to the system [8].
(Sultana et al. 2012) In this proposed model, speech from the Bangla language is converted to text
using the Speech Application Program Interface (SAPI) where the configuration of SAPI compared
pronunciations from continuous Bangla speech against a precompiled grammar file. After a match
was found, SAPI returned Bangla words in English characters. An average of about a 78%
recognition rate was obtained [10].
(Bharti et al. 2019) the automated system first recognizes speech, the second converts
it into text, the third matches the tokenized text with the library of visual sign words (videos of
sign languages), the fourth conceits all the matched videos according to the text recognized and
finally shows the merged video to the deaf/dumb person. System accuracy is
compared to the state of art approaches; found to be the best mode with accuracy at 67% in
offline mode and 93% while working online. [4].
(Dua et al. 2022) In the paper, the major aim of the author was to develop a speech-to-text
recognition system that recognizes infrequent speech signals (tonal) of Gurbani hymns using
CNNs. In addition to Praat for speech segmentation, six layers of 2DConv, 2DMax Pooling, and
256 units of dense layers (Google's TensorFlow service) were also used in this research work. This
architecture gave 89.15% accuracy. [5].
(Athira, Sruthi, and Lijiya 2022) Proposed a system that can recognize a wide range of gestures in
real-time videos which includes single-handed static and dynamic gestures, double-handed static
gestures, and finger spelling words of Indian Sign Language (ISL). The model successfully
recognized finger spelling alphabets with 91% accuracy and single-handed dynamic words with
89% accuracy [3].
Page | 4
3. System Design and Methodology:
A Bi-Directional Sign Language detector is a technology that detects hand gestures and converts
them into text and voice equivalents. It also converts voice to its equivalent ASL gestures. Sign
language bridges the gap for visually challenged people.
Page | 5
3.2 Flowchart Speech to Sign Language Detection
The flow chart (Figure 2) explains speech-to-sign language detection. The flow chart can be
seen in the figure the device opens the Microphone to capture input. The device checks for
ambient noise and checks if speech is recognized? If not, then again capture the input from
the microphone. If speech is recognized, then speaks recognized text using pyttsx3 and
displays corresponding letters using OpenCV. For each character in the text, it checks if the
corresponding letter is in dictionary? If not, then notify the unsupported characters and
display the next corresponding letter. If the letter is in the dictionary, then display the
corresponding letter using OpenCV. Close the displayed image window and again initialize
the microphone to detect the next letter else stop.
Page | 6
The dataset used in the proposed system is tailor-made for the application. It includes a minimum of
30 sample images for each of the 26 English alphabets. A real-time gesture data acquisition setup is
utilized to generate this dataset for sign language recognition. Through the webcam, the system
captures specific hand gestures and stores them into designated folders. Each alphabet has its own
dedicated directory, totaling 26 folders from A to Z. The system automatically scans the folders to
determine the current file count, allowing it to assign unique filenames and maintain systematic
organization. A rectangular Region of Interest (ROI) is defined on the video frame to isolate the
hand gesture from the background. This focused approach helps eliminate distractions, reduces
noise, and enhances the dataset’s overall quality, enabling the machine learning model to learn only
the relevant features of each gesture. When a particular key is pressed on the keyboard, the system
captures the frame and saves it into the respective directory—such as pressing ‘a’ for the letter A,
‘b’ for B, and so on.
C F R N L
Figure 3: Sample of Dataset
The system uses Mediapipe to extract the key features of the captured images. There are a total of
21 different key points that are extracted in the image considering fingertip, palm, and writs. The
system first converts the image from BGR (Blue, Green, Red) to RGB (Red, Green, Blue) as needed
by mediapipe, detecting hand landmarks using the model, and then converting the frames back to
the BGR format for visualization.
• Samples: Represents the total count of gesture sequences, each tied to a specific sign
language symbol.
• Sequence Length: Denotes the number of frames per sequence (typically 30 frames).
• Features: Refers to the set of features obtained from each frame (e.g., 63 numerical
values).
To effectively analyze this time-series data, the model employs a stack of three LSTM layers:
1. First LSTM Layer: Contains 64 units and captures basic temporal dependencies within
the sequence; its outputs are passed as sequences to the next layer.
2. Second LSTM Layer: With 128 units, this layer learns more complex temporal patterns
and continues outputting sequences.
3. Third LSTM Layer: This layer condenses the sequential information into a final
representation using 64 units and produces a non-sequential output for classification.
Page | 7
Figure 4: LSTM Layers
Page | 8
are passed as input to a trained LSTM (Long Short-Term Memory) model. For the final
prediction, the system used a Convolutional Neural Network (CNN) model to detect hand
gestures and predict sign language to text. If the predicted gesture is different from the last
predicted gesture, it gets added to the sentence list to avoid duplicate consecutive predictions
and ensures smooth English alphabets formation. The system updates its prediction of hand
gesture only when a new prediction is made
Figure 5: System predicted alphabets ‘Y’ and ‘C’ which their accuracy
The system predicts hand gestures only when the confidence exceeds 80%. If not, it waits for
additional frames to ensure accuracy. Once a gesture is confidently recognized, it is converted to
speech using the pyttsx3 Text-to-Speech (TTS) library, allowing offline audio playback of the
detected alphabet.
Page | 9
Figure 7: Training Loss and Validation Loss
Page | 10
Figure 8: Training Accuracy and Validation Accuracy
Page | 11
The figure (7) and figure (8) depict the performance during the training and testing of our LSTM
model using 5-fold cross-validation. Each of the 5 splits have both a validation and training
obtained, so there are four graphs per split: training loss, training accuracy, validation loss, and
validation accuracy. The loss curves show the progress of the model during training, where
categorical cross-entropy is being optimized. The decrease in training loss during the model
optimization process indicates that the model is learning. Validation loss also decreases, but there
are some cases where overfitting may occur. This is managed and minimized by drop out layers
(0.3), L2 regularization (0.001), ReduceLROnPlateau, and EarlyStopping over the epochs. The
accuracy curves indicate the model learning where training accuracy improves with epochs while
the validation accuracy plateaus if the model generalizes to unseen data. If the validation loss starts
increasing while the training loss decreases, then this indicates overfitting, this is controlled by
changing the learning rate and regularization. The model's performance spread across different
partitions are compared to selecting the version with the least validation loss and greatest accuracy
in order to confirm the model is robust. Cross-validation minimizes the model dependency on a
single partition, so it is trained and tested on different dataset splits. The results confirm that our
model effectively learns sign language patter
In Figure (9), the confusion matrix summarizes how good the model is at classifying the 26 English
alphabets in terms of performance. The model’s predictions are correct with very few of them being
incorrectly classified, which is why the diagonals are the strongest. Letters C (6), J (5), and U (5)
are the most frequently correct classified letters and show the model is very sure about these classes.
The slight misclassifications seen in some of the cases may be associated with overlapping features
of certain signs and image quality. In summary, it has already been identified that the model has a
great deal of precision and recall which means it is very likely that the model has good
generalizations. Nonetheless, there are some classifications that are done incorrectly, and this
suggests that there are also some classifications that are done incorrectly that can be addressed by
augmenting the data, tuning the hyperparameters, or increasing the dataset size for increased
robustness and real-world use.
Page | 12
Figure 10: Classification Report
The figure (10) shows that the system achieves 100% precision for all 26 English letters. The
precision, recall, and F1-score metrics used in the classifier show all 1.00, which shows that there is
no false positive or false negative. It shows that the model can work perfectly by identifying each
sign correctly. The support column thus shows the number of test samples per class, with different
letters that are essential to ensure a balanced evaluation. The similarity of the values obtained both
from macro and weighted averages to each other indicates consistency and fairness in the
classification process. These accurate results indicate that our model guarantees both predictability
and reliability in real-time sign language translation.
Page | 13
Figure 11: Asking user to say the word/phrase for prediction
Page | 14
Figure 15: Accuracy of Speech recognition of 15 pre-defined words/phrases
In Figure (15), users can see the speech recognition accuracy of specified phrases that different
people try to say. There are varying inputs to test the system's capabilities. The tested phrases are on
the x-axis and on the y-axis is the accuracy percentage which ranges from 96 to 104. The graph
shows a percentage close to one hundred across all phrases especially for the simpler words like
“dog” or Hello” to the more advanced phrases like “open the door” or “turn of the light.” The close
to exact accuracy on all phrases showcases the accuracy and strength of the speech recognition
system as poweful enough to acheive such precision without considerable fluctuation.
Page | 15
6. REFERENCES
1. Adewale, Victoria, and Adejoke Olamiti. 2018. “Conversion of Sign Language To Text And
Speech Using Machine Learning Techniques.” JOURNAL OF RESEARCH AND REVIEW IN
SCIENCE 5(1). doi: 10.36108/jrrslasu/8102/50(0170).
2. Alzubaidi, Mohammad A., Mwaffaq Otoom, and Areen M. Abu Rwaq. 2023. “A Novel
Assistive Glove to Convert Arabic Sign Language into Speech.” ACM Transactions on Asian and
Low-Resource Language Information Processing 22(2):1–16. doi: 10.1145/3545113.
3. Athira, P. K., C. J. Sruthi, and A. Lijiya. 2022. “A Signer Independent Sign Language
Recognition with Co-Articulation Elimination from Live Videos: An Indian Scenario.” Journal of
King Saud University - Computer and Information Sciences 34(3):771–81. doi:
10.1016/j.jksuci.2019.05.002.
4. Bharti, Ritika, Sarthak Yadav, Sourav Gupta, and Rajitha B. 2019. “Automated Speech to
Sign Language Conversion Using Google API and NLP.” SSRN Electronic Journal. doi:
10.2139/ssrn.3575439.
5. Dua, Sakshi, Sethuraman Sambath Kumar, Yasser Albagory, Rajakumar Ramalingam,
Ankur Dumka, Rajesh Singh, Mamoon Rashid, Anita Gehlot, Sultan S. Alshamrani, and Ahmed
Saeed AlGhamdi. 2022. “Developing a Speech Recognition System for Recognizing Tonal Speech
Signals Using a Convolutional Neural Network.” Applied Sciences 12(12):6223. doi:
10.3390/app12126223.
6. Kothadiya, Deep, Chintan Bhatt, Krenil Sapariya, Kevin Patel, Ana-Belén Gil-González,
and Juan M. Corchado. 2022. “Deepsign: Sign Language Detection and Recognition Using Deep
Learning.” Electronics 11(11):1780. doi: 10.3390/electronics11111780.
7. Mean Foong, Oi, Tang Jung Low, and Wai Wan La. 2009. “V2S: Voice to Sign Language
Translation System for Malaysian Deaf People.” Pp. 868–76 in Visual Informatics: Bridging
Research and Practice. Vol. 5857, Lecture Notes in Computer Science, edited by H. Badioze
Zaman, P. Robinson, M. Petrou, P. Olivier, H. Schröder, and T. K. Shih. Berlin, Heidelberg:
Springer Berlin Heidelberg.
8. Munde, Mansi, Ganesh Jadhav, Sushma Gunjal, Kamlesh Mahale, and Aditya Kale. 2024.
“A Real-Time Sign Language to Text Conversion System for Enhanced Communication
Accessibility.” doi: 10.15157/QR.2024.2.1.7-13.
9. Natarajan, B., E. Rajalakshmi, R. Elakkiya, Ketan Kotecha, Ajith Abraham, Lubna
Abdelkareim Gabralla, and V. Subramaniyaswamy. 2022. “Development of an End-to-End Deep
Learning Framework for Sign Language Recognition, Translation, and Video Generation.” IEEE
Access 10:104358–74. doi: 10.1109/ACCESS.2022.3210543.
10. Sultana, Shaheena, M. A. H. Akhand, Prodip Kumer Das, and M. M. Hafizur Rahman. 2012.
“Bangla Speech-to-Text Conversion Using SAPI.” Pp. 385–90 in 2012 International Conference on
Computer and Communication Engineering (ICCCE). Kuala Lumpur, Malaysia: IEEE.
11. 76th round of the National Sample Survey (NSS).
https://des.delhi.gov.in/sites/default/files/report_on_survey_of_persons_with_disabilities.pdf
12. About American sign language
https://en.wikipedia.org/wiki/Sign_language#:~:text=Wherever%20communities%20of%20people
%20with,some%20form%20of%20legal%20recognition.
13. Types of sign language. https://www.ai-media.tv/knowledge-hub/insights/sign-language-
alphabets/
Page | 16
Page | 17