0% found this document useful (0 votes)
21 views21 pages

Report SLD

The project report details the development of a Real-Time Bi-Directional Sign Language Interpreter aimed at bridging communication gaps between sign language users and non-users. It employs deep learning techniques, particularly Long Short-Term Memory (LSTM) networks, to accurately recognize hand gestures and convert them into text and speech, while also translating spoken language into sign language. The system has demonstrated 100% accuracy in controlled testing, promoting digital inclusivity and accessibility for individuals with hearing or speech impairments.

Uploaded by

Anshika Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views21 pages

Report SLD

The project report details the development of a Real-Time Bi-Directional Sign Language Interpreter aimed at bridging communication gaps between sign language users and non-users. It employs deep learning techniques, particularly Long Short-Term Memory (LSTM) networks, to accurately recognize hand gestures and convert them into text and speech, while also translating spoken language into sign language. The system has demonstrated 100% accuracy in controlled testing, promoting digital inclusivity and accessibility for individuals with hearing or speech impairments.

Uploaded by

Anshika Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

A Project Report

on

REAL TIME BI-DIRECTIONAL


SIGN LANGUAGE INTERPRETER
carried out as part of the Minor Project IT3270 Submitted

by

SIDDHANT GARG
229302317

in partial fulfilment for the award of the degree of

Bachelor of Technology
in

Information Technology

Under the Guidance of


Dr. Kavita Jhajharia

Department of Information Technology

MANIPAL UNIVERSITY JAIPUR


RAJASTHAN, INDIA

May 2025
CERTIFICATE

Date: 17-04-2025

This is to certify that the minor project titled REAL TIME BI-DIRECTIONAL SIGN LANGUAGE
INTERPRETER is a record of the bonafide work done by SIDDHANT GARG (229302317)
submitted in partial fulfilment of the requirements for the award of the Degree of Bachelor of
Technology in Information Technology of Manipal University Jaipur, during the academic year 2024-
25.

Dr. Kavita Jhajharia


Assistant Professor
Project Guide, Department of Information Technology
Manipal University Jaipur

Dr. Pratistha Mathur


HoD, Department of Information Technology
Manipal University Jaipur
ABSTRACT

Sign language serves as a vital means of communication for individuals who are hearing or speech
impaired. However, a considerable communication gap persists between those who use sign language
and the majority of the population who do not. To address this issue and promote inclusive interaction,
we propose a Real-Time Bi-Directional Sign Language Detection System that facilitates seamless
communication between sign language users and non-users. The proposed system comprises two integral
modules: (a) a sign-to-text-to-speech pipeline that employs computer vision and deep learning
techniques to detect and interpret hand gestures, convert them into text, and subsequently synthesize
speech output; and (b) a speech-to-text-sign module that uses advanced speech recognition technologies
to transcribe spoken words into text, which is then translated into corresponding sign language gestures
using a predefined gesture dataset. This dual-mode functionality ensures that communication flows
naturally in both directions, bridging the divide between spoken and visual language systems.

To ensure high accuracy and efficiency, our model utilizes Long Short-Term Memory (LSTM)
networks for dynamic gesture recognition, capable of understanding complex temporal patterns in sign
sequences. Additionally, optimized speech processing algorithms are integrated to handle real-time audio
input with minimal latency. The system architecture is designed to work seamlessly in real-time
environments, making it practical for everyday use in schools, workplaces, healthcare, and public spaces.
Extensive testing and evaluation have demonstrated that the system achieves 100% accuracy in both
translation directions under controlled conditions, highlighting its robustness and potential as a reliable
assistive communication tool. By enabling two-way communication between the hearing-impaired
community and the general population, this system represents a significant step toward digital inclusivity
and societal integration, empowering users through technology-driven accessibility.
LIST OF FIGURES
Figure No Figure Title Page No
1. Flow-chart Sign to text to speech 5
2. Flow-chart Speech to text to sign 6
3. Sample Dataset 7
4. LSTM Layer 8
5. System Predicting Alphabets 9
6. Media-player of predicted output 9
7. Training Loss and Validation Loss 10
8. Training Accuracy and Validation Accuracy 11
9. Confusion Matrix 12
10. Classification Report 13
11. Asking user to say the word/phrase for prediction 14
12. Confirming the phrase spoken 14
13. Showing expected output and accuracy 14
14. Grid showing “HELLO” in sign language 14
15. Accuracy of speech recognition with 15 pre-defined phrases 15
Table of Content

Page No

Chapter 1 INTRODUCTION

1.1 Problem Statement 2

1.2 Objectives of the Project 2

1.3 Scope of Report 3

Chapter 2 BACKGROUND OVERVIEW 3

Chapter 3 METHODOLOGY

3.1 Flowchart Sign Language Detection 4

3.2 Flowchart Speech to Sign Language 6

3.3 Model Architecture 7

3.4 Speech to text and sign 8

Chapter 4 RESULTS

4.1 Sign to speech and text conversion 8

4.2 Speech to text and sign conversion 13

Chapter 5 FUTURE WORK AND CONCLUSION 15

REFERENCES 16

Page | 1
1. Introduction
1.1 Problem statement:
Clear and effective communication is a fundamental part of human interaction. People
typically communicate through two primary modes: verbal and non-verbal. However, a
significant portion of the global population encounters difficulties in communication due to
hearing and speech disabilities. This challenge is often compounded by the general public’s
limited familiarity with sign language, making it especially difficult for individuals with
hearing impairments to interact with society. According to the 76th round of the National
Sample Survey (NSS) conducted by the National Statistical Office, 2.2% of India’s
population was reported to have a disability between July and December 2018. This
included 2.3% in rural areas, 2.0% in urban zones, 2.4% of males, 1.9% of females, 2.23%
of Scheduled Caste (SC) individuals, and 1.92% of those belonging to Scheduled Tribes
(ST). For those who are deaf, hard of hearing, or unable to speak, sign language serves as a
vital mode of communication. It provides them with a way to express emotions, convey
information, and engage in social interactions, bypassing the limitations posed by auditory
or verbal communication. Bridging the gap between sign language users and those
unfamiliar with it requires innovative technological interventions. Sign language, being a
gestural and visual form of communication, plays a crucial role in helping the especially
abled access educational opportunities, employment, healthcare, and public services, thereby
promoting inclusivity and participation in everyday life.

1.2 Objective:

• Enable bi-directional communication between sign language and spoken language users
through real-time translation.

• Leverage deep learning models (LSTM) for accurate recognition of dynamic hand gestures
in sign language.

• Utilize computer vision techniques to detect and process hand gestures from live video
input.

• Integrate speech-to-text conversion using speech recognition to interpret spoken


language input.

• Translate text into sign language using a predefined dataset of sign gestures.

• Convert recognized signs into audible speech via text-to-speech synthesis for better
interaction.

• Ensure real-time processing with minimal latency for smooth, natural conversations.

• Achieve high accuracy in both gesture and speech recognition through robust training and
evaluation.

• Promote digital inclusivity and accessibility for individuals with hearing or speech
impairments.

• Develop a user-friendly interface suitable for deployment in real-life scenarios like


classrooms, hospitals, and public spaces.

Page | 2
1.3 Scope:

The scope of this project encompasses the design and development of a Real-Time
Bi-Directional Sign Language Interpreter capable of translating between sign
language and spoken language to facilitate inclusive communication. The project
covers key areas such as computer vision for hand gesture detection, deep learning
(specifically LSTM networks) for gesture recognition, speech recognition for
converting spoken words into text, and text-to-speech (TTS) and sign language
rendering for output generation. It also involves the creation of a predefined dataset
of sign gestures, real-time processing pipelines, and a user-friendly interface that
ensures accessibility. The system is aimed at bridging communication gaps in
practical settings such as educational institutions, healthcare facilities, workplaces,
and public service environments, where interaction between the hearing-impaired
and non-sign language users is essential.

2. Background Details

American Sign Language (ASL) is mostly used in the United States and Canada as a visual-
spatial language. ASL is a natural language complete with its own phonology (that is, the way small
visual units are combined to form larger meaningful units), morphology (that is, the way meaningful
signs are formed), syntax, and grammar separate from that of spoken English. It uses hand signs,
movements, arm movements, facial expressions, and body language to represent semantic
meanings, emotions, and abstract concepts. Though ASL is used widely, it is also specific to the
Deaf culture of North America, making it important for communication within deaf and hard-of-
hearing communities. Several studies have been conducted on the ASL.

(Kothadiya et al. 2022) proposed a dataset-based revolutionary neural network architecture for sign
language recognition and tracking and used a model for real-time text generation on video. The
architecture of this system is multi-phase, consisting of frame generation, image pre-processing,
hand movement analysis, and location-based feature extraction phases, among others. Hand
attributes were defined using a point of interest (POI) of the hand. Employing this approach, 55
distinct features were derived and applied as input within the neural network architecture which was
made up of CNN layers that forecasted the signs. The proposed model was trained and tested on
English alphabets (from A to Z) and achieved 100% accuracy and 48% of immunity to noise [6].

(Natarajan et al. 2022) The authors have set up the framework for handling sign language
recognition, translation, and production tasks using MediaPipe as the library, and a hybrid
Convolutional Neural Network + Bi-directional Long Short-Term Memory. A model hybrid NMT,
MediaPipe, and Dynamic GAN are adopted for video presentation of spoken sentences. For the
production of good recognition accuracy and visual quality, he experimented with various
multilingual benchmark sign corpuses, which resulted in above 95% classification accuracy.
The proposed model had an average Bilingual Evaluation understudy score of 38.06, excellent
human evaluation scores, 3.46 average Fréchet Inception Distance to videos FID2vid score, 0.921
average Structural Similarity Index Measure SSIM values, 8.4 average Inception Score, 29.73
average Peak Signal-to-Noise Ratio PSNR score, 14.06 average Fréchet Inception Distance FID
score, and an average 0.715 Temporal Consistency Metric TCM Score which is a proof of
the proposed work. [9].

Page | 3
(Alzubaidi, Otoom, and Abu Rwaq 2023) proposed a system that provides an assistive device that
helps special people to communicate with others. The author made an electronic glove that uses an
MPU6050 sensor to track hand movements and potentiometers to monitor finger positions. Arduino
board is then used to determine the meaning of gestures and get the voice of the corresponding
word. The highest accuracy achieved with the Decision Tress Algorithm was 98% [2].

(Adewale and Olamiti 2018) The author has translated ASL into text and speech applying
unsupervised feature learning in the paper. The developed framework undergoes data capturing via
a KINECT sensor, feature extraction from Region of Interest (ROI), and supervised and
unsupervised classification of images with K-Nearest Neighbour(KNN). The system achieved a
78% accuracy of unsupervised feature learning. [1].

(Mean Foong, Low, and La 2009) in the paper “presented a Template-based recognition in this
study to convert voice to sign language. The V2S system was first trained with speech pattern which
is based on a generic spectral parameter set. The Database is used to store the spectral parameters as
templates. Recognition of speech is performed by matching the parameter set of the input with the
templates stored in the database and finally displaying the sign language in video format. Results
showed that the system has an 80.3% recognition rate [7].

(Munde et al. 2024) In this method, an artificial neural network is used for the identification of hand
gestures; this is introduced with CNNs. The finger-spelt American Sign Language is very precisely
identified using deep-learning techniques. Where for new users, parametric depth techniques
provide 83-85% accuracy, repetitions previously trained give 99.99%, in an underwhelming and
unsophisticated manner of a multilevel approach provides a 98% accuracy to the system [8].

(Sultana et al. 2012) In this proposed model, speech from the Bangla language is converted to text
using the Speech Application Program Interface (SAPI) where the configuration of SAPI compared
pronunciations from continuous Bangla speech against a precompiled grammar file. After a match
was found, SAPI returned Bangla words in English characters. An average of about a 78%
recognition rate was obtained [10].

(Bharti et al. 2019) the automated system first recognizes speech, the second converts
it into text, the third matches the tokenized text with the library of visual sign words (videos of
sign languages), the fourth conceits all the matched videos according to the text recognized and
finally shows the merged video to the deaf/dumb person. System accuracy is
compared to the state of art approaches; found to be the best mode with accuracy at 67% in
offline mode and 93% while working online. [4].
(Dua et al. 2022) In the paper, the major aim of the author was to develop a speech-to-text
recognition system that recognizes infrequent speech signals (tonal) of Gurbani hymns using
CNNs. In addition to Praat for speech segmentation, six layers of 2DConv, 2DMax Pooling, and
256 units of dense layers (Google's TensorFlow service) were also used in this research work. This
architecture gave 89.15% accuracy. [5].

(Athira, Sruthi, and Lijiya 2022) Proposed a system that can recognize a wide range of gestures in
real-time videos which includes single-handed static and dynamic gestures, double-handed static
gestures, and finger spelling words of Indian Sign Language (ISL). The model successfully
recognized finger spelling alphabets with 91% accuracy and single-handed dynamic words with
89% accuracy [3].

Page | 4
3. System Design and Methodology:

A Bi-Directional Sign Language detector is a technology that detects hand gestures and converts
them into text and voice equivalents. It also converts voice to its equivalent ASL gestures. Sign
language bridges the gap for visually challenged people.

3.1 Flowchart Sign Language Detection


The flowchart (Figure 1) outlines the gesture detection process in the sign language system.
It starts with real-time video capture, identifying the hand Region of Interest (ROI).
MediaPipe extracts key landmark points from the ROI, which are stored in a buffer of 30
frames. Once full, a pre-trained model predicts the gesture. If the prediction accuracy
exceeds 80%, the corresponding character and confidence score are displayed and spoken
using Text-To-Speech (TTS). Otherwise, the system continues collecting frames.

Figure 1: Flowchart Sign to Text to Speech

Page | 5
3.2 Flowchart Speech to Sign Language Detection
The flow chart (Figure 2) explains speech-to-sign language detection. The flow chart can be
seen in the figure the device opens the Microphone to capture input. The device checks for
ambient noise and checks if speech is recognized? If not, then again capture the input from
the microphone. If speech is recognized, then speaks recognized text using pyttsx3 and
displays corresponding letters using OpenCV. For each character in the text, it checks if the
corresponding letter is in dictionary? If not, then notify the unsupported characters and
display the next corresponding letter. If the letter is in the dictionary, then display the
corresponding letter using OpenCV. Close the displayed image window and again initialize
the microphone to detect the next letter else stop.

Figure 2: Flowchart Speech to Text to sign

Page | 6
The dataset used in the proposed system is tailor-made for the application. It includes a minimum of
30 sample images for each of the 26 English alphabets. A real-time gesture data acquisition setup is
utilized to generate this dataset for sign language recognition. Through the webcam, the system
captures specific hand gestures and stores them into designated folders. Each alphabet has its own
dedicated directory, totaling 26 folders from A to Z. The system automatically scans the folders to
determine the current file count, allowing it to assign unique filenames and maintain systematic
organization. A rectangular Region of Interest (ROI) is defined on the video frame to isolate the
hand gesture from the background. This focused approach helps eliminate distractions, reduces
noise, and enhances the dataset’s overall quality, enabling the machine learning model to learn only
the relevant features of each gesture. When a particular key is pressed on the keyboard, the system
captures the frame and saves it into the respective directory—such as pressing ‘a’ for the letter A,
‘b’ for B, and so on.

C F R N L
Figure 3: Sample of Dataset

The system uses Mediapipe to extract the key features of the captured images. There are a total of
21 different key points that are extracted in the image considering fingertip, palm, and writs. The
system first converts the image from BGR (Blue, Green, Red) to RGB (Red, Green, Blue) as needed
by mediapipe, detecting hand landmarks using the model, and then converting the frames back to
the BGR format for visualization.

3.3 MODEL ARCHITECTURE


Every sign language gesture in the system is modeled as a sequence of feature vectors.
These vectors are extracted from pre-processed frames stored in .npy files, with each frame
containing 63 unique attributes such as key point coordinates and other relevant features.
The input data is structured as a three-dimensional tensor with the format (samples,
sequence_length, features), where:

• Samples: Represents the total count of gesture sequences, each tied to a specific sign
language symbol.
• Sequence Length: Denotes the number of frames per sequence (typically 30 frames).
• Features: Refers to the set of features obtained from each frame (e.g., 63 numerical
values).

To effectively analyze this time-series data, the model employs a stack of three LSTM layers:

1. First LSTM Layer: Contains 64 units and captures basic temporal dependencies within
the sequence; its outputs are passed as sequences to the next layer.
2. Second LSTM Layer: With 128 units, this layer learns more complex temporal patterns
and continues outputting sequences.
3. Third LSTM Layer: This layer condenses the sequential information into a final
representation using 64 units and produces a non-sequential output for classification.

Page | 7
Figure 4: LSTM Layers

3.4 SPEECH TO TEXT AND SIGN LANGUAGE


The system combines speech recognition with image processing to convert spoken language
into its corresponding sign language visualizations. It begins by initializing the microphone
using sr.Microphone(), applying ambient noise calibration, and capturing audio input. The
speech input is then processed through the Google Speech Recognition API, which converts
the spoken words into text and formats it in lowercase for consistency. The converted text is
then confirmed aloud using the pyttsx3 text-to-speech engine. A predefined mapping is
maintained that links each letter from A to Z and each digit from 0 to 9 with their respective
sign language images. The recognized text is parsed one character at a time—if a match is
found in the dictionary, the related image is opened using OpenCV, resized for uniform
appearance, and briefly displayed before moving on to the next symbol. This operation runs
continuously in a loop, enabling ongoing speech recognition, voice confirmation, and visual
representation of signs. The system is designed to handle multiple exceptions gracefully,
such as failures in network connectivity, errors in speech recognition, or missing image files.

4. RESULTS AND DISCUSSION


This section briefly describes the outcome of the Real Time Bi-Directional Sign Language
interpreter, which converts sign language to text, text to speech, speech to text, and text back to sign
language. The system effectively bridges the gap for especially abled people. The performance of
the system in recognizing and translating American sign language to text and speech, as well as
converting speech to text and text back to sign language, is evaluated and discussed.

4.1 Sign to Text and Speech Conversion


Firstly, the system converts sign language to text. With the help of Mediapipe and Open-Cv,
the system detects a particular hand gesture and predicts the corresponding gesture to its
American Sign Language alphabet. The system uses Mediapipe to detect hand gestures,
which are then converted to key points. The extracted key points (hand landmark positions)

Page | 8
are passed as input to a trained LSTM (Long Short-Term Memory) model. For the final
prediction, the system used a Convolutional Neural Network (CNN) model to detect hand
gestures and predict sign language to text. If the predicted gesture is different from the last
predicted gesture, it gets added to the sentence list to avoid duplicate consecutive predictions
and ensures smooth English alphabets formation. The system updates its prediction of hand
gesture only when a new prediction is made

Figure 5: System predicted alphabets ‘Y’ and ‘C’ which their accuracy

The system predicts hand gestures only when the confidence exceeds 80%. If not, it waits for
additional frames to ensure accuracy. Once a gesture is confidently recognized, it is converted to
speech using the pyttsx3 Text-to-Speech (TTS) library, allowing offline audio playback of the
detected alphabet.

Figure 6: Screen short of media player of predicted output

Page | 9
Figure 7: Training Loss and Validation Loss

Page | 10
Figure 8: Training Accuracy and Validation Accuracy

Page | 11
The figure (7) and figure (8) depict the performance during the training and testing of our LSTM
model using 5-fold cross-validation. Each of the 5 splits have both a validation and training
obtained, so there are four graphs per split: training loss, training accuracy, validation loss, and
validation accuracy. The loss curves show the progress of the model during training, where
categorical cross-entropy is being optimized. The decrease in training loss during the model
optimization process indicates that the model is learning. Validation loss also decreases, but there
are some cases where overfitting may occur. This is managed and minimized by drop out layers
(0.3), L2 regularization (0.001), ReduceLROnPlateau, and EarlyStopping over the epochs. The
accuracy curves indicate the model learning where training accuracy improves with epochs while
the validation accuracy plateaus if the model generalizes to unseen data. If the validation loss starts
increasing while the training loss decreases, then this indicates overfitting, this is controlled by
changing the learning rate and regularization. The model's performance spread across different
partitions are compared to selecting the version with the least validation loss and greatest accuracy
in order to confirm the model is robust. Cross-validation minimizes the model dependency on a
single partition, so it is trained and tested on different dataset splits. The results confirm that our
model effectively learns sign language patter

Figure 9: Confusion Matrix

In Figure (9), the confusion matrix summarizes how good the model is at classifying the 26 English
alphabets in terms of performance. The model’s predictions are correct with very few of them being
incorrectly classified, which is why the diagonals are the strongest. Letters C (6), J (5), and U (5)
are the most frequently correct classified letters and show the model is very sure about these classes.
The slight misclassifications seen in some of the cases may be associated with overlapping features
of certain signs and image quality. In summary, it has already been identified that the model has a
great deal of precision and recall which means it is very likely that the model has good
generalizations. Nonetheless, there are some classifications that are done incorrectly, and this
suggests that there are also some classifications that are done incorrectly that can be addressed by
augmenting the data, tuning the hyperparameters, or increasing the dataset size for increased
robustness and real-world use.

Page | 12
Figure 10: Classification Report

The figure (10) shows that the system achieves 100% precision for all 26 English letters. The
precision, recall, and F1-score metrics used in the classifier show all 1.00, which shows that there is
no false positive or false negative. It shows that the model can work perfectly by identifying each
sign correctly. The support column thus shows the number of test samples per class, with different
letters that are essential to ensure a balanced evaluation. The similarity of the values obtained both
from macro and weighted averages to each other indicates consistency and fairness in the
classification process. These accurate results indicate that our model guarantees both predictability
and reliability in real-time sign language translation.

4.2 Speech to Text and Sign Conversion


This part of the model successfully converts speech to text and the maps spoken words to
their matching sign language gestures. A set of 15 pre-defined phrases (hello, dog, open the
door, good morning, computer, turn off the light, thank you, please help, good night, what is
your name, stop, peanut, left, welcome, python) were used as test inputs to check the
accuracy of the speech to text conversion. The system takes spoken words(fig) and converts
them into text using the Google Speech Recognition API (fig). However, the system can
recognise and convert any spoken word to text and its equivalent sign language gesture.
Recognition accuracy is computed via the Levenshtein distance between the expected and
recognized text. Levenshtein distance helps in recognizing the performance of speech
recognition even if minor errors exist.
After converting speech into text, the system processes every letter in the recognized text
individually. These letters match their respective ASL representations, which are stored as
images in a pre-defined dictionary. Once internal mapping is successfully performed from
the recognized letters to their respective ASL images, the arranged content appears in
structured format on a grid layout(fig.15). The grid display relies entirely on graphics to sign
back the spoken phrase. Still, it goes well for persons who are hearing impaired to get the
message across with the aid of sign language. The process bridges the gap of communication
between spoken language and sign language, thus enhancing accessibility and
communication.
The system correctly maps each spoken word to text and its parallel sign language gestures.
Yet if speech recognition errors occurred, the displayed signs reflected the incorrect
transcription.

Page | 13
Figure 11: Asking user to say the word/phrase for prediction

Figure 12: Confirming the spoken phrase

Figure 13: Showing expected output and Accuracy

Figure 14: Grid representation of Hello Sign language

Page | 14
Figure 15: Accuracy of Speech recognition of 15 pre-defined words/phrases

In Figure (15), users can see the speech recognition accuracy of specified phrases that different
people try to say. There are varying inputs to test the system's capabilities. The tested phrases are on
the x-axis and on the y-axis is the accuracy percentage which ranges from 96 to 104. The graph
shows a percentage close to one hundred across all phrases especially for the simpler words like
“dog” or Hello” to the more advanced phrases like “open the door” or “turn of the light.” The close
to exact accuracy on all phrases showcases the accuracy and strength of the speech recognition
system as poweful enough to acheive such precision without considerable fluctuation.

5. CONCLUSION AND FUTURE WORK


In the proposed system, we have successfully developed and demonstrated a real-time bi-
directional sign language detection system designed to bridge the communication gap between
individuals who use sign language and those who do not. This model is an inclusive solution that
facilitates seamless interaction by converting sign language gestures into text and synthesized
speech, and conversely, transforming spoken language into corresponding sign language
gestures. The implementation leverages cutting-edge technologies including computer vision,
deep learning, and natural language processing (NLP) to ensure high accuracy, fast processing,
and a user-friendly interface. Our experimental results confirm that the system achieves 100%
accuracy in translating in both directions for the data set tested, making it highly reliable for real-
world use. It performs exceptionally well in recognizing all 26 English alphabets and a predefined
set of spoken phrases, thus providing a powerful framework for bi-directional communication.
Moreover, integrating support for multiple sign language systems will enhance cross-cultural and
international usability. We also plan to develop a mobile or embedded version of this system to
make it portable and accessible to a broader audience, especially in remote or resource-limited
settings. Optimization for smartphones, Raspberry Pi, or wearable devices such as smart gloves and
AR glasses would make the technology more usable in daily life scenarios. In conclusion, our bi-
directional sign language detection system represents a significant step towards an inclusive future
where technological innovation helps remove communication barriers. With further research,
enhancements, and integration with modern devices, it has the potential to be widely adopted in
education, healthcare, customer service, public administration, and beyond.

Page | 15
6. REFERENCES

1. Adewale, Victoria, and Adejoke Olamiti. 2018. “Conversion of Sign Language To Text And
Speech Using Machine Learning Techniques.” JOURNAL OF RESEARCH AND REVIEW IN
SCIENCE 5(1). doi: 10.36108/jrrslasu/8102/50(0170).
2. Alzubaidi, Mohammad A., Mwaffaq Otoom, and Areen M. Abu Rwaq. 2023. “A Novel
Assistive Glove to Convert Arabic Sign Language into Speech.” ACM Transactions on Asian and
Low-Resource Language Information Processing 22(2):1–16. doi: 10.1145/3545113.
3. Athira, P. K., C. J. Sruthi, and A. Lijiya. 2022. “A Signer Independent Sign Language
Recognition with Co-Articulation Elimination from Live Videos: An Indian Scenario.” Journal of
King Saud University - Computer and Information Sciences 34(3):771–81. doi:
10.1016/j.jksuci.2019.05.002.
4. Bharti, Ritika, Sarthak Yadav, Sourav Gupta, and Rajitha B. 2019. “Automated Speech to
Sign Language Conversion Using Google API and NLP.” SSRN Electronic Journal. doi:
10.2139/ssrn.3575439.
5. Dua, Sakshi, Sethuraman Sambath Kumar, Yasser Albagory, Rajakumar Ramalingam,
Ankur Dumka, Rajesh Singh, Mamoon Rashid, Anita Gehlot, Sultan S. Alshamrani, and Ahmed
Saeed AlGhamdi. 2022. “Developing a Speech Recognition System for Recognizing Tonal Speech
Signals Using a Convolutional Neural Network.” Applied Sciences 12(12):6223. doi:
10.3390/app12126223.
6. Kothadiya, Deep, Chintan Bhatt, Krenil Sapariya, Kevin Patel, Ana-Belén Gil-González,
and Juan M. Corchado. 2022. “Deepsign: Sign Language Detection and Recognition Using Deep
Learning.” Electronics 11(11):1780. doi: 10.3390/electronics11111780.
7. Mean Foong, Oi, Tang Jung Low, and Wai Wan La. 2009. “V2S: Voice to Sign Language
Translation System for Malaysian Deaf People.” Pp. 868–76 in Visual Informatics: Bridging
Research and Practice. Vol. 5857, Lecture Notes in Computer Science, edited by H. Badioze
Zaman, P. Robinson, M. Petrou, P. Olivier, H. Schröder, and T. K. Shih. Berlin, Heidelberg:
Springer Berlin Heidelberg.
8. Munde, Mansi, Ganesh Jadhav, Sushma Gunjal, Kamlesh Mahale, and Aditya Kale. 2024.
“A Real-Time Sign Language to Text Conversion System for Enhanced Communication
Accessibility.” doi: 10.15157/QR.2024.2.1.7-13.
9. Natarajan, B., E. Rajalakshmi, R. Elakkiya, Ketan Kotecha, Ajith Abraham, Lubna
Abdelkareim Gabralla, and V. Subramaniyaswamy. 2022. “Development of an End-to-End Deep
Learning Framework for Sign Language Recognition, Translation, and Video Generation.” IEEE
Access 10:104358–74. doi: 10.1109/ACCESS.2022.3210543.
10. Sultana, Shaheena, M. A. H. Akhand, Prodip Kumer Das, and M. M. Hafizur Rahman. 2012.
“Bangla Speech-to-Text Conversion Using SAPI.” Pp. 385–90 in 2012 International Conference on
Computer and Communication Engineering (ICCCE). Kuala Lumpur, Malaysia: IEEE.
11. 76th round of the National Sample Survey (NSS).
https://des.delhi.gov.in/sites/default/files/report_on_survey_of_persons_with_disabilities.pdf
12. About American sign language
https://en.wikipedia.org/wiki/Sign_language#:~:text=Wherever%20communities%20of%20people
%20with,some%20form%20of%20legal%20recognition.
13. Types of sign language. https://www.ai-media.tv/knowledge-hub/insights/sign-language-
alphabets/

Page | 16
Page | 17

You might also like