Sign Language Detection Using the Computer Vision
Sign Language Detection Using the Computer Vision
Bachelor of Technology
in
Computer Science & Engineering
I, hereby declare that the project work entitled “Sign Language Detection Using the Computer
Vision” is an authentic work carried out by me under the guidance of Prof. Sunila, Department of
Computer Science & Engineering in partial fulfillment of the requirement for the award of the
degree of Bachelor of Technology Computer Science and this has not been submitted anywhere
else for any other degree.
Date: Signature
Jayant Kumar
200010130050
2
CERTIFICATE
This is to certify that Jayant Kumar (200010130050) is a student of B.Tech (CSE), Department
of Computer Science & Engineering, Guru Jambheshwar University of Science & Technology,
Hisar has completed the project entitled “Sign Language Detection Using the Computer
Vision”.
3
List of Figures
4
List of Tables
5
Abstract
This project provides a revolutionary method for sign language Detection using Computer Vision
strategies, Media Pipe, Hand gesture detection, and Google Trainable Machine.
Sign Language is a Vital verbal exchange tool for the deaf and difficult–of–hearing network, but
it remains in large part inaccessible to those unexpected with it
The approach addresses the hole via leveraging advanced technologies to appropriately interpret
and translate sign language gestures into textual content or speech in real-time
I employ MediaPipe, a sturdy framework advanced by way of Google Trainable Machine, for
real-time hand tracking and gesture popularity. Media Pipe’s hand detection module identifies
key landmarks on the arms, bearing in mind specific tracking of finger positions and movements.
These hand landmarks are then processed to locate specific sign language gestures.
To enhance the accuracy and reliability of our machine, we combine Google Trainable devices.
A Machine Learning, a machine learning platform that allows the advent of custom fashions
without massive coding.
Through training models on a comprehensive dataset of sign language gestures, the device learns
to recognize and differentiate among diverse signs and sign with high accuracy.
The proposed system was evaluated via extensive testing, showing promising results in terms of
accuracy, speed, and user-friendliness.
The demonstrates the capacity for real-time sign language interpretation, presenting a sensible
answer for bridging communication boundaries and fostering inclusivity for the deaf and
difficult-of-listening to network.
6
Contents
Page No.
1. Introduction
1.1 Problem Definition
1.2 Objectives
2. Existing System
3. Problems in the existing system
4. Proposed System
5. Advantages of the proposed system
6. Software requirement specification document
7. Design of the proposed system
8. Implementation (Coding)
9. Testing
10.User’s Manual
11.Conclusions
12.References/ Bibliography
13.Plagiarism Report
7
1. Introduction
Deaf and hearing people mainly use sign language to communicate through hand gestures.
People who are not familiar with it often cannot access it, creating communication barriers. By
enabling real-world sign language translation and recognition, machine learning (ML) paves the
way for innovative solutions to bridge this gap.
American Sign Language (ASL) is a natural language, complete with its own syntax, grammar,
and lexis. They use hand gestures, facial expressions, and body movements to convey meaning.
The main components of this hand gesture are:
5. Non-manual signals: Facial expressions and body postures that provide additional grammatical
information.
Machine learning, part of artificial intelligence (AI), involves algorithms to recognize patterns
and make decisions based on data. In the context of speech recognition, ML models can be
trained to recognize and interpret hand movements from visual data.
1. Computer Vision: This technology allows computers to interpret and process visual
information from the world. For speech recognition, computer vision algorithms analyze video
footage to detect and track hand movements and posture.
3. Google Trainable Machine: this platform allows you to create custom machine learning
models without extensive coding. By training models on a large database of speech cues, the
system learns to recognize individual cues with high accuracy.
9
The process of recognizing sign language using machine learning
1. Collect data: Collect a large collection of data from video clips and different signals. This data
set should include different hand shapes, gestures and directions.
2. Preprocessing: Use computer vision techniques for preprocessing video data. This includes
defining the hand, segmenting the hand region, and identifying key landmarks.
3. Feature extraction: Extracting features such as hand position, motion trajectory, and
orientation from pre-processed data. These features are important inputs for machine learning
models.
4. Learning Model: Use extracted features to train machine learning models. Common models
include convolutional neural networks (CNN) for image data and recurrent neural networks
(RNN) for sequential data.
5. Real-time recognition: Applying the learning model in a real-time system. The process model
records live video, detects hand gestures, and translates it into text or speech.
10
1.1 Problem Definition
Background
Utilizing hand gestures, facial expressions, and body moves. despite its significance, sign
language remains in large part inaccessible to individuals who do not recognize it, developing a
good-sized communique barrier. traditional methods of sign language interpretation, together
with human interpreters, aren't continually to be had and can be expensive.
There's a pressing want for an automatic machine that could correctly and successfully translate
sign language into text or speech in real-time.
Problem Declaration
The primary purpose is to increase an automatic machine for real-time signal language detection
and translation using system studying. This device must be able to accurately apprehend and
interpret hand gestures corresponding to distinct signal language words and phrases, offering a
textual content or speech output that can be without difficulty understood via non-signal
language customers.
11
2. Variability
Different Sign Languages: Each sign language, such as American Sign Language (ASL)
and British Sign Language (BSL), has its unique set of gestures and grammatical rules.
This necessitates the system to be versatile and adaptable to multiple sign languages.
Individual Differences: Variations in signing styles, speeds, and personal nuances
among individuals can lead to inconsistencies. Additionally, factors like hand size, shape,
and flexibility can affect how gestures are performed and perceived.
Environmental Factors: Changes in lighting conditions, background noise, and visual
distractions can significantly impact the system’s accuracy. The system must be robust
enough to handle different environments and conditions.
3. Real-time Processing
High Computational Demand: Processing video feeds in real time requires efficient
algorithms and powerful hardware. The system must analyze frames quickly to provide
instantaneous translations without lag.
Optimization: Balancing accuracy and speed is critical. The algorithms must be
optimized to ensure that they are fast enough for real-time use while maintaining high
accuracy levels.
4. Non-manual Signals
Facial Expressions: Facial expressions play a crucial role in sign language, conveying
emotions, questions, negations, and other grammatical elements. Capturing and
interpreting these expressions accurately is essential but challenging.
5. Training Data
Diverse Dataset: A large and diverse dataset is required to train machine learning models
effectively. The dataset must include various sign languages, gestures, and environmental
conditions to ensure the model’s robustness.
Data Collection and Annotation: Collecting and annotating a comprehensive dataset is
time-consuming and resource-intensive. Ensuring the quality and accuracy of the data is
crucial for effective model training.
12
Balancing the Dataset: The dataset should be balanced to include an equal representation
of different gestures and non-manual signals to prevent bias in the trained models.
1.2 Objectives
Develop sturdy computer imaginative and prescient algorithms to correctly hit upon hand
movements and positions.
The intention is to attain high precision in figuring out diverse handshapes, orientations, moves,
and locations.
Extract distinctive capabilities from hand gestures, inclusive of handshape, motion trajectory,
orientation, and region.
Goal is to make certain that the extracted functions comprehensively capture the nuances of
different signal language gestures for correct reputation.
The goal is to educate machine learning on the usage of a complete and numerous dataset of
sign language gestures.
The intention is to appoint advanced techniques which include convolutional neural networks
(CNNs) and recurrent neural networks (RNNs) to achieve excessive popularity accuracy.
4. Actual-Time Translation
Objective is to broaden a system able to process video feeds and translating detected gestures
into text or speech in real-time.
The Aim is to optimize the system for speed and efficiency to make sure seamless and on-the-
spot user enjoy.
5. Consumer-Friendly Interface
13
The objective is to design an intuitive and accessible consumer interface that enables easy
interaction for both sign language customers and non-users.
The goal is to create a clear and responsive interface that correctly presents translations of
gestures.
The goal is to behaviour rigorous trying out and validation of the machine under various
situations, such as extraordinary lighting, backgrounds, and consumer variations.
Aim is make sure that the gadget continues high accuracy and overall performance throughout
diverse scenarios and real-world environments.
The purpose is to bridge the conversation gap between signal language customers and people
surprised by signal language, fostering greater inclusivity.
The goal is to create a layout for the machine to be scalable and adaptable for destiny upgrades
and expansions.
The aim is to plan for non-stop updates and enhancements based totally on user feedback and
technological advancements, ensuring lengthy-term relevance and effectiveness.
By accomplishing these objectives, the venture pursuits to expand a relatively effective and
inclusive sign language detection device that leverages the electricity of gadget-gaining
knowledge to facilitate actual-time communication and foster extra inclusivity for the deaf and
hard-of-hearing community.
14
2. Existing System
The sign language Detection system that already exists is basic system that use the machine
learning system, only work detect by giving by input it does not run that efficiently on a real time
basis.
Data Processing, Training, Classify Gesture. The block diagram is simplified in detail to abstract
some of the minutiae:
• Data Processing: The load data.py script contains functions to load the Raw Image Data and
save the image data as numpy arrays into file storage. The process data.py script will load the
image data from data.npy and preprocess the image by resizing/rescaling the image and applying
filters and ZCA whitening to enhance features. During training the processed image data was
split into training, validation, and testing data and written to storage. Training also involves a
load dataset.py script that loads the relevant data split into a Dataset class. For use of the trained
model in classifying gestures, an individual image is loaded and processed from the filesystem.
• Training: The training loop for the model is contained in train model.py. The model is trained
with hyperparameters obtained from a config file that lists the learning rate, batch size, image
filtering, and number of epochs. The configuration used to train the model is saved along with the
model architecture for future evaluation and tweaking for improved results. Within the training
loop, the training and validation datasets are loaded as Dataloaders and the model is trained using
Adam Optimizer with Cross Entropy Loss. The model is evaluated every epoch on the validation
set and the model with best validation accuracy is saved to storage for further evaluation and use.
Upon finishing training, the training and validation error and loss is saved to the disk, along with
a plot of error and loss over training.
• Classify Gesture: After a model has been trained, it can be used to classify a new ASL gesture
that is available as a file on the filesystem. The user inputs the filepath of the gesture image and
the test data.py script will pass the filepath to process data.py to load and preprocess the file the
same way as the model has been trained.
15
Fig 2 (a) Flowchart of Process
Data Collection: - The primary source of data for this project was the compiled dataset of
American Sign Language (ASL) called the ASL Alphabet from Kaggle. The dataset is
comprised of images. 26 for the letters A - Z and 3 for space, delete, and nothing. This
data is solely of the user Akash gesturing in ASL, with the images taken from his laptop’s
webcam. These photos were then cropped, rescaled, and labeled for use. (a) ASL letter A
(b) ASL letter E (c) ASL letter H (d) ASL letter Y Figure 2: Examples of images from the
Kaggle dataset used for training.test sets of images were taken with a webcam under
different lighting conditions, backgrounds, and use of dominant/non-dominant hand.
These images were then cropped and preprocessed.
16
Fig 2(b) Showing hands figures
Data Pre-processing The data preprocessing was done using the PILLOW library, an
image processing library, and sklearn.decomposition library, which is useful for its matrix
optimization and decomposition functionality.
Edge Enhancement: Edge enhancement is an image filtering techniques that makes edges
more defined. This is achieved by the increase of contrast in a local region of the image
that is detected as an edge. This has the effect of making the border of the hand and
fingers, versus the background, much more clear and distinct. This can potentially help
the neural network identify the hand and its boundaries.
Image Whitening: ZCA, or image whitening, is a technique that uses the singular value
decomposition of a matrix. This algorithm decorrelates the data, and removes the
redundant, or obvious, information out of the data. This allows for the neural network to
look for more complex and sophisticated relationships, and to uncover the underlying
structure of the patterns it is being trained on. The covariance matrix of the image is set to
identity, and the mean to zero.
17
Fig 2 (d) After processing the image of hand
Machine Learning Model
Overall Structure The model used in this classification task is a fairly basic
implementation of a Convolutional Neural Network (CNN). As the project requires
classification of images, a CNN is the go-to architecture. The basis for our model design
came from Using Deep Convolutional Networks for Gesture Recognition in American
Sign Language paper that accomplished a similar ASL Gesture Classification task [4].
These convolutional blocks are repeated 3 times and followed by Fully Connected layers
that eventually classify into the required categories. The kernel sizes are maintained at 3
X 3 throughout the model. omitted the dropout layers on the fully connected layers at first
to allow for faster training and to establish a baseline without dropout
18
Fig 2(e) Process of Model
19
\
20
3. Problems in the existing system
Table of Contents
3.1. Introduction
3.1.1 Overview
Problem
Problem
Problem
Problem
21
3.5. Model Training
3.5.1 Overfitting
Problem
Problem
3.6.1 Latency
Problem
Problem
Problem
Problem
22
3.1. Introduction
3.1.1 Overview
Sign language detection is a complex task that involves interpreting hand gestures, facial
expressions, and body language to translate them into verbal language or text. With
advancements in machine learning and computer vision, researchers are developing systems
that can recognize and interpret sign language from images. These systems hold great
potential for improving accessibility and communication for the deaf and hard-of-hearing
communities.
Accurate sign language recognition systems can transform various sectors by making them
more inclusive. In education, such systems can assist teachers and students by providing real-
time translation. In healthcare, they can enable better communication between patients and
medical professionals. In public services, they can facilitate interactions in places like banks,
government offices, and transportation hubs.
This document aims to provide a comprehensive exploration of the challenges and solutions
in developing machine learning models for sign language detection from photos. It will cover
data challenges, model complexity, feature extraction, model training, real-time
implementation, cultural and linguistic variations, evaluation and benchmarking, case studies,
and future directions.
Problem - High-quality, labeled datasets are crucial for training machine learning models.
However, the availability of such datasets for sign language detection is limited.
Collecting these datasets requires capturing a wide range of signs performed by various
individuals under different conditions. This process is resource-intensive and time-
consuming. Existing datasets, such as RWTH-PHOENIX-Weather 2014 for German Sign
23
Language and the American Sign Language (ASL) dataset, are valuable but insufficient
for comprehensive model training.
Problem - In many datasets, some signs are more frequently used and thus more
represented than others, leading to class imbalance. This imbalance can cause models to
perform poorly on underrepresented signs. For instance, common signs like "hello" or
"thank you" might be well-represented, while less common signs might be scarce,
resulting in biased model performance.
Problem - Many signs have subtle differences, making it difficult for models to
distinguish between them. For example, the signs for “I love you” and “rock on” in ASL
are very similar, involving slight differences in finger positioning. Detecting these
nuances requires highly sensitive models that can capture fine-grained details.
Problem - Sign language includes both static poses and dynamic gestures involving
movement. Capturing and interpreting these movements from a single image is
challenging, as it lacks temporal information. Dynamic gestures, such as those involving
24
movement from one hand position to another, require understanding the sequence of
frames.
Problem - Accurately detecting and isolating hands and fingers in images is complex,
especially in cluttered or noisy backgrounds. Misidentification can lead to poor feature
extraction and inaccurate recognition. For example, overlapping hands or occlusions by
objects can hinder accurate detection.
3.5.1 Overfitting
Problem - when a machine learning model learns the training data too well, including the
noise and outliers, which leads to poor generalization on new, unseen data. This is a
common issue in deep learning, especially with complex models and limited data.
Causes -
High model complexity: Deep neural networks with numerous parameters can fit the training data
perfectly but fail to generalize.
Limited data: Insufficient training data can cause the model to memorize rather than learn
patterns.
Lack of regularization: Absence of techniques to penalize complexity can lead to overfitting.
Problems - Training deep learning models for sign language detection is computationally
intensive, requiring substantial processing power and memory. Limited access to
powerful GPUs or TPUs can hinder the training process and slow down research
progress.
25
3.6. Real-time Implementation
3.6.1 Latency
Problem - For practical applications, sign language detection systems need to operate in
real-time, meaning they must process and interpret signs with minimal delay. High
latency can make these systems less effective and frustrating to use.
Problem - Sign languages are not universal; different regions and communities use distinct
versions of sign language, each with its own vocabulary, grammar, and nuances. This diversity
presents a significant challenge for developing models that can accurately recognize and interpret
signs across different languages and dialects.
Challenges:
Dataset diversity: Collecting comprehensive datasets that cover the variations in different sign
languages.
Model generalization: Ensuring models can generalize across different dialects and regional
variations.
Community involvement: Engaging with diverse communities to gather data and validate models.
26
4. Proposed System
27