Final Project 1
Final Project 1
On
BACHELOR OF ENGINEERING
IN
INFORMATION SCIENCE AND ENGINEERING
Submitted by:
Team Members
NAME: Shivaraj Y USN: 2KA22IS048
NAME: Iranna S J USN: 2KA22IS017
NAME: Puneeth A USN: 2KA22IS039
NAME: Akash P C USN: 2KA23IS401
Smt Kamala & Sri Venkappa M Agadi College of Engineering & Technology
Department of Information Science & Engineering
Lakshmeshwar-582116
2024-2025
Certificate
This is to certify that the Project Phase I work entitled “EDGE AI FOR REAL-
TIME SIGN LANGUAGE TRASNLATION” is bonafide work carried out by
Shivaraj Yaliwal (2KA22IS048), Iranna S J (2KA22IS017), Puneeth Akki
(2KA22IS039), Akash P C (2KA23IS401), in partial fulfillment of the
requirements for the award of the degree of Bachelor of Engineering in
Information Science and Engineering of Visvesvaraya Technological
University, Belagavi, during the year 2024-2025. It is certified that all the
corrections/suggestions indicated for internal assessment have been incorporated
in the report. This report has been approved as it satisfies the academic
requirements in respect of project phase I work prescribed for the Bachelor of
Engineering degree.
2
ABSTRACT
The "Edge AI for Real-Time Sign Language Translation" project addresses communication
barriers between deaf and hearing communities by leveraging edge computing and artificial
intelligence. The system aims to deliver low-latency (≤100ms), accurate, and privacy-preserving
translation of sign language into text or speech, and vice versa, using resource-constrained edge devices.
By integrating a hybrid CNN-LSTM architecture, the model captures both spatial (handshape,
orientation) and temporal (motion trajectory) features of sign language gestures, ensuring robust
recognition. This approach aligns with advancements in lightweight AI models optimized for edge
deployment, such as TensorFlow Lite, which minimizes computational overhead while maintaining
accuracy.
While existing solutions like Sign-Speak and Signapse demonstrate the feasibility of AI-driven
sign language translation, this project advances the field by optimizing latency and model efficiency
for edge devices. Challenges such as limited dataset diversity and environmental variability remain, but
future work aims to expand language support (e.g., ASL, ISL, Libras) and improve continuous sign
recognition through advanced sequence models. By bridging technological gaps in accessibility, this
system empowers deaf individuals to communicate autonomously, fostering inclusivity in societal and
professional contexts.
3
CONTENTS
1 Introduction 5-6
3 Problem Identification 9
4 Objectives 10-12
5 Methodology 13-20
6 References 21
4
Chapter: 1
Introduction
The selection of Edge AI for this initiative is a strategic design choice, meticulously
crafted to overcome common limitations associated with traditional cloud-based AI solutions.
Cloud-dependent systems often introduce inherent latency due to the round-trip data
transmission to and from remote servers. For conversational interfaces, even minimal delays
can disrupt the natural flow of interaction, leading to frustration and hindering effective
dialogue. Furthermore, transmitting sensitive visual data, such as sign language gestures, to
external servers raises considerable privacy and security concerns, as this information could
potentially be compromised or misused. By performing data processing locally on the device,
5
Edge AI directly mitigates these issues, eliminating network latency for inference and ensuring
that sensitive communication data remains localized, thereby significantly enhancing privacy
and security. This approach positions the project as a highly practical and user-centric solution,
prioritizing both performance and trust.
The overall goals of this project are ambitious and precisely defined. A primary
objective is to enable real-time translation with an exceptionally low latency target of less than
or equal to 100 milliseconds (ms). This specific latency benchmark is not merely a technical
specification but a fundamental requirement for maintaining the natural rhythm and
interactivity of human conversation. In human-computer interaction, 100ms is widely
considered the threshold for "instantaneous" feedback, where delays are barely perceptible.
Achieving this demanding target on resource-constrained devices, such as a Raspberry Pi 4,
necessitates significant engineering effort in model optimization and hardware utilization.
Beyond speed, the project aims to achieve accurate gesture recognition by capturing both
spatial and temporal features of sign language, ensuring the fidelity of translation. Furthermore,
it is designed to be optimized for edge deployment through a lightweight AI model, ensure
privacy through a privacy-first design, and support diverse sign languages, including American
Sign Language (ASL) and Indian Sign Language (ISL). These objectives collectively
underscore the project's commitment to creating a truly functional, accessible, and reliable
conversational aid.
6
Chapter: 2
Literature Survey
Real-time sign language translation (SLT) has emerged as a critical tool for bridging
communication gaps between deaf and hearing communities. Recent advancements in Edge AI
have enabled low-latency, privacy-preserving solutions by processing data locally on resource-
constrained devices. This survey examines existing approaches, challenges, and innovations in
this domain.
Edge AI Innovations:
7
Recent Advancements:
Future Directions:
Despite progress, gaps remain. Expanding dataset diversity for underrepresented sign
languages and improving continuous gesture recognition are key priorities. Further
optimization of edge models, such as quantization and pruning, could reduce latency and
hardware dependencies. Integration with IoT ecosystems and multilingual support (e.g., ASL,
BSL, ISL) are also critical for scalability.
In conclusion, Edge AI has revolutionized real-time SLT by balancing accuracy, latency, and
privacy. Continued innovation in model efficiency and dataset inclusivity will drive broader
adoption, empowering deaf communities through seamless communication.
8
Chapter: 3
Problem Identification
9
Chapter: 4
Objectives
Core Objectives:
The following objectives outline the specific, measurable, achievable, relevant, and
time-bound goals guiding the development of an Edge AI system for real-time sign language
translation:
10
inherently exhibit bias, underperforming for underrepresented groups and thereby
undermining the project's core mission of accessibility and inclusion.
11
strengths of target edge processors, or vice-versa, to meet stringent power and memory
constraints.
12
Chapter: 5
Methodology
Sign language serves as a vital visual-gestural language, fundamental for communication within
the Deaf community. Despite its significance, traditional communication methods often present
substantial barriers, hindering seamless interaction between signing and non-signing
individuals. Real-time sign language translation (SLT) systems aim to bridge these
communication divides, yet they confront inherent challenges. These include the considerable
variability of signs across individuals and regional dialects, the critical need for nuanced
contextual understanding, and the stringent demands for real-time processing to maintain
natural conversational flow. Overcoming these complexities necessitates the development of
robust and highly adaptable technological solutions.
Real-time sign language translation systems are complex, typically involving several integrated
components designed to interpret the multifaceted nature of sign language. These components
include gesture recognition, which identifies specific hand shapes, movements, and
orientations; facial expression analysis, crucial for interpreting non-manual markers (NMMs)
that convey grammatical information, emotion, and contextual nuances; and body posture and
gaze tracking, which provide additional contextual cues. Accurate interpretation also relies
heavily on context understanding, integrating linguistic and situational context, which is
particularly vital given the inherent variability and potential ambiguity in natural sign
languages.
The selection of appropriate edge computing platforms is a critical decision in the development
of real-time sign language translation systems, requiring a careful balance of computational
power, energy efficiency, cost, and physical form factor. Platforms such as the NVIDIA Jetson
Series are widely recognized for their powerful Graphics Processing Units (GPUs), making
them suitable for complex deep learning models that benefit from parallel processing. Examples
include the Jetson Nano and Jetson Xavier NX. Conversely, the Google Coral Series, featuring
13
Tensor Processing Units (TPUs), are optimized for TensorFlow Lite inference, offering high
efficiency for specific types of models. Other platforms, such as Intel Movidius Myriad X or
custom Application-Specific Integrated Circuits (ASICs), may also be considered based on
specific project requirements and optimization targets.
For on-device AI inference, the choice of software frameworks and libraries is equally
important. TensorFlow Lite, specifically optimized for mobile and edge devices, supports
various quantization techniques to reduce model size and accelerate inference. PyTorch Mobile
offers similar capabilities, allowing direct deployment of PyTorch models to mobile and edge
platforms. Additionally, ONNX Runtime provides a cross-platform inference engine that
supports models from diverse frameworks, offering considerable flexibility in deployment.
These frameworks are instrumental in enabling model compression and efficient execution,
which are crucial for operating within resource-constrained environments.
Typical Power
Key Approximate
Platform Compute Consumption Strengths Weaknesses
Processor Cost ($)
(TOPS) (W)
Parallel
NVIDIA processing, Higher power
Jetson GPU 0.5 5-10 50-100 wide for sustained
Nano framework loads
support
High Higher cost,
NVIDIA
performance, increased
Jetson GPU 21 10-20 400-600
complex power
Xavier NX
model support consumption
14
High Limited
Google NPU efficiency for framework
Coral Dev (Edge 4 2-5 60-100 TensorFlow support,
Board TPU) Lite, low specific
power optimizations
Ultra-low
Intel Lower raw
power,
Movidius VPU 4 1-2 50-150 compute than
compact form
Myriad X GPUs
factor
Effective real-time sign language translation hinges on robust data acquisition, preprocessing,
and augmentation strategies. Multi-modal data collection is essential for capturing the full
richness and complexity of sign language. This typically involves RGB video for hand gestures
and facial expressions, depth data for three-dimensional hand pose and spatial information, and
Inertial Measurement Unit (IMU) data for precise motion tracking. Beyond the modalities,
ensuring diversity and representation within datasets is critical. Sign language exhibits
significant variations across signing styles, regional dialects, age, gender, and physical
characteristics. Datasets must account for this diversity to prevent bias and ensure the
generalizability of trained models. Throughout the data collection process, rigorous ethical
considerations are paramount, including obtaining informed consent, ensuring privacy
protection, and fostering fair representation of all sign language communities.
Once collected, raw data requires meticulous preprocessing. Normalization techniques, such as
scaling pixel values or standardizing joint coordinates, are applied to improve model training
stability. Segmentation is crucial for isolating individual signs or phrases from continuous
signing streams, often utilizing temporal segmentation methods. Noise reduction techniques are
also necessary to address issues like background clutter, lighting variations, and sensor noise
that can compromise data quality.
15
The selection of appropriate neural network architectures is fundamental to developing
effective real-time sign language translation systems. Convolutional Neural Networks (CNNs)
are highly effective for spatial feature extraction from image and video frames, identifying
elements such as hand shapes and facial features. Three-dimensional CNNs are particularly
useful for capturing the spatio-temporal dynamics inherent in signs. Recurrent Neural Networks
(RNNs) and their variants, such as Long Short-Term Memory (LSTM) networks, are well-
suited for processing sequential data, capturing temporal dependencies in sign movements and
sequences. Transformers have gained increasing popularity due to their ability to model long-
range dependencies and contextual relationships, which is highly relevant for understanding
full sign sentences and discourse. However, their computational cost can pose a significant
challenge for edge deployment. Often, hybrid architectures, combining elements like CNNs for
initial feature extraction followed by LSTMs or Transformers for temporal modeling, yield the
best performance.
To meet the stringent real-time and resource constraints of edge devices, chosen models must
undergo significant optimization through various compression techniques. Quantization
involves reducing the precision of model weights and activations, for example, from 32-bit
floating-point to 8-bit integers. This technique substantially decreases model size and
accelerates inference, proving highly effective for NPU-based edge devices. Pruning involves
removing redundant connections or neurons from the neural network without a significant loss
of accuracy. Knowledge distillation is another powerful method, where a smaller "student"
model is trained to mimic the behavior of a larger, more complex "teacher" model, thereby
transferring knowledge while reducing the overall model size. Furthermore, Neural
Architecture Search (NAS) can be employed to automatically design efficient network
architectures tailored specifically for given hardware constraints.
The successful deployment of a real-time Edge AI sign language translation system relies
heavily on robust system integration and an optimized deployment pipeline. Integrating
multiple sensors, such as RGB cameras, depth sensors, and Inertial Measurement Units (IMUs),
requires robust hardware interfaces and drivers. A critical aspect of this integration is designing
an efficient data pipeline for capturing, buffering, and transmitting multi-modal data to the
processing unit. Ensuring precise synchronization of data streams from these disparate sensors
16
is paramount to avoid misinterpretations, as even slight temporal misalignment can lead to
incorrect translation.
Evaluating the performance of real-time edge sign language translation systems requires a
comprehensive set of metrics that extend beyond traditional machine learning accuracy. Key
technical metrics include translation accuracy, often measured by Word Error Rate (WER) or
BLEU score for text output, or direct sign recognition accuracy. End-to-end latency,
representing the time from sign capture to translated output, is critical for enabling natural, real-
time interaction. Throughput, defined as the number of signs or frames processed per second,
quantifies the system's processing capacity. Resource efficiency is assessed through power
consumption, which is crucial for portability and battery life, and model size, indicating the
memory footprint of the deployed model. Finally, resource utilization, encompassing CPU,
GPU, NPU, and memory usage, provides insights into hardware efficiency.
Beyond technical benchmarks, the system's practical utility and acceptance are heavily
dependent on user experience and usability assessment. User satisfaction can be gauged through
surveys and qualitative feedback on translation quality, responsiveness, and ease of use.
Adaptability measures how well the system adjusts to different users, environments, and
signing styles. Effective error handling, where the system communicates uncertainty or
potential misinterpretations to the user, is also vital. Ultimately, the system's accessibility must
be ensured, making it usable by individuals with diverse needs and abilities.
Evaluating the performance of real-time edge sign language translation systems extends beyond
traditional machine learning accuracy metrics. While translation accuracy is fundamental,
practical utility is equally contingent on real-time performance indicators such as end-to-end
17
latency and throughput, as well as resource efficiency metrics like power consumption and
model size. A system that is highly accurate but too slow or power-intensive is not truly viable
for portable, real-time applications. Furthermore, the true measure of success for assistive
technologies lies in their user experience, encompassing user satisfaction, adaptability to
diverse signing styles, and effective error handling. Ethical considerations, including bias
detection and mitigation, also form an integral part of a comprehensive evaluation. Therefore,
performance assessment must be multi-faceted, integrating technical key performance
indicators with human-centered design principles and ethical audits. This comprehensive
approach acknowledges that the system's value is determined by its ability to deliver a
performant, usable, and ethically responsible solution that genuinely bridges communication
gaps.
Table 2 outlines key performance indicators (KPIs) for real-time Edge AI SLT, providing a
structured framework for comprehensive evaluation.
Target
KPI Category Specific KPI Measurement Method
Range/Threshold
<15% WER / >0.7 Standardized datasets,
Translation Accuracy
BLEU human evaluation
Technical System timing, sensor-to-
End-to-End Latency <200 ms
Performance output
Frames processed per
Throughput >30 frames/second
unit time
<5W (for portable Power meter readings
Power Consumption
devices) during operation
Resource File size of deployed
Model Size <100 MB
Efficiency model
Resource Utilization <80% average during On-device monitoring
(CPU/GPU/NPU) inference tools
18
>4.0 on 5-point Likert User surveys, qualitative
User Satisfaction Score
scale feedback
High (across diverse Performance across
User Adaptability
signers/styles) varied user demographics
Experience
Error Handling Clarity Clear communication User feedback,
of uncertainty observation of error
messages
Performance comparison
<5% (for specific
Bias Detection Rate across demographic
Ethical demographics)
groups
Compliance
Full adherence to Security audits, data flow
Privacy Compliance
regulations analysis
This table is crucial for establishing a comprehensive and actionable framework for evaluating
the success of an Edge AI SLT system. It ensures that evaluation goes beyond just raw machine
learning accuracy to include critical real-time performance metrics (latency, throughput),
resource efficiency (power consumption), and crucially, user-centric aspects (user satisfaction,
adaptability). This provides a holistic view of system performance. By suggesting "Target
Range/Thresholds," it provides concrete, measurable goals for development and optimization,
allowing for objective assessment of whether the system meets the practical and operational
requirements for real-world deployment. The inclusion of KPIs from different categories
(technical, resource, user, ethical) reinforces the understanding that building such a system
requires an interdisciplinary approach, where AI model performance, embedded systems
engineering, and human-computer interaction are all intertwined. It formalizes the
understanding that true "performance" is multi-dimensional and encompasses societal impact.
The development of real-time sign language translation systems faces significant challenges,
particularly concerning ethical implications. Data bias, stemming from inadequate
representation in datasets, can lead to models that perform poorly for certain demographics or
signing styles, potentially exacerbating communication inequalities. Mitigating this requires
19
rigorous data collection protocols and advanced bias detection techniques. While on-device
processing inherently mitigates some privacy risks by keeping data local, data collection and
model training still involve sensitive visual data. Therefore, robust data anonymization and
secure processing are essential. Misinterpretation and discrimination are serious concerns, as
errors in translation can lead to misunderstandings, and biased systems could inadvertently
discriminate against users. Ethical guidelines and continuous human oversight are critical to
address these issues.
The development of sign language translation systems carries profound ethical responsibilities,
particularly concerning data bias and potential misinterpretation. Sign language is a
fundamental aspect of identity and communication for the Deaf community.
Challenge Relevant
Specific Challenge Proposed Mitigation Strategy
Category Snippet IDs
Model optimization (quantization,
S_S3, S_S6,
Real-time Latency pruning), Hardware acceleration,
S_S7, S_S18
Efficient inference pipelines
Efficient neural network
S_S4, S_S10,
Computational Complexity architectures, Hardware
S_S17, S_S18
acceleration (GPUs, NPUs)
Technical
Model compression, Efficient data
Memory Constraints structures, On-device memory S_S6, S_S18
management
Robust middleware, Precise
Sensor Synchronization timestamping, Hardware-level S_S5, S_S13
synchronization
20
Multi-modal data collection,
S_S5, S_S12,
Data Scarcity Extensive data augmentation,
S_S2
Synthetic data generation
Diverse dataset collection, Robust
Data S_S2,
Data Variability feature extraction, Adaptive S_S15
models
Advanced preprocessing, Noise
Noise and Inconsistencies S_S19
reduction algorithms
Diverse and representative data
S_S8, S_S15,
Data Bias collection, Bias detection and
S_S19, S_S20
mitigation techniques
On-device processing, Federated S_S1, S_S8,
Ethical Privacy Concerns
learning, Data anonymization S_S9
Ethical AI guidelines, Human-in-
Misinterpretation/Discrimina
the-loop validation, Continuous S_S20
tion
monitoring
Low-power hardware, Dynamic
S_S7, S_S13,
Power Management voltage/frequency scaling,
S_S18
Deployment Intelligent sensor activation
Efficient heat dissipation design,
Thermal Constraints S_S13
Thermal throttling management
Adaptive models, Sensor
Robustness (environmental) calibration, Redundancy in S_S13
sensing
Over-the-Air (OTA) update
Model/Software Updates mechanisms, Continuous S_S22
integration/delivery
22
Chapter: 6
References
1. Abdulhamied, R. M., Nasr, M. M., & Abdulkader, S. N. (2023). Real-time recognition of
American sign language using long-short term memory neural network and hand detection.
Indonesian Journal of Electrical Engineering and Computer Science, 30(1), 545–
556.(ResearchGate)
2. Gan, S., Yin, Y., Jiang, Z., Xie, L., & Lu, S. (2023). Towards Real-Time Sign Language
Recognition and Translation on Edge Devices. In Proceedings of the 31st ACM
International Conference on Multimedia (pp. 4509–4517).
ACM.(yafengnju.github.io+1ACM Digital Library+1)
3. Papatsimouli, M., Sarigiannidis, P., & Fragulis, G. F. (2023). A Survey of Advancements in
Real- Time Sign Language Translators: Integration with IoT Technology. Technologies,
11(4), 83.(MDPI)
4. Joseph, T., Kumar, S., Mary Anita, E. A., Kim, J. H., & Nagar, A. (2025). Explainable Real-
Time Sign Language to Text Translation. In Fifth Congress on Intelligent Systems (pp. 213–
242). Springer.(SpringerLink)
23
Group Photo
24