Adaptive Object Detection for Indoor Navigation
Assistance: A Performance Evaluation of Real-Time
Algorithms
Abhinav Pratap Sushant Kumar Suchinton Chakravarty
Department of Computer Science and Department of Computer Science and Department of Computer Science and
Engineering, ASET Engineering, ASET Engineering, ASET
Amity University, Noida, India Amity University, Noida, India Amity University, Noida, India
[email protected] [email protected] [email protected] Abstract— This study compares real-time object detection we aim to address obstacles that visually impaired
algorithms aimed at enhancing indoor navigation for visually individuals encounter in real-time object recognition and
impaired individuals. Object detection models, including tracking, providing insights into the development of
YOLO, SSD, Faster R-CNN, and Mask R-CNN, are evaluated efficient, accessible navigation systems. This study
based on their accuracy, processing speed, and adaptability to emphasizes the importance of balancing accuracy, speed,
confined indoor settings. Navigation assistance for visually and adaptability when selecting and implementing detection
impaired users presents unique challenges, such as accurately and tracking algorithms, paving the way for inclusive and
identifying and tracking objects within dynamic spaces. By
impactful solutions in assistive technology.
focusing on these specific challenges, this research advances the
understanding of adaptive machine learning applications that
can improve indoor navigation systems. Additionally, the study
explores how each algorithm addresses trade-offs between II. DATASET
precision and processing efficiency, which are critical for real-
The Indoor Objects Detection dataset was created with
time usability in assistive technologies. Our findings highlight
that selecting algorithms with an optimal balance of accuracy, the aim of assisting persons with visual impairments in their
speed, and flexibility is essential for creating inclusive day to day life. This dataset supports the ‘Object Detection
navigation solutions. This comparative analysis provides for Blind People’ project under the AI Builders 2022
actionable insights for designing efficient and responsive initiative. The project's ultimate goal is the detection of
systems that cater to the unique needs of visually impaired indoor objects, emphasizing its relevance and applicability
individuals in indoor environments. across various domains.
The choice to utilize the Indoor Objects Detection
Keywords—Object Detection, Indoor Navigation, YOLO,
dataset for testing our models in object detection stems from
SSD, Faster R-CNN, Real-Time Processing, Accessibility
its intrinsic relevance to our research focus on enhancing
navigation for individuals with visual impairments. This
dataset specifically targets indoor environments, aligning
I. INTRODUCTION seamlessly with the challenges faced by visually impaired
In today’s technology-driven society, there is a growing persons in navigating confined spaces. By encompassing
need to enhance navigation options for individuals with 7331 labeled objects across 10 different indoor classes,
visual impairments. Indoor navigation, in particular, including doors, cabinets, and furniture, the dataset provides
presents unique challenges due to the dynamic and confined a distinct, and broad set of scenarios for testing the
nature of these spaces. By integrating advanced real-time effectiveness of our models. The inclusion of bounding box
object detection and tracking algorithms, technology can annotations further facilitates precise object detection,
play a crucial role in assisting visually impaired individuals, crucial for developing navigation assistance systems.
offering precise, timely, and adaptive navigation support.
Leveraging this dataset allows us to tailor our models to the
This research focuses on evaluating key object detection unique demands of indoor settings, fostering the creation of
algorithms—YOLO, SSD, Faster R-CNN, and Mask R- more accurate and adaptive solutions for real-time
CNN—to determine their effectiveness for indoor navigation assistance. The average areas for classes are
navigation applications. While these algorithms have been shown in Figure 1. The Co-Occurrence Matrix of class
extensively studied in general computer vision contexts, objects in Dataset makes this dataset more useful for our
their specific performance in real-time indoor environments, application.
aimed at accessibility for visually impaired users, remains
underexplored. Our study examines each algorithm across
essential criteria, including detection accuracy, processing
speed, and adaptability, to identify the models most suited to
assistive navigation systems.
Building on prior work in object detection and machine
learning, this research goes further by exploring domain-
specific adaptations required for indoor navigation. Through
the integration of these algorithms with machine learning,
IV. ALGORITHMS
A. YOLO (You Only Look Once)
YOLO, standing for You Only Look Once,
revolutionizes real-time object detection by employing a
grid-based approach. The input image is splitted into an SxS
grid, where every grid cell is tasked with directly forecasting
bounding boxes and class probabilities. This innovative
strategy eliminates the need for a multi-step process,
enabling YOLO to achieve high speed and efficiency. The
algorithm's single-pass detection system contributes to its
suitability for real-time applications, making it a favorite
choice in various domains.
Figure 1 - Co-Occurrence Matrix of class object in
Dataset
Figure 2 - Single stage YOLO architecture
III. RELATED WORK A comprehensive working model of YOLO for a single
frame is explained in Table 1:
The world of object detection models is diverse and
varied. By looking at the findings from multiple studies, we Table 1 - Working of YOLO
can get a comprehensive understanding of these models. In a
2021 review published in the IEEE Xplore [2], researchers STEPS DESCRIPTION
analyzed the efficiency of different models. YOLOv1 stood
Input Image Original image to be processed.
out with an impressive detection speed of just 0.02 seconds.
Faster R-CNN was also solid, with a speed of 1.1 seconds. Image is divided into an SxS grid. Every
And SSD512 showed efficiency at 0.125 seconds. This Grid Division cell has the responsibility of object
comparison underscores the trade-off between speed and detection.
accuracy, essential for real-time applications. Each individual grid cell makes predictions
A review by Springer Science+Business Media in 2020 Bounding Box
regarding bounding boxes, class
[2] highlighted the significance of deep convolutional neural Prediction
probabilities, and confidence scores.
networks (DCNNs) in object detection. Models like Faster
RCNN and YOLOV3 showed advantages in dealing with Non Maximum Redundant detections are filtered out using
objects of different sizes, including small ones. However, Suppression NMS, based on confidence scores.
limitations were found, such as slow training speed and Final output comprises bounding boxes,
difficulties with dense or tiny objects. This detailed Output
class probabilities, and confidence scores.
understanding helps researchers choose models that best suit
their specific needs.
In 2020, the IEEE Access paper presented Tinier-YOLO While the explanation used a single input image for
[3], a model designed for object detection in real-time in simplicity, YOLO is commonly used in a batch processing
limited environments. Compared to lightweight models such mode. Instead of processing images one by one, YOLO can
efficiently process multiple frames simultaneously. This
as SqueezeNet-SSD and MobileNet-SSD, Tinier-YOLO de-
parallel processing allows for real-time detection in video
monstrated better real-time performance. This highlights
streams by handling multiple frames concurrently.
the importance of model design in striking a balance
between accuracy and computational efficiency, which is
crucial in resource-constrained situations.
B. Faster R-CNN (Region-based Convolutional Neural
These observations collectively enrich our
Network)
comprehension of object detection models, emphasizing the
necessity for a customized approach aligned with the Faster R-CNN is a two-stage object detection framework
particular requirements of the application. This may involve which incorporates a Region Proposal Network (RPN) for
prioritizing rapid real-time detection, managing multi-scale generating region proposals and a Convolutional Neural
objects, or optimizing for confined environments. Network (CNN) for feature extraction and classification.
Both processes are described below:
● Region Proposal Network (RPN): Region Proposal
Network (RPN) serves the function of suggesting
potential object regions within an image. It
accomplishes this by systematically moving a small depending on the specific requirements of a navigation
window (anchor) across the feature map and application. The slight increase in computational intensity
predicting the presence or absence of an object. compared to YOLO should be weighed against the benefits
Concurrently, it adjusts the anchor boxes by of its versatile and efficient single-shot approach.
refining them according to predicted offsets.
● Feature Extraction and Classification (Fast R-
CNN): Following the generation of region D. Mask R-CNN
proposals, a Convolutional Neural Network (CNN) Mask R-CNN builds on the Faster R-CNN framework by
is utilized to extract features from each proposed adding a new branch to predict segmentation masks. This
region. These features are subsequently employed allows Mask R-CNN to not just detect objects but also
for both classification and bounding box regression precisely outline and segment each distinct instance found
tasks. Notably, the CNN shares convolutional within the detected regions.
layers with the Region Proposal Network (RPN),
thereby improving computational efficiency. Let's look at how Mask R-CNN works in more detail:
Region Proposal:
While Faster R-CNN excels in accuracy and precision,
its slower processing speed and complexity may pose ● Region Proposal: Similar to Faster R-CNN, Mask
challenges in real-time navigation systems. For applications R-CNN starts by generating region proposals using
requiring instant responses and rapid object detection, such the Region Proposal Network (RPN).
as autonomous navigation or assistive technologies for the ● Bounding Box Prediction: The bounding box
visually impaired, speed limitations may be a critical factor. prediction branch operates the very same way as in
However, in controlled environments where high accuracy is Faster R-CNN, refining the proposed bounding
prioritized over speed, Faster R-CNN could still find boxes and classifying the objects.
applications in navigation systems. The choice between ● Segmentation Mask Prediction: The additional
Faster R-CNN and other algorithms depends on the explicit segmentation mask branch runs in parallel to
requirements and constraints of the navigation scenario. predict pixel-level masks for each proposed
bounding box. These masks provide detailed
instance-level segmentation of the detected objects.
C. SSD (Single Shot Multibox Detector) ● Post-Processing: After making the predictions,
post-processing steps like Non-Maximum
SSD, or Single Shot MultiBox Detector, is an object
Suppression (NMS) are applied to filter and refine
detection technique optimized for a trade-off between
the final set of bounding boxes and segmentation
accuracy and speed. It operates by employing a predefined
masks.
set of boxes with various aspect ratios at each cell of the
feature map. Simultaneously, it predicts both bounding box Mask R-CNN's strength lies in tasks that demand
offsets and class scores. detailed instance segmentation. While its accuracy in
delineating objects is beneficial, the computational demands
Let's see the working of SSD in detail:
may be a consideration for real-time navigation systems,
● Feature Extraction: The input image is fed into a particularly in resource-constrained environments.
base Convolutional Neural Network (CNN), often Depending on the specific requirements of a navigation
based on architectures like VGG16. This CNN application, the trade-off between accuracy and
extracts features maps that represent the computational efficiency should be carefully assessed to
hierarchical properties of the input image. determine the suitability of Mask R-CNN.
● Default Box Assignment: At each position in the
feature maps, default boxes of different aspect
ratios are assigned. The SSD model then makes V. ANALYSIS
predictions for each of these default boxes,
allowing it to handle objects of different shapes and Our analysis employs the Indoor Objects Detection
sizes. dataset, focusing on evaluating each model’s performance
● Prediction: For each default box, the SSD model across key parameters. All four models were analyzed for
simultaneously predicts the bounding box offsets our use case. We used 107 indoor images from the test
and class scores. The bounding box offsets adjust dataset and calculated the speed, total time, and average
the dimensions of the default box to better fit the time of prediction for each model. We chose this dataset
actual location and size of the object. The class because it is well-suited for indoor navigation, and we
scores represent the confidence in the appearance wanted our model to be limited to this dataset as it is
of different object classes within the adjusted box. specifically used for navigation in indoor spaces. The test
● Post-processing: After these predictions, a results are informative and can be used while building a
confidence threshold is applied to filter out navigation system for visually impaired individuals.
predictions with low confidence scores. Two primary metrics were utilized to gauge the models'
Additionally, Non-Maximum Suppression (NMS) detection performance:
is employed to eliminate redundant bounding box
predictions, retaining only the most accurate ones. 1. Intersection over Union (IoU): Measures the
overlap between the predicted bounding boxes and
SSD's balanced accuracy and speed make it appropriate the ground truth boxes. The IoU is calculated using
for real-time navigation systems. Its ability to handle various the formula:
object scales in a single pass is advantageous, especially in
dynamic environments. However, the potential decrease in IoU = Area of Overlap / Area of Union
accuracy for small objects might be a consideration
2. Mean Average Precision (mAP): Provides an ● Faster R-CNN (Region-based Convolutional
aggregated measure of the model's precision across Neural Network): Displays an accuracy of 69%,
various IoU thresholds. The mAP is computed as representing a trade-off between accuracy and
the mean of the average precisions at different speed compared to YOLOv5 and SSD.
recall levels. The formula for average precision ● Mask R-CNN: Exhibits a lower accuracy of
(AP) at a particular IoU threshold is: 61.06%, likely influenced by the additional
AP = (TP at IoU) / (TP at IoU + FP at IoU) complexity introduced by its instance segmentation
capabilities.
Table 2 - Validation Accuracy F. B. Average Prediction Time
Metrics ● YOLOv5: Boasts the lowest average time at 0.357
seconds, reinforcing its suitability for real-time
Model Data Set applications.
Avg. IoU mAP
● SSD: Presents an average time time of 0.72
COCO12 seconds, indicating a reasonable balance between
YOLOV5 0.5 speed and accuracy.
8
● Faster R-CNN: Exhibits an average time of 6.57
seconds, showcasing a comparable performance to
SSD 0.5:0.95 0.195 SSD.
● Mask R-CNN: Records an average time of 7.60
Faster-R- 2017 seconds, reinforcing the computational demands
0.5:0.95 0.353 associated with its instance segmentation
CNN COCO
capability.
Mask-R-
0.5:0.95 0.327 The analysis highlights YOLOv5 as a top performer
CNN
when it comes to speed and accuracy, making it a great
choice for developing an indoor navigation system for
Table 3 - Comparison of Models individuals with visual impairments. SSD also stands out
with its balanced accuracy and speed, making it a viable
Model Accuracy Total Time Avg. option as well. However, Faster R-CNN and Mask R-CNN,
while accurate, require more computational power, limiting
Time(sec)
their real-time use in this specific case. These insights
provide a solid foundation for designing and implementing
YOLOV5 90.1% 38.25 sec 0.357 sec
an efficient navigation system that caters to the unique needs
of individuals with visual impairments in indoor
SSD 79.8% 77.27 sec 0.72 sec environments.
Faster-R-CNN 69% 703.73 sec 6.57 sec
Mask-R-CNN 61.06% 790.80 sec 7.60 sec VI. CONCLUSION
In our exploration of real-time object detection models
for indoor navigation assistance, we tested YOLOv5, SSD,
The conducted analysis on the four object detection Faster R-CNN, and Mask R-CNN. Interestingly, YOLOv5
models—YOLOv5, SSD, Faster R-CNN, and Mask R-CNN emerged as a standout performer, demonstrating an optimal
—provides valuable insights into their performance on the balance of speed and accuracy. It achieved an impressive
selected Indoor Objects Detection dataset. The focus of the 90.1% accuracy with an average prediction time of just
evaluation was on accuracy, total processing time, and 0.357 seconds. SSD also showcased a credible performance,
average prediction time, all of which are critical factors for delivering 79.8% accuracy.
the development of a navigation system designed
Nevertheless, the compromise between accuracy and
specifically for visually impaired people within indoor
computational speed became evident with the advent of
environments.
Faster R-CNN and Mask R-CNN. Although these models
demonstrated higher accuracy levels, they demanded more
processing time, underscoring the challenge of balancing
E. A. Accuracy precision with real-time efficiency.
● YOLOv5 (You Only Look Once): Demonstrates
the highest accuracy among the models, reaching
90.1%. This signifies YOLOv5's effectiveness in The practical implications of our findings are clear.
accurately detecting indoor objects within the Given its balance of speed and accuracy, YOLOv5 emerges
specified dataset. as a robust choice for real-time indoor navigation for
● SSD (Single Shot Multibox Detector): Achieves a visually impaired users. SSD, offering a commendable
commendable accuracy of 79.8%, showcasing its equilibrium between accuracy and speed, also stands out as
capability to balance accuracy and speed for indoor a viable option.In contrast, Faster R-CNN and Mask R-
object detection. CNN, despite their accuracy merits, present computational
demands that may limit their applicability in time-sensitive
navigation tasks.
This research not only contributes valuable insights into Smart Structures and Systems (ICSSS), Chennai, India,
the comparative performance of object detection models but 2020, pp. 1-4, doi: 10.1109/ICSSS49621.2020.9202254.
also offers nuanced guidance for the development of [8] Diwan, T., Anirudh, G. & Tembhurne, J.V. Object
assistive technologies. The emphasis on real-time efficiency, detection using YOLO: challenges, architectural successors,
alongside accuracy considerations, paves the way for the datasets and applications. Multimed Tools Appl 82, 9243–
implementation of more effective and adaptive indoor 9275 (2023). https://doi.org/10.1007/s11042-022-13644-y
navigation systems tailored to the unique needs of the [9] Hong, Sanghoon & Roh, Byungseok & Kim, Kye-Hyeon
visually impaired. & Cheon, Yeongjae & Park, Minje. (2016). PVANet:
Lightweight Deep Neural Networks for Real-time Object
Detection.
VII. REFERENCES [10] Chen, Zhihao & Khemmar, Redouane & Decoux,
[1] C. P. Papageorgiou, M. Oren and T. Poggio, "A general Benoit & Atahouet, Amphani & Ertaud, Jean-Yves. (2019).
framework for object detection," Sixth International Real Time Object Detection, Tracking, and Distance and
Conference on Computer Vision (IEEE Cat. Motion Estimation based on Deep Learning: Application to
No.98CH36271), Bombay, India, 1998, pp. 555-562, doi: Smart Mobility. 1-6. 10.1109/EST.2019.8806222.
10.1109/ICCV.1998.710772. [11] W. Fang, L. Wang and P. Ren, "Tinier-YOLO: A Real-
[2] A. K. Shetty, I. Saha, R. M. Sanghvi, S. A. Save and Y. Time Object Detection Method for Constrained
J. Patel, "A Review: Object Detection Models," 2021 6th Environments," in IEEE Access, vol. 8, pp. 1935-1944,
International Conference for Convergence in Technology 2020, doi: 10.1109/ACCESS.2019.2961959.
(I2CT), Maharashtra, India, 2021, pp. 1-8, doi: [12] Diwan, T., Anirudh, G. & Tembhurne, J.V. "Object
10.1109/I2CT51068.2021.9417895. detection using YOLO: challenges, architectural successors,
[3] Xiao, Y., Tian, Z., Yu, J. et al. A review of object datasets and applications." Multimed Tools Appl 82, 9243–
detection based on deep learning. Multimed Tools Appl 79, 9275 (2023). Read the paper.
23729–23791 (2020). https://doi.org/10.1007/s11042-020- [13] "An Indoor Navigation System for Visually Impaired
08976-6 People Using a Path Finding Algorithm and a Wearable
[4] X. Zou, "A Review of Object Detection Techniques," Cap." IEEE Xplore, 1 Apr. 2018.
2019 International Conference on Smart Grid and Electrical [14] A. L. Fraga, X. Yu, W. -J. Yi, and J. Saniie. "Indoor
Automation (ICSGEA), Xiangtan, China, 2019, pp. 251- Navigation System for Visually Impaired People using
254, doi: 10.1109/ICSGEA.2019.00065. Computer Vision," 2022 IEEE International Conference on
[5] Papageorgiou, C., Poggio, T. A Trainable System for Electro Information Technology (eIT), Mankato, MN, USA,
Object Detection. International Journal of Computer Vision 2022.
38, 15–33 (2000). https://doi.org/10.1023/A:1008162616689 [15] Foroughi F, Chen Z, Wang J. A CNN-Based System for
[6] Eckert, Martin & Blex, Matthias & Friedrich, Christoph. Mobile Robot Navigation in Indoor Environments via
(2018). Object Detection Featuring 3D Audio Localization Visual Localization with a Small Dataset. World Electric
for Microsoft HoloLens - A Deep Learning-based Sensor Vehicle Journal. 2021; 12(3):134.
Substitution Approach for the Blind. 555-561. https://doi.org/10.3390/wevj12030134
10.5220/0006655605550561.
[7] P. Malhotra and E. Garg, "Object Detection Techniques:
A Comparison," 2020 7th International Conference on