Enhancement of Robustness in Object Detection Module for Advanced Driver Assistance Systems (1)
Enhancement of Robustness in Object Detection Module for Advanced Driver Assistance Systems (1)
Abstract — A unified system integrating a compact object Advanced Driver Assistance System
detector and a surrounding environmental condition classifier
for enhancing the robustness of object detection scheme in
advanced driver assistance systems (ADAS) is proposed in this
paper. ADAS are invented to improve traffic safety and
effectiveness in autonomous driving systems where object
detection plays an extremely important role. However, modern
object detectors integrated into ADAS are still unstable due to
high latency and the variation of the environmental contexts in
the deployment phase. Our system is proposed to address the
aforementioned problems. The proposed system includes two
main components: (1) a compact one-stage object detector
which is expected to be able to perform at a comparable
accuracy compared to state-of-the-art object detectors, and (2)
an environmental condition detector that helps to send a
Fig. 1. Advanced driver assistance systems (ADAS) [8].
warning signal to the cloud in case the self-driving car needs
human actions due to the significance of the situation. The
empirical results prove the reliability and the scalability of the traffic effectiveness, prevent traffic accidents, and facilitate
proposed system to realistic scenarios. fully autonomous driving in near future. However, modern
object detectors deployed in ADAS are still unstable due to
Keywords—ADAS, object detection, autonomous driving, deep many factors including how to select an appropriate network
learning, intelligent systems. architecture that can properly balance the speed-accuracy
trade-off. The inference speed is vastly important because
I. INTRODUCTION deployment to mobile devices has many stringent require-
Recent technological breakthroughs of convolution neural ments on latency and computational resources, while mobile
networks (CNNs) and the outstanding evolution of Graphics devices usually support limited built-in hardware resources.
Processing Units (GPUs) that can boost the performance of We are in need of reducing the latency by using relatively
parallel computation have made deep learning become the small networks while maintaining the accuracy as much as
dominating approach for various computer vision tasks. CNN- possible.
based object detection, particularly, is a computer technology
The goal of this paper is to enhance the robustness of the
that has attracted a huge party of researchers for the past
object detection module in ADAS by determining an object
decade because of its applicability. Generally, there exist two
detection network that is able to efficiently balance the trade-
genres of object detectors: one-stage object detectors which
off between inference speed and detection accuracy. In this
show high inference speed and considerable accuracy, the
paper, we propose a YOLO-based object detector which is
most popular one for this one-stage type is YOLO [1,2],
constructed based on YOLOv2 [1] with only 17 convolutional
because of the superiority in inference speed, one-stage object
layers in its backbone in order to achieve a low latency as well
detectors generally are integrated into many real-time object
as a favorable detection accuracy. We also find that a critical
detection systems and mobile devices; two-stage object
reason for the instability of object detectors during the
detectors, alternatively, show higher object recognition and
deployment phase is the variation of the environmental
localization accuracy but with more expensive computational
contexts. An object detector is trained with certain weather
cost and diminished speed such as Faster R-CNN [3].
scenes may not perform properly with other weather scenes
Autonomous driving vehicles, on the other hand, have that it is not trained to work with. To address this problem, we
been considered the future of technology as they have drawn further propose an environmental condition classifier and a
huge attention for the last decade. Many studies regarding communication protocol between the system and the cloud via
autonomous robots are conducted such as autonomous drones internet connection so that the system is able to send a warning
[4] and self-driving cars [5-7]. Advanced driver assistance
systems (ADAS), consequently, are introduced to improve
In the driving scenarios, object detection models popularly A. YOLO-based Object Detector
are trained and tested on driving datasets such as BDD100K The core of an object detection module definitely is an
[9] and Cityscapes [10]. However, those datasets include object detection network. Generally, there are two genres of
image data that are collected in certain areas, which means that object detectors: one-stage object detectors which show high
an object detector which is trained on mentioned datasets may inference speed and considerable accuracy, and two-stage
not perform as designed for when it is tested with another object detectors which yield higher object recognition and
dataset which contains image data gathered from another city localization accuracy but with expensive computational cost
with very different scene contexts. The problem can be seen and diminished speed [6]. Because of the superiority in speed,
from a simpler perspective when an autonomous car operates one-stage object detectors generally are integrated into real-
in many different weather conditions in a day or different light time object detection systems and mobile devices such as
conditions between daytime and nighttime; for instance, the mobile phones, autonomous vehicles, and drones. Since this
159
Authorized licensed use limited to: National Taipei Univ. of Technology. Downloaded on May 06,2025 at 06:40:26 UTC from IEEE Xplore. Restrictions apply.
Fig. 4. Diagram of our system.
Fig. 3. Our proposed object detector (Backbone17-Det). (Backbone17). In our prior work, we have performed experi-
ments to verify the effectiveness of additional modules such
research is in the direction of enhancing the robustness of
as residual module [11], channel attention module (SE) [12],
object detectors in ADAS, one-stage detection scheme is
and spatial attention module (CBAM) [13] to choose the ones
chosen as the main type of network to investigate in this paper.
which are truly beneficial for improving our network
In one-stage object detectors, an input image is passed through
performance. The experiments have shown that the network
a backbone network to extract features and produce a final
with SE outperforms the one with CBAM in our case [6].
feature map with distilled information of objects, object
Therefore, we integrate residual connection and SE into our
recognition and localization of bounding boxes are performed
network backbone architecture. Our network (Backbone17-
directly on this feature map and done in a single pipeline
Det) is visualized in Fig. 3, the residual connection is applied
without any other post-processing steps.
once in every level of feature map resolution while SE is
In this study, we propose an efficient one-stage object applied before every downscaling step, specifically, SE is
detector based on our prior work [6] whose architecture was applied right after layer 5, layer 8, layer 13, and layer 17.
constructed based on the classical object detection algorithm
YOLO. Table 1 summarizes the backbone architectures of B. Environmental Condition Classifier
YOLOv2 [1] (Darknet-19) and our proposed object detector During onboard deployment, the object detector of an
autonomous driving car is expected to operate at a similar
TABLE I. DARKNET-19 AND BACKBONE17 ARCHITECTURES performance level as it is validated in training and testing
stages even under varying and complex environmental
Filters / Stride Output conditions, but this seems infeasible. In fact, there are multiple
Layers Types
Darknet-19 Backbone17 Resolution factors that an onboard object detector relies directly and
1 Conv 3 x 3 x 32 3 x 3 x 32 608 x 608
heavily on such as traffic density, road type, and time of the
day. Those factors impact directly the object detector and the
Maxpool 2x2/2 304 x 304
instability and the degradation of its performance without
2 Conv 3 x 3 x 64 3 x 3 x 64 / 2 304 x 304 warning is occasionally inevitable. This may lead the whole
Maxpool 2x2/2 152 x 152 system to implement unsafe and risky actions due to unreliable
3 Conv 3 x 3 x 128 3 x 3 x 128 / 2 152 x 152 object detection responses. The researchers in [14] introduce
4 Conv 1 x 1 x 64 1 x 1 x 64 152 x 152 a cascaded neural network that monitors the performance of
5 Conv 3 x 3 x 128 3 x 3 x 128 152 x 152 the object detector by predicting the quality of its mean
Maxpool 2x2/2 76 x 76
average precision (mAP) on a sliding window of the input
frames, the proposed cascaded network exploits the internal
6 Conv 3 x 3 x 256 3 x 3 x 256 / 2 76 x 76
features from the deep neural network of the object detector.
7 Conv 1 x 1 x 128 1 x 1 x 128 76 x 76
Similarly, but in a much simpler manner, we address this
8 Conv 3 x 3 x 256 3 x 3 x 256 76 x 76 problem by proposing an environmental condition classifier to
Maxpool 2x2/2 38 x 38 recognize and send a warning signal to the cloud whenever
9 Conv 3 x 3 x 512 3 x 3 x 512 / 2 38 x 38 there is a significant environmental change. Because this
10 Conv 1 x 1 x 256 1 x 1 x 256 38 x 38 system requires only an acceptable internet connection
11 Conv 3 x 3 x 512 3 x 3 x 512 38 x 38
between the mobile device and the cloud, this connection
protocol may help the authorities to take crucial actions from
12 Conv 1 x 1 x 256 1 x 1 x 256 38 x 38
the station instead of leaving the car handle by themselves in
13 Conv 3 x 3 x 512 3 x 3 x 512 38 x 38 the case the car needs human actions due to the significance
Maxpool 2x2/2 19 x 19 of the situation. The environmental condition classifier is
14 Conv 3 x 3 x 1024 3 x 3 x 1024 / 2 19 x 19 integrated along with the object detector in our system. Fig. 4
15 Conv 1 x 1 x 512 1 x 1 x 512 19 x 19 describes the diagram of our system.
16 Conv 3 x 3 x 1024 3 x 3 x 1024 19 x 19
IV. EXPERIMENTS
17 Conv 1 x 1 x 512 3 x 3 x 1024 19 x 19
18 Conv 3 x 3 x 1024 19 x 19
In this section, we first provide experimental constraints,
and the experimental results of the object detector as well as
19 Conv 1 x 1 x 1000 19 x 19
the environmental condition classifier are provided in order to
160
Authorized licensed use limited to: National Taipei Univ. of Technology. Downloaded on May 06,2025 at 06:40:26 UTC from IEEE Xplore. Restrictions apply.
Fig. 5. Person detection results using our object detector (Backbone17-Det), red boxes indicate the ground truth, green boxes depict detection results.
prove the applicability and the scalability of our system. We truck, motor, car, train, and rider. For the sake of simplicity,
also compare our result with other methods in terms of object we only concentrate on executing experiments and analysis on
detection efficiency. the person object category. The person data includes more
than 22K images (91K person objects) for training and about
A. Experimental Constraints 3.2K images (13K person objects) for testing. We train our
We propose the system in the scenario that we do not network with multi-scale training strategy; that is, the input
possess a real model of autonomous driving car. Therefore, size scales from 320x320 to 608x608 during training and a
the experiments are conducted in several constraints: specific resolution is chosen randomly for the input batch at
the beginning of every iteration. The network is trained on 2
1) Object detector: Our object detector is not tested on a
GeForce GTX TITAN X Graphics Cards, the batch size is 6
mobile device, but a PC. Hence, we only evaluate the for training and is 2 for testing. Training IoU threshold is 0.3.
accuracy of our object detector, not its inference speed, The total number of trained epochs is 120, the learning rate is
though we expect that our detector is able to obtain real-time ranged and dropped gradually from 10-4 at the first 2 epochs
inference speed on mobile devices due to its compactness. to 10-6 at the end of the training.
2) Environmental condition classifier: Our system is
The proposed object detection scheme is evaluated in
validated on BDD100K dataset [9], this dataset does not terms of Average Precision (AP) metric. We follow the
supply image data in different weather conditions, we PASCAL VOC convention by reporting AP at IoU = 0.5
simulate the variation of surrounding environment contexts (AP50). Our network with input size 608x608 produces
by using the change of light condition in different times of favorable detection outcomes of 43.6 AP corresponding to
day as described in Fig. 2. 7,978 detected objects out of 13,262 ground-truth objects with
IoU = 0.5, and 60.4 AP corresponding to 10,608 detected
B. Object Detector
objects with IoU = 0.3 (AP30). In order to compare the results
Our object detection network, Backbone17-Det, is trained in [15] in terms of the AP50 metric, two-stage object detection
on BDD100K [9]. BDD100K is a large-scale dataset with over model Faster R-CNN [3] is trained and tested on BDD100K
100K videos and its 2D bounding box annotations include 10 person data. The AP for our proposed scheme is 43.6 while
object categories: bus, traffic light, traffic sign, person, bike, Faster R-CNN shows 45.4. Note that the input image
resolutions are 1000x600 and 608x608 for Faster R-CNN and
TABLE II. OBJECT DETECTION PERFORMANCE COMPARISON our proposed scheme, respectively. Table II shows the
performance comparison. Our person detection results are
Model AP30 (%) AP50 (%) True Positives Labels shown in Fig. 5, red boxes present the labels and the detection
Our 60.4 - 10,608 (80.0%) 13,262
boxes are indicated in green.
161
Authorized licensed use limited to: National Taipei Univ. of Technology. Downloaded on May 06,2025 at 06:40:26 UTC from IEEE Xplore. Restrictions apply.
Algorithm: Light Condition Classification
162
Authorized licensed use limited to: National Taipei Univ. of Technology. Downloaded on May 06,2025 at 06:40:26 UTC from IEEE Xplore. Restrictions apply.
Understanding,” in 2016 Conference on Computer Vision and Pattern the European Conference on Computer Vision (ECCV), Munich,
Recognition (CVPR), United States, 2016, pp. 3213-3223. Germany, 2018, pp. 3-19.
[11] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for [14] Q. M. Rahman, N. Sünderhauf, and F. Dayoub, “Online Monitoring of
Image Recognition,” in 2016 IEEE Conference on Computer Vision Object Detection Performance Post-Deployment”, in arXiv preprint
and Pattern Recognition (CVPR), Las Vegas, NV, USA, 2016. arXiv:2011.07750, 2020,
[12] J. Hu, L. Shen, and G. Sun, “Squeeze-and-Excitation Networks,” in [15] Benjamin Wilson, Judy Hoffman, and Jamie Morgenstern, “Predictive
2018 IEEE/CVF Conference on Computer Vision and Pattern Inequity in Object Detection”, in arXiv preprint arXiv:1902.11097,
Recognition (CVPR), Salt Lake City, UT, USA, 2018. 2019.
[13] Sanghyun Woo , Jongchan Park , Joon-Young Lee, and In So Kweon,
“CBAM: Convolutional Block Attention Module,” in Proceedings of
163
Authorized licensed use limited to: National Taipei Univ. of Technology. Downloaded on May 06,2025 at 06:40:26 UTC from IEEE Xplore. Restrictions apply.