RRPN: Radar Region Proposal Network For Object Detection in Autonomous Vehicles
RRPN: Radar Region Proposal Network For Object Detection in Autonomous Vehicles
ABSTRACT While Radars can provide accurate range and range-rate in-
formation on the detected objects, they are not suitable for
Region proposal algorithms play an important role in most
tasks such as object classification. Cameras on the other hand,
state-of-the-art two-stage object detection networks by hy-
are very effective sensors for object classification, making
pothesizing object locations in the image. Nonetheless, re-
Radar and camera sensor fusion a very interesting topic in au-
gion proposal algorithms are known to be the bottleneck in
tonomous driving applications. Unfortunately, there has been
most two-stage object detection networks, increasing the pro-
very few studies in this area in recent years, mostly due to the
cessing time for each image and resulting in slow networks
lack of a publicly available dataset with annotated and syn-
not suitable for real-time applications such as autonomous
chronized camera and Radar data in an autonomous driving
driving vehicles. In this paper we introduce RRPN, a Radar-
setting.
based real-time region proposal algorithm for object detec-
tion in autonomous driving vehicles. RRPN generates object 2D object detection has seen a significant progress over
proposals by mapping Radar detections to the image coor- the past few years, resulting in very accurate and efficient
dinate system and generating pre-defined anchor boxes for algorithms mostly based on convolutional neural networks
each mapped Radar detection point. These anchor boxes are [2, 5, 6, 7]. These methods usually fall under two main cat-
then transformed and scaled based on the object’s distance egories, one-stage and two-stage algorithms. One-stage al-
from the vehicle, to provide more accurate proposals for the gorithms treat object detection as a regression problem and
detected objects. We evaluate our method on the newly re- learn the class probabilities and bounding boxes directly from
leased NuScenes dataset [1] using the Fast R-CNN object de- the input image [8]. YOLO [9] and SSD [7] are among the
tection network [2]. Compared to the Selective Search object most popular algorithms in this category. Two-stage algo-
proposal algorithm [3], our model operates more than 100× rithms such as [2, 6] on the other hand, use a Region Proposal
faster while at the same time achieves higher detection pre- Network (RPN) in the first stage to generate regions of inter-
cision and recall. Code has been made publicly available at ests, and then use these proposals in the second stage to do
https://github.com/mrnabati/RRPN. classification and bounding box regression. One-stage algo-
rithms usually reach lower accuracy rates, but are much faster
Index Terms— Region Proposal Network, Autonomous than their two-stage counterparts. The bottleneck in two-stage
Driving, Object Detection algorithms is usually the RPN, processing every single image
to generate ROIs for the object classifier, although yielding
1. INTRODUCTION higher accuracy. This makes two-stage object detection algo-
rithms not suitable for applications such as autonomous driv-
Real-time object detection is one of the most challenging ing where it’s extremely important for the perception system
problems in building perception systems for autonomous ve- to operate in real time.
hicles. Most self-driving vehicles take advantage of several In this paper we propose Radar Region Proposal Network
sensors such as cameras, Radars and LIDARs. Having differ- (RRPN), a real-time RPN based on Radar detections in au-
ent types of sensors provides an advantage in tasks such as tonomous vehicles. By relying only on Radar detections to
object detection and may result in more accurate and reliable propose regions of interest, we bypass the computationally
detections, but at the same time makes designing a real-time expensive vision-based region proposal step, while improv-
perception system more challenging. ing detection accuracy. We demonstrate the effectiveness of
Radars are one of the most popular sensors used in au- our approach in the newly released NuScenes dataset [1], fea-
tonomous vehicles and have been studied for a long time in turing data from Radars and cameras among other sensors
different automotive applications. Authors in [4] were among integrated on a vehicle. When used in the Fast R-CNN ob-
the first researchers discussing such applications for Radars, ject detection network, our proposed method achieves higher
providing a detailed approach for utilizing them on vehicles. mean Average Precision (AP) and mean Average Recall (AR)
compared to the Selective Search algorithm originally used in obtained from the Radar can be easily associated with the pro-
Fast R-CNN, while operating more than 100× faster. posed regions of interest, providing accurate depth and veloc-
ity information for the detected objects.
RRPN also provides an attention mechanism to focus the
2. RELATED WORK underlying computational resources on the more important
parts of the input data. While in other object detection ap-
Authors in [10] discussed the application of Radars in naviga- plications the entire image may be of equal importance. In
tion for autonomous vehicles, using an extended Kalman filter an autonomous driving application more attention needs to be
to fuse the radar and vehicle control signals for estimating ve- given to objects on the road. For example in a highway driv-
hicle position. In [11] authors proposed a correlation based ing scenario, the perception system needs to be able to detect
pattern matching algorithm in addition to a range-window to all the vehicles on the road, but there is no need to dedicate re-
detect and track objects in front of a vehicle. In [12] Ji et al. sources to detect a picture of a vehicle on a billboard. A Radar
proposed an attention selection system based on Radar de- based RPN focuses only on the physical objects surrounding
tections to find candidate targets and employ a classification the vehicle, hence inherently creating an attention mechanism
network to classify those objects. They generated a single focusing on parts of the input image that are more important.
attention window for each Radar detection, and used a Multi- The proposed RRPN consists of three steps: perspective
layer In-place Learning Network (MILN) as the classifier. transformation, anchor generation and distance compensa-
Authors in [13] proposed a LIDAR and vision-based tion, each individually discussed in the following sections.
pedestrian detection system using both a centralized and
decentralized fusion architecture. In the former, authors
3.1. Perspective Transformation
proposed a feature level fusion system where features from
LIDAR and vision spaces are combined in a single vector The first step in generating ROIs is mapping the radar detec-
which is classified using a single classifier. In the latter, two tions from the vehicle coordinates to the camera-view coordi-
classifiers are employed, one per sensorfeature space. More nates. Radar detections are reported in a bird’s eye view per-
recently, Choi et al. in [14] proposed a multi-sensor fusion spective as shown in Fig. 1 (a), with the object’s range and az-
system addressing the fusion of 14 sensors integrated on a ve- imuth measured in the vehicle’s coordinate system. By map-
hicle. This system uses an Extended Kalman Filter to process ping these detections to the camera-view coordinates, we are
the observations from individual sensors and is able to detect able to associate the objects detected by the Radars to those
and track pedestrian, bicyclists and vehicles. seen in the images obtained by the camera.
Vision based object proposal algorithms have been very In general, the projective relation between a 3D point P =
popular among object detection networks. Authors in [3] pro- [X; Y ; Z; 1] and its image p = [x; y; 1] in the camera-view
posed the Selective Search algorithm, diversifying the search plane can be expressed as below:
for objects by using a variety of complementary image parti-
h11 h12 h13 h14
tionings. Despite its high accuracy, Selective Search is com-
p = HP , H = h21 h22 h23 h24 (1)
putationally expensive, operating at 2-7 seconds per image.
h31 h32 h33 h34
Edge Boxes [15] is another vision based object proposal al-
gorithm using edges to detect objects. Edge Boxes is faster In an autonomous driving application, the matrix H can
than the Selective Search algorithm with a run time of 0.25 be obtained from the calibration parameters of the camera.
seconds per image, but it is still considered very slow in real-
time applications such as autonomous driving. 3.2. Anchor Generation
Once the Radar detections are mapped to the image coordi-
3. RADAR REGION PROPOSAL NETWORK nates, we have the approximate location of every detected ob-
ject in the image. These mapped Radar detections, hereafter
We propose RRPN for object detection and classification in called Points of Interest (POI), provide valuable information
autonomous vehicles, a real-time algorithm using Radar de- about the objects in each image, without any processing on
tections to generate object proposals. The generated propos- the image itself. Having this information, a simple approach
als can be used in any two-stage object detection network for proposing ROIs would be introducing a bounding box cen-
such as Fast-RCNN. Relying only on Radar detections to gen- tered at every POI. One problem with this approach is that
erate object proposals makes an extremely fast RPN, making Radar detections are not always mapped to the center of the
it suitable for autonomous driving applications. Aside from detected objects in every image. Another problem is the fact
being a RPN for an object detection algorithm, the proposed that Radars do not provide any information about the size of
network also inherently acts as a sensor fusion algorithm by the detected objects and proposing a fixed-size bounding box
fusing the Radar and camera data to obtain higher accuracy for objects of different sizes would not be an effective ap-
and reliability. The objects’ range and range-rate information proach.
d
α
(a) Bird’s eye view (b) Centered anchors (c) Right aligned anchors (d) Bottom-aligned anchors (e) Left aligned anchors
Fig. 1: Generating anchors of different shapes and sizes for each Radar detection, shown here as the blue circle.
We use the idea of anchor bounding boxes from Faster 4. EXPERIMENTS AND RESULTS
R-CNN [6] to alleviate the problems mentioned above. For
every POI, we generate several bounding boxes with different 4.1. Dataset
sizes and aspect ratios centered at the POI, as shown in Fig. 1
To evaluate the proposed RPN, we use the recently released
(b). We use 4 different sizes and 3 different aspect ratios to
NuScenes dataset. NuScenes is a publicly available large-
generate these anchors.
scale dataset for autonomous driving, featuring a full sensor
To account for the fact that the POI is not always mapped
suite including Radars, cameras, LIDAR and GPS units. Hav-
to the center of the object in the image coordinate, we also
ing 3D bounding boxes for 25 object classes and 1.3M Radar
generate different translated versions of the anchors. These
sweeps, NuScenes is the first large-scale dataset to publicly
translated anchors provide more accurate bounding boxes
provide synchronized and annotated camera and Radar data
when the POI is mapped towards the right, left or the bottom
collected in highly challenging driving situations. To use this
of the object as shown in Fig. 1 c-e.
dataset in our application, we have converted all 3D bound-
ing boxes to 2D and also merged some of the similar classes,
3.3. Distance Compensation such as child, adult and police officer. The classes used for
The distance of each object from the vehicle plays an impor- our experiments are Car, Truck, Person, Motorcycle, Bicycle
tant role in determining its size in the image. Generally, ob- and Bus.
jects’ sizes in an image have an inverse relationship with their The NuScenes dataset includes images from 6 different
distance from the camera. Radar detections have the range in- cameras in the front, sides and back of the vehicle. The Radar
formation for every detected object, which is used in this step detections are obtained from four corner Radars and one front
to scale all generated anchors. We use the following formula Radar. We use two subsets of the samples available in the
to determine the scaling factor to use on the anchors: dataset for our experiments. The first subset contains data
from the front camera and front Radar only, with 23k samples.
1 We refer to this subset as NS-F . The second subset contains
Si = α +β (2)
di data from the rear camera and two rear Radars, in addition to
where di is the distance to the ith object, and α and β are two all the samples from NS-F . This subset has 45k images and
parameters used to adjust the scale factor. These parameters we call it NS-FB . Since front Radars usually have a longer
are learned by maximizing the Intersection Over Union (IOU) range compared to the corner Radars, NS-F gives us more
between the generated bounding boxes and the ground truth accurate detections for objects far away from the vehicle. On
bounding boxes in each image, as shown in Eq. 3 below. the other hand, NS-FB includes samples from the rear camera
Mi
N X
and Radar that are more challenging for our network. We
X
i further split each dataset with a ratio of 0.85-0.15 for training
argmax max IOUjk (α, β) (3)
α,β i=1 j=1
1<k<Ai and testing, respectively.
car 0.92
car 1.00
person 0.92 car 1.00 truck 0.97
truck 0.99 car 1.00 car 0.99
car 1.00
Fig. 2: Detection results. Top row: ground truth, middle row: Selective Search, bottom row: RRPN
We compare the results of detection using RRPN proposals Table 1: Detection results for the NS-F and NS-FB datasets
with that of the Selective Search algorithm [3], which uses a
method AP AP50 AP75 AR ARs ARm ARl
variety of complementary image partitionings to find objects SS + X101 - F 0.368 0.543 0.406 0.407 0.000 0.277 0.574
in images. In both RRPN and Selective Search, we limit the SS + R101 - F 0.418 0.628 0.450 0.464 0.001 0.372 0.316
RRPN + X101 - F 0.419 0.652 0.463 0.478 0.041 0.406 0.573
number of object proposals to 2000 per image. RRPN + R101 - F 0.430 0.649 0.485 0.486 0.040 0.412 0.582
The evaluation metrics used in our experiments are the SS + X101 - FB 0.332 0.545 0.352 0.382 0.001 0.291 0.585
SS + R101 - FB 0.336 0.548 0.357 0.385 0.001 0.291 0.591
same metrics used in the COCO dataset [18], namely mean RRPN + X101 - FB 0.354 0.592 0.369 0.420 0.202 0.391 0.510
Average Precision (AP) and mean Average Recall (AR). We RRPN + R101 - FB 0.355 0.590 0.370 0.421 0.211 0.391 0.514
also report the AP calculated with 0.5 and 0.75 IOU, as well
as AR for small, medium and large objects areas. Table 2: Per-class AP for the NS-F and NS-FB datasets
method Car Truck Person Motorcycle Bicycle Bus
4.3. Results SS + X101 - F 0.424 0.509 0.117 0.288 0.190 0.680
SS + R101 - F 0.472 0.545 0.155 0.354 0.241 0.722
The Fast R-CNN object detection results for the two RPN net- RRPN + X101 - F 0.428 0.501 0.212 0.407 0.304 0.660
RRPN + R101 - F 0.442 0.516 0.220 0.434 0.306 0.664
works on NS-F and NS-FB datasets are shown in Table 1. SS + X101 - FB 0.390 0.415 0.122 0.292 0.179 0.592
According to these results, RRPN is outperforming Selective SS + R101 - FB 0.392 0.420 0.121 0.291 0.191 0.600
RRPN + X101 - FB 0.414 0.449 0.174 0.294 0.215 0.579
Search in almost all metrics. Table 2 shows the per-class AP RRPN + R101 - FB 0.418 0.447 0.171 0.305 0.214 0.572
results for the NS-F and NS-FB datasets, respectively. For
the NS-F dataset, RRPN outperforms Selective Search in the
Person, Motorcycle and Bicycle classes with a wide margin, 5. CONCLUSION
while following Selective Search closely in other classes. For
the NS-FB dataset, RRPN outperforms Selective Search in all We presented RRPN, a real-time region proposal network for
classes except for the Bus class. object detection in autonomous driving applications. By only
Figure 2 shows selected examples of the object detec- relying on Radar detections to propose ROIs, our method
tion results, with the first row showing the ground truth and is extremely fast while at the same time achieving a higher
mapped Radar detections. The next two rows are the detected precision and recall compared to the Selective Search algo-
bounding boxes using the region proposals from Selective rithm. Additionally, RRPN inherently performs as a sensor
Search and RRPN respectively. According to these figures, fusion algorithm, fusing the data obtained from Radars with
RRPN has been very successful in proposing accurate bound- vision data to obtain faster and more accurate detections. We
ing boxes even under hard circumstances such as object oc- evaluated RRPN on the NuScenes dataset and compared the
clusion and overlap. In our experiments, RRPN was able to results to the Selective Search algorithm. Our experiments
generate proposals for anywhere between 70 to 90 images per show RRPN operates more than 100x faster than the Selective
second, depending on the number of Radar detections, while Search algorithm, while resulting in better detection average
Selective Search took between 2-7 seconds per image. precision and recall.
6. REFERENCES [12] Zhengping Ji and Danil Prokhorov, “Radar-vision fu-
sion for object classification,” Proceedings of the 11th
[1] Holger Caesar, Varun Bankiti, Alex H. Lang, Sourabh International Conference on Information Fusion, FU-
Vora, Venice Erin Liong, Qiang Xu, Anush Krish- SION 2008, vol. 2, pp. 265–271, 2008.
nan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom,
“nuscenes: A multimodal dataset for autonomous driv- [13] Cristiano Premebida, Oswaldo Ludwig, and Urbano
ing,” 2019. Nunes, “Lidar and vision-based pedestrian detection
system,” Journal of Field Robotics, vol. 26, no. 9, pp.
[2] Ross Girshick, “Fast R-CNN,” 2015 IEEE International 696–711, 2009.
Conference on Computer Vision (ICCV), Dec 2015.
[14] Hyunggi Cho, Young Woo Seo, B. V.K.Vijaya Kumar,
[3] Jasper R. R. Uijlings, Koen E. A. van de Sande, Theo and Ragunathan Raj Rajkumar, “A multi-sensor fusion
Gevers, and Arnold W. M. Smeulders, “Selective Search system for moving object detection and tracking in ur-
for Object Recognition,” International Journal of Com- ban driving environments,” Proceedings - IEEE Inter-
puter Vision, vol. 104, no. 2, pp. 154–171, sep 2013. national Conference on Robotics and Automation, pp.
1836–1843, 2014.
[4] Dale M. Grimes and Trevor Owen Jones, “Automotive
Radar: A Brief Review,” Proceedings of the IEEE, vol. [15] C Lawrence Zitnick and Piotr Dollár, “Edge boxes: Lo-
62, no. 6, pp. 804–822, 1974. cating object proposals from edges,” in European con-
ference on computer vision. Springer, 2014, pp. 391–
[5] Jifeng Dai, Yi Li, Kaiming He, and Jian Sun, “R-fcn: 405.
Object detection via region-based fully convolutional
networks,” in Advances in neural information process- [16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian
ing systems, 2016, pp. 379–387. Sun, “Deep residual learning for image recognition,”
2016 IEEE Conference on Computer Vision and Pattern
[6] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Recognition (CVPR), Jun 2016.
Sun, “Faster R-CNN: Towards Real-Time Object De-
tection with Region Proposal Networks,” in Neural In- [17] Saining Xie, Ross Girshick, Piotr Dollar, Zhuowen Tu,
formation Processing Systems (NIPS), jun 2015. and Kaiming He, “Aggregated residual transformations
for deep neural networks,” 2017 IEEE Conference on
[7] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Computer Vision and Pattern Recognition (CVPR), Jul
Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C. 2017.
Berg, “SSD: Single Shot MultiBox Detector,” in Euro-
[18] Tsung-Yi Lin, Michael Maire, Serge Belongie, James
pean Conference on Computer Vision. dec 2016.
Hays, Pietro Perona, Deva Ramanan, Piotr Dollr, and
[8] Petru Soviany and Radu Tudor Ionescu, “Optimizing the C. Lawrence Zitnick, “Microsoft coco: Common ob-
trade-off between single-stage and two-stage object de- jects in context,” Lecture Notes in Computer Science, p.
tectors using image difficulty prediction,” arXiv preprint 740755, 2014.
arXiv:1803.08707, 2018.