
3D bounding box regression using two variants of Point-
Net. The segmentation network predicts the 3D mask of
the object of interest (i.e. instance segmentation); and the
regression network estimates the amodal 3D bounding box
(covering the entire object even if only part of it is visible).
In contrast to previous work that treats RGB-D data as
2D maps for CNNs, our method is more 3D-centric as we
lift depth maps to 3D point clouds and process them us-
ing 3D tools. This 3D-centric view enables new capabilities
for exploring 3D data in a more effective manner. First,
in our pipeline, a few transformations are applied succes-
sively on 3D coordinates, which align point clouds into a
sequence of more constrained and canonical frames. These
alignments factor out pose variations in data, and thus make
3D geometry pattern more evident, leading to an easier job
of 3D learners. Second, learning in 3D space can better ex-
ploits the geometric and topological structure of 3D space.
In principle, all objects live in 3D space; therefore, we be-
lieve that many geometric structures, such as repetition, pla-
narity, and symmetry, are more naturally parameterized and
captured by learners that directly operate in 3D space. The
usefulness of this 3D-centric network design philosophy has
been supported by much recent experimental evidence.
Our method achieve leading positions on KITTI 3D ob-
ject detection [
1] and bird’s eye view detection [2] bench-
marks. Compared with the previous state of the art [
5], our
method is 8.04% better on 3D car AP with high efficiency
(running at 5 fps). Our method also fits well to indoor RGB-
D data where we have achieved 8.9% and 6.4% better 3D
mAP than [
13] and [24] on SUN-RGBD while running one
to three orders of magnitude faster.
The key contributions of our work are as follows:
• We propose a novel framework for RGB-D data based
3D object detection called Frustum PointNets.
• We show how we can train 3D object detectors un-
der our framework and achieve state-of-the-art perfor-
mance on standard 3D object detection benchmarks.
• We provide extensive quantitative evaluations to vali-
date our design choices as well as rich qualitative re-
sults for understanding the strengths and limitations of
our method.
2. Related Work
3D Object Detection from RGB-D Data Researchers
have approached the 3D detection problem by taking var-
ious ways to represent RGB-D data.
Front view image based methods: [
3, 19, 34] take
monocular RGB images and shape priors or occlusion pat-
terns to infer 3D bounding boxes. [
15, 6] represent depth
data as 2D maps and apply CNNs to localize objects in 2D
image. In comparison we represent depth as a point cloud
and use advanced 3D deep networks (PointNets) that can
exploit 3D geometry more effectively.
Bird’s eye view based methods: MV3D [
5] projects Li-
DAR point cloud to bird’s eye view and trains a region pro-
posal network (RPN [
23]) for 3D bounding box proposal.
However, the method lags behind in detecting small objects,
such as pedestrians and cyclists and cannot easily adapt to
scenes with multiple objects in vertical direction.
3D based methods: [
31, 28] train 3D object classifiers
by SVMs on hand-designed geometry features extracted
from point cloud and then localize objects using sliding-
window search. [
7] extends [31] by replacing SVM with
3D CNN on voxelized 3D grids. [
24] designs new geomet-
ric features for 3D object detection in a point cloud. [
29, 14]
convert a point cloud of the entire scene into a volumetric
grid and use 3D volumetric CNN for object proposal and
classification. Computation cost for those method is usu-
ally quite high due to the expensive cost of 3D convolutions
and large 3D search space. Recently, [
13] proposes a 2D-
driven 3D object detection method that is similar to ours
in spirit. However, they use hand-crafted features (based
on histogram of point coordinates) with simple fully con-
nected networks to regress 3D box location and pose, which
is sub-optimal in both speed and performance. In contrast,
we propose a more flexible and effective solution with deep
3D feature learning (PointNets).
Deep Learning on Point Clouds Most existing works
convert point clouds to images or volumetric forms before
feature learning. [
33, 18, 21] voxelize point clouds into
volumetric grids and generalize image CNNs to 3D CNNs.
[
16, 25, 32, 7] design more efficient 3D CNN or neural net-
work architectures that exploit sparsity in point cloud. How-
ever, these CNN based methods still require quantitization
of point clouds with certain voxel resolution. Recently, a
few works [
20, 22] propose a novel type of network archi-
tectures (PointNets) that directly consumes raw point clouds
without converting them to other formats. While PointNets
have been applied to single object classification and seman-
tic segmentation, our work explores how to extend the ar-
chitecture for the purpose of 3D object detection.
3. Problem Definition
Given RGB-D data as input, our goal is to classify and
localize objects in 3D space. The depth data, obtained from
LiDAR or indoor depth sensors, is represented as a point
cloud in RGB camera coordinates. The projection matrix
is also known so that we can get a 3D frustum from a 2D
image region. Each object is represented by a class (one
among k predefined classes) and an amodal 3D bounding
box. The amodal box bounds the complete object even if
part of the object is occluded or truncated. The 3D box is
919