0% found this document useful (0 votes)

14 views17 pages

Region-Based_Convolutional_Networks_for_Accurate_Object_Detection_and_Segmentation

Uploaded by

nn4489

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views17 pages

Region-Based_Convolutional_Networks_for_Accurate_Object_Detection_and_Segmentation

Uploaded by

nn4489

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

142 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 38, NO.

1, JANUARY 2016

Region-Based Convolutional Networks for

Accurate Object Detection and Segmentation
Ross Girshick, Jeff Donahue, Student Member, IEEE,
Trevor Darrell, Member, IEEE, and Jitendra Malik, Fellow, IEEE

Abstract—Object detection performance, as measured on the canonical PASCAL VOC Challenge datasets, plateaued in the final
years of the competition. The best-performing methods were complex ensemble systems that typically combined multiple low-level
image features with high-level context. In this paper, we propose a simple and scalable detection algorithm that improves mean
average precision (mAP) by more than 50 percent relative to the previous best result on VOC 2012—achieving a mAP of 62.4 percent.
Our approach combines two ideas: (1) one can apply high-capacity convolutional networks (CNNs) to bottom-up region proposals in
order to localize and segment objects and (2) when labeled training data are scarce, supervised pre-training for an auxiliary task,
followed by domain-specific fine-tuning, boosts performance significantly. Since we combine region proposals with CNNs, we call
the resulting model an R-CNN or Region-based Convolutional Network. Source code for the complete system is available
at http://www.cs.berkeley.edu/~rbg/rcnn.

Index Terms—Object recognition, detection, semantic segmentation, convolutional networks, deep learning, transfer learning

1 INTRODUCTION

R ECOGNIZING objects and localizing them in images is

one of the most fundamental and challenging prob-
lems in computer vision. There has been significant prog-
benchmark, our system achieves a relative improvement of
more than 50 percent mean average precision (mAP) com-
pared to the best methods based on low-level image fea-
ress on this problem over the last decade due largely to tures. Our approach also scales well with the number of
the use of low-level image features, such as SIFT [1] and object categories, which is a long-standing challenge for
HOG [2], in sophisticated machine learning frameworks. existing methods.
But if we look at performance on the canonical visual rec- We trace the roots of our approach to Fukushima’s
ognition task, PASCAL VOC object detection [3], it is gen- “neocognitron” [4], a hierarchical and shift-invariant model
erally acknowledged that progress slowed from 2010 for pattern recognition. While the basic architecture of the
onward, with small gains obtained by building ensemble neocognitron is used widely today, Fukushima’s method
systems and employing minor variants of successful had limited empirical success in part because it lacked a
methods. supervised training algorithm. Rumelhart et al. [5] showed
SIFT and HOG are semi-local orientation histograms, a that a similar architecture could be trained with supervised
representation we could associate roughly with complex error backpropagation to classify synthetic renderings of
cells in V1, the first cortical area in the primate visual path- the characters ‘T‘ and ‘C‘. Building on this work, LeCun
way. But we also know that recognition occurs several et al. demonstrated in an influential sequence of papers
stages downstream, which suggests that there might be (from [6] to [7]) that stochastic gradient descent (SGD) via
hierarchical, multi-stage processes for computing features backpropagation was effective for training deeper networks
that are even more informative for visual recognition. for challenging real-world handwritten character recogni-
In this paper, we describe an object detection and seg- tion problems. These models are now known as convolu-
mentation system that uses multi-layer convolutional net- tional (neural) networks, CNNs, or ConvNets.
works to compute highly discriminative, yet invariant, CNNs saw heavy use in the 1990s, but then fell out of
features. We use these features to classify image regions, fashion with the rise of support vector machines. In 2012,
which can then be output as detected bounding boxes or Krizhevsky et al. [8] rekindled interest in CNNs by showing
pixel-level segmentation masks. On the PASCAL detection a substantial improvement in image classification accuracy
on the imagenet large scale visual recognition challenge
(ILSVRC) [9], [10]. Their success resulted from training a
R. Girshick is with Microsoft Research, Redmond, WA. large CNN on 1.2 million labeled images, together with a
E-mail: [email protected]. few twists on CNNs from the 1990s (e.g., maxðx; 0Þ “ReLU”
J. Donahue, T. Darrell, and J. Malik are with the Department of Electrical
non-linearities, “dropout” regularization, and a fast GPU
Engineering and Computer Science, UC Berkeley, Berkeley, CA.
E-mail: {jdonahue, malik}@eecs.berkeley.edu. implementation).
Manuscript received 25 Nov. 2014; revised 5 May 2015; accepted 18 May The significance of the ImageNet result was vigorously
2015. Date of publication 24 May 2015; date of current version 9 Dec. 2015. debated during the ILSVRC 2012 workshop. The central
Recommended for acceptance by A. Vedaldi. issue can be distilled to the following: To what extent do the
For information on obtaining reprints of this article, please send e-mail to: CNN classification results on ImageNet generalize to object
[email protected], and reference the Digital Object Identifier below.
Digital Object Identifier no. 10.1109/TPAMI.2015.2437384 detection results on the PASCAL VOC Challenge?
Authorized licensed use limited to: SRM Institute of Science and Technology. Downloaded on July 12,2024 at 04:26:24 UTC from IEEE Xplore. Restrictions apply.
0162-8828 ß 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
GIRSHICK ET AL.: REGION-BASED CONVOLUTIONAL NETWORKS FOR ACCURATE OBJECT DETECTION AND SEGMENTATION 143

pre-training, followed by supervised fine-tuning (FT) (e.g.,

[17]). The second principle contribution of this paper is to
show that supervised pre-training on a large auxiliary dataset
(ILSVRC), followed by domain-specific fine-tuning on a
small dataset (PASCAL), is an effective paradigm for learn-
ing high-capacity CNNs when data are scarce. In our
experiments, fine-tuning for detection can improve mAP by
as much as 8 percentage points. After fine-tuning, our sys-
Fig. 1. Object detection system overview. Our system (1) takes an input
image, (2) extracts around 2000 bottom-up region proposals, (3) com- tem achieves a mAP of 63 percent on VOC 2010 compared
putes features for each proposal using a large CNN, and then (4) classi- to 33 percent for the highly-tuned, HOG-based deformable
fies each region using class-specific linear SVMs. We trained an R-CNN part model (DPM) [18], [23].
that achieves a mean average precision of 62.9 percent on PASCAL Our original motivation for using regions was born out of
VOC 2010. For comparison, [21] reports 35.1 percent mAP using the
same region proposals, but with a spatial pyramid and bag-of-visual- a pragmatic research methodology: move from image classi-
words approach. The popular deformable part models perform at fication to object detection as simply as possible. Since then,
33.4 percent. On the 200-class ILSVRC2013 detection dataset, we this design choice has proved valuable because R-CNNs are
trained an R-CNN with a mAP of 31.4 percent, a large improvement over
OverFeat [19], which had the previous best result at 24.3 percent mAP.
straightforward to implement and train (compared to slid-
ing-window CNNs) and it provides a unified solution to
object detection and segmentation.
We answered this question in a conference version This journal paper extends our earlier work [11] in a
of this paper [11] by showing that a CNN can lead to number of ways. First, we provide more implementation
dramatically higher object detection performance on details, rationales for design decisions, and ablation studies.
PASCAL VOC as compared to systems based on simpler Second, we present new results on PASCAL detection using
HOG-like features. To achieve this result, we bridged the deeper networks. Our approach is agnostic to the particular
gap between image classification and object detection by choice of network architecture used and we show that
developing solutions to two problems: (1) How can we recent work on deeper networks (e.g., [24]) translates into
localize objects with a deep network? and (2) How can large improvements in object detection. Finally, we give a
we train a high-capacity model with only a small quantity head-to-head comparison of R-CNNs with the recently pro-
of annotated detection data? posed OverFeat [19] detection system. OverFeat uses a slid-
Unlike image classification, detection requires localizing ing-window CNN for detection and was a top-performing
(likely many) objects within an image. One approach is to method on the ILSVRC 2013 detection challenge. We train
frame detection as a regression problem. This formulation an R-CNN that significantly outperforms OverFeat, with a
can work well for localizing a single object, but detecting mAP of 31.4 percent versus 24.3 percent on the 200-class
multiple objects requires complex workarounds [12] or an ILSVRC2013 detection dataset.
ad hoc assumption about the number of objects per image
[13]. An alternative is to build a sliding-window detector.
CNNs have been used in this way for at least two decades, 2 RELATED WORK
typically on constrained object categories, such as faces [14], Deep CNNs for object detection. There were several efforts [12],
[15], hands [16], and pedestrians [17]. This approach is [13], [19] to use convolutional networks for PASCAL-style
attractive in terms of computational efficiency, however its object detection concurrent with the development of R-
straightforward application requires all objects to share a CNNs. Szegedy et al. [12] model object detection as a regres-
common aspect ratio. The aspect ratio problem can be sion problem. Given an image window, they use a CNN to
addressed with mixture models (e.g., [18]), where each com- predict foreground pixels over a coarse grid for the whole
ponent specializes in a narrow band of aspect ratios, or with object as well as the object’s top, bottom, left and right halves.
bounding-box regression (e.g., [18], [19]). A grouping process then converts the predicted masks into
Instead, we solve the localization problem by operating detected bounding boxes. Szegedy et al. train their model
within the “recognition using regions” paradigm [20], from a random initialization on VOC 2012 trainval and get a
which has been successful for both object detection [21] and mAP of 30.5 percent on VOC 2007 test. In comparison, an R-
semantic segmentation [22]. At test time, our method gener- CNN using the same network architecture gets a mAP of
ates around 2,000 category-independent region proposals 58.5 percent, but uses supervised ImageNet pre-training.
for the input image, extracts a fixed-length feature vector One hypothesis is that [12] performs worse because it does
from each proposal using a CNN, and then classifies each not use ImageNet pre-training. Recent work from Agrawal
region with category-specific linear SVMs. We use a simple et al. [25] shows that this is not the case; they find that an R-
warping technique (anisotropic image scaling) to compute a CNN trained from a random initialization on VOC 2007
fixed-size CNN input from each region proposal, regardless trainval (using the same network architecture as [12])
of the region’s shape. Fig. 1 shows an overview of a region- achieves a mAP of 40.7 percent on VOC 2007 test despite
based convolutional network (R-CNN) and highlights some using half the amount of training data as [12].
of our results. Scalability and speed. In addition to being accurate, it’s
A second challenge faced in detection is that labeled data important for object detection systems to scale well as the
are scarce and the amount currently available is insufficient number of object categories increases. Significant effort has
for training large CNNs from random initializations. The gone into making methods like DPM [18] scale to thousands
conventional solution to this problem is to use unsupervised of object categories. For example, Dean et al. [26] replace
Authorized licensed use limited to: SRM Institute of Science and Technology. Downloaded on July 12,2024 at 04:26:24 UTC from IEEE Xplore. Restrictions apply.
144 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 38, NO. 1, JANUARY 2016

exact filter convolutions in DPM with hashtable lookups. Transfer learning. R-CNN training is based on inductive
They show that with this technique it’s possible to run 10k transfer learning, using the taxonomy of Pan and Yang [37].
DPM detectors in about 5 minutes per image on a desktop To train an R-CNN, we typically start with ImageNet classi-
workstation. However, there is an unfortunate tradeoff. fication as a source task and dataset, train a network using
When a large number of DPM detectors compete, the supervision, and then transfer that network to the target
approximate hashing approach causes a substantial loss in task and dataset using supervised fine-tuning. This method
detection accuracy. R-CNNs, in contrast, scale very well is related to traditional multi-task learning [38], [39], except
with the number of object classes to detect because nearly that we train for the tasks sequentially and are ultimately
all computation is shared between all object categories. The only interested in performing well on the target task.
only class-specific computations are a reasonably small This strategy is different from the dominant paradigm in
matrix-vector product and greedy non-maximum suppres- recent neural network literature of unsupervised tranfer learn-
sion. Although these computations scale linearly with the ing (see [40] for a survey covering unsupervised pre-train-
number of categories, the scale factor is small. Measured ing and represetation learning more generally). Supervised
empirically, it takes only 30 ms longer to detect 200 classes transfer learning using CNNs, but without fine-funing, was
than 20 classes on a CPU, without any approximations. This also investigated in concurrent work by Donahue et al. [41].
makes it feasible to rapidly detect tens of thousands of They show that Krizhevsky et al.’s CNN, once trained on
object categories without any modifications to the core ImageNet, can be used as a blackbox feature extractor,
algorithm. yielding excellent performance on several recognition tasks
Despite this graceful scaling behavior, an R-CNN can including scene classification, fine-grained sub-categoriza-
take 10 to 45 seconds per image on a GPU, depending on tion, and domain adaptation. Hoffman et al. [42] show how
the network used, since each region is passed through the transfer learning can be used to train R-CNNs for classes
network independently. Recent work from He et al. [27] that have image-level labels, but no bounding-box training
(“SPPnet”) improves R-CNN efficiency by sharing compu- data. Their approach is based on modeling the task shift
tation through a feature pyramid, allowing for detection at from image classification to object detection and then trans-
a few frames per second. Building on SPPnet, Girshick [28] fering that knowledge to classes that have no detection
shows that it’s possible to further reduce training and test- training data.
ing times, while improving detection accuracy and simplify- R-CNN extensions. Since their introduction, R-CNNs have
ing the training process, using an approach called “Fast R- been extended to a variety of new tasks and datasets.
CNN.” Fast R-CNN reduces detection times (excluding Karpathy et al. [43] learn a model for bi-directional image
region proposal computation) to 50 to 300 ms per image, and sentence retrieval. Their image representation is
depending on network architecture. derived from an R-CNN trained to detect 200 classes on the
Localization methods. The dominant approach to object ILSVRC2013 detection dataset. Gkioxari et al. [44] use
detection has been based on sliding-window detectors. multi-task learning to train R-CNNs for person detection,
This approach goes back (at least) to early face detectors 2D pose estimation, and action recognition. Hariharan et al.
[15], and continued with HOG-based pedestrian detection [45] propose a unification of the object detection and seman-
[2], and part-based generic object detection [18]. An alter- tic segmentation tasks, termed “simultaneous detection and
native is to first compute a pool of (likely overlapping) segmentation” (SDS), and train a two-column R-CNN for
image regions, each one serving as a candidate object, and this task. They show that a single region proposal algorithm
then to filter these candidates in a way that aims to retain (MCG [35]) can be used effectively for traditional bounding-
only the true objects. Multiple segmentation hypotheses box detection as well as semantic segmentation. Their PAS-
were used by Hoiem et al. [29] to estimate the rough geo- CAL segmentation results improve significantly on the ones
metric scene structure and by Russell et al. [30] to automat- reported in this paper. Gupta et al. [46] extend R-CNNs to
ically discover object classes in a set of images. The object detection in depth images. They show that a well-
“selective search” algorithm of van de Sande et al. [21] designed input signal, where the depth map is augmented
popularized the multiple segmentation approach for object with height above ground and local surface orientation with
detection by showing strong results on PASCAL object respect to gravity, allows training an R-CNN that outper-
detection. Our approach was inspired by the success of forms existing RGB-D object detection baselines. Song et al.
selective search. [47] train an R-CNN using weak, image-level supervision
Object proposal generation is now an active research by mining for positive training examples using a submodu-
area. EdgeBoxes [31] outputs high-quality rectangular lar cover algorithm and then training a latent SVM.
(box) proposals quickly (0.3 s per image). BING [32] Many systems based on, or implementing, R-CNNs
generates box proposals at 3 ms per image, however it were used in the recent ILSVRC2014 object detection chal-
has subsequently been shown that the proposal quality lenge [48], resulting in substantial improvements in detec-
is too poor to be useful in R-CNNs [33]. Other methods tion accuracy. In particular, the winning method,
focus on pixel-wise segmentation, producing regions GoogLeNet [49], [50], uses an innovative network design
instead of boxes. These approaches include RIGOR [34] in an R-CNN. With a single network (and a slightly sim-
and MCG [35], which take 10 to 30 s per image and pler pipeline that excludes SVM training and bounding-
GOP [36], a faster methods that takes 1 s per image. box regression), they improve R-CNN performance to
For a more in-depth survey of proposal algorithms, 38.0 percent mAP from a basline of 34.5 percent. They also
Hosang et al. [33] provide an insightful meta-evaluation show that an ensemble of six networks improves their
of recent methods. result to 43.9 percent mAP.
Authorized licensed use limited to: SRM Institute of Science and Technology. Downloaded on July 12,2024 at 04:26:24 UTC from IEEE Xplore. Restrictions apply.
GIRSHICK ET AL.: REGION-BASED CONVOLUTIONAL NETWORKS FOR ACCURATE OBJECT DETECTION AND SEGMENTATION 145

3 OBJECT DETECTION WITH AN R-CNN

Our object detection system consists of three modules. The
first generates category-independent region proposals.
These proposals define the set of candidate detections avail-
able to our detector. The second module is a convolutional
network that extracts a fixed-length feature vector from Fig. 2. Warped training samples from VOC 2007 train.
each region. The third module is a set of class-specific linear
SVMs. In this section, we present our design decisions for
each module, describe their test-time usage, detail how their 3.2 Test-Time Detection
parameters are learned, and show detection results on PAS- At test time, we run selective search on the test image to
CAL VOC 2010-12 and ILSVRC2013. extract around 2,000 region proposals (we use selective
search’s “fast mode” in all experiments). We warp each pro-
posal and forward propagate it through the CNN in order
3.1 Module Design to compute features. Then, for each class, we score each
3.1.1 Region Proposals extracted feature vector using the SVM trained for that class.
A variety of recent papers offer methods for generating cate- Given all scored regions in an image, we apply a greedy
gory-independent region proposals. Examples include: non-maximum suppression (for each class independently)
objectness [51], selective search [21], category-independent that rejects a region if it has an intersection-over-union
object proposals [52], constrained parametric min-cuts (IoU) overlap with a higher scoring selected region larger
(CPMC) [22], multi-scale combinatorial grouping [35], and than a learned threshold.
Cireşan et al. [53], who detect mitotic cells by applying
a CNN to regularly-spaced square crops, which are a special 3.2.1 Run-Time Analysis
case of region proposals. While R-CNN is agnostic to the
Two properties make detection efficient. First, all CNN
particular region proposal method, we use selective search
parameters are shared across all categories. Second, the fea-
to enable a controlled comparison with prior detection work
ture vectors computed by the CNN are low-dimensional
(e.g., [21], [54]).
when compared to other common approaches, such as spa-
tial pyramids with bag-of-visual-word encodings. The fea-
3.1.2 Feature Extraction tures used in the UVA detection system [21], for example,
are two orders of magnitude larger than ours (360 k versus
We extract a fixed-length feature vector from each region
4 k-dimensional).
proposal using a CNN. The particular CNN architecture
The result of such sharing is that the time spent comput-
used is a system hyperparameter. Most of our experiments
ing region proposals and features (10 s/image on an NVI-
use the Caffe [55] implementation of the CNN described by
DIA Titan Black GPU or 53 s/image on a CPU, using
Krizhevsky et al. [8] (TorontoNet), however we have also
TorontoNet) is amortized over all classes. The only class-
experimented with the 16-layer deep network from Simon-
specific computations are dot products between features
yan and Zisserman [24] (OxfordNet). In both cases, the fea-
and SVM weights and non-maximum suppression. In prac-
ture vectors are 4,096-dimensional. Features are computed
tice, all dot products for an image are batched into a single
by forward propagating a mean-subtracted S S RGB
matrix-matrix product. The feature matrix is typically
image through the network and reading off the values out-
2;000 4;096 and the SVM weight matrix is 4;096 N,
put by the penultimate layer (the layer just before the soft-
where N is the number of classes.
max classifier). For TorontoNet, S ¼ 227 and for OxfordNet
This analysis shows that R-CNNs can scale to thou-
S ¼ 224. We refer readers to [8], [24], [55] for more network
sands of object classes without resorting to approximate
architecture details.
techniques, such as hashing. Even if there were 100 k clas-
In order to compute features for a region proposal, we
ses, the resulting matrix multiplication takes only 10 sec-
must first convert the image data in that region into a form
onds on a modern multi-core CPU. This efficiency is not
that is compatible with the CNN (its architecture requires
merely the result of using region proposals and shared fea-
inputs of a fixed S S pixel size).1 Of the many possible
tures. The UVA system, due to its high-dimensional fea-
transformations of our arbitrary-shaped regions, we opt for
tures, would be two orders of magnitude slower while
the simplest. Regardless of the size or aspect ratio of the can-
requiring 134 GB of memory just to store 100 k linear pre-
didate region, we warp all pixels in a tight bounding box
dictors, compared to just 1.5 GB for our lower-dimensional
around it to the required size. Prior to warping, we dilate
features.
the tight bounding box so that at the warped size there are
It is also interesting to contrast R-CNNs with the recent
exactly p pixels of warped image context around the origi-
work from Dean et al. on scalable detection using DPMs
nal box (we use p ¼ 16). Fig. 2 shows a random sampling of
and hashing [56]. They report a mAP of around 16 percent
warped training regions. Alternatives to warping are dis-
on VOC 2007 at a run-time of 5 minutes per image when
cussed in Section 7.1.
introducing 10 k distractor classes. With our approach, 10 k
detectors can run in about a minute on a CPU, and because
1. Of course the entire network can be run convolutionally, which no approximations are made mAP would remain at
enables handling arbitrary input sizes, however then the output size is
no longer a fixed-length vector. The output can be converted to a fixed- 59 percent with TorontoNet and 66 percent with OxfordNet
length through another transformation, such as in [27]. (Section 4.2).
Authorized licensed use limited to: SRM Institute of Science and Technology. Downloaded on July 12,2024 at 04:26:24 UTC from IEEE Xplore. Restrictions apply.
146 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 38, NO. 1, JANUARY 2016

TABLE 1
Detection Average Precision (Percent) on VOC 2010 Test

VOC 2010 test aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv mAP
DPM v5 [23] 49.2 53.8 13.1 15.3 35.5 53.4 49.7 27.0 17.2 28.8 14.7 17.8 46.4 51.2 47.7 10.8 34.2 20.7 43.8 38.3 33.4
UVA [21] 56.2 42.4 15.3 12.6 21.8 49.3 36.8 46.1 12.9 32.1 30.0 36.5 43.5 52.9 32.9 15.3 41.1 31.8 47.0 44.8 35.1
Regionlets [54] 65.0 48.9 25.9 24.6 24.5 56.1 54.5 51.2 17.0 28.9 30.2 35.8 40.2 55.7 43.5 14.3 43.9 32.6 54.0 45.9 39.7
SegDPM [57] 61.4 53.4 25.6 25.2 35.5 51.7 50.6 50.8 19.3 33.8 26.8 40.4 48.3 54.4 47.1 14.8 38.7 35.0 52.8 43.1 40.4
R-CNN T-Net 67.1 64.1 46.7 32.0 30.5 56.4 57.2 65.9 27.0 47.3 40.9 66.6 57.8 65.9 53.6 26.7 56.5 38.1 52.8 50.2 50.2
R-CNN T-Net BB 71.8 65.8 53.0 36.8 35.9 59.7 60.0 69.9 27.9 50.6 41.4 70.0 62.0 69.0 58.1 29.5 59.4 39.3 61.2 52.4 53.7
R-CNN O-Net 76.5 70.4 58.0 40.2 39.6 61.8 63.7 81.0 36.2 64.5 45.7 80.5 71.9 74.3 60.6 31.5 64.7 52.5 64.6 57.2 59.8
R-CNN O-Net BB 79.3 72.4 63.1 44.0 44.4 64.6 66.3 84.9 38.8 67.3 48.4 82.3 75.0 76.7 65.7 35.8 66.2 54.8 69.1 58.8 62.9

T-Net stands for TorontoNet and O-Net for OxfordNet (Section 3.1.2). R-CNNs are most directly comparable to UVA and Regionlets since all methods use selec-
tive search region proposals. Bounding-box regression is described in Section 7.3. At publication time, SegDPM was the top-performer on the PASCAL VOC
leaderboard. DPM and SegDPM use context rescoring not used by the other methods. SegDPM and all R-CNNs use additional training data.

3.3 Training it to 0 decreased mAP by four points. Positive examples are

3.3.1 Supervised Pre-Training defined simply to be the ground-truth bounding boxes for
We discriminatively pre-trained the CNN on a large auxil- each class.
iary dataset (ILSVRC2012 classification) using image-level Once features are extracted and training labels are
annotations only (bounding-box labels are not available for applied, we optimize one linear SVM per class. Since the
this data). Pre-training was performed using the open training data are too large to fit in memory, we adopt the
source Caffe CNN library [55]. standard hard negative mining method [18], [58]. Hard neg-
ative mining converges quickly and in practice mAP stops
increasing after only a single pass over all images.
3.3.2 Domain-Specific Fine-Tuning
In Section 7.2 we discuss why the positive and negative
To adapt the CNN to the new task (detection) and the new examples are defined differently in fine-tuning versus SVM
domain (warped proposal windows), we continue stochas- training. We also discuss the trade-offs involved in training
tic gradient descent training of the CNN parameters using detection SVMs rather than simply using the outputs from
only warped region proposals. Aside from replacing the the final softmax layer of the fine-tuned CNN.
CNN’s ImageNet-specific 1000-way classification layer with
a randomly initialized (N þ 1)-way classification layer
3.4 Results on PASCAL VOC 2010-12
(where N is the number of object classes, plus 1 for back-
ground), the CNN architecture is unchanged. For VOC, Following the PASCAL VOC best practices [3], we validated
N ¼ 20 and for ILSVRC2013, N ¼ 200. We treat all region all design decisions and hyperparameters on the VOC 2007
proposals with 0:5 IoU overlap with a ground-truth box dataset (Section 4.2). For final results on the VOC 2010-12
as positives for that box’s class and the rest as negatives. We datasets, we fine-tuned the CNN on VOC 2012 train and
start SGD at a learning rate of 0.001 (1/10th of the initial optimized our detection SVMs on VOC 2012 trainval. We
pre-training rate), which allows fine-tuning to make prog- submitted test results to the evaluation server only once for
ress while not clobbering the initialization. In each SGD iter- each of the two major algorithm variants (with and without
ation, we uniformly sample 32 positive windows (over all bounding-box regression).
classes) and 96 background windows to construct a mini- Table 1 shows complete results on VOC 2010.2 We com-
batch of size 128. We bias the sampling towards positive pare our method against four strong baselines, including
windows because they are extremely rare compared to SegDPM [57], which combines DPM detectors with the out-
background. OxfordNet requires more memory than Toron- put of a semantic segmentation system [59] and uses addi-
toNet making it necessary to decrease the minibatch size in tional inter-detector context and image-classifier rescoring.
order to fit on a single GPU. We decreased the batch size The most germane comparison is to the UVA system from
from 128 to just 24 while maintaining the same biased sam- Uijlings et al. [21], since our systems use the same region pro-
pling scheme. posal algorithm. To classify regions, their method builds a
four-level spatial pyramid and populates it with densely sam-
pled SIFT, Extended OpponentSIFT, and RGB-SIFT descrip-
3.3.3 Object Category Classifiers
tors, each vector quantized with 4,000-word codebooks.
Consider training a binary classifier to detect cars. It’s clear Classification is performed with a histogram intersection ker-
that an image region tightly enclosing a car should be a pos- nel SVM. Compared to their multi-feature, non-linear kernel
itive example. Similarly, it’s clear that a background region, SVM approach, we achieve a large improvement in mAP,
which has nothing to do with cars, should be a negative from 35.1 percent to 53.7 percent mAP with TorontoNet and
example. Less clear is how to label a region that partially 62.9 percent with OxfordNet, while also being much faster. R-
overlaps a car. We resolve this issue with an IoU overlap CNNs achieve similar performance (53.3 percent / 62.4 per-
threshold, below which regions are defined as negatives. cent mAP) on VOC 2012 test.
The overlap threshold, 0:3, was selected by a grid search
over f0; 0:1; . . . ; 0:5g on a validation set. We found that
2. We use VOC 2010 because there are more published results com-
selecting this threshold carefully is important. Setting it to pared to 2012. Additionally, VOC 2010, 2011, 2012 are very similar data-
0:5, as in [21], decreased mAP by 5 points. Similarly, setting sets, with 2011 and 2012 being identical (for the detection task).
Authorized licensed use limited to: SRM Institute of Science and Technology. Downloaded on July 12,2024 at 04:26:24 UTC from IEEE Xplore. Restrictions apply.
GIRSHICK ET AL.: REGION-BASED CONVOLUTIONAL NETWORKS FOR ACCURATE OBJECT DETECTION AND SEGMENTATION 147

Fig. 3. (Left) Mean average precision on the ILSVRC2013 detection test set. Methods preceeded by * use outside training data (images and labels
from the ILSVRC classification dataset in all cases). (Right) Box plots for the 200 average precision values per method. A box plot for the post-com-
petition OverFeat result is not shown because per-class APs are not yet available. The red line marks the median AP, the box bottom and top are the
25th and 75th percentiles. The whiskers extend to the min and max AP of each method. Each AP is plotted as a green dot over the whiskers (best
viewed digitally with zoom).

3.5 Results on ILSVRC2013 Detection top-scoring regions. Our method lets the selected unit
We ran an R-CNN on the 200-class ILSVRC2013 detection “speak for itself” by showing exactly which inputs it fires
dataset using the same system hyperparameters that we on. We avoid averaging in order to see different visual
used for PASCAL VOC. We followed the same protocol of modes and gain insight into the invariances computed by
submitting test results to the ILSVRC2013 evaluation server the unit.
only twice, once with and once without bounding-box We visualize units from layer pool5 of a TorontoNet,
regression. which is the max-pooled output of the network’s fifth
Fig. 3 compares our R-CNN to the entries in the ILSVRC and final convolutional layer. The pool5 feature map is
2013 competition and to the post-competition OverFeat 6 6 256 ¼ 9216-dimensional. Ignoring boundary effects,
result [19]. Using TorontoNet, our R-CNN achieves a mAP each pool5 unit has a receptive field of 195 195 pixels in
of 31.4 percent, which is significantly ahead of the second- the original 227 227 pixel input. A central pool5 unit has a
best result of 24.3 percent from OverFeat. To give a sense of nearly global view, while one near the edge has a smaller,
the AP distribution over classes, box plots are also pre- clipped support.
sented. Most of the competing submissions (OverFeat, Each row in Fig. 4 displays the top 16 activations for a
NEC-MU, Toronto A, and UIUC-IFP) used convolutional pool5 unit from a CNN that we fine-tuned on VOC 2007
networks, indicating that there is significant nuance in how trainval. Six of the 256 functionally unique units are visual-
CNNs can be applied to object detection, leading to greatly ized. These units were selected to show a representative
varying outcomes. Notably, UvA-Euvision’s entry did not sample of what the network learns. In the second row, we
use CNNs and was based on a fast VLAD encoding [60]. see a unit that fires on dog faces and dot arrays. The unit
In Section 5, we give an overview of the ILSVRC2013 corresponding to the third row is a red blob detector. There
detection dataset and provide details about choices that we are also detectors for human faces and more abstract pat-
made when training R-CNNs on it. terns such as text and triangular structures with windows.
The network appears to learn a representation that com-
bines a small number of class-tuned features together with
4 ANALYSIS a distributed representation of shape, texture, color, and
4.1 Visualizing Learned Features material properties. The subsequent fully connected layer
First-layer filters can be visualized directly and are easy to fc6 has the ability to model a large set of compositions of
understand [8]. They capture oriented edges and opponent these rich features. Agrawal et al. [25] provide a more in-
colors. Understanding the subsequent layers is more chal- depth analysis of the learned features.
lenging. Zeiler and Fergus present a visually attractive
deconvolutional approach in [63]. We propose a simple
4.2 Ablation Studies
(and complementary) non-parametric method that directly
shows what the network learned. 4.2.1 Performance Layer-by-Layer, without Fine-
The idea is to single out a particular unit (feature) in the Tuning
network and use it as if it were an object detector in its To understand which layers are critical for detection perfor-
own right. That is, we compute the unit’s activations on a mance, we analyzed results on the VOC 2007 dataset for
large set of held-out region proposals (about 10 million), each of the TorontoNet’s last three layers. Layer pool5 was
sort the proposals from highest to lowest activation, per- briefly described in Section 4.1. The final two layers are
form non-maximum suppression, and then display the summarized below.
Authorized licensed use limited to: SRM Institute of Science and Technology. Downloaded on July 12,2024 at 04:26:24 UTC from IEEE Xplore. Restrictions apply.
148 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 38, NO. 1, JANUARY 2016

Fig. 4. Top regions for six pool5 units. Receptive fields and activation values are drawn in white. Some units are aligned to concepts, such as people
(row 1) or text (4). Other units capture texture and material properties, such as dot arrays (2) and specular reflections (6).

Layer fc6 is fully connected to pool5 . To compute features, representation would enable experimentation with sliding-
it multiplies a 4;096 9;216 weight matrix by the pool5 fea- window detectors, including DPM, on top of pool5 features.
ture map (reshaped as a 9,216-dimensional vector) and then
adds a vector of biases. This intermediate vector is compo- 4.2.2 Performance Layer-by-Layer, with Fine-Tuning
nent-wise half-wave rectified (x maxð0; xÞ).
We now look at results from our CNN after having fine-
Layer fc7 is the final layer of the network. It is imple-
tuned its parameters on VOC 2007 trainval. The improve-
mented by multiplying the features computed by fc6 by a
ment is striking (Table 2 rows 4-6): fine-tuning increases
4;096 4;096 weight matrix, and similarly adding a vector
mAP by 8.0 percentage points to 54.2 percent. The boost
of biases and applying half-wave rectification.
from fine-tuning is much larger for fc6 and fc7 than for
We start by looking at results from the CNN without fine-
pool5 , which suggests that the pool5 features learned from
tuning on PASCAL, i.e., all CNN parameters were pre-
ImageNet are general and that most of the improvement is
trained on ILSVRC 2012 only. Analyzing performance
gained from learning domain-specific non-linear classifiers
layer-by-layer (Table 2 rows 1-3) reveals that features from
on top of them.
fc7 generalize worse than features from fc6 . This means that
29 percent, or about 16.8 million, of the CNN’s parameters
can be removed without degrading mAP. More surprising 4.2.3 Comparison to Recent Feature Learning Methods
is that removing both fc7 and fc6 produces quite good results Relatively few feature learning methods have been tried on
even though pool5 features are computed using only 6 per- PASCAL VOC detection. We look at two recent approaches
cent of the CNN’s parameters. Much of the CNN’s represen- that build on deformable part models. For reference, we
tational power comes from its convolutional layers, rather also include results for the standard HOG-based DPM [23].
than from the much larger densely connected layers. This The first DPM feature learning method, DPM ST [61],
finding suggests potential utility in computing a dense fea- augments HOG features with histograms of “sketch token”
ture map, in the sense of HOG, of an arbitrary-sized image probabilities. Intuitively, a sketch token is a tight distribu-
by using only the convolutional layers of the CNN. This tion of contours passing through the center of an image

TABLE 2
Detection Average Precision (Percent) on VOC 2007 Test

VOC 2007 test aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv mAP
R-CNN pool5 51.8 60.2 36.4 27.8 23.2 52.8 60.6 49.2 18.3 47.8 44.3 40.8 56.6 58.7 42.4 23.4 46.1 36.7 51.3 55.7 44.2
R-CNN fc6 59.3 61.8 43.1 34.0 25.1 53.1 60.6 52.8 21.7 47.8 42.7 47.8 52.5 58.5 44.6 25.6 48.3 34.0 53.1 58.0 46.2
R-CNN fc7 57.6 57.9 38.5 31.8 23.7 51.2 58.9 51.4 20.0 50.5 40.9 46.0 51.6 55.9 43.3 23.3 48.1 35.3 51.0 57.4 44.7
R-CNN FT pool5 58.2 63.3 37.9 27.6 26.1 54.1 66.9 51.4 26.7 55.5 43.4 43.1 57.7 59.0 45.8 28.1 50.8 40.6 53.1 56.4 47.3
R-CNN FT fc6 63.5 66.0 47.9 37.7 29.9 62.5 70.2 60.2 32.0 57.9 47.0 53.5 60.1 64.2 52.2 31.3 55.0 50.0 57.7 63.0 53.1
R-CNN FT fc7 64.2 69.7 50.0 41.9 32.0 62.6 71.0 60.7 32.7 58.5 46.5 56.1 60.6 66.8 54.2 31.5 52.8 48.9 57.9 64.7 54.2
R-CNN FT fc7 BB 68.1 72.8 56.8 43.0 36.8 66.3 74.2 67.6 34.4 63.5 54.5 61.2 69.1 68.6 58.7 33.4 62.9 51.1 62.5 64.8 58.5
DPM v5 [23] 33.2 60.3 10.2 16.1 27.3 54.3 58.2 23.0 20.0 24.1 26.7 12.7 58.1 48.2 43.2 12.0 21.1 36.1 46.0 43.5 33.7
DPM ST [61] 23.8 58.2 10.5 8.5 27.1 50.4 52.0 7.3 19.2 22.8 18.1 8.0 55.9 44.8 32.4 13.3 15.9 22.8 46.2 44.9 29.1
DPM HSC [62] 32.2 58.3 11.5 16.3 30.6 49.9 54.8 23.5 21.5 27.7 34.0 13.7 58.1 51.6 39.9 12.4 23.5 34.4 47.4 45.2 34.3

Rows 1-3 show R-CNN performance without fine-tuning. Rows 4-6 show results for the CNN pre-trained on ILSVRC 2012 and then fine-tuned (FT) on VOC
2007 trainval. Row 7 includes a simple bounding-box regression stage that reduces localization errors (Section 7.3). Rows 8-10 present DPM methods as a strong
baseline. Thelicensed
Authorized first uses
useonly HOG,
limited to: while the nextoftwo
SRM Institute use different
Science feature learning
and Technology. approaches
Downloaded to12,2024
on July augmentator04:26:24
replace UTC
HOG. AllIEEE
from R-CNN results
Xplore. use TorontoNet.
Restrictions apply.
GIRSHICK ET AL.: REGION-BASED CONVOLUTIONAL NETWORKS FOR ACCURATE OBJECT DETECTION AND SEGMENTATION 149

TABLE 3
Detection Average Precision (Percent) on VOC 2007 Test for Two Different CNN Architectures

VOC 2007 test aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv mAP
R-CNN T-Net 64.2 69.7 50.0 41.9 32.0 62.6 71.0 60.7 32.7 58.5 46.5 56.1 60.6 66.8 54.2 31.5 52.8 48.9 57.9 64.7 54.2
R-CNN T-Net BB 68.1 72.8 56.8 43.0 36.8 66.3 74.2 67.6 34.4 63.5 54.5 61.2 69.1 68.6 58.7 33.4 62.9 51.1 62.5 64.8 58.5
R-CNN O-Net 71.6 73.5 58.1 42.2 39.4 70.7 76.0 74.5 38.7 71.0 56.9 74.5 67.9 69.6 59.3 35.7 62.1 64.0 66.5 71.2 62.2
R-CNN O-Net BB 73.4 77.0 63.4 45.4 44.6 75.1 78.1 79.8 40.5 73.7 62.2 79.4 78.1 73.1 64.2 35.6 66.8 67.2 70.4 71.1 66.0

The first two rows are results from Table 2 using Krizhevsky et al.’s TorontoNet architecture (T-Net). Rows three and four use the recently proposed 16-layer
OxfordNet architecture (O-Net) from Simonyan and Zisserman [24].

patch. Sketch token probabilities are computed at each pixel From a transfer learning point of view, it is very encourag-
by a random forest that was trained to classify 35 35 pixel ing that large improvements in image classification translate
patches into one of 150 sketch tokens or background. directly into large improvements in object detection.
The second method, DPM HSC [62], replaces HOG with
histograms of sparse codes (HSC). To compute an HSC, 4.4 Detection Error Analysis
sparse code activations are solved for at each pixel using a
We applied the excellent detection analysis tool from Hoiem
learned dictionary of 100 7 7 pixel (grayscale) atoms. The
et al. [64] in order to reveal our method’s error modes,
resulting activations are rectified in three ways (full and
understand how fine-tuning changes them, and to see how
both half-waves), spatially pooled, unit ‘2 normalized, and
our error types compare with DPM. A full summary of the
then power transformed (x signðxÞjxja ).
analysis tool is beyond the scope of this paper and we
All R-CNN variants strongly outperform the three DPM
encourage readers to consult [64] to understand some finer
baselines (Table 2 rows 8-10), including the two that use
details (such as “normalized AP”). Since the analysis is best
feature learning. Compared to the latest version of DPM,
absorbed in the context of the associated plots, we present
which uses only HOG features, our mAP is more than
the discussion within the captions of Figs. 5 and 6.
20 percentage points higher: 54.2 percent vs. 33.7 percent—a
61 percent relative improvement. The combination of HOG
and sketch tokens yields 2.5 mAP points over HOG alone, 4.5 Bounding-Box Regression
while HSC improves over HOG by four mAP points (when Based on the error analysis, we implemented a simple
compared internally to their private DPM baselines—both method to reduce localization errors. Inspired by the
use non-public implementations of DPM that underperform bounding-box regression employed in DPM [18], we train a
the open source version [23]). These methods achieve mAPs linear regression model to predict a new detection window
of 29.1 percent and 34.3 percent, respectively.

4.3 Network Architectures

Most results in this paper use the TorontoNet network
architecture from Krizhevsky et al. [8]. However, we have
found that the choice of architecture has a large effect on R-
CNN detection performance. In Table 3, we show results on
VOC 2007 test using the 16-layer deep OxfordNet recently
proposed by Simonyan and Zisserman [24]. This network
was one of the top performers in the recent ILSVRC 2014
classification challenge. The network has a homogeneous
structure consisting of 13 layers of 3 3 convolution ker-
nels, with five max pooling layers interspersed, and topped
with three fully-connected layers.
To use OxfordNet in an R-CNN, we downloaded the
publicly available pre-trained network weights for the
VGG_ILSVRC_16_layers model from the Caffe Model
Zoo.3 We then fine-tuned the network using the same proto- Fig. 5. Distribution of top-ranked false positive (FP) types for R-CNNs
col as we used for TorontoNet. The only difference was to with TorontoNet. Each plot shows the evolving distribution of FP types
as more FPs are considered in order of decreasing score. Each FP is
use smaller minibatches (24 examples) as required in order categorized into 1 of 4 types: Loc—poor localization (a detection with an
to fit within GPU memory. The results in Table 3 show that IoU overlap with the correct class between 0.1 and 0.5, or a duplicate);
an R-CNN with OxfordNet substantially outperforms an R- Sim—confusion with a similar category; Oth—confusion with a dissimilar
CNN with TorontoNet, increasing mAP from 58.5 percent object category; BG—a FP that fired on background. Compared with
DPM (see [64]), significantly more of our errors result from poor localiza-
to 66.0 percent. However there is a considerable drawback tion, rather than confusion with background or other object classes, indi-
in terms of compute time, with the forward pass of Oxford- cating that the CNN features are much more discriminative than HOG.
Net taking roughly seven times longer than TorontoNet. Loose localization likely results from our use of bottom-up region pro-
posals and the positional invariance learned from pre-training the CNN
for whole-image classification. Column three shows how our simple
3. https://github.com/BVLC/caffe/wiki/Model-Zoo bounding-box regression method fixes many localization errors.
Authorized licensed use limited to: SRM Institute of Science and Technology. Downloaded on July 12,2024 at 04:26:24 UTC from IEEE Xplore. Restrictions apply.
150 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 38, NO. 1, JANUARY 2016

Fig. 6. Sensitivity to object characteristics. Each plot shows the mean (over classes) normalized AP (see [64]) for the highest and lowest performing
subsets within six different object characteristics (occlusion, truncation, bounding-box area, aspect ratio, viewpoint, part visibility). For example,
bounding-box area comprises the subsets extra-small, small ,. . ., extra-large. We show plots for our method (R-CNN) with and without fine-tuning
and bounding-box regression as well as for DPM voc-release5. Overall, fine-tuning does not reduce sensitivity (the difference between max and
min), but does substantially improve both the highest and lowest performing subsets for nearly all characteristics. This indicates that fine-tuning does
more than simply improve the lowest performing subsets for aspect ratio and bounding-box area, as one might conjecture based on how we warp net-
work inputs. Instead, fine-tuning improves robustness for all characteristics including occlusion, truncation, viewpoint, and part visibility.

given the pool5 features for a selective search region pro- The nature of these splits presents a number of choices
posal. Full details are given in Section 7.3. Results in for training an R-CNN. The train images cannot be used for
Tables 1, 2, and Fig. 5 show that this simple approach fixes a hard negative mining, because annotations are not exhaus-
large number of mislocalized detections, boosting mAP by tive. Where should negative examples come from? Also, the
three to four points. train images have different statistics than val and test.
Should the train images be used at all, and if so, to what
4.6 Qualitative Results extent? While we have not thoroughly evaluated a large
Qualitative detection results on ILSVRC2013 are presented number of choices, we present what seemed like the most
in Fig. 8. Each image was sampled randomly from the val2 obvious path based on previous experience.
set and all detections from all detectors with a precision Our general strategy is to rely heavily on the val set and
greater than 0.5 are shown. Note that these are not curated use some of the train images as an auxiliary source of posi-
and give a realistic impression of the detectors in action. tive examples. To use val for both training and validation,
we split it into roughly equally sized “val1 ” and “val2 ” sets.
Since some classes have very few examples in val (the small-
5 THE ILSVRC2013 DETECTION DATASET
est has only 31 and half have fewer than 110), it is important
In Section 3 we presented results on the ILSVRC2013 detec- to produce an approximately class-balanced partition. To
tion dataset. This dataset is less homogeneous than PAS- do this, a large number of candidate splits were generated
CAL VOC, requiring choices about how to use it. Since and the one with the smallest maximum relative class
these decisions are non-trivial, we cover them in this sec- imbalance was selected.4 Each candidate split was gener-
tion. The methodology and “val1 ” and “val2 ” data splits ated by clustering val images using their class counts as fea-
introduced in this section were used widely by participants tures, followed by a randomized local search that may
in the ILSVRC2014 detection challenge. improve the split balance. The particular split used here has
a maximum relative imbalance of about 11 percent and a
5.1 Dataset Overview median relative imbalance of 4 percent. The val1 /val2 split
The ILSVRC2013 detection dataset is split into three sets: and code used to produce them are publicly available in the
train (395,918), val (20,121), and test (40,152), where the R-CNN code repository, allowing other researchers to com-
number of images in each set is in parentheses. The val and pare their methods on the val splits used in this report.
test splits are drawn from the same image distribution.
These images are scene-like and similar in complexity 5.2 Region Proposals
(number of objects, amount of clutter, pose variability, etc.) We followed the same region proposal approach that was
to PASCAL VOC images. The val and test splits are exhaus- used for detection on PASCAL. Selective search [21] was
tively annotated, meaning that in each image all instances run in “fast mode” on each image in val1 , val2 , and test (but
from all 200 classes are labeled with bounding boxes. The not on images in train). One minor modification was
train set, in contrast, is drawn from the ILSVRC2013 classifi- required to deal with the fact that selective search is not
cation image distribution. These images have more variable scale invariant and so the number of regions produced
complexity with a skew towards images of a single centered depends on the image resolution. ILSVRC image sizes range
object. Unlike val and test, the train images (due to their from very small to a few that are several mega-pixels, and
large number) are not exhaustively annotated. In any given so we resized each image to a fixed width (500 pixels)
train image, instances from the 200 classes may or may not before running selective search. On val, selective search
be labeled. In addition to these image sets, each class has an resulted in an average of 2,403 region proposals per image
extra set of negative images. Negative images are manually with a 91.6 percent recall of all ground-truth bounding
checked to validate that they do not contain any instances boxes (at 0.5 IoU threshold). This recall is notably lower
of their associated class. The negative image sets were not
used in this work. More information on how ILSVRC was 4. Relative imbalance is measured as ja bj=ða þ bÞ where a and b
collected and annotated can be found in [65], [66]. are class counts in each half of the split.
Authorized licensed use limited to: SRM Institute of Science and Technology. Downloaded on July 12,2024 at 04:26:24 UTC from IEEE Xplore. Restrictions apply.
GIRSHICK ET AL.: REGION-BASED CONVOLUTIONAL NETWORKS FOR ACCURATE OBJECT DETECTION AND SEGMENTATION 151

TABLE 4
ILSVRC2013 Ablation Study of Data Usage Choices, Fine-Tuning, and Bounding-Box Regression

test set val2 val2 val2 val2 val2 val2 test test
SVM training set val1 val1 þ train:5k val1 þ train1k val1 þ train1k val1 þ train1k val1 þ train1k val þ train1k val þ train1k
CNN fine-tuning set n/a n/a n/a val1 val1 þ train1k val1 þ train1k val1 þ train1k val1 þ train1k
bbox reg set n/a n/a n/a n/a n/a val1 n/a val
CNN feature layer fc6 fc6 fc6 fc7 fc7 fc7 fc7 fc7
mAP 20.9 24.1 24.1 26.5 29.7 31.0 30.2 31.4
median AP 17.7 21.0 21.4 24.8 29.2 29.6 29.0 30.3

All experiments use TorontoNet.

than in PASCAL, where it is approximately 98 percent, indi- box regressor training sets to use valþtrain1k and val,
cating significant room for improvement in the region respectively. We used the CNN that was fine-tuned on
proposal stage. val1 þ train1k to avoid re-running fine-tuning and feature
computation.
5.3 Training Data
For training data, we formed a set of images and boxes that 5.5 Ablation Study
includes all selective search and ground-truth boxes from Table 4 shows an ablation study of the effects of different
val1 together with up to N ground-truth boxes per class amounts of training data, fine-tuning, and bounding-box
from train (if a class has fewer than N ground-truth boxes regression. A first observation is that mAP on val2 matches
in train, then we take all of them). We’ll call this dataset of mAP on test very closely. This gives us confidence that
images and boxes val1 þ trainN . In an ablation study, we mAP on val2 is a good indicator of test set performance. The
show mAP on val2 for N 2 f0; 500; 1000g (Section 5.5). first result, 20.9 percent, is what R-CNN achieves using a
Training data are required for three procedures in R- CNN pre-trained on the ILSVRC2012 classification dataset
CNN: (1) CNN fine-tuning, (2) detector SVM training, (no fine-tuning) and given access to the small amount of
and (3) bounding-box regressor training. CNN fine-tuning training data in val1 (recall that half of the classes in val1
was run for 50k SGD iteration on val1 þ trainN using the have between 15 and 55 examples). Expanding the training
exact same settings as were used for PASCAL. Fine-tun- set to val1 þ trainN improves performance to 24.1 percent,
ing on a single NVIDIA Tesla K20 took 13 hours using with essentially no difference between N ¼ 500 and
Caffe. For SVM training, all ground-truth boxes from N ¼ 1000. Fine-tuning the CNN using examples from just
val1 þ trainN were used as positive examples for their val1 gives a modest improvement to 26.5 percent, however
respective classes. Hard negative mining was performed there is likely significant overfitting due to the small num-
on a randomly selected subset of 5,000 images from val1 . ber of positive training examples. Expanding the fine-tun-
An initial experiment indicated that mining negatives ing set to val1 þ train1k , which adds up to 1000 positive
from all of val1 , versus a 5,000 image subset (roughly half examples per class from the train set, helps significantly,
of it), resulted in only a 0.5 percentage point drop in boosting mAP to 29.7 percent. Bounding-box regression
mAP, while cutting SVM training time in half. No nega- improves results to 31.0 percent, which is a smaller relative
tive examples were taken from train because the annota- gain that what was observed in PASCAL.
tions are not exhaustive. The extra sets of verified
negative images were not used. The bounding-box regres- 5.6 Relationship to OverFeat
sors were trained on val1 .
There is an interesting relationship between R-CNN and
OverFeat: OverFeat can be seen (roughly) as a special
5.4 Validation and Evaluation case of an R-CNN. If one were to replace selective search
Before submitting results to the evaluation server, we vali- region proposals with a multi-scale pyramid of regular
dated data usage choices and the effect of fine-tuning and square regions and change the per-class bounding-box
bounding-box regression on the val2 set using the training regressors to a single bounding-box regressor, then the
data described above. All system hyperparameters (e.g., systems would be very similar (modulo some potentially
SVM C hyperparameters, padding used in region warping, significant differences in how they are trained: CNN
NMS thresholds, bounding-box regression hyperpara- detection fine-tuning, using SVMs, etc.). It is worth noting
meters) were fixed at the same values used for PASCAL. that OverFeat has a significant speed advantage over R-
Undoubtedly some of these hyperparameter choices are CNN: it is about 9 faster, based on a figure of 2 seconds
slightly suboptimal for ILSVRC, however the goal of this per image quoted from [19]. This speed comes from the
work was to produce a preliminary R-CNN result on fact that OverFeat’s sliding windows (i.e., region pro-
ILSVRC without extensive dataset tuning. After selecting posals) are not warped at the image level and therefore
the best choices on val2 , we submitted exactly two result computation can be easily shared between overlapping
files to the ILSVRC2013 evaluation server. The first sub- windows. Sharing is implemented by running the entire
mission was without bounding-box regression and the sec- network in a convolutional fashion over arbitrary-sized
ond submission was with bounding-box regression. For inputs. OverFeat is slower than the pyramid-based ver-
these submissions, we expanded the SVM and bounding- sion of R-CNN from He et al. [27].
Authorized licensed use limited to: SRM Institute of Science and Technology. Downloaded on July 12,2024 at 04:26:24 UTC from IEEE Xplore. Restrictions apply.
152 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 38, NO. 1, JANUARY 2016

TABLE 5 (full+fg) simply concatenates the full and fg features; our

Segmentation Mean Accuracy (Percent) on VOC 2011 experiments validate their complementarity.
Validation

full fg full + fg 6.2 Results on VOC 2011

R-CNN R-CNN R-CNN Table 5 shows a summary of our results on the VOC 2011
O2 P [59] fc6 fc7 fc6 fc7 fc6 fc7 validation set compared with O2 P. Within each feature
computation strategy, layer fc6 always outperforms fc7
46.4 43.0 42.5 43.7 42.1 47.9 45.8
and the following discussion refers to the fc6 features.
Column 1 presents O2 P; 2-7 use our CNN pre-trained on ILSVRC 2012. The fg strategy slightly outperforms full, indicating that
the masked region shape provides a stronger signal,
matching our intuition. However, full+fg achieves an
6 SEMANTIC SEGMENTATION average accuracy of 47.9 percent, our best result by a mar-
Region classification is a standard technique for semantic gin of 4.2 percent (also modestly outperforming O2 P),
segmentation, allowing us to easily apply R-CNNs to the indicating that the context provided by the full features is
PASCAL VOC segmentation challenge. To facilitate a direct highly informative even given the fg features. Notably,
comparison with the current leading semantic segmentation training the 20 SVRs on our full+fg features takes an hour
system (called O2 P for “second-order pooling”) [59], we on a single core, compared to 10+ hours for training on
work within their open source framework. O2 P uses CPMC O2 P features.
to generate 150 region proposals per image and then pre- Table 6 shows the per-category segmentation accuracy
dicts the quality of each region, for each class, using support on VOC 2011 val for each of our six segmentation methods
vector regression (SVR). The high performance of their in addition to the O2 P method [59]. These results show
approach is due to the quality of the CPMC regions and the which methods are strongest across each of the 20 PASCAL
powerful second-order pooling of multiple feature types classes, plus the background class.
(enriched variants of SIFT and LBP). We also note that In Table 7 we present results on the VOC 2011 test set,
Farabet et al. [67] recently demonstrated good results on comparing our best-performing method, fc6 (full+fg),
several dense scene labeling datasets (not including PAS- against two strong baselines. Our method achieves the high-
CAL) using a CNN as a multi-scale per-pixel classifier. est segmentation accuracy for 11 out of 21 categories, and
We follow [59], [68] and extend the PASCAL segmen- the highest overall segmentation accuracy of 47.9 percent,
tation training set to include the extra annotations made averaged across categories (but likely ties with the O2 P
available by Hariharan et al. [69]. Design decisions and result under any reasonable margin of error). Still better
hyperparameters were cross-validated on the VOC 2011 performance could likely be achieved by fine-tuning.
validation set. Final test results were evaluated only once. More recently, a number of semantic segmentation
approaches based on deep CNNs have lead to dramatic
improvements, pushing segmentation mean accuracy over
6.1 CNN Features for Segmentation 70 percent [70], [71], [72], [73]. The highest performing of
We evaluate three strategies for computing features on these methods combine fully-convolution networks (fine-
CPMC regions, all of which begin by warping the rectan- tuned for segmentation) with efficient fully-connected
gular window around the region to 227 227 (we use Gaussian CRFs [74].
TorontoNet for these experiments). The first strategy (full)
ignores the region’s shape and computes CNN features
directly on the warped window, exactly as we did for
7 IMPLEMENTATION AND DESIGN DETAILS
detection. However, these features ignore the non-rectan- 7.1 Object Proposal Transformations
gular shape of the region. Two regions might have very The convolutional networks used in this work require a
similar bounding boxes while having very little overlap. fixed-size input (e.g., 227 227 pixels) in order to produce a
Therefore, the second strategy (fg) computes CNN fea- fixed-size output. For detection, we consider object pro-
tures only on a region’s foreground mask. We replace the posals that are arbitrary image rectangles. We evaluated
background with the mean input so that background two approaches for transforming object proposals into valid
regions are zero after mean subtraction. The third strategy CNN inputs.

TABLE 6
Per-Category Segmentation Accuracy (Percent) on the VOC 2011 Validation Set

VOC 2011 val bg aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv mean
O2 P [59] 84.0 69.0 21.7 47.7 42.2 42.4 64.7 65.8 57.4 12.9 37.4 20.5 43.7 35.7 52.7 51.0 35.8 51.0 28.4 59.8 49.7 46.4
full R-CNN fc6 81.3 56.2 23.9 42.9 40.7 38.8 59.2 56.5 53.2 11.4 34.6 16.7 48.1 37.0 51.4 46.0 31.5 44.0 24.3 53.7 51.1 43.0
full R-CNN fc7 81.0 52.8 25.1 43.8 40.5 42.7 55.4 57.7 51.3 8.7 32.5 11.5 48.1 37.0 50.5 46.4 30.2 42.1 21.2 57.7 56.0 42.5
fg R-CNN fc6 81.4 54.1 21.1 40.6 38.7 53.6 59.9 57.2 52.5 9.1 36.5 23.6 46.4 38.1 53.2 51.3 32.2 38.7 29.0 53.0 47.5 43.7
fg R-CNN fc7 80.9 50.1 20.0 40.2 34.1 40.9 59.7 59.8 52.7 7.3 32.1 14.3 48.8 42.9 54.0 48.6 28.9 42.6 24.9 52.2 48.8 42.1
full+fg R-CNN fc6 83.1 60.4 23.2 48.4 47.3 52.6 61.6 60.6 59.1 10.8 45.8 20.9 57.7 43.3 57.4 52.9 34.7 48.7 28.1 60.0 48.6 47.9
full+fg R-CNN fc7 82.3 56.7 20.6 49.9 44.2 43.6 59.3 61.3 57.8 7.7 38.4 15.1 53.4 43.7 50.8 52.0 34.1 47.8 24.7 60.1 55.2 45.7

These experiments use TorontoNet without fine-tuning.

Authorized licensed use limited to: SRM Institute of Science and Technology. Downloaded on July 12,2024 at 04:26:24 UTC from IEEE Xplore. Restrictions apply.
GIRSHICK ET AL.: REGION-BASED CONVOLUTIONAL NETWORKS FOR ACCURATE OBJECT DETECTION AND SEGMENTATION 153

TABLE 7
Segmentation Accuracy (Percent) on VOC 2011 Test

VOC bg aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv mean
2011 test
R&P [68] 83.4 46.8 18.9 36.6 31.2 42.7 57.3 47.4 44.1 8.1 39.4 36.1 36.3 49.5 48.3 50.7 26.3 47.2 22.1 42.0 43.2 40.8
O2 P [59] 85.4 69.7 22.3 45.2 44.4 46.9 66.7 57.8 56.2 13.5 46.1 32.3 41.2 59.1 55.3 51.0 36.2 50.4 27.8 46.9 44.6 47.6
ours (full+fg 84.2 66.9 23.7 58.3 37.4 55.4 73.3 58.7 56.5 9.7 45.5 29.5 49.3 40.1 57.8 53.9 33.8 60.7 22.7 47.1 41.3 47.9
R-CNN fc6 )

We compare against two strong baselines: the “Regions and Parts” (R&P) method of [68] and the second-order pooling (O2 P) method of [59]. Without any
fine-tuning, our CNN achieves top segmentation performance, outperforming R&P and roughly matching O2 P. These experiments use TorontoNet without
fine-tuning.

The first method (“tightest square with context”) enclo- detection SVMs? To review the definitions briefly, for fine-
ses each object proposal inside the tightest square and then tuning we map each object proposal to the ground-truth
scales (isotropically) the image contained in that square to instance with which it has maximum IoU overlap (if any)
the CNN input size. Fig. 7 column (B) shows this transfor- and label it as a positive for the matched ground-truth
mation. A variant on this method (“tightest square without class if the IoU is at least 0.5. All other proposals are
context”) excludes the image content that surrounds the labeled “background” (i.e., negative examples for all clas-
original object proposal. Fig. 7 column (C) shows this trans- ses). For training SVMs, in contrast, we take only the
formation. The second method (“warp”) anisotropically ground-truth boxes as positive examples for their respec-
scales each object proposal to the CNN input size. Fig. 7 col- tive classes and label proposals with less than 0.3 IoU
umn (D) shows the warp transformation. overlap with all instances of a class as a negative for that
For each of these transformations, we also consider class. Proposals that fall into the grey zone (more than 0.3
including additional image context around the original IoU overlap, but are not ground truth) are ignored.
object proposal. The amount of context padding (p) is Historically speaking, we arrived at these definitions
defined as a border size around the original object proposal because we started by training SVMs on features computed
in the transformed input coordinate frame. Fig. 7 shows by the ImageNet pre-trained CNN, and so fine-tuning was
p ¼ 0 pixels in the top row of each example and p ¼ 16 pixels not a consideration at that point in time. In that setup, we
in the bottom row. In all methods, if the source rectangle found that our particular label definition for training SVMs
extends beyond the image, the missing data are replaced was optimal within the set of options we evaluated (which
with the image mean (which is then subtracted before input- included the setting we now use for fine-tuning). When we
ing the image into the CNN). A pilot set of experiments started using fine-tuning, we initially used the same posi-
showed that warping with context padding (p ¼ 16 pixels) tive and negative example definition as we were using for
outperformed the alternatives by a large margin (3-5 mAP SVM training. However, we found that results were much
points). Obviously more alternatives are possible, including worse than those obtained using our current definition of
using replication instead of mean padding. Exhaustive eval- positives and negatives.
uation of these alternatives is left as future work. Our hypothesis is that this difference in how positives
and negatives are defined is not fundamentally important
7.2 Positive Versus Negative Examples and Softmax and arises from the fact that fine-tuning data are limited.
Two design choices warrant further discussion. The first is: Our current scheme introduces many “jittered” examples
Why are positive and negative examples defined differ- (those proposals with overlap between 0.5 and 1, but not
ently for fine-tuning the CNN versus training the object ground truth), which expands the number of positive exam-
ples by approximately 30x. We conjecture that this large set
is needed when fine-tuning the entire network to avoid over-
fitting. However, we also note that using these jittered
examples is likely suboptimal because the network is not
being fine-tuned for precise localization.
This leads to the second issue: Why, after fine-tuning,
train SVMs at all? It would be cleaner to simply apply the
last layer of the fine-tuned network, which is a 21-way soft-
max regression classifier, as the object detector. We tried
this and found that performance on VOC 2007 dropped
from 54.2 to 50.9 percent mAP. This performance drop
likely arises from a combination of several factors including
that the definition of positive examples used in fine-tuning
does not emphasize precise localization and the softmax
Fig. 7. Different object proposal transformations. (A) the original object classifier was trained on randomly sampled negative exam-
proposal at its actual scale relative to the transformed CNN inputs; (B) ples rather than on the subset of “hard negatives” used for
tightest square with context; (C) tightest square without context; (D)
SVM training.
warp. Within each column and example proposal, the top row corre-
sponds to p ¼ 0 pixels of context padding while the bottom row has This result shows that it’s possible to obtain close to the
p ¼ 16 pixels of context padding. same level of performance without training SVMs after
Authorized licensed use limited to: SRM Institute of Science and Technology. Downloaded on July 12,2024 at 04:26:24 UTC from IEEE Xplore. Restrictions apply.
154 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 38, NO. 1, JANUARY 2016

Fig. 8. Example detections on the val2 set from the configuration that achieved 31.0 percent mAP on val2 . Each image was sampled randomly (these
are not curated). All detections at precision greater than 0.5 are shown. Each detection is labeled with the predicted class and the precision value of
that detection from the detector’s precision-recall curve. Viewing digitally with zoom is recommended.

fine-tuning. We conjecture that with some additional tweaks tw ¼ log ðGw =Pw Þ (8)
to fine-tuning the remaining performance gap may be
closed. If true, this would simplify and speed up R-CNN
training with no loss in detection performance. th ¼ log ðGh =Ph Þ: (9)

7.3 Bounding-Box Regression As a standard regularized least squares problem, this can be
We use a simple bounding-box regression stage to improve solved efficiently in closed form.
localization performance. After scoring each selective search We found two subtle issues while implementing
proposal with a class-specific detection SVM, we predict a bounding-box regression. The first is that regularization
new bounding box for the detection using a class-specific is important: we set ¼ 1000 based on a validation set.
bounding-box regressor. This is similar in spirit to the The second issue is that care must be taken when select-
bounding-box regression used in deformable part models ing which training pairs ðP; GÞ to use. Intuitively, if P is
[18]. The primary difference between the two approaches is far from all ground-truth boxes, then the task of trans-
that here we regress from features computed by the CNN, forming P to a ground-truth box G does not make sense.
rather than from geometric features computed on the Using examples like P would lead to a hopeless learning
inferred DPM part locations. problem. Therefore, we only learn from a proposal P if it
The input to our training algorithm is a set of N training is nearby at least one ground-truth box. We implement
pairs fðP i ; Gi Þgi¼1;...;N , where P i ¼ ðPxi ; Pyi ; Pwi ; Phi Þ specifies “nearness” by assigning P to the ground-truth box G
the pixel coordinates of the center of proposal P i ’s bound- with which it has maximum IoU overlap (in case it over-
ing box together with P i ’s width and height in pixels. Hence laps more than one) if and only if the overlap is greater
forth, we drop the superscript i unless it is needed. Each than a threshold (which we set to 0.6 using a validation
ground-truth bounding box G is specified in the same way: set). All unassigned proposals are discarded. We do this
G ¼ ðGx ; Gy ; Gw ; Gh Þ. Our goal is to learn a transformation once for each object class in order to learn a set of class-
that maps a proposed box P to a ground-truth box G. specific bounding-box regressors.
We parameterize the transformation in terms of four At test time, we score each proposal and predict its new
functions dx ðP Þ, dy ðP Þ, dw ðP Þ, and dh ðP Þ. The first two detection window only once. In principle, we could iterate
specify a scale-invariant translation of the center of P ’s this procedure (i.e., re-score the newly predicted bounding
bounding box, while the second two specify log-space box, and then predict a new bounding box from it, and so
translations of the width and height of P ’s bounding box. on). However, we found that iterating does not improve
After learning these functions, we can transform an input results.
proposal P into a predicted ground-truth box G ^ by apply-
7.4 Analysis of Cross-Dataset Redundancy
ing the transformation One concern when training on an auxiliary dataset is that
there might be redundancy between it and the test set. Even
^x ¼ Pw dx ðP Þ þ Px
G (1) though the tasks of object detection and whole-image classi-
fication are substantially different, making such cross-set
^y ¼ Ph dy ðP Þ þ Py
G (2) redundancy much less worrisome, we still conducted a
thorough investigation that quantifies the extent to which
^w ¼ Pw expðdw ðP ÞÞ PASCAL test images are contained within the ILSVRC 2012
G (3)
training and validation sets. Our findings may be useful to
researchers who are interested in using ILSVRC 2012 as
^h ¼ Ph expðdh ðP ÞÞ:
G (4) training data for the PASCAL image classification task.
We performed two checks for duplicate (and near-
Each function d$ ðP Þ (where $ is one of x; y; h; w) is duplicate) images. The first test is based on exact matches of
modeled as a linear function of the pool5 features of pro- flickr image IDs, which are included in the VOC 2007 test
posal P , denoted by f5 ðP Þ. (The dependence of f5 ðP Þ on annotations (these IDs are intentionally kept secret for sub-
the image data is implicitly assumed.) Thus we have sequent PASCAL test sets). All PASCAL images, and about
d$ ðP Þ ¼ wT
$ f5 ðP Þ, where w$ is a vector of learnable half of ILSVRC, were collected from flickr.com. This check
model parameters. We learn w$ by optimizing the regu- turned up 31 matches out of 4,952 (0.63 percent).
larized least squares objective (ridge regression): The second check uses GIST [75] descriptor matching,
which was shown in [76] to have excellent performance at
X
N near-duplicate image detection in large (> 1 million) image
i 2
w$ ¼ arg min ^T
ðti$ w ^ $ jj2 :
$ f5 ðP ÞÞ þ jjw (5) collections. Following [76], we computed GIST descriptors
^$
w i
on warped 32 32 pixel versions of all ILSVRC 2012 train-
The regression targets t$ for the training pair ðP; GÞ are val and PASCAL 2007 test images.
defined as Euclidean distance nearest-neighbor matching of GIST
descriptors revealed 38 near-duplicate images (including all
tx ¼ ðGx Px Þ=Pw (6) 31 found by flickr ID matching). The matches tend to vary
slightly in JPEG compression level and resolution, and to a
lesser extent cropping. These findings show that the overlap
ty ¼ ðGy Py Þ=Ph (7) is small, less than 1 percent. For VOC 2012, because flickr
Authorized licensed use limited to: SRM Institute of Science and Technology. Downloaded on July 12,2024 at 04:26:24 UTC from IEEE Xplore. Restrictions apply.
156 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 38, NO. 1, JANUARY 2016

IDs are not available, we used the GIST matching method [7] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based
learning applied to document recognition,” Proc. IEEE, vol. 86,
only. Based on GIST matches, 1.5 percent of VOC 2012 test no. 11, pp. 2278–2324, Nov. 1998.
images are in ILSVRC 2012 trainval. The slightly higher rate [8] A. Krizhevsky, I. Sutskever, and G. Hinton, “ImageNet classifica-
for VOC 2012 is likely due to the fact that the two datasets tion with deep convolutional neural networks,” in Proc. Adv. Neu-
were collected closer together in time than VOC 2007 and ral Inf. Process. Syst., 2012, pp. 1106–1114.
[9] J. Deng, A. Berg, S. Satheesh, H. Su, A. Khosla, and L. Fei-Fei.
ILSVRC 2012 were. Imagenet large scale visual recognition competition 2012
(ILSVRC2012) [Online]. Available: http://www.image-net.org/
challenges/LSVRC/2012/, 2012.
8 CONCLUSION [10] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei,
“ImageNet: A large-scale hierarchical image database,” in Proc.
In recent years, object detection performance had stagnated. IEEE Conf. Comput. Vis. Pattern Recog., 2009, pp. 248–255.
The best performing systems were complex ensembles com- [11] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature
bining multiple low-level image features with high-level hierarchies for accurate object detection and semantic
context from object detectors and scene classifiers. This segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog.,
2014, pp. 580–587.
paper presents a simple and scalable object detection algo- [12] C. Szegedy, A. Toshev, and D. Erhan, “Deep neural networks for
rithm that gives more than a 50 percent relative improve- object detection,” in Proc. Adv. Neural Inf. Process. Syst., 2013,
ment over the best previous results on PASCAL VOC 2012. pp. 2553–2561.
[13] D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov, “Scalable
We achieved this performance through two insights. The object detection using deep neural networks,” in Proc. IEEE Conf.
first is to apply high-capacity convolutional networks to Comput. Vis. Pattern Recog., 2014, pp. 2155–2162.
bottom-up region proposals in order to localize and seg- [14] H. A. Rowley, S. Baluja, and T. Kanade, “Neural network-based
ment objects. The second is a paradigm for training large face detection,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 20,
no. 1, pp. 23–28, Jan. 1998.
CNNs when labeled training data are scarce. We show [15] R. Vaillant, C. Monrocq, and Y. LeCun, “Original approach for the
that it is highly effective to pre-train the network—with localisation of objects in images,” IEE Proc. Vis., Image, Signal Pro-
supervision—for a auxiliary task with abundant data (image cess., vol. 141, no. 4, pp. 245–250, Aug. 1994.
classification) and then to fine-tune the network for the tar- [16] J. Platt and S. Nowlan, “A convolutional neural network hand
tracker,” in Proc. Adv. Neural Inf. Process. Syst., 1995, pp. 901–908.
get task where data is scarce (detection). We conjecture that [17] P. Sermanet, K. Kavukcuoglu, S. Chintala, and Y. LeCun,
the “supervised pre-training/domain-specific fine-tuning” “Pedestrian detection with unsupervised multi-stage feature
paradigm will be highly effective for a variety of data-scarce learning,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2013,
pp. 3626–3633.
vision problems. [18] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan,
We conclude by noting that it is significant that we “Object detection with discriminatively trained part based mod-
achieved these results by using a combination of classical els,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 9, pp. 1627–
tools from computer vision and deep learning (bottom-up 1645, Sep. 2010.
[19] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y.
region proposals and convolutional networks). Rather than LeCun, “OverFeat: Integrated recognition, localization and detec-
opposing lines of scientific inquiry, the two are natural and tion using convolutional networks,” in Proc. Int. Conf. Learn. Repre-
inevitable partners. sentations, 2014, p. 16.
[20] C. Gu, J. J. Lim, P. Arbelaez, and J. Malik, “Recognition using
regions,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2009,
ACKNOWLEDGMENTS pp. 1030–1037.
[21] J. Uijlings, K. van de Sande, T. Gevers, and A. Smeulders,
This research was supported in part by DARPA Mind’s Eye “Selective search for object recognition,” Int. J. Comput. Vis.,
and MSEE programs, by US National Science Foundation vol. 104, no. 3, pp. 154–171, 2013.
Awards IIS-0905647, IIS-1134072, and IIS-1212798, MURI [22] J. Carreira and C. Sminchisescu, “CPMC: Automatic object seg-
mentation using constrained parametric min-cuts,” IEEE Trans.
N000014-10-1-0933, and by support from Toyota. The GPUs Pattern Anal. Mach. Intell., vol. 34, no. 7, pp. 1312–1328, Jul. 2012.
used in this research were generously donated by the NVI- [23] R. Girshick, P. Felzenszwalb, and D. McAllester. Discriminatively
DIA Corporation. R. Girshick is with Microsoft Research and trained deformable part models, release 5 [Online]. Available:
was with the Department of Electrical Engineering and Com- http://www.cs.berkeley.edu/~rbg/latent-v5/, 2012.
[24] K. Simonyan and A. Zisserman, “Very deep convolutional net-
puter Science, UC Berkeley during the majority of this work. works for large-scale image recognition,” in Proc. Int. Conf. Learn.
Representations, 2015.
REFERENCES [25] P. Agrawal, R. Girshick, and J. Malik, “Analyzing the performance
of multilayer neural networks for object recognition,” in Proc. 13th
[1] D. Lowe, “Distinctive image features from scale-invariant key- Eur. Conf. Comput. Vis., 2014, pp. 329–344.
points,” Int. J. Comput. Vis., vol. 60, no. 2, pp. 91–110, 2004. [26] T. Dean, J. Yagnik, M. Ruzon, M. Segal, J. Shlens, and S.
[2] N. Dalal and B. Triggs, “Histograms of oriented gradients for Vijayanarasimhan, “Fast, accurate detection of 100,000 object
human detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog- classes on a single machine,” in Proc. IEEE Conf. Comput. Vis.
nit., 2005, pp. 886–893. Pattern Recog., 2013, pp. 1814–1821.
[3] M. Everingham, L. van Gool, C. K. I. Williams, J. Winn, and A. [27] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in
Zisserman, “The PASCAL visual object classes (VOC) challenge,” deep convolutional networks for visual recognition,” in Proc. 13th
Int. J. Comput. Vis., vol. 80, no. 2, pp. 303–338, 2010. Eur. Conf. Comput. Vis., 2014, pp. 346–361.
[4] K. Fukushima, “Neocognitron: A self-organizing neural network [28] R. Girshick, “Fast R-CNN,” arXiv e-prints, vol. arXiv:1504.08083v1
model for a mechanism of pattern recognition unaffected by shift [cs.CV], 2015.
in position,” Biol. Cybern., vol. 36, no. 4, pp. 193–202, 1980. [29] D. Hoiem, A. Efros, and M. Hebert, “Geometric context from a
[5] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning inter- single image,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog.,
nal representations by error propagation,” Parallel Distrib. Process., 2005, pp. 654–661.
vol. 1, pp. 318–362, 1986. [30] B. C. Russell, W. T. Freeman, A. A. Efros, J. Sivic, and A. Zisserman,
[6] Y. LeCun, B. Boser, J. Denker, D. Henderson, R. Howard, W. “Using multiple segmentations to discover objects and their extent
Hubbard, and L. Jackel, “Backpropagation applied to handwritten in image collections,” in Proc. IEEE Conf. Comput. Vis. Pattern
zip code recognition,” Neural Comput., vol. 1, no. 4, pp. 541–551, Recog., 2006, pp. 1605–1614.
1989.
Authorized licensed use limited to: SRM Institute of Science and Technology. Downloaded on July 12,2024 at 04:26:24 UTC from IEEE Xplore. Restrictions apply.
GIRSHICK ET AL.: REGION-BASED CONVOLUTIONAL NETWORKS FOR ACCURATE OBJECT DETECTION AND SEGMENTATION 157

[31] C. L. Zitnick and P. Dollar, “Edge boxes: Locating object proposals [57] S. Fidler, R. Mottaghi, A. Yuille, and R. Urtasun, “Bottom-up seg-
from edges,” in Proc. 13th Eur. Conf. Comput. Vis., 2014, pp. 391–405. mentation for top-down detection,” in Proc. IEEE Conf. Comput.
[32] M.-M. Cheng, Z. Zhang, W.-Y. Lin, and P. H. S. Torr, “BING: Vis. Pattern Recog., 2013, pp. 3294–3301.
Binarized normed gradients for objectness estimation at 300fps,” [58] K. Sung and T. Poggio, “Example-based learning for view-based
in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2014, pp. 3286–3293. human face detection,” Artif. Intell. Lab., Massachussets Inst. Tech-
[33] J. Hosang, R. Benenson, P. Dollar, and B. Schiele, “What nol., Cambridge, MA, USA, Tech. Rep. A.I. Memo No. 1521, 1994.
makes for effective detection proposals?” arXiv e-prints, [59] J. Carreira, R. Caseiro, J. Batista, and C. Sminchisescu, “Semantic
vol. arXiv:1502.05082v1 [cs.CV], 2015. segmentation with second-order pooling,” in Proc. 12th Eur. Conf.
[34] A. Humayun, F. Li, and J. M. Rehg, “RIGOR: Reusing inference in Comput. Vis., 2012, pp. 430–443.
graph cuts for generating object regions,” in Proc. IEEE Conf. Com- [60] K. E. van de Sande, C. G. Snoek, and A. W. Smeulders, “Fisher
put. Vis. Pattern Recog., 2014, pp. 336–343. and vlad with flair,” in Proc. IEEE Conf. Comput. Vis. Pattern
[35] P. Arbel aez, J. Pont-Tuset, J. Barron, F. Marques, and J. Malik, Recog., 2014, pp. 2377–2384.
“Multiscale combinatorial grouping,” in Proc. IEEE Conf. Comput. [61] J. J. Lim, C. L. Zitnick, and P. Dollar, “Sketch tokens: A learned
Vis. Pattern Recog., 2014, pp. 328–335. mid-level representation for contour and object detection,” in
[36] P. Kr€ ahenb€ uhl, and V. Koltun, “Geodesic object proposals,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2013, pp. 3158–3165.
Proc. 13th Eur. Conf. Comput. Vis., 2014, pp. 725–739. [62] X. Ren and D. Ramanan, “Histograms of sparse codes for object
[37] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Trans. detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2013,
Pattern Anal. Mach. Intell., vol. 22, no. 10, pp. 1345–1359, Oct. 2010. pp. 3246–3253.
[38] R. Caruana, “Multitask learning: A knowledge-based source of [63] M. Zeiler, G. Taylor, and R. Fergus, “Adaptive deconvolutional
inductive bias,” in Proc. 10th Int. Conf. Mach. Learn., 1993, pp. 41–48. networks for mid and high level feature learning,” in Proc. IEEE
[39] S. Thrun, “Is learning the n-th thing any easier than learning the Conf. Comput. Vis. Pattern Recog., 2011, pp. 2018–2025.
first?” in Proc. Adv. Neural Inf. Process. Syst., 1996, pp. 640–646. [64] D. Hoiem, Y. Chodpathumwan, and Q. Dai, “Diagnosing error in
[40] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: object detectors,” in Proc. 12th Eur. Conf. Comput. Vis., 2012,
A review and new perspectives,” IEEE Trans. Pattern Anal. Mach. pp. 340–353.
Intell., vol. 35, no. 8, pp. 1798–1828, Aug. 2013. [65] J. Deng, O. Russakovsky, J. Krause, M. Bernstein, A. C. Berg, and
[41] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, L. Fei-Fei, “Scalable multi-label annotation,” in Proc. SIGCHI Conf.
and T. Darrell, “DeCAF: A deep convolutional activation fea- Human Factors Comput. Syst., 2014, pp. 3099–3102.
ture for generic visual recognition,” in Proc. Int. Conf. Mach. [66] H. Su, J. Deng, and L. Fei-Fei, “Crowdsourcing annotations for
Learn., 2014, pp. 647–655. visual object detection,” in Proc. AAAI 4th Human Comput. Work-
[42] J. Hoffman, S. Guadarrama, E. Tzeng, J. Donahue, R. Girshick, T. shop, 2012.
Darrell, and K. Saenko, “From large-scale object classifiers to [67] C. Farabet, C. Couprie, L. Najman, and Y. LeCun, “Learning hier-
large-scale object detectors: An adaptation approach,” in Proc. archical features for scene labeling,” IEEE Trans. Pattern Anal.
Adv. Neural Inf. Process. Syst., 2014, pp. 3536–3544. Mach. Intell., vol. 35, no. 8, pp. 1915–1929, Aug. 2013.
[43] A. Karpathy, A. Joulin, and L. Fei-Fei, “Deep fragment embed- [68] P. Arbelaez, B. Hariharan, C. Gu, S. Gupta, L. Bourdev, and J.
dings for bidirectional image sentence mapping,” in Proc. Adv. Malik, “Semantic segmentation using regions and parts,” in Proc.
Neural Inf. Process. Syst., 2014, pp. 1889–1897. IEEE Conf. Comput. Vis. Pattern Recog., 2012, pp. 3378–3385.
[44] G. Gkioxari, B. Hariharan, R. Girshick, and J. Malik, “R-CNNs for [69] B. Hariharan, P. Arbelaez, L. Bourdev, S. Maji, and J. Malik,
pose estimation and action detection,” arXiv e-prints, “Semantic contours from inverse detectors,” in Proc. IEEE Int.
vol. arXiv:1406.5212v1 [cs.CV], 2014. Conf. Comput. Vis., 2011, pp. 991–998.
[45] B. Hariharan, P. Arbel aez, R. Girshick, and J. Malik, “Simultaneous [70] B. Hariharan, P. Arbelaez, R. Girshick, and J. Malik,
detection and segmentation,” in Proc. 13th Eur. Conf. Comput. Vis., “Hypercolumns for object segmentation and fine-grained local-
2014, pp. 297–312. ization,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2015,
[46] S. Gupta, R. Girshick, P. Arbelaez, and J. Malik, “Learning rich fea- pp. 447–456.
tures from RGB-D images for object detection and segmentation,” [71] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L.
in Proc. 13th Eur. Conf. Comput. Vis., 2014, pp. 345–360. Yuille, “Semantic image segmentation with deep convolutional
[47] H. O. Song, R. Girshick, S. Jegelka, J. Mairal, Z. Harchaoui, and T. nets and fully connected CRFs,” in Proc. Int. Conf. Learn. Represen-
Darrell, “On learning to localize objects with minimal super- tations, 2015.
vision,” in Proc. Int. Conf. Mach. Learn., 2014, pp. 1611–1619. [72] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D.
[48] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Du, C. Huang, and P. Torr, “Conditional random fields as recur-
Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. rent neural networks,” arXiv e-print, vol. arXiv:1502.03240v2 [cs.
Fei-Fei, “ImageNet large scale visual recognition challenge,” arXiv CV], 2015.
e-prints, vol. arXiv:1409.0575v1 [cs.CV], 2014. [73] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional net-
[49] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. works for semantic segmentation,” in Proc. IEEE Conf. Comput.
Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with Vis. Pattern Recog., 2015, pp. 3431–3440.
convolutions,” arXiv e-prints, vol. arXiv:1409.4842v1 [cs.CV], 2014. [74] P. Kr€ahenb€ uhl and V. Koltun, “Efficient inference in fully con-
[50] C. Szegedy, S. Reed, D. Erhan, and D. Anguelov, “Scalable, high- nected CRFs with Gaussian edge potentials,” in Proc. Adv. Neural
quality object detection,” arXiv e-prints, vol. arXiv:1412.1441v2 Inf. Process. Syst., 2011, pp. 109–117.
[cs.CV], 2015. [75] A. Oliva and A. Torralba, “Modeling the shape of the scene: A
[51] B. Alexe, T. Deselaers, and V. Ferrari, “Measuring the objectness holistic representation of the spatial envelope,” Int. J. Comput.
of image windows,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, Vis., vol. 42, pp. 145–175, 2001.
no. 11, pp. 2189–2202, Nov. 2012. [76] M. Douze, H. Jegou, H. Sandhawalia, L. Amsaleg, and C. Schmid,
[52] I. Endres and D. Hoiem, “Category independent object “Evaluation of gist descriptors for web-scale image search,” in
proposals,” in Proc. 11th Eur. Conf. Comput. Vis., 2010, pp. 575–588. Proc. ACM Int. Conf. Image Video Retrieval, 2009, p. 19.
[53] D. Cireşan, A. Giusti, L. Gambardella, and J. Schmidhuber,
“Mitosis detection in breast cancer histology images with deep Ross Girshick received the MS and PhD
neural networks,” in Proc. Medical Image Comput. Comput.-Assisted degrees in computer science, both from the
Intervention, 2013, pp. 411–418. University of Chicago where he studied under the
[54] X. Wang, M. Yang, S. Zhu, and Y. Lin, “Regionlets for generic supervision of Pedro Felzenszwalb. He is a
object detection,” in Proc. IEEE Int. Conf. Comput. Vis., 2013, researcher at Microsoft Research, Redmond.
pp. 17–24. Prior to joining Microsoft Research, he was a
[55] Y. Jia. (2013). Caffe: An open source convolutional architecture Postdoctorial fellow at the University of California,
for fast feature embedding [Online]. Available: http://caffe. Berkeley where he collaborated with Jitendra
berkeleyvision.org/ Malik and Trevor Darrell. During the course of
[56] T. Dean, M. A. Ruzon, M. Segal, J. Shlens, S. Vijayanarasimhan, PASCAL VOC object detection challenge, he par-
and J. Yagnik, “Fast, accurate detection of 100,000 object classes ticipated in multiple winning object detection
on a single machine,” in Proc. IEEE Conf. Comput. Vis. Pattern entries and was awarded a “lifetime achievement” prize for his work on
Recog., 2013, pp. 1814–1821. the widely used Deformable Part Models.
Authorized licensed use limited to: SRM Institute of Science and Technology. Downloaded on July 12,2024 at 04:26:24 UTC from IEEE Xplore. Restrictions apply.
158 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 38, NO. 1, JANUARY 2016

Jeff Donahue received the BS degree in com- Jitendra Malik received the BTech degree in EE
puter science from UT Austin in 2011, completing from the Indian Institute of Technology, Kanpur,
an honors thesis with Kristen Grauman. He is a in 1980 and the PhD degree in CS from Stanford
fourth year PhD student at UC Berkeley, super- University in 1985. In January 1986, he joined
vised by Trevor Darrell. His research interests UC Berkeley, where he is currently the Arthur J.
are in visual recognition and machine learning, Chick professor in the Department of EECS. He
most recently focusing on the application of deep is also on the faculty of the Department of
learning to visual localization and sequencing Bioengineering, and the Cognitive Science and
problems. He is a student member of the IEEE. Vision Science groups. During 2002-2004, he
served as the chair of the Computer Science Divi-
sion and during 2004-2006 as the Department
Chair of EECS. His research group has worked on computer vision,
Trevor Darrell received the BSE degree from the computational modeling of human vision, computer graphics and the
University of Pennsylvania in 1988, having analysis of biological images. Several well-known concepts and algo-
started his career in computer vision as an under- rithms arose in this research, such as anisotropic diffusion, normalized
graduate researcher in Ruzena Bajcsy’s GRASP cuts, high dynamic range imaging, and shape contexts. He is one of the
lab. He received the SM and PhD degrees from most highly cited researchers in computer vision, with 10 of his papers
MIT in 1992 and 1996, respectively. His group is having received more than a thousand citations each. He has graduated
co-located at the University of California, Berke- 33 PhD students, many of whom are now prominent researchers in aca-
ley, and the UCB-affiliated International Com- demia and industry. He received the Gold Medal for the best graduating
puter Science Institute (ICSI), also located in student in electrical engineering from IIT Kanpur in 1980 and a Presiden-
Berkeley, CA. He is on the faculty of the CS Divi- tial Young Investigator Award in 1989. He was awarded the Longuet-
sion of the EECS Department at UCB and is the Higgins Prize for a contribution that has stood the test of time twice, in
vision group lead at ICSI. His group develops algorithms for large-scale 2007 and in 2008. He is a member of the National Academy of Engineer-
perceptual learning, including object and activity recognition and detec- ing and a fellow of the American Academy of Arts and Sciences. In 2013,
tion, for a variety of applications including multimodal interaction with he received the IEEE PAMI-TC Distinguished Researcher in Computer
robots and mobile devices. His interests include computer vision, Vision Award, and in 2014 the K.S.Fu Prize from the International Asso-
machine learning, computer graphics, and perception-based human ciation of Pattern Recognition. He is a fellow of both the ACM and IEEE.
computer interfaces. He was previously on the faculty of the MIT EECS
department from 1999-2008, where he directed the Vision Interface
Group. He was a member of the research staff at Interval Research Cor- " For more information on this or any other computing topic,
poration from 1996-1999. He is a member of the IEEE. please visit our Digital Library at www.computer.org/publications/dlib.

Authorized licensed use limited to: SRM Institute of Science and Technology. Downloaded on July 12,2024 at 04:26:24 UTC from IEEE Xplore. Restrictions apply.