Region-Based_Convolutional_Networks_for_Accurate_Object_Detection_and_Segmentation
Region-Based_Convolutional_Networks_for_Accurate_Object_Detection_and_Segmentation
1, JANUARY 2016
Abstract—Object detection performance, as measured on the canonical PASCAL VOC Challenge datasets, plateaued in the final
years of the competition. The best-performing methods were complex ensemble systems that typically combined multiple low-level
image features with high-level context. In this paper, we propose a simple and scalable detection algorithm that improves mean
average precision (mAP) by more than 50 percent relative to the previous best result on VOC 2012—achieving a mAP of 62.4 percent.
Our approach combines two ideas: (1) one can apply high-capacity convolutional networks (CNNs) to bottom-up region proposals in
order to localize and segment objects and (2) when labeled training data are scarce, supervised pre-training for an auxiliary task,
followed by domain-specific fine-tuning, boosts performance significantly. Since we combine region proposals with CNNs, we call
the resulting model an R-CNN or Region-based Convolutional Network. Source code for the complete system is available
at http://www.cs.berkeley.edu/~rbg/rcnn.
Index Terms—Object recognition, detection, semantic segmentation, convolutional networks, deep learning, transfer learning
1 INTRODUCTION
exact filter convolutions in DPM with hashtable lookups. Transfer learning. R-CNN training is based on inductive
They show that with this technique it’s possible to run 10k transfer learning, using the taxonomy of Pan and Yang [37].
DPM detectors in about 5 minutes per image on a desktop To train an R-CNN, we typically start with ImageNet classi-
workstation. However, there is an unfortunate tradeoff. fication as a source task and dataset, train a network using
When a large number of DPM detectors compete, the supervision, and then transfer that network to the target
approximate hashing approach causes a substantial loss in task and dataset using supervised fine-tuning. This method
detection accuracy. R-CNNs, in contrast, scale very well is related to traditional multi-task learning [38], [39], except
with the number of object classes to detect because nearly that we train for the tasks sequentially and are ultimately
all computation is shared between all object categories. The only interested in performing well on the target task.
only class-specific computations are a reasonably small This strategy is different from the dominant paradigm in
matrix-vector product and greedy non-maximum suppres- recent neural network literature of unsupervised tranfer learn-
sion. Although these computations scale linearly with the ing (see [40] for a survey covering unsupervised pre-train-
number of categories, the scale factor is small. Measured ing and represetation learning more generally). Supervised
empirically, it takes only 30 ms longer to detect 200 classes transfer learning using CNNs, but without fine-funing, was
than 20 classes on a CPU, without any approximations. This also investigated in concurrent work by Donahue et al. [41].
makes it feasible to rapidly detect tens of thousands of They show that Krizhevsky et al.’s CNN, once trained on
object categories without any modifications to the core ImageNet, can be used as a blackbox feature extractor,
algorithm. yielding excellent performance on several recognition tasks
Despite this graceful scaling behavior, an R-CNN can including scene classification, fine-grained sub-categoriza-
take 10 to 45 seconds per image on a GPU, depending on tion, and domain adaptation. Hoffman et al. [42] show how
the network used, since each region is passed through the transfer learning can be used to train R-CNNs for classes
network independently. Recent work from He et al. [27] that have image-level labels, but no bounding-box training
(“SPPnet”) improves R-CNN efficiency by sharing compu- data. Their approach is based on modeling the task shift
tation through a feature pyramid, allowing for detection at from image classification to object detection and then trans-
a few frames per second. Building on SPPnet, Girshick [28] fering that knowledge to classes that have no detection
shows that it’s possible to further reduce training and test- training data.
ing times, while improving detection accuracy and simplify- R-CNN extensions. Since their introduction, R-CNNs have
ing the training process, using an approach called “Fast R- been extended to a variety of new tasks and datasets.
CNN.” Fast R-CNN reduces detection times (excluding Karpathy et al. [43] learn a model for bi-directional image
region proposal computation) to 50 to 300 ms per image, and sentence retrieval. Their image representation is
depending on network architecture. derived from an R-CNN trained to detect 200 classes on the
Localization methods. The dominant approach to object ILSVRC2013 detection dataset. Gkioxari et al. [44] use
detection has been based on sliding-window detectors. multi-task learning to train R-CNNs for person detection,
This approach goes back (at least) to early face detectors 2D pose estimation, and action recognition. Hariharan et al.
[15], and continued with HOG-based pedestrian detection [45] propose a unification of the object detection and seman-
[2], and part-based generic object detection [18]. An alter- tic segmentation tasks, termed “simultaneous detection and
native is to first compute a pool of (likely overlapping) segmentation” (SDS), and train a two-column R-CNN for
image regions, each one serving as a candidate object, and this task. They show that a single region proposal algorithm
then to filter these candidates in a way that aims to retain (MCG [35]) can be used effectively for traditional bounding-
only the true objects. Multiple segmentation hypotheses box detection as well as semantic segmentation. Their PAS-
were used by Hoiem et al. [29] to estimate the rough geo- CAL segmentation results improve significantly on the ones
metric scene structure and by Russell et al. [30] to automat- reported in this paper. Gupta et al. [46] extend R-CNNs to
ically discover object classes in a set of images. The object detection in depth images. They show that a well-
“selective search” algorithm of van de Sande et al. [21] designed input signal, where the depth map is augmented
popularized the multiple segmentation approach for object with height above ground and local surface orientation with
detection by showing strong results on PASCAL object respect to gravity, allows training an R-CNN that outper-
detection. Our approach was inspired by the success of forms existing RGB-D object detection baselines. Song et al.
selective search. [47] train an R-CNN using weak, image-level supervision
Object proposal generation is now an active research by mining for positive training examples using a submodu-
area. EdgeBoxes [31] outputs high-quality rectangular lar cover algorithm and then training a latent SVM.
(box) proposals quickly (0.3 s per image). BING [32] Many systems based on, or implementing, R-CNNs
generates box proposals at 3 ms per image, however it were used in the recent ILSVRC2014 object detection chal-
has subsequently been shown that the proposal quality lenge [48], resulting in substantial improvements in detec-
is too poor to be useful in R-CNNs [33]. Other methods tion accuracy. In particular, the winning method,
focus on pixel-wise segmentation, producing regions GoogLeNet [49], [50], uses an innovative network design
instead of boxes. These approaches include RIGOR [34] in an R-CNN. With a single network (and a slightly sim-
and MCG [35], which take 10 to 30 s per image and pler pipeline that excludes SVM training and bounding-
GOP [36], a faster methods that takes 1 s per image. box regression), they improve R-CNN performance to
For a more in-depth survey of proposal algorithms, 38.0 percent mAP from a basline of 34.5 percent. They also
Hosang et al. [33] provide an insightful meta-evaluation show that an ensemble of six networks improves their
of recent methods. result to 43.9 percent mAP.
Authorized licensed use limited to: SRM Institute of Science and Technology. Downloaded on July 12,2024 at 04:26:24 UTC from IEEE Xplore. Restrictions apply.
GIRSHICK ET AL.: REGION-BASED CONVOLUTIONAL NETWORKS FOR ACCURATE OBJECT DETECTION AND SEGMENTATION 145
TABLE 1
Detection Average Precision (Percent) on VOC 2010 Test
VOC 2010 test aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv mAP
DPM v5 [23] 49.2 53.8 13.1 15.3 35.5 53.4 49.7 27.0 17.2 28.8 14.7 17.8 46.4 51.2 47.7 10.8 34.2 20.7 43.8 38.3 33.4
UVA [21] 56.2 42.4 15.3 12.6 21.8 49.3 36.8 46.1 12.9 32.1 30.0 36.5 43.5 52.9 32.9 15.3 41.1 31.8 47.0 44.8 35.1
Regionlets [54] 65.0 48.9 25.9 24.6 24.5 56.1 54.5 51.2 17.0 28.9 30.2 35.8 40.2 55.7 43.5 14.3 43.9 32.6 54.0 45.9 39.7
SegDPM [57] 61.4 53.4 25.6 25.2 35.5 51.7 50.6 50.8 19.3 33.8 26.8 40.4 48.3 54.4 47.1 14.8 38.7 35.0 52.8 43.1 40.4
R-CNN T-Net 67.1 64.1 46.7 32.0 30.5 56.4 57.2 65.9 27.0 47.3 40.9 66.6 57.8 65.9 53.6 26.7 56.5 38.1 52.8 50.2 50.2
R-CNN T-Net BB 71.8 65.8 53.0 36.8 35.9 59.7 60.0 69.9 27.9 50.6 41.4 70.0 62.0 69.0 58.1 29.5 59.4 39.3 61.2 52.4 53.7
R-CNN O-Net 76.5 70.4 58.0 40.2 39.6 61.8 63.7 81.0 36.2 64.5 45.7 80.5 71.9 74.3 60.6 31.5 64.7 52.5 64.6 57.2 59.8
R-CNN O-Net BB 79.3 72.4 63.1 44.0 44.4 64.6 66.3 84.9 38.8 67.3 48.4 82.3 75.0 76.7 65.7 35.8 66.2 54.8 69.1 58.8 62.9
T-Net stands for TorontoNet and O-Net for OxfordNet (Section 3.1.2). R-CNNs are most directly comparable to UVA and Regionlets since all methods use selec-
tive search region proposals. Bounding-box regression is described in Section 7.3. At publication time, SegDPM was the top-performer on the PASCAL VOC
leaderboard. DPM and SegDPM use context rescoring not used by the other methods. SegDPM and all R-CNNs use additional training data.
Fig. 3. (Left) Mean average precision on the ILSVRC2013 detection test set. Methods preceeded by * use outside training data (images and labels
from the ILSVRC classification dataset in all cases). (Right) Box plots for the 200 average precision values per method. A box plot for the post-com-
petition OverFeat result is not shown because per-class APs are not yet available. The red line marks the median AP, the box bottom and top are the
25th and 75th percentiles. The whiskers extend to the min and max AP of each method. Each AP is plotted as a green dot over the whiskers (best
viewed digitally with zoom).
3.5 Results on ILSVRC2013 Detection top-scoring regions. Our method lets the selected unit
We ran an R-CNN on the 200-class ILSVRC2013 detection “speak for itself” by showing exactly which inputs it fires
dataset using the same system hyperparameters that we on. We avoid averaging in order to see different visual
used for PASCAL VOC. We followed the same protocol of modes and gain insight into the invariances computed by
submitting test results to the ILSVRC2013 evaluation server the unit.
only twice, once with and once without bounding-box We visualize units from layer pool5 of a TorontoNet,
regression. which is the max-pooled output of the network’s fifth
Fig. 3 compares our R-CNN to the entries in the ILSVRC and final convolutional layer. The pool5 feature map is
2013 competition and to the post-competition OverFeat 6 6 256 ¼ 9216-dimensional. Ignoring boundary effects,
result [19]. Using TorontoNet, our R-CNN achieves a mAP each pool5 unit has a receptive field of 195 195 pixels in
of 31.4 percent, which is significantly ahead of the second- the original 227 227 pixel input. A central pool5 unit has a
best result of 24.3 percent from OverFeat. To give a sense of nearly global view, while one near the edge has a smaller,
the AP distribution over classes, box plots are also pre- clipped support.
sented. Most of the competing submissions (OverFeat, Each row in Fig. 4 displays the top 16 activations for a
NEC-MU, Toronto A, and UIUC-IFP) used convolutional pool5 unit from a CNN that we fine-tuned on VOC 2007
networks, indicating that there is significant nuance in how trainval. Six of the 256 functionally unique units are visual-
CNNs can be applied to object detection, leading to greatly ized. These units were selected to show a representative
varying outcomes. Notably, UvA-Euvision’s entry did not sample of what the network learns. In the second row, we
use CNNs and was based on a fast VLAD encoding [60]. see a unit that fires on dog faces and dot arrays. The unit
In Section 5, we give an overview of the ILSVRC2013 corresponding to the third row is a red blob detector. There
detection dataset and provide details about choices that we are also detectors for human faces and more abstract pat-
made when training R-CNNs on it. terns such as text and triangular structures with windows.
The network appears to learn a representation that com-
bines a small number of class-tuned features together with
4 ANALYSIS a distributed representation of shape, texture, color, and
4.1 Visualizing Learned Features material properties. The subsequent fully connected layer
First-layer filters can be visualized directly and are easy to fc6 has the ability to model a large set of compositions of
understand [8]. They capture oriented edges and opponent these rich features. Agrawal et al. [25] provide a more in-
colors. Understanding the subsequent layers is more chal- depth analysis of the learned features.
lenging. Zeiler and Fergus present a visually attractive
deconvolutional approach in [63]. We propose a simple
4.2 Ablation Studies
(and complementary) non-parametric method that directly
shows what the network learned. 4.2.1 Performance Layer-by-Layer, without Fine-
The idea is to single out a particular unit (feature) in the Tuning
network and use it as if it were an object detector in its To understand which layers are critical for detection perfor-
own right. That is, we compute the unit’s activations on a mance, we analyzed results on the VOC 2007 dataset for
large set of held-out region proposals (about 10 million), each of the TorontoNet’s last three layers. Layer pool5 was
sort the proposals from highest to lowest activation, per- briefly described in Section 4.1. The final two layers are
form non-maximum suppression, and then display the summarized below.
Authorized licensed use limited to: SRM Institute of Science and Technology. Downloaded on July 12,2024 at 04:26:24 UTC from IEEE Xplore. Restrictions apply.
148 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 38, NO. 1, JANUARY 2016
Fig. 4. Top regions for six pool5 units. Receptive fields and activation values are drawn in white. Some units are aligned to concepts, such as people
(row 1) or text (4). Other units capture texture and material properties, such as dot arrays (2) and specular reflections (6).
Layer fc6 is fully connected to pool5 . To compute features, representation would enable experimentation with sliding-
it multiplies a 4;096 9;216 weight matrix by the pool5 fea- window detectors, including DPM, on top of pool5 features.
ture map (reshaped as a 9,216-dimensional vector) and then
adds a vector of biases. This intermediate vector is compo- 4.2.2 Performance Layer-by-Layer, with Fine-Tuning
nent-wise half-wave rectified (x maxð0; xÞ).
We now look at results from our CNN after having fine-
Layer fc7 is the final layer of the network. It is imple-
tuned its parameters on VOC 2007 trainval. The improve-
mented by multiplying the features computed by fc6 by a
ment is striking (Table 2 rows 4-6): fine-tuning increases
4;096 4;096 weight matrix, and similarly adding a vector
mAP by 8.0 percentage points to 54.2 percent. The boost
of biases and applying half-wave rectification.
from fine-tuning is much larger for fc6 and fc7 than for
We start by looking at results from the CNN without fine-
pool5 , which suggests that the pool5 features learned from
tuning on PASCAL, i.e., all CNN parameters were pre-
ImageNet are general and that most of the improvement is
trained on ILSVRC 2012 only. Analyzing performance
gained from learning domain-specific non-linear classifiers
layer-by-layer (Table 2 rows 1-3) reveals that features from
on top of them.
fc7 generalize worse than features from fc6 . This means that
29 percent, or about 16.8 million, of the CNN’s parameters
can be removed without degrading mAP. More surprising 4.2.3 Comparison to Recent Feature Learning Methods
is that removing both fc7 and fc6 produces quite good results Relatively few feature learning methods have been tried on
even though pool5 features are computed using only 6 per- PASCAL VOC detection. We look at two recent approaches
cent of the CNN’s parameters. Much of the CNN’s represen- that build on deformable part models. For reference, we
tational power comes from its convolutional layers, rather also include results for the standard HOG-based DPM [23].
than from the much larger densely connected layers. This The first DPM feature learning method, DPM ST [61],
finding suggests potential utility in computing a dense fea- augments HOG features with histograms of “sketch token”
ture map, in the sense of HOG, of an arbitrary-sized image probabilities. Intuitively, a sketch token is a tight distribu-
by using only the convolutional layers of the CNN. This tion of contours passing through the center of an image
TABLE 2
Detection Average Precision (Percent) on VOC 2007 Test
VOC 2007 test aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv mAP
R-CNN pool5 51.8 60.2 36.4 27.8 23.2 52.8 60.6 49.2 18.3 47.8 44.3 40.8 56.6 58.7 42.4 23.4 46.1 36.7 51.3 55.7 44.2
R-CNN fc6 59.3 61.8 43.1 34.0 25.1 53.1 60.6 52.8 21.7 47.8 42.7 47.8 52.5 58.5 44.6 25.6 48.3 34.0 53.1 58.0 46.2
R-CNN fc7 57.6 57.9 38.5 31.8 23.7 51.2 58.9 51.4 20.0 50.5 40.9 46.0 51.6 55.9 43.3 23.3 48.1 35.3 51.0 57.4 44.7
R-CNN FT pool5 58.2 63.3 37.9 27.6 26.1 54.1 66.9 51.4 26.7 55.5 43.4 43.1 57.7 59.0 45.8 28.1 50.8 40.6 53.1 56.4 47.3
R-CNN FT fc6 63.5 66.0 47.9 37.7 29.9 62.5 70.2 60.2 32.0 57.9 47.0 53.5 60.1 64.2 52.2 31.3 55.0 50.0 57.7 63.0 53.1
R-CNN FT fc7 64.2 69.7 50.0 41.9 32.0 62.6 71.0 60.7 32.7 58.5 46.5 56.1 60.6 66.8 54.2 31.5 52.8 48.9 57.9 64.7 54.2
R-CNN FT fc7 BB 68.1 72.8 56.8 43.0 36.8 66.3 74.2 67.6 34.4 63.5 54.5 61.2 69.1 68.6 58.7 33.4 62.9 51.1 62.5 64.8 58.5
DPM v5 [23] 33.2 60.3 10.2 16.1 27.3 54.3 58.2 23.0 20.0 24.1 26.7 12.7 58.1 48.2 43.2 12.0 21.1 36.1 46.0 43.5 33.7
DPM ST [61] 23.8 58.2 10.5 8.5 27.1 50.4 52.0 7.3 19.2 22.8 18.1 8.0 55.9 44.8 32.4 13.3 15.9 22.8 46.2 44.9 29.1
DPM HSC [62] 32.2 58.3 11.5 16.3 30.6 49.9 54.8 23.5 21.5 27.7 34.0 13.7 58.1 51.6 39.9 12.4 23.5 34.4 47.4 45.2 34.3
Rows 1-3 show R-CNN performance without fine-tuning. Rows 4-6 show results for the CNN pre-trained on ILSVRC 2012 and then fine-tuned (FT) on VOC
2007 trainval. Row 7 includes a simple bounding-box regression stage that reduces localization errors (Section 7.3). Rows 8-10 present DPM methods as a strong
baseline. Thelicensed
Authorized first uses
useonly HOG,
limited to: while the nextoftwo
SRM Institute use different
Science feature learning
and Technology. approaches
Downloaded to12,2024
on July augmentator04:26:24
replace UTC
HOG. AllIEEE
from R-CNN results
Xplore. use TorontoNet.
Restrictions apply.
GIRSHICK ET AL.: REGION-BASED CONVOLUTIONAL NETWORKS FOR ACCURATE OBJECT DETECTION AND SEGMENTATION 149
TABLE 3
Detection Average Precision (Percent) on VOC 2007 Test for Two Different CNN Architectures
VOC 2007 test aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv mAP
R-CNN T-Net 64.2 69.7 50.0 41.9 32.0 62.6 71.0 60.7 32.7 58.5 46.5 56.1 60.6 66.8 54.2 31.5 52.8 48.9 57.9 64.7 54.2
R-CNN T-Net BB 68.1 72.8 56.8 43.0 36.8 66.3 74.2 67.6 34.4 63.5 54.5 61.2 69.1 68.6 58.7 33.4 62.9 51.1 62.5 64.8 58.5
R-CNN O-Net 71.6 73.5 58.1 42.2 39.4 70.7 76.0 74.5 38.7 71.0 56.9 74.5 67.9 69.6 59.3 35.7 62.1 64.0 66.5 71.2 62.2
R-CNN O-Net BB 73.4 77.0 63.4 45.4 44.6 75.1 78.1 79.8 40.5 73.7 62.2 79.4 78.1 73.1 64.2 35.6 66.8 67.2 70.4 71.1 66.0
The first two rows are results from Table 2 using Krizhevsky et al.’s TorontoNet architecture (T-Net). Rows three and four use the recently proposed 16-layer
OxfordNet architecture (O-Net) from Simonyan and Zisserman [24].
patch. Sketch token probabilities are computed at each pixel From a transfer learning point of view, it is very encourag-
by a random forest that was trained to classify 35 35 pixel ing that large improvements in image classification translate
patches into one of 150 sketch tokens or background. directly into large improvements in object detection.
The second method, DPM HSC [62], replaces HOG with
histograms of sparse codes (HSC). To compute an HSC, 4.4 Detection Error Analysis
sparse code activations are solved for at each pixel using a
We applied the excellent detection analysis tool from Hoiem
learned dictionary of 100 7 7 pixel (grayscale) atoms. The
et al. [64] in order to reveal our method’s error modes,
resulting activations are rectified in three ways (full and
understand how fine-tuning changes them, and to see how
both half-waves), spatially pooled, unit ‘2 normalized, and
our error types compare with DPM. A full summary of the
then power transformed (x signðxÞjxja ).
analysis tool is beyond the scope of this paper and we
All R-CNN variants strongly outperform the three DPM
encourage readers to consult [64] to understand some finer
baselines (Table 2 rows 8-10), including the two that use
details (such as “normalized AP”). Since the analysis is best
feature learning. Compared to the latest version of DPM,
absorbed in the context of the associated plots, we present
which uses only HOG features, our mAP is more than
the discussion within the captions of Figs. 5 and 6.
20 percentage points higher: 54.2 percent vs. 33.7 percent—a
61 percent relative improvement. The combination of HOG
and sketch tokens yields 2.5 mAP points over HOG alone, 4.5 Bounding-Box Regression
while HSC improves over HOG by four mAP points (when Based on the error analysis, we implemented a simple
compared internally to their private DPM baselines—both method to reduce localization errors. Inspired by the
use non-public implementations of DPM that underperform bounding-box regression employed in DPM [18], we train a
the open source version [23]). These methods achieve mAPs linear regression model to predict a new detection window
of 29.1 percent and 34.3 percent, respectively.
Fig. 6. Sensitivity to object characteristics. Each plot shows the mean (over classes) normalized AP (see [64]) for the highest and lowest performing
subsets within six different object characteristics (occlusion, truncation, bounding-box area, aspect ratio, viewpoint, part visibility). For example,
bounding-box area comprises the subsets extra-small, small ,. . ., extra-large. We show plots for our method (R-CNN) with and without fine-tuning
and bounding-box regression as well as for DPM voc-release5. Overall, fine-tuning does not reduce sensitivity (the difference between max and
min), but does substantially improve both the highest and lowest performing subsets for nearly all characteristics. This indicates that fine-tuning does
more than simply improve the lowest performing subsets for aspect ratio and bounding-box area, as one might conjecture based on how we warp net-
work inputs. Instead, fine-tuning improves robustness for all characteristics including occlusion, truncation, viewpoint, and part visibility.
given the pool5 features for a selective search region pro- The nature of these splits presents a number of choices
posal. Full details are given in Section 7.3. Results in for training an R-CNN. The train images cannot be used for
Tables 1, 2, and Fig. 5 show that this simple approach fixes a hard negative mining, because annotations are not exhaus-
large number of mislocalized detections, boosting mAP by tive. Where should negative examples come from? Also, the
three to four points. train images have different statistics than val and test.
Should the train images be used at all, and if so, to what
4.6 Qualitative Results extent? While we have not thoroughly evaluated a large
Qualitative detection results on ILSVRC2013 are presented number of choices, we present what seemed like the most
in Fig. 8. Each image was sampled randomly from the val2 obvious path based on previous experience.
set and all detections from all detectors with a precision Our general strategy is to rely heavily on the val set and
greater than 0.5 are shown. Note that these are not curated use some of the train images as an auxiliary source of posi-
and give a realistic impression of the detectors in action. tive examples. To use val for both training and validation,
we split it into roughly equally sized “val1 ” and “val2 ” sets.
Since some classes have very few examples in val (the small-
5 THE ILSVRC2013 DETECTION DATASET
est has only 31 and half have fewer than 110), it is important
In Section 3 we presented results on the ILSVRC2013 detec- to produce an approximately class-balanced partition. To
tion dataset. This dataset is less homogeneous than PAS- do this, a large number of candidate splits were generated
CAL VOC, requiring choices about how to use it. Since and the one with the smallest maximum relative class
these decisions are non-trivial, we cover them in this sec- imbalance was selected.4 Each candidate split was gener-
tion. The methodology and “val1 ” and “val2 ” data splits ated by clustering val images using their class counts as fea-
introduced in this section were used widely by participants tures, followed by a randomized local search that may
in the ILSVRC2014 detection challenge. improve the split balance. The particular split used here has
a maximum relative imbalance of about 11 percent and a
5.1 Dataset Overview median relative imbalance of 4 percent. The val1 /val2 split
The ILSVRC2013 detection dataset is split into three sets: and code used to produce them are publicly available in the
train (395,918), val (20,121), and test (40,152), where the R-CNN code repository, allowing other researchers to com-
number of images in each set is in parentheses. The val and pare their methods on the val splits used in this report.
test splits are drawn from the same image distribution.
These images are scene-like and similar in complexity 5.2 Region Proposals
(number of objects, amount of clutter, pose variability, etc.) We followed the same region proposal approach that was
to PASCAL VOC images. The val and test splits are exhaus- used for detection on PASCAL. Selective search [21] was
tively annotated, meaning that in each image all instances run in “fast mode” on each image in val1 , val2 , and test (but
from all 200 classes are labeled with bounding boxes. The not on images in train). One minor modification was
train set, in contrast, is drawn from the ILSVRC2013 classifi- required to deal with the fact that selective search is not
cation image distribution. These images have more variable scale invariant and so the number of regions produced
complexity with a skew towards images of a single centered depends on the image resolution. ILSVRC image sizes range
object. Unlike val and test, the train images (due to their from very small to a few that are several mega-pixels, and
large number) are not exhaustively annotated. In any given so we resized each image to a fixed width (500 pixels)
train image, instances from the 200 classes may or may not before running selective search. On val, selective search
be labeled. In addition to these image sets, each class has an resulted in an average of 2,403 region proposals per image
extra set of negative images. Negative images are manually with a 91.6 percent recall of all ground-truth bounding
checked to validate that they do not contain any instances boxes (at 0.5 IoU threshold). This recall is notably lower
of their associated class. The negative image sets were not
used in this work. More information on how ILSVRC was 4. Relative imbalance is measured as ja bj=ða þ bÞ where a and b
collected and annotated can be found in [65], [66]. are class counts in each half of the split.
Authorized licensed use limited to: SRM Institute of Science and Technology. Downloaded on July 12,2024 at 04:26:24 UTC from IEEE Xplore. Restrictions apply.
GIRSHICK ET AL.: REGION-BASED CONVOLUTIONAL NETWORKS FOR ACCURATE OBJECT DETECTION AND SEGMENTATION 151
TABLE 4
ILSVRC2013 Ablation Study of Data Usage Choices, Fine-Tuning, and Bounding-Box Regression
test set val2 val2 val2 val2 val2 val2 test test
SVM training set val1 val1 þ train:5k val1 þ train1k val1 þ train1k val1 þ train1k val1 þ train1k val þ train1k val þ train1k
CNN fine-tuning set n/a n/a n/a val1 val1 þ train1k val1 þ train1k val1 þ train1k val1 þ train1k
bbox reg set n/a n/a n/a n/a n/a val1 n/a val
CNN feature layer fc6 fc6 fc6 fc7 fc7 fc7 fc7 fc7
mAP 20.9 24.1 24.1 26.5 29.7 31.0 30.2 31.4
median AP 17.7 21.0 21.4 24.8 29.2 29.6 29.0 30.3
than in PASCAL, where it is approximately 98 percent, indi- box regressor training sets to use valþtrain1k and val,
cating significant room for improvement in the region respectively. We used the CNN that was fine-tuned on
proposal stage. val1 þ train1k to avoid re-running fine-tuning and feature
computation.
5.3 Training Data
For training data, we formed a set of images and boxes that 5.5 Ablation Study
includes all selective search and ground-truth boxes from Table 4 shows an ablation study of the effects of different
val1 together with up to N ground-truth boxes per class amounts of training data, fine-tuning, and bounding-box
from train (if a class has fewer than N ground-truth boxes regression. A first observation is that mAP on val2 matches
in train, then we take all of them). We’ll call this dataset of mAP on test very closely. This gives us confidence that
images and boxes val1 þ trainN . In an ablation study, we mAP on val2 is a good indicator of test set performance. The
show mAP on val2 for N 2 f0; 500; 1000g (Section 5.5). first result, 20.9 percent, is what R-CNN achieves using a
Training data are required for three procedures in R- CNN pre-trained on the ILSVRC2012 classification dataset
CNN: (1) CNN fine-tuning, (2) detector SVM training, (no fine-tuning) and given access to the small amount of
and (3) bounding-box regressor training. CNN fine-tuning training data in val1 (recall that half of the classes in val1
was run for 50k SGD iteration on val1 þ trainN using the have between 15 and 55 examples). Expanding the training
exact same settings as were used for PASCAL. Fine-tun- set to val1 þ trainN improves performance to 24.1 percent,
ing on a single NVIDIA Tesla K20 took 13 hours using with essentially no difference between N ¼ 500 and
Caffe. For SVM training, all ground-truth boxes from N ¼ 1000. Fine-tuning the CNN using examples from just
val1 þ trainN were used as positive examples for their val1 gives a modest improvement to 26.5 percent, however
respective classes. Hard negative mining was performed there is likely significant overfitting due to the small num-
on a randomly selected subset of 5,000 images from val1 . ber of positive training examples. Expanding the fine-tun-
An initial experiment indicated that mining negatives ing set to val1 þ train1k , which adds up to 1000 positive
from all of val1 , versus a 5,000 image subset (roughly half examples per class from the train set, helps significantly,
of it), resulted in only a 0.5 percentage point drop in boosting mAP to 29.7 percent. Bounding-box regression
mAP, while cutting SVM training time in half. No nega- improves results to 31.0 percent, which is a smaller relative
tive examples were taken from train because the annota- gain that what was observed in PASCAL.
tions are not exhaustive. The extra sets of verified
negative images were not used. The bounding-box regres- 5.6 Relationship to OverFeat
sors were trained on val1 .
There is an interesting relationship between R-CNN and
OverFeat: OverFeat can be seen (roughly) as a special
5.4 Validation and Evaluation case of an R-CNN. If one were to replace selective search
Before submitting results to the evaluation server, we vali- region proposals with a multi-scale pyramid of regular
dated data usage choices and the effect of fine-tuning and square regions and change the per-class bounding-box
bounding-box regression on the val2 set using the training regressors to a single bounding-box regressor, then the
data described above. All system hyperparameters (e.g., systems would be very similar (modulo some potentially
SVM C hyperparameters, padding used in region warping, significant differences in how they are trained: CNN
NMS thresholds, bounding-box regression hyperpara- detection fine-tuning, using SVMs, etc.). It is worth noting
meters) were fixed at the same values used for PASCAL. that OverFeat has a significant speed advantage over R-
Undoubtedly some of these hyperparameter choices are CNN: it is about 9 faster, based on a figure of 2 seconds
slightly suboptimal for ILSVRC, however the goal of this per image quoted from [19]. This speed comes from the
work was to produce a preliminary R-CNN result on fact that OverFeat’s sliding windows (i.e., region pro-
ILSVRC without extensive dataset tuning. After selecting posals) are not warped at the image level and therefore
the best choices on val2 , we submitted exactly two result computation can be easily shared between overlapping
files to the ILSVRC2013 evaluation server. The first sub- windows. Sharing is implemented by running the entire
mission was without bounding-box regression and the sec- network in a convolutional fashion over arbitrary-sized
ond submission was with bounding-box regression. For inputs. OverFeat is slower than the pyramid-based ver-
these submissions, we expanded the SVM and bounding- sion of R-CNN from He et al. [27].
Authorized licensed use limited to: SRM Institute of Science and Technology. Downloaded on July 12,2024 at 04:26:24 UTC from IEEE Xplore. Restrictions apply.
152 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 38, NO. 1, JANUARY 2016
TABLE 6
Per-Category Segmentation Accuracy (Percent) on the VOC 2011 Validation Set
VOC 2011 val bg aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv mean
O2 P [59] 84.0 69.0 21.7 47.7 42.2 42.4 64.7 65.8 57.4 12.9 37.4 20.5 43.7 35.7 52.7 51.0 35.8 51.0 28.4 59.8 49.7 46.4
full R-CNN fc6 81.3 56.2 23.9 42.9 40.7 38.8 59.2 56.5 53.2 11.4 34.6 16.7 48.1 37.0 51.4 46.0 31.5 44.0 24.3 53.7 51.1 43.0
full R-CNN fc7 81.0 52.8 25.1 43.8 40.5 42.7 55.4 57.7 51.3 8.7 32.5 11.5 48.1 37.0 50.5 46.4 30.2 42.1 21.2 57.7 56.0 42.5
fg R-CNN fc6 81.4 54.1 21.1 40.6 38.7 53.6 59.9 57.2 52.5 9.1 36.5 23.6 46.4 38.1 53.2 51.3 32.2 38.7 29.0 53.0 47.5 43.7
fg R-CNN fc7 80.9 50.1 20.0 40.2 34.1 40.9 59.7 59.8 52.7 7.3 32.1 14.3 48.8 42.9 54.0 48.6 28.9 42.6 24.9 52.2 48.8 42.1
full+fg R-CNN fc6 83.1 60.4 23.2 48.4 47.3 52.6 61.6 60.6 59.1 10.8 45.8 20.9 57.7 43.3 57.4 52.9 34.7 48.7 28.1 60.0 48.6 47.9
full+fg R-CNN fc7 82.3 56.7 20.6 49.9 44.2 43.6 59.3 61.3 57.8 7.7 38.4 15.1 53.4 43.7 50.8 52.0 34.1 47.8 24.7 60.1 55.2 45.7
TABLE 7
Segmentation Accuracy (Percent) on VOC 2011 Test
VOC bg aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv mean
2011 test
R&P [68] 83.4 46.8 18.9 36.6 31.2 42.7 57.3 47.4 44.1 8.1 39.4 36.1 36.3 49.5 48.3 50.7 26.3 47.2 22.1 42.0 43.2 40.8
O2 P [59] 85.4 69.7 22.3 45.2 44.4 46.9 66.7 57.8 56.2 13.5 46.1 32.3 41.2 59.1 55.3 51.0 36.2 50.4 27.8 46.9 44.6 47.6
ours (full+fg 84.2 66.9 23.7 58.3 37.4 55.4 73.3 58.7 56.5 9.7 45.5 29.5 49.3 40.1 57.8 53.9 33.8 60.7 22.7 47.1 41.3 47.9
R-CNN fc6 )
We compare against two strong baselines: the “Regions and Parts” (R&P) method of [68] and the second-order pooling (O2 P) method of [59]. Without any
fine-tuning, our CNN achieves top segmentation performance, outperforming R&P and roughly matching O2 P. These experiments use TorontoNet without
fine-tuning.
The first method (“tightest square with context”) enclo- detection SVMs? To review the definitions briefly, for fine-
ses each object proposal inside the tightest square and then tuning we map each object proposal to the ground-truth
scales (isotropically) the image contained in that square to instance with which it has maximum IoU overlap (if any)
the CNN input size. Fig. 7 column (B) shows this transfor- and label it as a positive for the matched ground-truth
mation. A variant on this method (“tightest square without class if the IoU is at least 0.5. All other proposals are
context”) excludes the image content that surrounds the labeled “background” (i.e., negative examples for all clas-
original object proposal. Fig. 7 column (C) shows this trans- ses). For training SVMs, in contrast, we take only the
formation. The second method (“warp”) anisotropically ground-truth boxes as positive examples for their respec-
scales each object proposal to the CNN input size. Fig. 7 col- tive classes and label proposals with less than 0.3 IoU
umn (D) shows the warp transformation. overlap with all instances of a class as a negative for that
For each of these transformations, we also consider class. Proposals that fall into the grey zone (more than 0.3
including additional image context around the original IoU overlap, but are not ground truth) are ignored.
object proposal. The amount of context padding (p) is Historically speaking, we arrived at these definitions
defined as a border size around the original object proposal because we started by training SVMs on features computed
in the transformed input coordinate frame. Fig. 7 shows by the ImageNet pre-trained CNN, and so fine-tuning was
p ¼ 0 pixels in the top row of each example and p ¼ 16 pixels not a consideration at that point in time. In that setup, we
in the bottom row. In all methods, if the source rectangle found that our particular label definition for training SVMs
extends beyond the image, the missing data are replaced was optimal within the set of options we evaluated (which
with the image mean (which is then subtracted before input- included the setting we now use for fine-tuning). When we
ing the image into the CNN). A pilot set of experiments started using fine-tuning, we initially used the same posi-
showed that warping with context padding (p ¼ 16 pixels) tive and negative example definition as we were using for
outperformed the alternatives by a large margin (3-5 mAP SVM training. However, we found that results were much
points). Obviously more alternatives are possible, including worse than those obtained using our current definition of
using replication instead of mean padding. Exhaustive eval- positives and negatives.
uation of these alternatives is left as future work. Our hypothesis is that this difference in how positives
and negatives are defined is not fundamentally important
7.2 Positive Versus Negative Examples and Softmax and arises from the fact that fine-tuning data are limited.
Two design choices warrant further discussion. The first is: Our current scheme introduces many “jittered” examples
Why are positive and negative examples defined differ- (those proposals with overlap between 0.5 and 1, but not
ently for fine-tuning the CNN versus training the object ground truth), which expands the number of positive exam-
ples by approximately 30x. We conjecture that this large set
is needed when fine-tuning the entire network to avoid over-
fitting. However, we also note that using these jittered
examples is likely suboptimal because the network is not
being fine-tuned for precise localization.
This leads to the second issue: Why, after fine-tuning,
train SVMs at all? It would be cleaner to simply apply the
last layer of the fine-tuned network, which is a 21-way soft-
max regression classifier, as the object detector. We tried
this and found that performance on VOC 2007 dropped
from 54.2 to 50.9 percent mAP. This performance drop
likely arises from a combination of several factors including
that the definition of positive examples used in fine-tuning
does not emphasize precise localization and the softmax
Fig. 7. Different object proposal transformations. (A) the original object classifier was trained on randomly sampled negative exam-
proposal at its actual scale relative to the transformed CNN inputs; (B) ples rather than on the subset of “hard negatives” used for
tightest square with context; (C) tightest square without context; (D)
SVM training.
warp. Within each column and example proposal, the top row corre-
sponds to p ¼ 0 pixels of context padding while the bottom row has This result shows that it’s possible to obtain close to the
p ¼ 16 pixels of context padding. same level of performance without training SVMs after
Authorized licensed use limited to: SRM Institute of Science and Technology. Downloaded on July 12,2024 at 04:26:24 UTC from IEEE Xplore. Restrictions apply.
154 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 38, NO. 1, JANUARY 2016
Fig. 8. Example detections on the val2 set from the configuration that achieved 31.0 percent mAP on val2 . Each image was sampled randomly (these
are not curated). All detections at precision greater than 0.5 are shown. Each detection is labeled with the predicted class and the precision value of
that detection from the detector’s precision-recall curve. Viewing digitally with zoom is recommended.
Authorized licensed use limited to: SRM Institute of Science and Technology. Downloaded on July 12,2024 at 04:26:24 UTC from IEEE Xplore. Restrictions apply.
GIRSHICK ET AL.: REGION-BASED CONVOLUTIONAL NETWORKS FOR ACCURATE OBJECT DETECTION AND SEGMENTATION 155
fine-tuning. We conjecture that with some additional tweaks tw ¼ log ðGw =Pw Þ (8)
to fine-tuning the remaining performance gap may be
closed. If true, this would simplify and speed up R-CNN
training with no loss in detection performance. th ¼ log ðGh =Ph Þ: (9)
7.3 Bounding-Box Regression As a standard regularized least squares problem, this can be
We use a simple bounding-box regression stage to improve solved efficiently in closed form.
localization performance. After scoring each selective search We found two subtle issues while implementing
proposal with a class-specific detection SVM, we predict a bounding-box regression. The first is that regularization
new bounding box for the detection using a class-specific is important: we set ¼ 1000 based on a validation set.
bounding-box regressor. This is similar in spirit to the The second issue is that care must be taken when select-
bounding-box regression used in deformable part models ing which training pairs ðP; GÞ to use. Intuitively, if P is
[18]. The primary difference between the two approaches is far from all ground-truth boxes, then the task of trans-
that here we regress from features computed by the CNN, forming P to a ground-truth box G does not make sense.
rather than from geometric features computed on the Using examples like P would lead to a hopeless learning
inferred DPM part locations. problem. Therefore, we only learn from a proposal P if it
The input to our training algorithm is a set of N training is nearby at least one ground-truth box. We implement
pairs fðP i ; Gi Þgi¼1;...;N , where P i ¼ ðPxi ; Pyi ; Pwi ; Phi Þ specifies “nearness” by assigning P to the ground-truth box G
the pixel coordinates of the center of proposal P i ’s bound- with which it has maximum IoU overlap (in case it over-
ing box together with P i ’s width and height in pixels. Hence laps more than one) if and only if the overlap is greater
forth, we drop the superscript i unless it is needed. Each than a threshold (which we set to 0.6 using a validation
ground-truth bounding box G is specified in the same way: set). All unassigned proposals are discarded. We do this
G ¼ ðGx ; Gy ; Gw ; Gh Þ. Our goal is to learn a transformation once for each object class in order to learn a set of class-
that maps a proposed box P to a ground-truth box G. specific bounding-box regressors.
We parameterize the transformation in terms of four At test time, we score each proposal and predict its new
functions dx ðP Þ, dy ðP Þ, dw ðP Þ, and dh ðP Þ. The first two detection window only once. In principle, we could iterate
specify a scale-invariant translation of the center of P ’s this procedure (i.e., re-score the newly predicted bounding
bounding box, while the second two specify log-space box, and then predict a new bounding box from it, and so
translations of the width and height of P ’s bounding box. on). However, we found that iterating does not improve
After learning these functions, we can transform an input results.
proposal P into a predicted ground-truth box G ^ by apply-
7.4 Analysis of Cross-Dataset Redundancy
ing the transformation One concern when training on an auxiliary dataset is that
there might be redundancy between it and the test set. Even
^x ¼ Pw dx ðP Þ þ Px
G (1) though the tasks of object detection and whole-image classi-
fication are substantially different, making such cross-set
^y ¼ Ph dy ðP Þ þ Py
G (2) redundancy much less worrisome, we still conducted a
thorough investigation that quantifies the extent to which
^w ¼ Pw expðdw ðP ÞÞ PASCAL test images are contained within the ILSVRC 2012
G (3)
training and validation sets. Our findings may be useful to
researchers who are interested in using ILSVRC 2012 as
^h ¼ Ph expðdh ðP ÞÞ:
G (4) training data for the PASCAL image classification task.
We performed two checks for duplicate (and near-
Each function d$ ðP Þ (where $ is one of x; y; h; w) is duplicate) images. The first test is based on exact matches of
modeled as a linear function of the pool5 features of pro- flickr image IDs, which are included in the VOC 2007 test
posal P , denoted by f5 ðP Þ. (The dependence of f5 ðP Þ on annotations (these IDs are intentionally kept secret for sub-
the image data is implicitly assumed.) Thus we have sequent PASCAL test sets). All PASCAL images, and about
d$ ðP Þ ¼ wT
$ f5 ðP Þ, where w$ is a vector of learnable half of ILSVRC, were collected from flickr.com. This check
model parameters. We learn w$ by optimizing the regu- turned up 31 matches out of 4,952 (0.63 percent).
larized least squares objective (ridge regression): The second check uses GIST [75] descriptor matching,
which was shown in [76] to have excellent performance at
X
N near-duplicate image detection in large (> 1 million) image
i 2
w$ ¼ arg min ^T
ðti$ w ^ $ jj2 :
$ f5 ðP ÞÞ þ jjw (5) collections. Following [76], we computed GIST descriptors
^$
w i
on warped 32 32 pixel versions of all ILSVRC 2012 train-
The regression targets t$ for the training pair ðP; GÞ are val and PASCAL 2007 test images.
defined as Euclidean distance nearest-neighbor matching of GIST
descriptors revealed 38 near-duplicate images (including all
tx ¼ ðGx Px Þ=Pw (6) 31 found by flickr ID matching). The matches tend to vary
slightly in JPEG compression level and resolution, and to a
lesser extent cropping. These findings show that the overlap
ty ¼ ðGy Py Þ=Ph (7) is small, less than 1 percent. For VOC 2012, because flickr
Authorized licensed use limited to: SRM Institute of Science and Technology. Downloaded on July 12,2024 at 04:26:24 UTC from IEEE Xplore. Restrictions apply.
156 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 38, NO. 1, JANUARY 2016
IDs are not available, we used the GIST matching method [7] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based
learning applied to document recognition,” Proc. IEEE, vol. 86,
only. Based on GIST matches, 1.5 percent of VOC 2012 test no. 11, pp. 2278–2324, Nov. 1998.
images are in ILSVRC 2012 trainval. The slightly higher rate [8] A. Krizhevsky, I. Sutskever, and G. Hinton, “ImageNet classifica-
for VOC 2012 is likely due to the fact that the two datasets tion with deep convolutional neural networks,” in Proc. Adv. Neu-
were collected closer together in time than VOC 2007 and ral Inf. Process. Syst., 2012, pp. 1106–1114.
[9] J. Deng, A. Berg, S. Satheesh, H. Su, A. Khosla, and L. Fei-Fei.
ILSVRC 2012 were. Imagenet large scale visual recognition competition 2012
(ILSVRC2012) [Online]. Available: http://www.image-net.org/
challenges/LSVRC/2012/, 2012.
8 CONCLUSION [10] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei,
“ImageNet: A large-scale hierarchical image database,” in Proc.
In recent years, object detection performance had stagnated. IEEE Conf. Comput. Vis. Pattern Recog., 2009, pp. 248–255.
The best performing systems were complex ensembles com- [11] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature
bining multiple low-level image features with high-level hierarchies for accurate object detection and semantic
context from object detectors and scene classifiers. This segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog.,
2014, pp. 580–587.
paper presents a simple and scalable object detection algo- [12] C. Szegedy, A. Toshev, and D. Erhan, “Deep neural networks for
rithm that gives more than a 50 percent relative improve- object detection,” in Proc. Adv. Neural Inf. Process. Syst., 2013,
ment over the best previous results on PASCAL VOC 2012. pp. 2553–2561.
[13] D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov, “Scalable
We achieved this performance through two insights. The object detection using deep neural networks,” in Proc. IEEE Conf.
first is to apply high-capacity convolutional networks to Comput. Vis. Pattern Recog., 2014, pp. 2155–2162.
bottom-up region proposals in order to localize and seg- [14] H. A. Rowley, S. Baluja, and T. Kanade, “Neural network-based
ment objects. The second is a paradigm for training large face detection,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 20,
no. 1, pp. 23–28, Jan. 1998.
CNNs when labeled training data are scarce. We show [15] R. Vaillant, C. Monrocq, and Y. LeCun, “Original approach for the
that it is highly effective to pre-train the network—with localisation of objects in images,” IEE Proc. Vis., Image, Signal Pro-
supervision—for a auxiliary task with abundant data (image cess., vol. 141, no. 4, pp. 245–250, Aug. 1994.
classification) and then to fine-tune the network for the tar- [16] J. Platt and S. Nowlan, “A convolutional neural network hand
tracker,” in Proc. Adv. Neural Inf. Process. Syst., 1995, pp. 901–908.
get task where data is scarce (detection). We conjecture that [17] P. Sermanet, K. Kavukcuoglu, S. Chintala, and Y. LeCun,
the “supervised pre-training/domain-specific fine-tuning” “Pedestrian detection with unsupervised multi-stage feature
paradigm will be highly effective for a variety of data-scarce learning,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2013,
pp. 3626–3633.
vision problems. [18] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan,
We conclude by noting that it is significant that we “Object detection with discriminatively trained part based mod-
achieved these results by using a combination of classical els,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 9, pp. 1627–
tools from computer vision and deep learning (bottom-up 1645, Sep. 2010.
[19] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y.
region proposals and convolutional networks). Rather than LeCun, “OverFeat: Integrated recognition, localization and detec-
opposing lines of scientific inquiry, the two are natural and tion using convolutional networks,” in Proc. Int. Conf. Learn. Repre-
inevitable partners. sentations, 2014, p. 16.
[20] C. Gu, J. J. Lim, P. Arbelaez, and J. Malik, “Recognition using
regions,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2009,
ACKNOWLEDGMENTS pp. 1030–1037.
[21] J. Uijlings, K. van de Sande, T. Gevers, and A. Smeulders,
This research was supported in part by DARPA Mind’s Eye “Selective search for object recognition,” Int. J. Comput. Vis.,
and MSEE programs, by US National Science Foundation vol. 104, no. 3, pp. 154–171, 2013.
Awards IIS-0905647, IIS-1134072, and IIS-1212798, MURI [22] J. Carreira and C. Sminchisescu, “CPMC: Automatic object seg-
mentation using constrained parametric min-cuts,” IEEE Trans.
N000014-10-1-0933, and by support from Toyota. The GPUs Pattern Anal. Mach. Intell., vol. 34, no. 7, pp. 1312–1328, Jul. 2012.
used in this research were generously donated by the NVI- [23] R. Girshick, P. Felzenszwalb, and D. McAllester. Discriminatively
DIA Corporation. R. Girshick is with Microsoft Research and trained deformable part models, release 5 [Online]. Available:
was with the Department of Electrical Engineering and Com- http://www.cs.berkeley.edu/~rbg/latent-v5/, 2012.
[24] K. Simonyan and A. Zisserman, “Very deep convolutional net-
puter Science, UC Berkeley during the majority of this work. works for large-scale image recognition,” in Proc. Int. Conf. Learn.
Representations, 2015.
REFERENCES [25] P. Agrawal, R. Girshick, and J. Malik, “Analyzing the performance
of multilayer neural networks for object recognition,” in Proc. 13th
[1] D. Lowe, “Distinctive image features from scale-invariant key- Eur. Conf. Comput. Vis., 2014, pp. 329–344.
points,” Int. J. Comput. Vis., vol. 60, no. 2, pp. 91–110, 2004. [26] T. Dean, J. Yagnik, M. Ruzon, M. Segal, J. Shlens, and S.
[2] N. Dalal and B. Triggs, “Histograms of oriented gradients for Vijayanarasimhan, “Fast, accurate detection of 100,000 object
human detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog- classes on a single machine,” in Proc. IEEE Conf. Comput. Vis.
nit., 2005, pp. 886–893. Pattern Recog., 2013, pp. 1814–1821.
[3] M. Everingham, L. van Gool, C. K. I. Williams, J. Winn, and A. [27] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in
Zisserman, “The PASCAL visual object classes (VOC) challenge,” deep convolutional networks for visual recognition,” in Proc. 13th
Int. J. Comput. Vis., vol. 80, no. 2, pp. 303–338, 2010. Eur. Conf. Comput. Vis., 2014, pp. 346–361.
[4] K. Fukushima, “Neocognitron: A self-organizing neural network [28] R. Girshick, “Fast R-CNN,” arXiv e-prints, vol. arXiv:1504.08083v1
model for a mechanism of pattern recognition unaffected by shift [cs.CV], 2015.
in position,” Biol. Cybern., vol. 36, no. 4, pp. 193–202, 1980. [29] D. Hoiem, A. Efros, and M. Hebert, “Geometric context from a
[5] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning inter- single image,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog.,
nal representations by error propagation,” Parallel Distrib. Process., 2005, pp. 654–661.
vol. 1, pp. 318–362, 1986. [30] B. C. Russell, W. T. Freeman, A. A. Efros, J. Sivic, and A. Zisserman,
[6] Y. LeCun, B. Boser, J. Denker, D. Henderson, R. Howard, W. “Using multiple segmentations to discover objects and their extent
Hubbard, and L. Jackel, “Backpropagation applied to handwritten in image collections,” in Proc. IEEE Conf. Comput. Vis. Pattern
zip code recognition,” Neural Comput., vol. 1, no. 4, pp. 541–551, Recog., 2006, pp. 1605–1614.
1989.
Authorized licensed use limited to: SRM Institute of Science and Technology. Downloaded on July 12,2024 at 04:26:24 UTC from IEEE Xplore. Restrictions apply.
GIRSHICK ET AL.: REGION-BASED CONVOLUTIONAL NETWORKS FOR ACCURATE OBJECT DETECTION AND SEGMENTATION 157
[31] C. L. Zitnick and P. Dollar, “Edge boxes: Locating object proposals [57] S. Fidler, R. Mottaghi, A. Yuille, and R. Urtasun, “Bottom-up seg-
from edges,” in Proc. 13th Eur. Conf. Comput. Vis., 2014, pp. 391–405. mentation for top-down detection,” in Proc. IEEE Conf. Comput.
[32] M.-M. Cheng, Z. Zhang, W.-Y. Lin, and P. H. S. Torr, “BING: Vis. Pattern Recog., 2013, pp. 3294–3301.
Binarized normed gradients for objectness estimation at 300fps,” [58] K. Sung and T. Poggio, “Example-based learning for view-based
in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2014, pp. 3286–3293. human face detection,” Artif. Intell. Lab., Massachussets Inst. Tech-
[33] J. Hosang, R. Benenson, P. Dollar, and B. Schiele, “What nol., Cambridge, MA, USA, Tech. Rep. A.I. Memo No. 1521, 1994.
makes for effective detection proposals?” arXiv e-prints, [59] J. Carreira, R. Caseiro, J. Batista, and C. Sminchisescu, “Semantic
vol. arXiv:1502.05082v1 [cs.CV], 2015. segmentation with second-order pooling,” in Proc. 12th Eur. Conf.
[34] A. Humayun, F. Li, and J. M. Rehg, “RIGOR: Reusing inference in Comput. Vis., 2012, pp. 430–443.
graph cuts for generating object regions,” in Proc. IEEE Conf. Com- [60] K. E. van de Sande, C. G. Snoek, and A. W. Smeulders, “Fisher
put. Vis. Pattern Recog., 2014, pp. 336–343. and vlad with flair,” in Proc. IEEE Conf. Comput. Vis. Pattern
[35] P. Arbel aez, J. Pont-Tuset, J. Barron, F. Marques, and J. Malik, Recog., 2014, pp. 2377–2384.
“Multiscale combinatorial grouping,” in Proc. IEEE Conf. Comput. [61] J. J. Lim, C. L. Zitnick, and P. Dollar, “Sketch tokens: A learned
Vis. Pattern Recog., 2014, pp. 328–335. mid-level representation for contour and object detection,” in
[36] P. Kr€ ahenb€ uhl, and V. Koltun, “Geodesic object proposals,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2013, pp. 3158–3165.
Proc. 13th Eur. Conf. Comput. Vis., 2014, pp. 725–739. [62] X. Ren and D. Ramanan, “Histograms of sparse codes for object
[37] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Trans. detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2013,
Pattern Anal. Mach. Intell., vol. 22, no. 10, pp. 1345–1359, Oct. 2010. pp. 3246–3253.
[38] R. Caruana, “Multitask learning: A knowledge-based source of [63] M. Zeiler, G. Taylor, and R. Fergus, “Adaptive deconvolutional
inductive bias,” in Proc. 10th Int. Conf. Mach. Learn., 1993, pp. 41–48. networks for mid and high level feature learning,” in Proc. IEEE
[39] S. Thrun, “Is learning the n-th thing any easier than learning the Conf. Comput. Vis. Pattern Recog., 2011, pp. 2018–2025.
first?” in Proc. Adv. Neural Inf. Process. Syst., 1996, pp. 640–646. [64] D. Hoiem, Y. Chodpathumwan, and Q. Dai, “Diagnosing error in
[40] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: object detectors,” in Proc. 12th Eur. Conf. Comput. Vis., 2012,
A review and new perspectives,” IEEE Trans. Pattern Anal. Mach. pp. 340–353.
Intell., vol. 35, no. 8, pp. 1798–1828, Aug. 2013. [65] J. Deng, O. Russakovsky, J. Krause, M. Bernstein, A. C. Berg, and
[41] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, L. Fei-Fei, “Scalable multi-label annotation,” in Proc. SIGCHI Conf.
and T. Darrell, “DeCAF: A deep convolutional activation fea- Human Factors Comput. Syst., 2014, pp. 3099–3102.
ture for generic visual recognition,” in Proc. Int. Conf. Mach. [66] H. Su, J. Deng, and L. Fei-Fei, “Crowdsourcing annotations for
Learn., 2014, pp. 647–655. visual object detection,” in Proc. AAAI 4th Human Comput. Work-
[42] J. Hoffman, S. Guadarrama, E. Tzeng, J. Donahue, R. Girshick, T. shop, 2012.
Darrell, and K. Saenko, “From large-scale object classifiers to [67] C. Farabet, C. Couprie, L. Najman, and Y. LeCun, “Learning hier-
large-scale object detectors: An adaptation approach,” in Proc. archical features for scene labeling,” IEEE Trans. Pattern Anal.
Adv. Neural Inf. Process. Syst., 2014, pp. 3536–3544. Mach. Intell., vol. 35, no. 8, pp. 1915–1929, Aug. 2013.
[43] A. Karpathy, A. Joulin, and L. Fei-Fei, “Deep fragment embed- [68] P. Arbelaez, B. Hariharan, C. Gu, S. Gupta, L. Bourdev, and J.
dings for bidirectional image sentence mapping,” in Proc. Adv. Malik, “Semantic segmentation using regions and parts,” in Proc.
Neural Inf. Process. Syst., 2014, pp. 1889–1897. IEEE Conf. Comput. Vis. Pattern Recog., 2012, pp. 3378–3385.
[44] G. Gkioxari, B. Hariharan, R. Girshick, and J. Malik, “R-CNNs for [69] B. Hariharan, P. Arbelaez, L. Bourdev, S. Maji, and J. Malik,
pose estimation and action detection,” arXiv e-prints, “Semantic contours from inverse detectors,” in Proc. IEEE Int.
vol. arXiv:1406.5212v1 [cs.CV], 2014. Conf. Comput. Vis., 2011, pp. 991–998.
[45] B. Hariharan, P. Arbel aez, R. Girshick, and J. Malik, “Simultaneous [70] B. Hariharan, P. Arbelaez, R. Girshick, and J. Malik,
detection and segmentation,” in Proc. 13th Eur. Conf. Comput. Vis., “Hypercolumns for object segmentation and fine-grained local-
2014, pp. 297–312. ization,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2015,
[46] S. Gupta, R. Girshick, P. Arbelaez, and J. Malik, “Learning rich fea- pp. 447–456.
tures from RGB-D images for object detection and segmentation,” [71] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L.
in Proc. 13th Eur. Conf. Comput. Vis., 2014, pp. 345–360. Yuille, “Semantic image segmentation with deep convolutional
[47] H. O. Song, R. Girshick, S. Jegelka, J. Mairal, Z. Harchaoui, and T. nets and fully connected CRFs,” in Proc. Int. Conf. Learn. Represen-
Darrell, “On learning to localize objects with minimal super- tations, 2015.
vision,” in Proc. Int. Conf. Mach. Learn., 2014, pp. 1611–1619. [72] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D.
[48] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Du, C. Huang, and P. Torr, “Conditional random fields as recur-
Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. rent neural networks,” arXiv e-print, vol. arXiv:1502.03240v2 [cs.
Fei-Fei, “ImageNet large scale visual recognition challenge,” arXiv CV], 2015.
e-prints, vol. arXiv:1409.0575v1 [cs.CV], 2014. [73] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional net-
[49] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. works for semantic segmentation,” in Proc. IEEE Conf. Comput.
Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with Vis. Pattern Recog., 2015, pp. 3431–3440.
convolutions,” arXiv e-prints, vol. arXiv:1409.4842v1 [cs.CV], 2014. [74] P. Kr€ahenb€ uhl and V. Koltun, “Efficient inference in fully con-
[50] C. Szegedy, S. Reed, D. Erhan, and D. Anguelov, “Scalable, high- nected CRFs with Gaussian edge potentials,” in Proc. Adv. Neural
quality object detection,” arXiv e-prints, vol. arXiv:1412.1441v2 Inf. Process. Syst., 2011, pp. 109–117.
[cs.CV], 2015. [75] A. Oliva and A. Torralba, “Modeling the shape of the scene: A
[51] B. Alexe, T. Deselaers, and V. Ferrari, “Measuring the objectness holistic representation of the spatial envelope,” Int. J. Comput.
of image windows,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, Vis., vol. 42, pp. 145–175, 2001.
no. 11, pp. 2189–2202, Nov. 2012. [76] M. Douze, H. Jegou, H. Sandhawalia, L. Amsaleg, and C. Schmid,
[52] I. Endres and D. Hoiem, “Category independent object “Evaluation of gist descriptors for web-scale image search,” in
proposals,” in Proc. 11th Eur. Conf. Comput. Vis., 2010, pp. 575–588. Proc. ACM Int. Conf. Image Video Retrieval, 2009, p. 19.
[53] D. Cireşan, A. Giusti, L. Gambardella, and J. Schmidhuber,
“Mitosis detection in breast cancer histology images with deep Ross Girshick received the MS and PhD
neural networks,” in Proc. Medical Image Comput. Comput.-Assisted degrees in computer science, both from the
Intervention, 2013, pp. 411–418. University of Chicago where he studied under the
[54] X. Wang, M. Yang, S. Zhu, and Y. Lin, “Regionlets for generic supervision of Pedro Felzenszwalb. He is a
object detection,” in Proc. IEEE Int. Conf. Comput. Vis., 2013, researcher at Microsoft Research, Redmond.
pp. 17–24. Prior to joining Microsoft Research, he was a
[55] Y. Jia. (2013). Caffe: An open source convolutional architecture Postdoctorial fellow at the University of California,
for fast feature embedding [Online]. Available: http://caffe. Berkeley where he collaborated with Jitendra
berkeleyvision.org/ Malik and Trevor Darrell. During the course of
[56] T. Dean, M. A. Ruzon, M. Segal, J. Shlens, S. Vijayanarasimhan, PASCAL VOC object detection challenge, he par-
and J. Yagnik, “Fast, accurate detection of 100,000 object classes ticipated in multiple winning object detection
on a single machine,” in Proc. IEEE Conf. Comput. Vis. Pattern entries and was awarded a “lifetime achievement” prize for his work on
Recog., 2013, pp. 1814–1821. the widely used Deformable Part Models.
Authorized licensed use limited to: SRM Institute of Science and Technology. Downloaded on July 12,2024 at 04:26:24 UTC from IEEE Xplore. Restrictions apply.
158 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 38, NO. 1, JANUARY 2016
Jeff Donahue received the BS degree in com- Jitendra Malik received the BTech degree in EE
puter science from UT Austin in 2011, completing from the Indian Institute of Technology, Kanpur,
an honors thesis with Kristen Grauman. He is a in 1980 and the PhD degree in CS from Stanford
fourth year PhD student at UC Berkeley, super- University in 1985. In January 1986, he joined
vised by Trevor Darrell. His research interests UC Berkeley, where he is currently the Arthur J.
are in visual recognition and machine learning, Chick professor in the Department of EECS. He
most recently focusing on the application of deep is also on the faculty of the Department of
learning to visual localization and sequencing Bioengineering, and the Cognitive Science and
problems. He is a student member of the IEEE. Vision Science groups. During 2002-2004, he
served as the chair of the Computer Science Divi-
sion and during 2004-2006 as the Department
Chair of EECS. His research group has worked on computer vision,
Trevor Darrell received the BSE degree from the computational modeling of human vision, computer graphics and the
University of Pennsylvania in 1988, having analysis of biological images. Several well-known concepts and algo-
started his career in computer vision as an under- rithms arose in this research, such as anisotropic diffusion, normalized
graduate researcher in Ruzena Bajcsy’s GRASP cuts, high dynamic range imaging, and shape contexts. He is one of the
lab. He received the SM and PhD degrees from most highly cited researchers in computer vision, with 10 of his papers
MIT in 1992 and 1996, respectively. His group is having received more than a thousand citations each. He has graduated
co-located at the University of California, Berke- 33 PhD students, many of whom are now prominent researchers in aca-
ley, and the UCB-affiliated International Com- demia and industry. He received the Gold Medal for the best graduating
puter Science Institute (ICSI), also located in student in electrical engineering from IIT Kanpur in 1980 and a Presiden-
Berkeley, CA. He is on the faculty of the CS Divi- tial Young Investigator Award in 1989. He was awarded the Longuet-
sion of the EECS Department at UCB and is the Higgins Prize for a contribution that has stood the test of time twice, in
vision group lead at ICSI. His group develops algorithms for large-scale 2007 and in 2008. He is a member of the National Academy of Engineer-
perceptual learning, including object and activity recognition and detec- ing and a fellow of the American Academy of Arts and Sciences. In 2013,
tion, for a variety of applications including multimodal interaction with he received the IEEE PAMI-TC Distinguished Researcher in Computer
robots and mobile devices. His interests include computer vision, Vision Award, and in 2014 the K.S.Fu Prize from the International Asso-
machine learning, computer graphics, and perception-based human ciation of Pattern Recognition. He is a fellow of both the ACM and IEEE.
computer interfaces. He was previously on the faculty of the MIT EECS
department from 1999-2008, where he directed the Vision Interface
Group. He was a member of the research staff at Interval Research Cor- " For more information on this or any other computing topic,
poration from 1996-1999. He is a member of the IEEE. please visit our Digital Library at www.computer.org/publications/dlib.
Authorized licensed use limited to: SRM Institute of Science and Technology. Downloaded on July 12,2024 at 04:26:24 UTC from IEEE Xplore. Restrictions apply.