
DSSD : Deconvolutional Single Shot Detector
Cheng-Yang Fu
1∗
, Wei Liu
1∗
, Ananth Ranga
2
, Ambrish Tyagi
2
, Alexander C. Berg
1
1
UNC Chapel Hill,
2
Amazon Inc.
Abstract
The main contribution of this paper is an approach for
introducing additional context into state-of-the-art general
object detection. To achieve this we first combine a state-of-
the-art classifier (Residual-101 [14]) with a fast detection
framework (SSD [18]). We then augment SSD+Residual-
101 with deconvolution layers to introduce additional large-
scale context in object detection and improve accuracy,
especially for small objects, calling our resulting system
DSSD for deconvolutional single shot detector. While these
two contributions are easily described at a high-level, a
naive implementation does not succeed. Instead we show
that carefully adding additional stages of learned transfor-
mations, specifically a module for feed-forward connections
in deconvolution and a new output module, enables this new
approach and forms a potential way forward for further de-
tection research. Results are shown on both PASCAL VOC
and COCO detection. Our DSSD with 513 × 513 input
achieves 81.5% mAP on VOC2007 test, 80.0% mAP on
VOC2012 test, and 33.2% mAP on COCO, outperform-
ing a state-of-the-art method R-FCN [3] on each dataset.
1. Introduction
The main contribution of this paper is an approach for
introducing additional context into state-of-the-art general
object detection. The end result achieves the current high-
est accuracy for detection with a single network on PAS-
CAL VOC [6] while also maintaining comparable speed
with a previous state-of-the-art detection [3]. To achieve
this we first combine a state-of-the-art classifier (Residual-
101 [14]) with a fast detection framework (SSD [18]). We
then augment SSD+Residual-101 with deconvolution lay-
ers to introduce additional large-scale context in object de-
tection and improve accuracy, especially for small objects,
calling our resulting system DSSD for deconvolutional sin-
gle shot detector. While these two contributions are easily
described at a high-level, a naive implementation does not
∗
Equal Contribution
succeed. Instead we show that carefully adding additional
stages of learned transformations, specifically a module for
feed forward connections in deconvolution and a new out-
put module, enables this new approach and forms a poten-
tial way forward for further detection research.
Putting this work in context, there has been a recent
move in object detection back toward sliding-window tech-
niques in the last two years. The idea is that instead of first
proposing potential bounding boxes for objects in an im-
age and then classifying them, as exemplified in selective
search[27] and R-CNN[12] derived methods, a classifier is
applied to a fixed set of possible bounding boxes in an im-
age. While sliding window approaches never completely
disappeared, they had gone out of favor after the heydays of
HOG [4] and DPM [7] due to the increasingly large number
of box locations that had to be considered to keep up with
state-of-the-art. They are coming back as more powerful
machine learning frameworks integrating deep learning are
developed. These allow fewer potential bounding boxes to
be considered, but in addition to a classification score for
each box, require predicting an offset to the actual location
of the object—snapping to its spatial extent. Recently these
approaches have been shown to be effective for bound-
ing box proposals [5, 24] in place of bottom-up grouping
of segmentation [27, 12]. Even more recently, these ap-
proaches were used to not only score bounding boxes as po-
tential object locations, but to simultaneously predict scores
for object categories, effectively combining the steps of re-
gion proposal and classification. This is the approach taken
by You Only Look Once (YOLO) [23] which computes a
global feature map and uses a fully-connected layer to pre-
dict detections in a fixed set of regions. Taking this single-
shot approach further by adding layers of feature maps for
each scale and using a convolutional filter for prediction, the
Single Shot MultiBox Detector (SSD) [18] is significantly
more accurate and is currently the best detector with respect
to the speed-vs-accuracy trade-off.
When looking for ways to further improve the accuracy
of detection, obvious targets are better feature networks and
adding more context, especially for small objects, in addi-
tion to improving the spatial resolution of the bounding box
1
arXiv:1701.06659v1 [cs.CV] 23 Jan 2017
评论0