
[17, 31], keypoint detection [3] and counting [2]. As one
of high-level vision tasks, object detection might be the
only one deviating from the neat fully convolutional per-
pixel prediction framework mainly due to the use of anchor
boxes. It is nature to ask a question: Can we solve object
detection in the neat per-pixel prediction fashion, analogue
to FCN for semantic segmentation, for example? Thus
those fundamental vision tasks can be unified in (almost)
one single framework. We show that the answer is affir-
mative. Moreover, we demonstrate that, for the first time,
the much simpler FCN-based detector achieves even better
performance than its anchor-based counterparts.
In the literature, some works attempted to leverage the
FCNs-based framework for object detection such as Dense-
Box [12]. Specifically, these FCN-based frameworks di-
rectly predict a 4D vector plus a class category at each spa-
tial location on a level of feature maps. As shown in Fig. 1
(left), the 4D vector depicts the relative offsets from the four
sides of a bounding box to the location. These frameworks
are similar to the FCNs for semantic segmentation, except
that each location is required to regress a 4D continuous
vector. However, to handle the bounding boxes with dif-
ferent sizes, DenseBox [12] crops and resizes training im-
ages to a fixed scale. Thus DenseBox has to perform detec-
tion on image pyramids, which is against FCN’s philosophy
of computing all convolutions once. Besides, more signif-
icantly, these methods are mainly used in special domain
objection detection such as scene text detection [33, 10] or
face detection [32, 12], since it is believed that these meth-
ods do not work well when applied to generic object de-
tection with highly overlapped bounding boxes. As shown
in Fig. 1 (right), the highly overlapped bounding boxes re-
sult in an intractable ambiguity: it is not clear w.r.t. which
bounding box to regress for the pixels in the overlapped re-
gions.
In the sequel, we take a closer look at the issue and show
that with FPN this ambiguity can be largely eliminated. As
a result, our method can already obtain comparable detec-
tion accuracy with those traditional anchor based detectors.
Furthermore, we observe that our method may produce a
number of low-quality predicted bounding boxes at the lo-
cations that are far from the center of an target object. In
order to suppress these low-quality detections, we intro-
duce a novel “center-ness” branch (only one layer) to pre-
dict the deviation of a pixel to the center of its correspond-
ing bounding box, as defined in Eq. (3). This score is then
used to down-weight low-quality detected bounding boxes
and merge the detection results in NMS. The simple yet ef-
fective center-ness branch allows the FCN-based detector
to outperform anchor-based counterparts under exactly the
same training and testing settings.
This new detection framework enjoys the following ad-
vantages.
• Detection is now unified with many other FCN-
solvable tasks such as semantic segmentation, making
it easier to re-use ideas from those tasks.
• Detection becomes proposal free and anchor free,
which significantly reduces the number of design pa-
rameters. The design parameters typically need heuris-
tic tuning and many tricks are involved in order to
achieve good performance. Therefore, our new de-
tection framework makes the detector, particularly its
training, considerably simpler.
• By eliminating the anchor boxes, our new detector
completely avoids the complicated computation re-
lated to anchor boxes such as the IOU computation and
matching between the anchor boxes and ground-truth
boxes during training, resulting in faster training and
testing as well as less training memory footprint than
its anchor-based counterpart.
• Without bells and whistles, we achieve state-of-the-
art results among one-stage detectors. We also show
that the proposed FCOS can be used as a Region
Proposal Networks (RPNs) in two-stage detectors and
can achieve significantly better performance than its
anchor-based RPN counterparts. Given the even better
performance of the much simpler anchor-free detector,
we encourage the community to rethink the necessity of
anchor boxes in object detection, which are currently
considered as the de facto standard for detection.
• The proposed detector can be immediately extended
to solve other vision tasks with minimal modification,
including instance segmentation and key-point detec-
tion. We believe that this new method can be the new
baseline for many instance-wise prediction problems.
2. Related Work
Anchor-based Detectors. Anchor-based detectors inherit
the ideas from traditional sliding-window and proposal
based detectors such as Fast R-CNN [6]. In anchor-based
detectors, the anchor boxes can be viewed as pre-defined
sliding windows or proposals, which are classified as pos-
itive or negative patches, with an extra offsets regression
to refine the prediction of bounding box locations. There-
fore, the anchor boxes in these detectors may be viewed
as training samples. Unlike previous detectors like Fast
RCNN, which compute image features for each sliding win-
dow/proposal repeatedly, anchor boxes make use of the fea-
ture maps of CNNs and avoid repeated feature computation,
speeding up detection process dramatically. The design of
anchor boxes are popularized by Faster R-CNN in its RPNs
[24], SSD [18] and YOLOv2 [22], and has become the con-
vention in a modern detector.
However, as described above, anchor boxes result in
excessively many hyper-parameters, which typically need
2