单阶段的目标检测网络论文集合：YoLo、YoLoV2、YoLoV3、SSD、DSSD

共5个文件

pdf：5个

需积分: 48 7 浏览量 2019-10-23 10:56:31 上传评论 2 收藏 19.02MB ZIP 举报

单阶段目标检测网络是计算机视觉领域中的重要技术，主要用于图像中的物体识别与定位。相比于传统的两阶段检测方法，如R-CNN系列，单阶段方法在速度和实时性上有着显著优势，牺牲一定的精度来换取更快的运行速度。本压缩包包含了一些经典的单阶段目标检测模型的论文，包括YoLo、YoLoV2、YoLoV3、SSD以及DSSD，这些模型在深度学习的发展中起着关键作用。 1. **YoLo（You Only Look Once）**： - YoLo是由Joseph Redmon等人在2015年提出的，它的核心思想是将目标检测视为回归问题，直接预测边界框和类别概率，避免了先验区域选择的步骤。 - YoLo网络结构简洁，处理速度快，但对小目标的检测性能较弱。 2. **YoLoV2**： - YoLoV2在YoLo的基础上进行了优化，引入了 anchor boxes（预定义的边界框比例），提高了对小目标的检测能力。 - 还采用了批归一化（Batch Normalization）、多尺度训练等技术，提升了模型的精度。 3. **YoLoV3**： - YoLoV3进一步优化了网络结构，使用了更深层次的DarkNet-53作为基础网络，并加入了残差连接。 - 引入了更大的anchor box比例，增强了对不同大小目标的检测。 - 添加了特征金字塔网络（Feature Pyramid Network, FPN），提升了对小目标的检测性能。 4. **SSD（Single Shot MultiBox Detector）**： - SSD由Wei Liu等人提出，也是基于深度学习的一次性目标检测框架，它在多个不同尺度的特征层上预测边界框，解决了小目标检测的问题。 - SSD利用了多尺度特征，通过不同分辨率的特征图检测不同大小的目标，同时引入了默认框（default boxes）的概念，以适应不同长宽比的目标。 5. **DSSD（Deconvolutional Single Shot Detector）**： - DSSD是SSD的一种改进版，由Fang-Yi Dong等人提出。它结合了卷积神经网络（CNN）和反卷积神经网络（DeconvNet），增加了对细节的捕捉能力。 - DSSD在SSD的特征金字塔网络基础上，使用反卷积层进行上采样，以提高检测精度，尤其是在小目标检测上表现更优。这些模型各有特点，适用于不同的应用场景。例如，YoLo系列适合需要快速响应的实时应用，而SSD和DSSD在精度和速度之间找到了较好的平衡。通过阅读这些论文，你可以深入理解这些模型的设计思路，为自己的项目选择合适的检测算法。

资源详情

资源评论

资源推荐

收起资源包目录

单阶段目标检测.zip （5个子文件）

yolo.pdf 5.05MB

SSD.pdf 2.38MB

DSSD.pdf 5.36MB

YoLoV2.pdf 5.01MB

Yolov3.pdf 2.34MB

DSSD : Deconvolutional Single Shot Detector

Cheng-Yang Fu

1∗

, Wei Liu

1∗

, Ananth Ranga

, Ambrish Tyagi

, Alexander C. Berg

UNC Chapel Hill,

Amazon Inc.

{cyfu, wliu}@cs.unc.edu, {ananthr, ambrisht}@amazon.com, [email protected]

Abstract

The main contribution of this paper is an approach for

introducing additional context into state-of-the-art general

object detection. To achieve this we ﬁrst combine a state-of-

the-art classiﬁer (Residual-101 [14]) with a fast detection

framework (SSD [18]). We then augment SSD+Residual-

101 with deconvolution layers to introduce additional large-

scale context in object detection and improve accuracy,

especially for small objects, calling our resulting system

DSSD for deconvolutional single shot detector. While these

two contributions are easily described at a high-level, a

naive implementation does not succeed. Instead we show

that carefully adding additional stages of learned transfor-

mations, speciﬁcally a module for feed-forward connections

in deconvolution and a new output module, enables this new

approach and forms a potential way forward for further de-

tection research. Results are shown on both PASCAL VOC

and COCO detection. Our DSSD with 513 × 513 input

achieves 81.5% mAP on VOC2007 test, 80.0% mAP on

VOC2012 test, and 33.2% mAP on COCO, outperform-

ing a state-of-the-art method R-FCN [3] on each dataset.

1. Introduction

The main contribution of this paper is an approach for

introducing additional context into state-of-the-art general

object detection. The end result achieves the current high-

est accuracy for detection with a single network on PAS-

CAL VOC [6] while also maintaining comparable speed

with a previous state-of-the-art detection [3]. To achieve

this we ﬁrst combine a state-of-the-art classiﬁer (Residual-

101 [14]) with a fast detection framework (SSD [18]). We

then augment SSD+Residual-101 with deconvolution lay-

ers to introduce additional large-scale context in object de-

tection and improve accuracy, especially for small objects,

calling our resulting system DSSD for deconvolutional sin-

gle shot detector. While these two contributions are easily

described at a high-level, a naive implementation does not

∗

Equal Contribution

succeed. Instead we show that carefully adding additional

stages of learned transformations, speciﬁcally a module for

feed forward connections in deconvolution and a new out-

put module, enables this new approach and forms a poten-

tial way forward for further detection research.

Putting this work in context, there has been a recent

move in object detection back toward sliding-window tech-

niques in the last two years. The idea is that instead of ﬁrst

proposing potential bounding boxes for objects in an im-

age and then classifying them, as exempliﬁed in selective

search[27] and R-CNN[12] derived methods, a classiﬁer is

applied to a ﬁxed set of possible bounding boxes in an im-

age. While sliding window approaches never completely

disappeared, they had gone out of favor after the heydays of

HOG [4] and DPM [7] due to the increasingly large number

of box locations that had to be considered to keep up with

state-of-the-art. They are coming back as more powerful

machine learning frameworks integrating deep learning are

developed. These allow fewer potential bounding boxes to

be considered, but in addition to a classiﬁcation score for

each box, require predicting an offset to the actual location

of the object—snapping to its spatial extent. Recently these

approaches have been shown to be effective for bound-

ing box proposals [5, 24] in place of bottom-up grouping

of segmentation [27, 12]. Even more recently, these ap-

proaches were used to not only score bounding boxes as po-

tential object locations, but to simultaneously predict scores

for object categories, effectively combining the steps of re-

gion proposal and classiﬁcation. This is the approach taken

by You Only Look Once (YOLO) [23] which computes a

global feature map and uses a fully-connected layer to pre-

dict detections in a ﬁxed set of regions. Taking this single-

shot approach further by adding layers of feature maps for

each scale and using a convolutional ﬁlter for prediction, the

Single Shot MultiBox Detector (SSD) [18] is signiﬁcantly

more accurate and is currently the best detector with respect

to the speed-vs-accuracy trade-off.

When looking for ways to further improve the accuracy

of detection, obvious targets are better feature networks and

adding more context, especially for small objects, in addi-

tion to improving the spatial resolution of the bounding box

arXiv:1701.06659v1 [cs.CV] 23 Jan 2017

Original(Predic-on(layer(

conv1(

pool1(

conv2_x(

conv3_x(

conv4_x(

conv5_x(

DSSD#Layers#

SSD#Layers#

conv1(

pool1(

conv2_x(

conv3_x(

conv4_x(

conv5_x(

#Predic.on#Module# #Deconvolu.on###Module#

Figure 1: Networks of SSD and DSSD on residual network. The blue modules are the layers added in SSD framework,

and we call them SSD Layers. In the bottom ﬁgure, the red layers are DSSD layers.

prediction process. Previous versions of SSD were based on

the VGG [26] network, but many researchers have achieved

better accuracy for tasks using Residual-101 [14]. Look-

ing to concurrent research outside of detection, there has

been a work on integrating context using so called “encoder-

decoder” networks where a bottleneck layer in the middle

of a network is used to encode information about an input

image and then progressively larger layers decode this into

a map over the whole image. The resulting wide, narrow,

wide structure of the network is often referred to as an hour-

glass. These approaches have been especially useful in re-

cent works on semantic segmentation [21], and human pose

estimation [20].

Unfortunately neither of these modiﬁcations, using the

much deeper Residual-101, or adding deconvolution layers

to the end of SSD feature layers, work “out of the box”.

Instead it is necessary to carefully construct combination

modules for integrating deconvolution, and output modules

to insulate the Residual-101 layers during training and al-

low effective learning.

The code will be open sourced with models upon publi-

cation.

2. Related Work

The majority of object detection methods, including

SPPnet [13], Fast R-CNN [11], Faster R-CNN [24], R-

FCN [3] and YOLO [23], use the top-most layer of a Con-

vNet to learn to detect objects at different scales. Although

powerful, it imposes a great burden for a single layer to

model all possible object scales and shapes.

There are variety of ways to improve detection accuracy

by exploiting multiple layers within a ConvNet. The ﬁrst

set of approaches combine feature maps from different lay-

ers of a ConvNet and use the combined feature map to do

prediction. ION [1] uses L2 normalization [19] to com-

bine multiple layers from VGGNet and pool features for

object proposals from the combined layer. HyperNet [16]

also follows a similar method and uses the combined layer

to learn object proposals and to pool features. Because the

Cls$ Loc$Regress$

Feature$Layer$

Cls$ Loc$Regress$

Feature$Layer$

Conv$1x1x256$

Conv$1x1x1024$

Eltw$Sum$

Feature$Layer$

Conv$1x1x256$

Conv$1x1x1024$

Eltw$Sum$

Conv$1x1x1024$

Cls$ Loc$Regress$

Feature$Layer$

Conv$1x1x256$

Conv$1x1x1024$

Eltw$Sum$

Conv$1x1x1024$

Cls$ Loc$Regress$

Conv$1x1x256$

Conv$1x1x1024$

Eltw$Sum$

Conv$1x1x1024$

($a$)$ ($b$)$ ($c$)$ ($d$)$

Figure 2: Variants of the prediction module

combined feature map has features from different levels of

abstraction of the input image, the pooled feature is more

descriptive and is better suitable for localization and classi-

ﬁcation. However, the combined feature map not only in-

creases the memory footprint of a model signiﬁcantly but

also decreases the speed of the model.

Another set of methods uses different layers within a

ConvNet to predict objects of different scales. Because the

nodes in different layers have different receptive ﬁelds, it

is natural to predict large objects from layers with large re-

ceptive ﬁelds (called higher or later layers within a Con-

vNet) and use layers with small receptive ﬁelds to predict

small objects. SSD [18] spreads out default boxes of differ-

ent scales to multiple layers within a ConvNet and enforces

each layer to focus on predicting objects of certain scale.

MS-CNN [2] applies deconvolution on multiple layers of a

ConvNet to increase feature map resolution before using the

layers to learn region proposals and pool features. However,

in order to detect small objects well, these methods need

to use some information from shallow layers with small re-

ceptive ﬁelds and dense feature maps, which may cause low

performance on small objects because shallow layers have

less semantic information about objects. By using deconvo-

lution layers and skip connections, we can inject more se-

mantic information in dense (deconvolution) feature maps,

which in turn helps predict small objects.

There is another line of work which tries to include con-

text information for prediction. Multi-Region CNN [10]

pools features not only from the region proposal but also

pre-deﬁned regions such as half parts, center, border and

the context area. Following many existing works on seman-

tic segmentation [21] and pose estimation [20], we propose

to use an encoder-decoder hourglass structure to pass con-

text information before doing prediction. The deconvolu-

tion layers not only addresses the problem of shrinking res-

olution of feature maps in convolution neural networks, but

also brings in context information for prediction.

3. Deconvolutional Single Shot Detection

(DSSD) model

We begin by reviewing the structure of SSD and then de-

scribe the new prediction module that produces signiﬁcantly

improved training effectiveness when using Residual-101 as

the base network for SSD. Next we discuss how to add de-

convolution layers to make a hourglass network, and how

to integrate the the new deconvolutional module to pass se-

mantic context information for the ﬁnal DSSD model.

SSD

The Single Shot MultiBox Detector (SSD [18]) is built

on top of a ”base” network that ends (or is truncated to

end) with some convolutional layers. SSD adds a series of

progressively smaller convolutional layers as shown in blue

on top of Figure 1 (the base network is shown in white).

Each of the added layers, and some of the earlier base net-

work layers are used to predict scores and offsets for some

predeﬁned default bounding boxes. These predictions are

performed by 3x3x#channels dimensional ﬁlters, one ﬁlter

for each category score and one for each dimension of the

bounding box that is regressed. It uses non-maximum sup-

pression (NMS) to post-process the predictions to get ﬁnal

detection results. More details can be found in [18], where

the detector uses VGG [26] as the base network.

3.1. Using Residual-101 in place of VGG

Our ﬁrst modiﬁcation is using Residual-101 in place of

VGG used in the original SSD paper, in particular we use

the Residual-101 network from [14]. The goal is to im-

prove accuracy. Figure 1 top shows SSD with Residual-101

as the base network. Here we are adding layers after the

conv5 x block, and predicting scores and box offsets from

conv3 x, conv5 x, and the additional layers. By itself this

does not improve results. Considering the ablation study re-

sults in Table 4, the top row shows a mAP of 76.4 of SSD

with Residual-101 on 321 × 321 inputs for PASCAL VOC

2007 test. This is lower than the 77.5 for SSD with VGG

on 300 ×300 inputs (see Table3). However adding an addi-

tional prediction module, described next, increases perfor-

mance signiﬁcantly.

Prediction module

In the original SSD [18], the objective functions are ap-

plied on the selected feature maps directly and a L2 normal-

ization layer is used for the conv4 3 layer, because of the

large magnitude of the gradient. MS-CNN[2] points out that

improving the sub-network of each task can improve accu-

racy. Following this principle, we add one residual block

for each prediction layer as shown in Figure 2 variant (c).

We also tried the original SSD approach (a) and a version

of the residual block with a skip connection (b) as well as

two sequential residual blocks (d). Ablation studies with

the different prediction modules are shown in Table 4 and

discussed in Section 4. We note that Residual-101 and the

prediction module seem to perform signiﬁcantly better than

VGG without the prediction module for higher resolution

input images.

HxWx512'

2Hx2Wx512'

Conv'3x3x512'

BN'

ReLU'

Conv'3x3x512'

BN'

Deconv'2x2x512'

Conv'3x3x512'

BN'

Eltw'Product'

ReLU'

Deconvolu=on'Layer'

Feature'Layer'from'SSD'

2Hx2WxD'

Figure 3: Deconvolution module

Deconvolutional SSD

In order to include more high-level context in detec-

tion, we move prediction to a series of deconvolution layers

placed after the original SSD setup, effectively making an

asymmetric hourglass network structure, as shown in Fig-

ure 1 bottom. The DSSD model in our experiments is built

on SSD with Residual-101. Extra deconvolution layers are

added to successively increase the resolution of feature map

layers. In order to strengthen features, we adopt the ”skip

connection” idea from the Hourglass model [20]. Although

the hourglass model contains symmetric layers in both the

Encoder and Decoder stage, we make the decoder stage ex-

tremely shallow for two reasons. First, detection is a fun-

damental task in vision and may need to provide informa-

tion for the downstream tasks. Therefore, speed is an im-

portant factor. Building the symmetric network means the

time for inference will double. This is not what we want in

this fast detection framework. Second, there are no pre-

trained models which include a decoder stage trained on

the classiﬁcation task of ILSVRC CLS-LOC dataset [25]

because classiﬁcation gives a single whole image label in-

stead of a local label as in detection. State-of-the-art de-

tectors rely on the power of transfer learning. The model

pre-trained on the classiﬁcation task of ILSVRC CLS-LOC

dataset [25] makes the accuracy of our detector higher and

converge faster compared to a randomly initialized model.

Since there is no pre-trained model for our decoder, we can-

not take the advantage of transfer learning for the decoder

layers which must be trained starting from random initial-

ization. An important aspect of the deconvolution layers

is computational cost, especially when adding information

from the previous layers in addition to the deconvolutional

process.

Deconvolution Module

In order to help integrating information from earlier fea-

ture maps and the deconvolution layers, we introduce a de-

convolution module as shown in Figure 3. This module ﬁts

into the overall DSSD architecture as indicated by the solid

circles in the bottom of Figure 1. The deconvolution mod-

ule is inspired by Pinheiro et al. [22] who suggested that a

factored version of the deconvolution module for a reﬁne-

ment network has the same accuracy as a more complicated

one and the network will be more efﬁcient. We make the

following modiﬁcations and show them in Figure 3. First,

a batch normalization layer is added after each convolution

layer. Second, we use the learned deconvolution layer in-

stead of bilinear upsampling. Last, we test different combi-

nation methods: element-wise sum and element-wise prod-

uct. The experimental results show that the element-wise

product provides the best accuracy (See Table 4 bottom sec-

tions).

Training

We follow almost the same training policy as SSD. First,

we have to match a set of default boxes to target ground

truth boxes. For each ground truth box, we match it with the

评论收藏

内容反馈

Yuuchuin

粉丝: 6

单阶段的目标检测网络论文集合：YoLo、YoLoV2、YoLoV3、SSD、DSSD

评论0

最新资源

单阶段的目标检测网络论文集合：YoLo、YoLoV2、YoLoV3、SSD、DSSD

评论0

单阶段目标检测论文

YOLOv1、YOLOv2、 YOLOv3、SSD DSSD 单阶段目标检测论文

yolo，yolov2和yolov3的论文原文.zip

YOLO-PPT.rar

YOLO系列 ppt

中文翻译学习笔记-YOLO的全面评述：从YOLOv1到YOLOv8

目标检测YOLOv3共3个文档 1-原版论文pdf-2-中文翻译pdf-3-中英文翻译对照pdf.rar

多模态遥感图像中的SuperYOLO超分辨率目标检测算法：基于Yolov5、YOLOv7和YOLOv8的优化改进与注意力机制多尺度提升精度技术,SuperYOLO：增强型多模态遥感图像超分辨率目标检

yolo，yolov2,yolov3论文原文

yolov5疲劳驾驶检测，疲劳检测，pyqt5，目标检测，深度学习，网络优化，目标检测接单，yolov5，yolov7，yolo

yolo_v3_reasonj3l_目标检测_快速目标检测_yolov3_YOLO目标_

YOLO-TLA：基于YOLOv5的高效轻量级小目标检测模型

YOLO系列进化论：从YOLOv1至YOLOv8的目标检测技术革新

YOLOV1-V7英文论文，深度学习、目标检测领域必读经典论文

YOLO-Former：YOLO与ViT握手

Yolo8目标检测预训练模型 - yolov8s.pt

yolo-基于Mxnet实现的YOLOv3目标检测算法.zip

基于YOLO的目标检测界面化部署实现（支持yolov1-yolov5、yolop、yolox）

3D视界，YOLO洞悉：YOLO模型在三维目标检测的革命性应用

yolov5论文-YOLO算法最全综述：从YOLOv1到YOLOv5电子版pdf

行人检测(人体检测)2：YOLOv5实现人体检测(含人体检测数据集和训练代码).txt

keras-yolo3 实时目标检测

YOLO算法原理与历史发展+深度学习基础：卷积神经网络+YOLOv1：实时物体检测初探+YOLOv2：优化与提升等全套教程

yolov11重点知识点学习教程 yolov11训练策略

深入浅出Yolo系列之Yolov3&Yolov4&Yolov5&Yolox核心基础知识完整讲解_江南研习社-CSDN博客_深入浅

基于YOLO系列算法的钢轨表面缺陷检测：从YOLOv3到YOLOv8的技术演进与应用

人工智能本科毕业设计基于yolov5的步态识别多目标跨镜头跟踪检测算法系统源码.zip

入门YOLO系列目标检测：从YOLOv8到未来版本的实用指南

目标检测+旋转目标框+YOLO+小目标检测

基于MATLAB的YOLOv2车辆目标检测实例（100%可用含中文注释）

windows11 修改默认存储位置 || Windows 11子系统（WSL）迁移到D盘

This-Repo-Has-302-Stars：是的，是的，此存储库中有302星

最新资源