Systematic Evaluation of Convolution Neural Network Advances On The Imagenet-2017
Systematic Evaluation of Convolution Neural Network Advances On The Imagenet-2017
a r t i c l e i n f o a b s t r a c t
Article history: The paper systematically studies the impact of a range of recent advances in convolution neural net-
Received 7 June 2016 work (CNN) architectures and learning methods on the object categorization (ILSVRC) problem. The eval-
Revised 11 March 2017
uation tests the influence of the following choices of the architecture: non-linearity (ReLU, ELU, max-
Accepted 11 May 2017
out, compatability with batch normalization), pooling variants (stochastic, max, average, mixed), network
Available online xxx
width, classifier design (convolutional, fully-connected, SPP), image pre-processing, and of learning pa-
Keywords: rameters: learning rate, batch size, cleanliness of the data, etc.
CNN
The performance gains of the proposed modifications are first tested individually and then in combina-
Benchmark
Non-linearity
tion. The sum of individual gains is greater than the observed improvement when all modifications are
Pooling introduced, but the “deficit” is small suggesting independence of their benefits.
ImageNet We show that the use of 128 × 128 pixel images is sufficient to make qualitative conclusions about
optimal network structure that hold for the full size Caffe and VGG nets. The results are obtained an
order of magnitude faster than with the standard 224 pixel images.
© 2017 Elsevier Inc. All rights reserved.
1. Introduction depth (Gao and Jojic, 2016), effective receptive field size (Luo et al.,
2016), etc. have been published. The topic of local minima in
Deep convolution networks have become the mainstream deep network optimization is well covered by Choromanska et al.
method for solving various computer vision tasks, such as (2014) and by Soudry and Carmon (2016). However, the latest
image classification (Russakovsky et al., 2015), object detec- state of art results have been achieved by hand-crafted architec-
tion (Everingham et al., 2010; Russakovsky et al., 2015), semantic tures (Zagoruyko and Komodakis, 2016) or by large-scale “trial-
segmentation (Dai et al., 2016), image retrieval (Tolias et al., 2016), end-error” reinforcement learning search (Zoph and Le, 0 0 0 0).
tracking (Nam and Han, 2015), text detection (Jaderberg et al., Improvements of many components of the CNN architecture
2014), stereo matching (Žbontar and LeCun, 2014), and many other. like the non-linearity type, pooling, structure and learning have
Besides two classic works on training neural networks – (LeCun been recently proposed. First applied in the ILSVRC (Russakovsky
et al., 1998b) and Bengio (2012), which are still highly rele- et al., 2015) competition, they have been adopted in different re-
vant, there is very little guidance or theory on the plethora search areas.
of design choices and hyper-parameter settings of CNNs with The contributions of the recent CNN improvements and their
the consequence that researchers proceed by trial-and-error ex- interaction have not been systematically evaluated. We survey the
perimentation and architecture copying, sticking to established recent developments and perform a large scale experimental study
net types. With good results in ImageNet competition, the that considers the choice of non-linearity, pooling, learning rate
AlexNet (Krizhevsky et al., 2012), VGGNet (Simonyan and Zisser- policy, classifier design, network width, batch normalization (Ioffe
man, 2015) and GoogLeNet(Inception) (Szegedy et al., 2015) have and Szegedy, 2015). We did not include ResNets (He et al., 2016a) –
become the de-facto standard. a recent development achieving excellent results – since they have
Theory-grounded recommendations for the selection the num- been well covered in papers (He et al., 2016b; Larsson et al., 2016;
ber of neurons (Ithapu et al., 2017; Schmidhuber, 1997), network Szegedy et al., 2016; Zagoruyko and Komodakis, 2016).
There are three main contributions of the paper. First, we sur-
vey and present baseline results for a wide variety of architectures
∗
Corresponding author.
and design choices both individually and in combination. Based
E-mail address: ducha.aiki@gmail.com (D. Mishkin).
http://dx.doi.org/10.1016/j.cviu.2017.05.007
1077-3142/© 2017 Elsevier Inc. All rights reserved.
Please cite this article as: D. Mishkin et al., Systematic evaluation of convolution neural network advances on the Imagenet, Computer
Vision and Image Understanding (2017), http://dx.doi.org/10.1016/j.cviu.2017.05.007
JID: YCVIU
ARTICLE IN PRESS [m5G;May 19, 2017;10:19]
2 D. Mishkin et al. / Computer Vision and Image Understanding 000 (2017) 1–9
Table 1 Table 2
List of hyper-parameters tested. The basic CaffeNet architecture used in most experiments. Pad 1 – zero-padding
on the image boundary with1 pixel. Group 2 convolution – filters are split into 2
Hyper-parameter Variants separate groups. The architecture is denoted in “shorthand” as 96C11/4 → MP3/2
Non-linearity linear, tanh, sigmoid, ReLU, VLReLU, RReLU, → 192G2C5/2 → MP3/2 → 384G2C3 → 384C3 → 256G2C3 → MP3/2 → 2048C3
PReLU, ELU, maxout, APL, combination → 2048C1 → 10 0 0C1.
Batch Normalization (BN) before non-linearity. after non-linearity
input image 128 × 128 px, random crop from 144 × N, random mirror
BN + non-linearity linear, tanh, sigmoid, ReLU, VLReLU,
RReLU, PReLU, ELU, maxout pre-process out = 0.04 (BGR - (104; 117; 124))
Pooling max, average, stochastic, max+average, conv1 conv 11 × 11 × 96, stride 4
strided convolution ReLU
Pooling window size 3 × 3, 2 × 2, 3 × 3 with zero-padding pool1 max pool 3 × 3, stride 2
Learning rate decay policy step, square, square root, linear conv2 conv 5 × 5 × 192, stride 2, pad 1, group 2
Colorspace & Pre-processing RGB, HSV, YCrCb, grayscale, learned, ReLU
CLAHE, histogram equalized pool2 max pool 3 × 3, stride 2
Classifier design pooling-FC-FC-clf, SPP-FC-FC-clf, conv3 conv 3 × 3 × 384, pad 1
pooling-conv-conv-clf-avepool, ReLU
pooling-conv-conv-avepool-clf
√ √ √ √ √ conv4 conv 3 × 3 × 384, pad 1, group 2
Network width 1/4, 1/2 2, 1/2, 1/ 2, 1 , 2, 2, 2 2, 4, 4 2 ReLU
Input image size 64, 96, 128, 180, 224 conv5 conv 3 × 3 × 256, pad 1, group 2
Dataset size 20 0k, 40 0k, 60 0k, 80 0k, 120 0k(full) ReLU
Batch size 1, 32, 64, 128, 256, 512, 1024 pool5 max pool 3 × 3, stride 2
Percentage of noisy data 0, 5%, 10%, 15%, 32% fc6 fully-connected 4096
Using bias yes/no ReLU
drop6 dropout ratio 0.5
fc7 fully-connected 4096
on a large-scale evaluation, we provide novel recommendations ReLU
and insights into deep convolutional network structure. Second, we drop7 dropout ratio 0.5
show that for popular architectures – AlexNet, GoogLeNet, VGGNet fc8-clf softmax-10 0 0
2. Evaluation
All tested networks were trained on the 10 0 0 object category In order to decrease the probability of overfitting and to make
classification problem on the ImageNet dataset (Russakovsky et al., experiments less demanding in memory, another change of Caf-
2015). The set consists of a 1.2M image training set, a 50k image feNet is made. A number of filters in fully-connected layers 6 and
validation set and a 100k image test set. The test set is not used in 7 were reduced by a factor of two, from 4096 to 2048. The results
the experiments. The commonly used pre-processing includes im- validating the resolution reduction are presented in Fig. 1.
age rescaling to 256 × N, where N ≥ 256, and then cropping a ran- The parameters and architecture of the standard CaffeNet are
dom 224 × 224 square (Howard, 2013; Krizhevsky et al., 2012). The shown in Table 2. For experiments we used CaffeNet with 2 ×
setup achieves good results in classification, but training a network thinner fully-connected layers, named as CaffeNet128-FC2048. The
of this size takes several days even on modern GPUs. We thus pro- architecture can be denoted as 96C11/4 → MP3/2 → 192G2C5/2
pose to limit the image size to 144 × N where N ≥ 128 (denoted → MP3/2 → 384G2C3 → 384C3 → 256G2C3 → MP3/2 → 2048C3
as ImageNet-128px). For example, the CaffeNet (Jia et al., 2014) is → 2048C1 → 10 0 0C1. Here we used fully-convolutional notation
trained within 24 h using NVIDIA GTX980 on ImageNet-128px. for fully-connected layers, which are equivalent when image input
size is fixed to 128 × 128 px. The default activation function is
2.1.1. Architectures ReLU and it is put after every convolution layer, except the last
The input size reduction is validated by training CaffeNet, 10 0 0-way softmax classifier.
GoogLeNet and VGGNet on both the reduced and standard image
sizes. The results are shown in Fig. 1. The reduction of the input 2.1.2. Learning
image size leads to a consistent drop of around 6% in top-1 accu- SGD with momentum 0.9 is used for learning, the initial learn-
racy for all three popular architectures and does not change their ing rate is set to 0.01, decreased by a factor of ten after every 100k
relative order (VGGNet > GoogLeNet > CaffeNet) or accuracy dif- iterations until learning stops after 320k iterations. The L2 weight
ference. decay for convolutional weights is set to 0.0 0 05 and it is not ap-
plied to bias. The dropout (Srivastava et al., 2014) with probability
1
https://www.github.com/ducha- aiki/caffenet- benchmark. 0.5 is used before the two last layers. All the networks were initial-
Please cite this article as: D. Mishkin et al., Systematic evaluation of convolution neural network advances on the Imagenet, Computer
Vision and Image Understanding (2017), http://dx.doi.org/10.1016/j.cviu.2017.05.007
JID: YCVIU
ARTICLE IN PRESS [m5G;May 19, 2017;10:19]
D. Mishkin et al. / Computer Vision and Image Understanding 000 (2017) 1–9 3
Table 3 ELU, the network architecture used in the evaluation was designed
Non-linearities tested.
specifically for the experiment and is not commonly used.
Name Formula Year
Fig. 2. Top-1 accuracy gain over ReLU in the CaffeNet-128 architecture. MaxS stands for “maxout, same compexity”, MaxW – maxout, same width, CSoftplus – centered
softplus. The baseline, i.e. ReLU, accuracy is 47.1%.
Please cite this article as: D. Mishkin et al., Systematic evaluation of convolution neural network advances on the Imagenet, Computer
Vision and Image Understanding (2017), http://dx.doi.org/10.1016/j.cviu.2017.05.007
JID: YCVIU
ARTICLE IN PRESS [m5G;May 19, 2017;10:19]
4 D. Mishkin et al. / Computer Vision and Image Understanding 000 (2017) 1–9
Fig. 3. Top-1 accuracy gain over max pooling for the CaffeNet-128 architecture. Left – different pooling methods, right – different receptive field sizes. Stoch stands for
stochastic pooling, “stoch no dropout” – for a network with stochastic pooling and turned off drop6 and drop7 layers.
Table 4 Table 5
Poolings tested. Learning rate decay policies, tested in paper. L0 – initial learning rate, M = number
of learning iterations, i – current iteration, S – step iteration. γ – decay coefficient.
Name Formula Year
Name Formula Parameters Accuracy
max y= maxh,w x
i, j=1 i, j
1989
1 h,w step lr = L0 γ floor(i/S)
S = 100k, γ = 0.1, M = 320k 0.471
average y= hw i, j=1 xi, j 1989
stochastic y = xi, j with prob.
xi, j
h,w 2013 square (1 − i/M )2
lr = L0 M = 320k 0.483
x
i, j=1 i, j
square root lr = L0 1 − i/M M = 320k 0.483
strided convolution – 2014
h,w linear lr = L0 (1 − i/M ) M = 320k 0.493
max + average y =maxh,w x + 1
xi, j 2015
i, j=1 i, j hw i, j=1 M = 160k 0.466
Table 6
of feature maps. The most popular options are max pooling and Top-1 accuracy on ImageNet-128px, batch normaliza-
average pooling. Among the recent advances are: Stochastic pool- tion placement. ReLU activation is used.
ing (Zeiler and Fergus, 2013), LP-Norm pooling (Gulcehre et al.,
Network BN placement
2013) and Tree-Gated pooling (Lee et al., 2015). Only the authors
No BN Before After
of the last paper have tested their pooling on ImageNet.
The pooling receptive field is another design choice. Krizhevsky CaffeNet128-FC2048 0.471 0.478 0.499
et al. (2012) claimed superiority of overlapping pooling with 3 × 3 GoogLeNet128 0.619 0.603 0.596
Please cite this article as: D. Mishkin et al., Systematic evaluation of convolution neural network advances on the Imagenet, Computer
Vision and Image Understanding (2017), http://dx.doi.org/10.1016/j.cviu.2017.05.007
JID: YCVIU
ARTICLE IN PRESS [m5G;May 19, 2017;10:19]
D. Mishkin et al. / Computer Vision and Image Understanding 000 (2017) 1–9 5
Fig. 4. Left: learning rate decay policy, right: validation accuracy. The formulae for each policy are given in Table 5.
Fig. 6. Classifier design: Top-1 accuracy gain over standard CaffeNet-128 architec-
Fig. 5. Top-1 accuracy gain over ReLU without batch normalization (BN) in ture.
CaffeNet-128 architecture. The baseline – ReLU – accuracy is 47.1%.
Please cite this article as: D. Mishkin et al., Systematic evaluation of convolution neural network advances on the Imagenet, Computer
Vision and Image Understanding (2017), http://dx.doi.org/10.1016/j.cviu.2017.05.007
JID: YCVIU
ARTICLE IN PRESS [m5G;May 19, 2017;10:19]
6 D. Mishkin et al. / Computer Vision and Image Understanding 000 (2017) 1–9
Fig. 7. Batch size and initial learning rate impact to the accuracy. Fig. 9. Input image size impact on the accuracy.
Please cite this article as: D. Mishkin et al., Systematic evaluation of convolution neural network advances on the Imagenet, Computer
Vision and Image Understanding (2017), http://dx.doi.org/10.1016/j.cviu.2017.05.007
JID: YCVIU
ARTICLE IN PRESS [m5G;May 19, 2017;10:19]
D. Mishkin et al. / Computer Vision and Image Understanding 000 (2017) 1–9 7
4. Best-of-all experiments
5. Conclusions
Fig. 10. Training dataset size and cleanliness impact on the accuracy.
Table 7
We have compared systematically a set of recent CNN advances
Influence of the bias in con- on large scale dataset. We have shown that benchmarking can be
volution and fully-connected done at an affordable time and computation cost. A summary of
layers. Top-1 accuracy on recommendations:
ImageNet-128px.
Network Accuracy
• use ELU non-linearity without batchnorm or ReLU with it.
• apply a learned colorspace transformation of RGB.
With bias 0.471
• use the linear learning rate decay policy.
Without bias 0.445
• use a sum of the average and max pooling layers.
• use mini-batch size around 128 or 256. If this is too big for your
GPU, decrease the learning rate proportionally to the batch size.
• use fully-connected layers as convolutional and average the
predictions for the final decision.
• when investing in increasing training set size, check if a plateau
has not been reach.
• cleanliness of the data is more important than the size.
• if you cannot increase the input image size, reduce the stride
in the consequent layers, it has roughly the same effect.
• if your network has a complex and highly optimized architec-
ture, like e.g. GoogLeNet, be careful with modifications.
Acknowledgements
Appendix A
Fig. 11. Applying all improvements that do not change feature maps size: linear A1. Image pre-processing
learning rate decay policy, colorspace transformation “F”, ELU nonlinearity in con-
volution layers, maxout non-linearity in fully-connected layers and a sum of average
The commonly used input to CNN is raw RGB pixels and
and max pooling.
the commonly adopted recommendation is not to use any pre-
processing. There has not been much research on the optimal col-
The results are shown in Fig. 10 which clearly shows that big- orspace or pre-processing techniques for CNN. Fuad Rachmadi and
ger (and more diverse) dataset brings an improvement. There is Ketut Eddy Purnama (2015) explored different colorspaces for ve-
a minimum size below which performance quickly degrades. Less hicle color identification, Dong et al. (2015) compared YCrCb and
of clean data outperforms more noisy ones: a clean dataset with RGB channels for image super-resolution, Graham (2015) extracted
400k images performs on par with 1.2M dataset with 800k correct local average color from retina images in winning solution to the
images. Kaggle competition.
The pre-processing experiment is divided in two parts. First,
3.10. Bias in convolution layers we have tested popular handcrafted image pre-processing meth-
ods and colorspaces. Since all transformations were done on-the-
We conducted a simple experiment on the importance of the fly, we first tested if calculation of the mean pixel and variance
bias in the convolution and fully-connected layers. First, the net- over the training set can be replaced with applying batch normal-
work is trained as usual, for the second – biases are initialized ization to input images. It decreases final accuracy by 0.3% and
with zeros and the bias learning rate is set to zero. The network can be seen as baseline for all other methods. We have tested
without biases shows 2.6% less accuracy than the default – see HSV, YCrCb, Lab, RGB and single-channel grayscale colorspaces. Re-
Table 7. sults are shown in Fig. A.12. The experiment confirms that RGB is
Please cite this article as: D. Mishkin et al., Systematic evaluation of convolution neural network advances on the Imagenet, Computer
Vision and Image Understanding (2017), http://dx.doi.org/10.1016/j.cviu.2017.05.007
JID: YCVIU
ARTICLE IN PRESS [m5G;May 19, 2017;10:19]
8 D. Mishkin et al. / Computer Vision and Image Understanding 000 (2017) 1–9
Fig. A.12. Left: performance of using various colorspaces and pre-processing. Right: learned colorspace transformations. Parameters are given in Table A.9.
Please cite this article as: D. Mishkin et al., Systematic evaluation of convolution neural network advances on the Imagenet, Computer
Vision and Image Understanding (2017), http://dx.doi.org/10.1016/j.cviu.2017.05.007
JID: YCVIU
ARTICLE IN PRESS [m5G;May 19, 2017;10:19]
D. Mishkin et al. / Computer Vision and Image Understanding 000 (2017) 1–9 9
Table A.9 Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama,
Mini-networks for learned colorspace transformations, placed after image and be- S., Darrell, T., Caffe: convolutional architecture for fast feature embedding.
fore conv1 layer. In all cases RGB means scales and centered input 0.04∗ (Img - arXiv:1408.5093.
(104, 117,124)). Krizhevsky, A., One weird trick for parallelizing convolutional neural networks.
arXiv:1404.5997.
Name Architecture Non-linearity Acc. Krizhevsky, A., Sutskever, I., Hinton, G.E., 2012. Imagenet classification with deep
convolutional neural networks. In: Pereira, F., Burges, C., Bottou, L., Wein-
A RGB → conv1×1×10 → conv1×1×3 tanh 0.463
berger, K. (Eds.), Advances in Neural Information Processing Systems, Vol. 25.
RGB RGB - 0.471
Curran Associates, Inc., pp. 1097–1105.
B RGB → conv1×1×3 → conv1×1×3 VLReLU 0.480 Larsson, G., Maire, M., Shakhnarovich, G., Fractalnet: ultra-deep neural networks
C RGB → conv1×1×10 → conv1×1×3 + RGB VLReLU 0.482 without residuals. arXiv:1605.07648.
D [RGB; log(RGB)] → conv1×1×10 → conv1×1×3 VLReLU 0.482 LeCun, Y., Bottou, L., Bengio, Y., Haffner, P., 1998. Gradient-based learning applied to
E RGB → conv1×1×16 → conv1×1×3 VLReLU 0.483 document recognition. In: Proceedings of the IEEE, Vol. 86, pp. 2278–2324.
F RGB → conv1×1×10 → conv1×1×3 VLReLU 0.485 LeCun, Y., Bottou, L., Orr, G.B., Müller, K.-R., 1998. Effiicient backprop. In: Neural Net-
works: Tricks of the Trade, This Book is an Outgrowth of a 1996 NIPS Work-
shop. Springer-Verlag, London, UK, UK, pp. 9–50.
Lee, C.-Y., Gallagher, P. W., Tu, Z., Generalizing pooling functions in convolutional
architectures are described in Table A.9. The learning process is neural networks: mixed, gated, and tree. arXiv:1509.08985.
Li, H., Ouyang, W., Wang, X., Multi-bias non-linear activation in deep neural net-
joint with the main network and can be seen as extending the Caf- works. arXiv:1604.00676.
feNet architecture with several 1×1 convolutions at the input. The Lin, M., Chen, Q., Yan, S., Network in network. arXiv:1312.4400.
best performing network gave 1.4% absolute accuracy gain without Luo, W., Li, Y., Urtasun, R., Zemel, R., 2016. Understanding the effective receptive
field in deep convolutional neural networks. In: Lee, D.D., Sugiyama, M.,
a significant computational cost.
Luxburg, U.V., Guyon, I., Garnett, R. (Eds.), Advances in Neural Infor-
mation Processing Systems 29. Curran Associates, Inc., pp. 4898–4906.
References http://www.papers.nips.cc/paper/6203- understanding- the- effective- receptive-
field- in- deep- convolutional- neural- networks.pdf
Agostinelli, F., Hoffman, M., Sadowski, P., Baldi, P., Learning activation functions to Maas, A.L., Hannun, A.Y., Ng, A.Y., Rectifier nonlinearities improve neural network
improve deep neural networks. arXiv:1412.6830. acoustic models. Proc. ICML. 30
Bengio, Y., 2012. Practical recommendations for gradient-based training of deep ar- Mishkin, D., Matas, J., 2016. All you need is a good in it. In: Proceedings of ICLR.
chitectures. Springer Berlin Heidelberg, Berlin, Heidelberg, pp. 437–478. Nam, H., Han, B., Learning multi-domain convolutional neural networks for visual
Canziani, A., Paszke, A., Culurciello, E., An analysis of deep neural network models tracking. arXiv:1510.07945.
for practical applications. arXiv:1605.07678. Radford, A., Metz, L., Chintala, S., Unsupervised representation learning with deep
Choromanska, A., Henaff, M., Mathieu, M., Ben Arous, G., LeCun, Y., The loss surfaces convolutional generative adversarial networks. arXiv:1511.06434.
of multilayer networks. arXiv:1412.0233. Ren, S., He, K., Girshick, R., Zhang, X., Sun, J., Object detection networks on convo-
Clevert, D.-A., Unterthiner, T., Hochreiter, S., Fast and accurate deep network learn- lutional feature maps. arXiv:1504.06066.
ing by Exponential Linear Units (ELUs). arXiv:1511.07289. Rumelhart, D.E., Hinton, G.E., Williams, R.J., 1986. Learning Internal Representations
Dai, J., He, K., Sun, J., Instance-aware semantic segmentation via multi-task network by Error Propagation, Vol. vol. 1. MIT Press, Cambridge, MA, USA, pp. 318–362.
cascades. arXiv:1512.04412. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpa-
Dong, C., Change Loy, C., He, K., Tang, X., Image super-resolution using deep convo- thy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L., 2015. Imagenet large
lutional networks. arXiv:1501.0 0 092. scale visual recognition challenge. Int. J. Comput. Vision (IJCV) 1–42. doi:10.
Everingham, M., Gool, L., Williams, C.K., Winn, J., Zisserman, A., 2010. The Pascal 1007/s11263-015-0816-y.
visual object classes (voc) challenge. Int. J. Comput. Vision (IJCV) 88 (2), 303– Schmidhuber, J., 1997. Discovering neural nets with low Kolmogorov complex-
338. doi:10.10 07/s11263-0 09-0275-4. ity and high generalization capability. Neural Netw. 10 (5), 857–873. doi:10.
Fuad Rachmadi, R., Ketut Eddy Purnama, I., Vehicle color recognition using convo- 1016/S0893-6080(96)00127-X. http://www.sciencedirect.com/science/article/pii/
lutional neural network. arXiv:1510.07391. S089360809600127X
Gao, T., Jojic, V., 2016. Degrees of freedom in deep neural networks. In: Proceed- Schroff, F., Kalenichenko, D., Philbin, J., 2015. Facenet: a unified embedding for face
ings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence, recognition and clustering. In: Proceedings of the IEEE Conference on Computer
UAI’16. AUAI Press, Arlington, Virginia, United States, pp. 232–241. http://dl.acm. Vision and Pattern Recognition, pp. 815–823.
org/citation.cfm?id=3020948.3020973 Simonyan, K., Zisserman, A., 2015. Very deep convolutional networks for large-scale
Glorot, X., Bordes, A., Bengio, Y., 2011. Deep sparse rectifier neural networks. In: visual recognition. In: Proceedings of ICLR.
Gordon, G.J., Dunson, D.B. (Eds.), Proceedings of the Fourteenth International Soudry, D., Carmon, Y., No bad local minima: Data independent training error guar-
Conference on Artificial Intelligence and Statistics (AISTATS-11), Vol. 15. Jour- antees for multilayer neural networks. arXiv:1605.08361.
nal of Machine Learning Research - Workshop and Conference Proceedings, Springenberg, J.T., Dosovitskiy, A., Brox, T., Riedmiller, M., 2014. Striving for simplic-
pp. 315–323. ity: the all convolutional net. Proceedings of ICLR Workshop. arXiv: 1412.6806.
Goodfellow, I.J., Warde-Farley, D., Mirza, M., Courville, A.C., Bengio, Y., 2013. Max- Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R., 2014.
out networks. In: Proceedings of the 30th International Conference on Machine Dropout: a simple way to prevent neural networks from overfitting. J. Mach.
Learning, ICML 2013. Atlanta, GA, USA, Vol. 16–21, pp. 1319–1327. Learn. Res. 15, 1929–1958.
Graham, B., 2014. Train you very own deep convolutional Swietojanski, P., Li, J., Huang, J.-T., 2014. Investigation of maxout networks for speech
network. https://www.kaggle.com/c/cifar-10/forums/t/10493/ recognition. In: Proceedings of ICASSP.
train- you- very- own- deep- convolutional- network. Szegedy, C., Ioffe, S., Vanhoucke, V., Inception-v4, Inception-ResNet and the impact
Graham, B., 2015. Kaggle Diabetic Retinopathy Detection Competition Report. Tech. of residual connections on learning. arXiv:1602.07261.
rep. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Van-
Gulcehre, C., Cho, K., Pascanu, R., Bengio, Y., Learned-norm pooling for deep feed- houcke, V., Rabinovich, A., 2015. Going deeper with convolutions. CVPR, Vol.
forward and recurrent neural networks. arXiv:1311.1780 2015.
He, K., Zhang, X., Ren, S., Sun, J., 2016a. Deep residual learning for image recognition, Taigman, Y., Yang, M., Ranzato, M., Wolf, L., 2014. Deepface: closing the gap to hu-
CVPR. arXiv:1512.03385. man-level performance in face verification. In: Proceedings of the IEEE Confer-
He, K., Zhang, X., Ren, S., Sun, J., 2015. Delving deep into rectifiers: surpassing ence on Computer Vision and Pattern Recognition, pp. 1701–1708.
human-level performance on imagenet classification, CVPR. arXiv:1502.01852. Tolias, G., Sicre, R., Jégou, H., Particular object retrieval with integral max-pooling of
He, K., Zhang, X., Ren, S., Sun, J., 2016b. Identity mappings in deep residual net- CNN activations. arXiv:1511.05879.
works, ECCV. arXiv:1603.05027. Wilson, D.R., Martinez, T.R., 2003. The general inefficiency of batch training
He, K., Zhang, X., Ren, S., Sun, J., 2014. Spatial pyramid pooling in deep convolutional for gradient descent learning. Neural Netw. 16 (10), 1429–1451. doi:10.1016/
networks for visual recognition, ECCV. arXiv:1406.4729. S0893- 6080(03)00138- 2.
Howard, A.G., Some improvements on deep convolutional neural network based im- Xie, L., Wang, J., Wei, Z., Wang, M., Tian, Q., Disturblabel: regularizing CNN on the
age classification. arXiv:1312.5402. loss layer. arXiv:1605.0 0 055.
Hummel, R., 1977. Image enhancement by histogram transformation. Comput. Xu, B., Wang, N., Chen, T., Li, M., Empirical evaluation of rectified activations in con-
Graphics Image Process. 6 (2), 184–195. doi:10.1016/S0146-664X(77)80011-7. volutional network. arXiv:1505.00853.
Ioffe, S., Szegedy, C., 2015. Batch normalization: accelerating deep network training Zagoruyko, S., Komodakis, N., Wide residual networks. arXiv:1605.07146.
by reducing internal covariate shift. In: Blei, D., Bach, F. (Eds.), Proceedings of Žbontar, J., LeCun, Y., Computing the stereo matching cost with a convolutional neu-
the 32nd International Conference on Machine Learning (ICML-15). JMLR Work- ral network. arXiv:1409.4326.
shop and Conference Proceedings, pp. 448–456. Zeiler, M. D., Fergus, R., Stochastic pooling for regularization of deep convolutional
Ithapu, V.K., Ravi, S.N., Singh, V., On architectural choices in deep learning: neural networks. arXiv:1301.3557.
from network structure to gradient convergence and parameter estimation. Zoph, B., Le, Q. V., Neural architecture search with reinforcement learning.
arXiv:1702.08670. arXiv:1611.01578.
Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A., Synthetic data and artificial Zuiderveld, K., 1994. Contrast Limited Adaptive Histogram Equalization. In: Graphics
neural networks for natural scene text recognition. arXiv:1406.2227. Gems IV. Academic Press Professional, Inc., San Diego, CA, USA, pp. 474–485.
Please cite this article as: D. Mishkin et al., Systematic evaluation of convolution neural network advances on the Imagenet, Computer
Vision and Image Understanding (2017), http://dx.doi.org/10.1016/j.cviu.2017.05.007