0% found this document useful (0 votes)
90 views9 pages

Systematic Evaluation of Convolution Neural Network Advances On The Imagenet-2017

rnc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
90 views9 pages

Systematic Evaluation of Convolution Neural Network Advances On The Imagenet-2017

rnc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

JID: YCVIU

ARTICLE IN PRESS [m5G;May 19, 2017;10:19]

Computer Vision and Image Understanding 0 0 0 (2017) 1–9

Contents lists available at ScienceDirect

Computer Vision and Image Understanding


journal homepage: www.elsevier.com/locate/cviu

Systematic evaluation of convolution neural network advances on the


Imagenet
Dmytro Mishkin a,∗, Nikolay Sergievskiy b, Jiri Matas a
a
Center for Machine Perception, Faculty of Electrical Engineering, Czech Technical University in Prague. Karlovo namesti, 13. Prague 2, 12135, Czech Republic
b
ELVEES NeoTek, Proyezd 4922, 4 build. 2, Zelenograd, Moscow, 124498, Russian Federation

a r t i c l e i n f o a b s t r a c t

Article history: The paper systematically studies the impact of a range of recent advances in convolution neural net-
Received 7 June 2016 work (CNN) architectures and learning methods on the object categorization (ILSVRC) problem. The eval-
Revised 11 March 2017
uation tests the influence of the following choices of the architecture: non-linearity (ReLU, ELU, max-
Accepted 11 May 2017
out, compatability with batch normalization), pooling variants (stochastic, max, average, mixed), network
Available online xxx
width, classifier design (convolutional, fully-connected, SPP), image pre-processing, and of learning pa-
Keywords: rameters: learning rate, batch size, cleanliness of the data, etc.
CNN
The performance gains of the proposed modifications are first tested individually and then in combina-
Benchmark
Non-linearity
tion. The sum of individual gains is greater than the observed improvement when all modifications are
Pooling introduced, but the “deficit” is small suggesting independence of their benefits.
ImageNet We show that the use of 128 × 128 pixel images is sufficient to make qualitative conclusions about
optimal network structure that hold for the full size Caffe and VGG nets. The results are obtained an
order of magnitude faster than with the standard 224 pixel images.
© 2017 Elsevier Inc. All rights reserved.

1. Introduction depth (Gao and Jojic, 2016), effective receptive field size (Luo et al.,
2016), etc. have been published. The topic of local minima in
Deep convolution networks have become the mainstream deep network optimization is well covered by Choromanska et al.
method for solving various computer vision tasks, such as (2014) and by Soudry and Carmon (2016). However, the latest
image classification (Russakovsky et al., 2015), object detec- state of art results have been achieved by hand-crafted architec-
tion (Everingham et al., 2010; Russakovsky et al., 2015), semantic tures (Zagoruyko and Komodakis, 2016) or by large-scale “trial-
segmentation (Dai et al., 2016), image retrieval (Tolias et al., 2016), end-error” reinforcement learning search (Zoph and Le, 0 0 0 0).
tracking (Nam and Han, 2015), text detection (Jaderberg et al., Improvements of many components of the CNN architecture
2014), stereo matching (Žbontar and LeCun, 2014), and many other. like the non-linearity type, pooling, structure and learning have
Besides two classic works on training neural networks – (LeCun been recently proposed. First applied in the ILSVRC (Russakovsky
et al., 1998b) and Bengio (2012), which are still highly rele- et al., 2015) competition, they have been adopted in different re-
vant, there is very little guidance or theory on the plethora search areas.
of design choices and hyper-parameter settings of CNNs with The contributions of the recent CNN improvements and their
the consequence that researchers proceed by trial-and-error ex- interaction have not been systematically evaluated. We survey the
perimentation and architecture copying, sticking to established recent developments and perform a large scale experimental study
net types. With good results in ImageNet competition, the that considers the choice of non-linearity, pooling, learning rate
AlexNet (Krizhevsky et al., 2012), VGGNet (Simonyan and Zisser- policy, classifier design, network width, batch normalization (Ioffe
man, 2015) and GoogLeNet(Inception) (Szegedy et al., 2015) have and Szegedy, 2015). We did not include ResNets (He et al., 2016a) –
become the de-facto standard. a recent development achieving excellent results – since they have
Theory-grounded recommendations for the selection the num- been well covered in papers (He et al., 2016b; Larsson et al., 2016;
ber of neurons (Ithapu et al., 2017; Schmidhuber, 1997), network Szegedy et al., 2016; Zagoruyko and Komodakis, 2016).
There are three main contributions of the paper. First, we sur-
vey and present baseline results for a wide variety of architectures

Corresponding author.
and design choices both individually and in combination. Based
E-mail address: ducha.aiki@gmail.com (D. Mishkin).

http://dx.doi.org/10.1016/j.cviu.2017.05.007
1077-3142/© 2017 Elsevier Inc. All rights reserved.

Please cite this article as: D. Mishkin et al., Systematic evaluation of convolution neural network advances on the Imagenet, Computer
Vision and Image Understanding (2017), http://dx.doi.org/10.1016/j.cviu.2017.05.007
JID: YCVIU
ARTICLE IN PRESS [m5G;May 19, 2017;10:19]

2 D. Mishkin et al. / Computer Vision and Image Understanding 000 (2017) 1–9

Table 1 Table 2
List of hyper-parameters tested. The basic CaffeNet architecture used in most experiments. Pad 1 – zero-padding
on the image boundary with1 pixel. Group 2 convolution – filters are split into 2
Hyper-parameter Variants separate groups. The architecture is denoted in “shorthand” as 96C11/4 → MP3/2
Non-linearity linear, tanh, sigmoid, ReLU, VLReLU, RReLU, → 192G2C5/2 → MP3/2 → 384G2C3 → 384C3 → 256G2C3 → MP3/2 → 2048C3
PReLU, ELU, maxout, APL, combination → 2048C1 → 10 0 0C1.
Batch Normalization (BN) before non-linearity. after non-linearity
input image 128 × 128 px, random crop from 144 × N, random mirror
BN + non-linearity linear, tanh, sigmoid, ReLU, VLReLU,
RReLU, PReLU, ELU, maxout pre-process out = 0.04 (BGR - (104; 117; 124))
Pooling max, average, stochastic, max+average, conv1 conv 11 × 11 × 96, stride 4
strided convolution ReLU
Pooling window size 3 × 3, 2 × 2, 3 × 3 with zero-padding pool1 max pool 3 × 3, stride 2
Learning rate decay policy step, square, square root, linear conv2 conv 5 × 5 × 192, stride 2, pad 1, group 2
Colorspace & Pre-processing RGB, HSV, YCrCb, grayscale, learned, ReLU
CLAHE, histogram equalized pool2 max pool 3 × 3, stride 2
Classifier design pooling-FC-FC-clf, SPP-FC-FC-clf, conv3 conv 3 × 3 × 384, pad 1
pooling-conv-conv-clf-avepool, ReLU
pooling-conv-conv-avepool-clf
√ √ √ √ √ conv4 conv 3 × 3 × 384, pad 1, group 2
Network width 1/4, 1/2 2, 1/2, 1/ 2, 1 , 2, 2, 2 2, 4, 4 2 ReLU
Input image size 64, 96, 128, 180, 224 conv5 conv 3 × 3 × 256, pad 1, group 2
Dataset size 20 0k, 40 0k, 60 0k, 80 0k, 120 0k(full) ReLU
Batch size 1, 32, 64, 128, 256, 512, 1024 pool5 max pool 3 × 3, stride 2
Percentage of noisy data 0, 5%, 10%, 15%, 32% fc6 fully-connected 4096
Using bias yes/no ReLU
drop6 dropout ratio 0.5
fc7 fully-connected 4096
on a large-scale evaluation, we provide novel recommendations ReLU
and insights into deep convolutional network structure. Second, we drop7 dropout ratio 0.5
show that for popular architectures – AlexNet, GoogLeNet, VGGNet fc8-clf softmax-10 0 0

– the recommendations based on results obtained on small images


hold for common image size 224 × 224 or even 300 × 300 pixels
which allows very fast testing. Last, but not least, the benchmark
is fully reproducible and all scripts and data are available online.1
The paper is structured as follows. In Section 2.1, we explain
and validate experiment design. In Section 3, the influence of
a range of hyper-parameters is evaluated in isolation. The re-
lated literature is review the corresponding in experiment sections.
Section 4 is devoted to the combination of best hyper-parameter
setting and to “squeezing-the-last-percentage-points” for a given
architecture recommendation. The paper is concluded in Section 5.

2. Evaluation

Standard CaffeNet parameters and architecture are shown in


Table 2. The full list of tested attributes is given in Table 1.

Fig. 1. Impact of image and network size on top-1 accuracy.


2.1. Evaluation framework

All tested networks were trained on the 10 0 0 object category In order to decrease the probability of overfitting and to make
classification problem on the ImageNet dataset (Russakovsky et al., experiments less demanding in memory, another change of Caf-
2015). The set consists of a 1.2M image training set, a 50k image feNet is made. A number of filters in fully-connected layers 6 and
validation set and a 100k image test set. The test set is not used in 7 were reduced by a factor of two, from 4096 to 2048. The results
the experiments. The commonly used pre-processing includes im- validating the resolution reduction are presented in Fig. 1.
age rescaling to 256 × N, where N ≥ 256, and then cropping a ran- The parameters and architecture of the standard CaffeNet are
dom 224 × 224 square (Howard, 2013; Krizhevsky et al., 2012). The shown in Table 2. For experiments we used CaffeNet with 2 ×
setup achieves good results in classification, but training a network thinner fully-connected layers, named as CaffeNet128-FC2048. The
of this size takes several days even on modern GPUs. We thus pro- architecture can be denoted as 96C11/4 → MP3/2 → 192G2C5/2
pose to limit the image size to 144 × N where N ≥ 128 (denoted → MP3/2 → 384G2C3 → 384C3 → 256G2C3 → MP3/2 → 2048C3
as ImageNet-128px). For example, the CaffeNet (Jia et al., 2014) is → 2048C1 → 10 0 0C1. Here we used fully-convolutional notation
trained within 24 h using NVIDIA GTX980 on ImageNet-128px. for fully-connected layers, which are equivalent when image input
size is fixed to 128 × 128 px. The default activation function is
2.1.1. Architectures ReLU and it is put after every convolution layer, except the last
The input size reduction is validated by training CaffeNet, 10 0 0-way softmax classifier.
GoogLeNet and VGGNet on both the reduced and standard image
sizes. The results are shown in Fig. 1. The reduction of the input 2.1.2. Learning
image size leads to a consistent drop of around 6% in top-1 accu- SGD with momentum 0.9 is used for learning, the initial learn-
racy for all three popular architectures and does not change their ing rate is set to 0.01, decreased by a factor of ten after every 100k
relative order (VGGNet > GoogLeNet > CaffeNet) or accuracy dif- iterations until learning stops after 320k iterations. The L2 weight
ference. decay for convolutional weights is set to 0.0 0 05 and it is not ap-
plied to bias. The dropout (Srivastava et al., 2014) with probability
1
https://www.github.com/ducha- aiki/caffenet- benchmark. 0.5 is used before the two last layers. All the networks were initial-

Please cite this article as: D. Mishkin et al., Systematic evaluation of convolution neural network advances on the Imagenet, Computer
Vision and Image Understanding (2017), http://dx.doi.org/10.1016/j.cviu.2017.05.007
JID: YCVIU
ARTICLE IN PRESS [m5G;May 19, 2017;10:19]

D. Mishkin et al. / Computer Vision and Image Understanding 000 (2017) 1–9 3

Table 3 ELU, the network architecture used in the evaluation was designed
Non-linearities tested.
specifically for the experiment and is not commonly used.
Name Formula Year

none y = x - 3.1.2. Experiment


1
sigmoid y = 1+e−x
1986 We have tested the most popular activation functions and all
e2x −1
tanh y = e2x +1
1986 those with available or trivial implementations: ReLU, tanh, sig-
ReLU y = max(x, 0) 2010
moid, VLReLU, RReLU, PReLU, ELU, linear, maxout, APL, SoftPlus.
(centered) SoftPlus y = ln (ex + 1 ) − ln 2 2011
LReLU y = max(x, α x), α ≈ 0.01 2011
Formulas and references are given in Table 3. We have selected APL
maxout y = max(W1 x + b1 , W2 x + b2 ) 2013 and maxout with two linear pieces. Maxout is tested in two modi-

APL y = max(x,0) + Ss=1 asi max (0, −x + bsi ) 2014 fications: MaxW – having the same effective network width, which
VLReLU y = max(x, α x), α ∈ 0.1, 0.5 2014 doubles the number of parameters and computation costs because
RReLU y = max(x, α x), α = random(0.1, 0.5) 2015
of the two linear pieces,
√ and MaxS – having same computational
PReLU y = max(x, α x), α is learnable 2015
ELU y = x, if x ≥ 0, else α (ex − 1 ) 2015 complexity - with 2 thinner each piece. Besides this, we have
tested “optimally scaled” tanh, proposed by LeCun et al. (1998b).
We have also tried to train sigmoid (Rumelhart et al., 1986) net-
ized with LSUV (Mishkin and Matas, 2016). Biases are initialized to work, but the initial loss never decreased. Finally, as proposed by
zero. Since the LSUV initialization works under the assumption of Swietojanski et al. (2014), we have tested combination of ReLU for
preserving unit variance of the input, pixel intensities were scaled convolutional layers and maxout for the fully connected layers of
by 0.04, after subtracting the mean of BGR pixel values (104 117 the network.
124). Results are shown in Fig. 2. The best single performing activa-
tion function similar in complexity to ReLU is ELU. The parametric
3. Single experiments PReLU performed on par. The performance of the centered soft-
plus is the same as for ELU. Surprisingly, Very Leaky ReLU, popular
This section is devoted to the experiments with a single hyper- for DCGAN networks (Radford et al., 2015) and for small datasets,
parameter or design choice per experiment. does not outperform vanilla ReLU. Interesting, the network with no
non-linearity has respectable performance – 38.9% top-1 accuracy
3.1. Activation functions on ImageNet, not much worse than tanh-network.
The Swietojanski et al. (2014) hypothesis about maxout power
3.1.1. Previous work in the final layers is confirmed and combined ELU (after convo-
The activation functions for neural networks are a hot topic, lutional layers) + maxout (after fully connected layers) shows the
many functions have been proposed since the ReLU discov- best performance among non-linearities with speed close to ReLU.
ery (Glorot et al., 2011). The first group are related to ReLU, i.e. Wide maxout outperforms the rest of the competitors at a higher
LeakyReLU (Maas et al., 2013) and Very Leaky ReLU (Graham, computational cost.
2014), RReLU (Xu et al., 2015), PReLU (He et al., 2015) and its gen-
eralized version – APL (Agostinelli et al., 2014), ELU (Clevert et al., 3.2. Pooling
2016). Others are based on different ideas, e.g. maxout (Goodfellow
et al., 2013), MBA (Li et al., 2016), etc. However, to our best knowl- 3.2.1. Previous work
edge only a small fraction of this activation functions have been Pooling, combined with striding, is a common way to archieve
evaluated on ImageNet-scale dataset. And when they have, e.g. a degree of invariance together with a reduction of spatial size

Fig. 2. Top-1 accuracy gain over ReLU in the CaffeNet-128 architecture. MaxS stands for “maxout, same compexity”, MaxW – maxout, same width, CSoftplus – centered
softplus. The baseline, i.e. ReLU, accuracy is 47.1%.

Please cite this article as: D. Mishkin et al., Systematic evaluation of convolution neural network advances on the Imagenet, Computer
Vision and Image Understanding (2017), http://dx.doi.org/10.1016/j.cviu.2017.05.007
JID: YCVIU
ARTICLE IN PRESS [m5G;May 19, 2017;10:19]

4 D. Mishkin et al. / Computer Vision and Image Understanding 000 (2017) 1–9

Fig. 3. Top-1 accuracy gain over max pooling for the CaffeNet-128 architecture. Left – different pooling methods, right – different receptive field sizes. Stoch stands for
stochastic pooling, “stoch no dropout” – for a network with stochastic pooling and turned off drop6 and drop7 layers.

Table 4 Table 5
Poolings tested. Learning rate decay policies, tested in paper. L0 – initial learning rate, M = number
of learning iterations, i – current iteration, S – step iteration. γ – decay coefficient.
Name Formula Year
Name Formula Parameters Accuracy
max y= maxh,w x
i, j=1 i, j
1989
1 h,w step lr = L0 γ floor(i/S)
S = 100k, γ = 0.1, M = 320k 0.471
average y= hw i, j=1 xi, j 1989
stochastic y = xi, j with prob.
xi, j
h,w 2013 square (1 − i/M )2
lr = L0  M = 320k 0.483
x
i, j=1 i, j
square root lr = L0 1 − i/M M = 320k 0.483
strided convolution – 2014
h,w linear lr = L0 (1 − i/M ) M = 320k 0.493
max + average y =maxh,w x + 1
xi, j 2015
i, j=1 i, j hw i, j=1 M = 160k 0.466

Table 6
of feature maps. The most popular options are max pooling and Top-1 accuracy on ImageNet-128px, batch normaliza-
average pooling. Among the recent advances are: Stochastic pool- tion placement. ReLU activation is used.
ing (Zeiler and Fergus, 2013), LP-Norm pooling (Gulcehre et al.,
Network BN placement
2013) and Tree-Gated pooling (Lee et al., 2015). Only the authors
No BN Before After
of the last paper have tested their pooling on ImageNet.
The pooling receptive field is another design choice. Krizhevsky CaffeNet128-FC2048 0.471 0.478 0.499
et al. (2012) claimed superiority of overlapping pooling with 3 × 3 GoogLeNet128 0.619 0.603 0.596

window size and stride 2, while VGGNet (Simonyan and Zisserman,


2015) uses a non-overlapping 2 × 2 window.
3.3. Learning rate policy
3.2.2. Experiment
We tested (see Table 4) average, max, stochastic and proposed Learning rate is one of the most important hyper-parameters
by Lee et al. (2015) sum of average and max pooling, and skip- which influences the final CNN performance. Surprisingly, the most
ping pooling at all, replacing it with strided convolutions proposed commonly used learning rate decay policy is “reduce learning rate
by Springenberg et al. (2014). We also tried Tree and Gated pool- 10x when validation error stops decreasing” adopted with no pa-
ings (Lee et al., 2015), but we encountered convergence problems rameter search. While this works well in practice, such policy can
and the results depended strongly on the input image size. We do be sub-optimal. We have tested four learning rate policies: step,
not know if it is a problem of the implementation, or the method quadratic and square root decay (used for training GoogLeNet by
itself and therefore omitted the results. BVLC (Jia et al., 2014)), and linear decay, which are shown in
The results are shown in Fig. 3, left. Stochastic pooling had very Table 5. The actual learning rate dynamics are shown in Fig. 4, left.
bad results. In order to check if it was due to extreme random- The validation accuracy is shown in the right. Linear decay gives
ization by the stochastic pooling and dropout, we trained a net- the best results.
work without the dropout. This decreased accuracy even more. The
best results were obtained by a combination of max and average 3.4. Batch normalization
pooling. Our guess is that max pooling brings selectivity and in-
variance, while average pooling allows using gradients of all fil- Batch normalization (Ioffe and Szegedy, 2015) (BN) is a recent
ters, instead of throwing away 3/4 of information as done by non- method that solves the gradient exploding/vanishing problem and
overlapping 2 × 2 max pooling. guarantees near-optimal learning regime for the layer following the
The second experiment is about the receptive field size. The re- batch normalized one. Following Mishkin and Matas (2016), we
sults are shown in Fig. 3, right. Overlapping pooling is inferior to a first tested different options where to put BN – before or after the
non-overlapping 2 × 2 window, but wins if zero-padding is done. non-linearity. Results presented in Table 6 are surprisingly incon-
This can be explained by the fact that better results are obtained sistent: CaffeNet architecture prefers Conv-ReLU-BN-Conv, while
for larger outputs; 3 × 3/2 pooling leads to 3 × 3 spatial size of GoogLeNet – Conv-BN-ReLU-Conv placement. Moreover, results for
pool5 feature map, 2 × 2/2 leads to 4 × 4 pool5, while 3 × 3/2 + 1 GoogLeNet are inferior to the plain network. The difference with
– to 5 × 5. This observation means there is a speed – performance respect to Ioffe and Szegedy (2015) is that we have not changed
trade-off. any other parameters except using BN, while in the original paper,

Please cite this article as: D. Mishkin et al., Systematic evaluation of convolution neural network advances on the Imagenet, Computer
Vision and Image Understanding (2017), http://dx.doi.org/10.1016/j.cviu.2017.05.007
JID: YCVIU
ARTICLE IN PRESS [m5G;May 19, 2017;10:19]

D. Mishkin et al. / Computer Vision and Image Understanding 000 (2017) 1–9 5

Fig. 4. Left: learning rate decay policy, right: validation accuracy. The formulae for each policy are given in Table 5.

Fig. 6. Classifier design: Top-1 accuracy gain over standard CaffeNet-128 architec-
Fig. 5. Top-1 accuracy gain over ReLU without batch normalization (BN) in ture.
CaffeNet-128 architecture. The baseline – ReLU – accuracy is 47.1%.

There are three the most popular approaches to classifier de-


authors decreased regularization (both weight decay and dropout), sign. First – final layer of the feature extractor is the max pooling
changed the learning rate decay policy and applied an additional layer and the classifier is a one or two layer MLP, as it is done in
training set re-shuffling. Also, GoogLeNet behavior seems different LeNet (LeCun et al., 1998a), AlexNet (Krizhevsky et al., 2012) and
to CaffeNet and VGGNet w.r.t. to other modification, see Section 4. VGGNet (Simonyan and Zisserman, 2015). Second – spatial pool-
For the next experiment with BN and activations, we selected ing pyramid layer (He et al., 2014) instead of pooling layer, fol-
placement after non-linearity. Results are shown in Fig. 5. Batch lowed by two layer MLP. And the third architecture consist of av-
normalization washes out differences between ReLU-family vari- erage pooling layer, squashing spatial dimensions, followed by soft-
ants, so there is no need to use the more complex variants. Sig- max classifier without any feature transform. This variant is used
moid with BN outperforms ReLU without it, but, surprisingly, tanh in GoogLeNet (Szegedy et al., 2015) and ResNet (He et al., 2016a).
with BN shows worse accuracy than sigmoid with BN.
3.5.2. Experiment
We have explored the following variants: default 2-layer MLP,
3.5. Classifier design SPPNet with 2 and 3 pyramid levels, removing pool5-layer, treat-
ing fully-connected layers as convolutional, which allows to use
3.5.1. Previous work zero-padding, therefore increase the effective number of training
The CNN architecture can be seen as integration of feature de- examples for this layer, averaging features before softmax layer or
tector and which is following by a classifier. Ren et al. (2015) pro- averaging spatial predictions of the softmax layer (Lin et al., 2013).
posed to consider convolutional layers of the AlexNet as a feature The results are shown in the Fig. 6. The best results are got in fully
extractor and fully-connected layers as 2-layer MLP as a classifier. convolution setup with zero padding, followed by a global average
They argued that 2 fully-connected layers are not the optimal de- pooling, which is applied to classifier predictions. The advantage of
sign and explored various architectures instead. But they consid- the SPP over standard max pooling is less pronounced.
ered only pre-trained CNN or HOGs as feature extractor, and so
explored mostly the transfer learning scenario, when most of the 3.6. Batch size and learning rate
network weights are frozen. Also, they explored architectures with
additional convolution layers, which can be seen not as a better The mini-batch size is always a trade-off. Early work by Wilson
classifier, but as an enhancement of the feature extractor. and Martinez (2003) showed that the mini-batch size of one

Please cite this article as: D. Mishkin et al., Systematic evaluation of convolution neural network advances on the Imagenet, Computer
Vision and Image Understanding (2017), http://dx.doi.org/10.1016/j.cviu.2017.05.007
JID: YCVIU
ARTICLE IN PRESS [m5G;May 19, 2017;10:19]

6 D. Mishkin et al. / Computer Vision and Image Understanding 000 (2017) 1–9

Fig. 7. Batch size and initial learning rate impact to the accuracy. Fig. 9. Input image size impact on the accuracy.

3.8. Input image size

The input image size, as it brings additional information and


training samples for convolution filters, plays a very important
role. Our initial experiment, showed in Fig. 1 indicates that Caf-
feNet, trained on 227 × 227 images can compete with much more
complex GoogLeNet architecture, trained on smaller images. So the
obvious question is what is the dependence image size and final
accuracy.
We have performed an experiment with different input image
sizes: 96, 128, 180 and 224 pixels wide. The results are presented
in Fig. 9. The bad news are that while accuracy depends on im-
age size linearly, the needed computations grow quadratically, so
it is a very expensive way to a performance gain. In the second
part of experiment, we kept the spatial output size of the pool1
Fig. 8. Network width impact on the accuracy. layer fixed while changing the input image size. To archieve this,
we respectively change the stride and filter size of the conv1 layer.
Results show that the gain from a large image size mostly (after
achieved the highest accuracy and smallest generalization error.
some minimum value) comes from the larger spatial size of deeper
On the other hand, GPU architecture are inefficient for small batch
layers than from the unseen image details.
sizes.
Here we explore the influence of mini-batch size on the fi-
nal accuracy. Experiments show that keeping a constant learning 3.9. Dataset size and noisy labels
rate for different mini-batch sizes has a negative impact on per-
formance. We also tested the heuristic proposed by Krizhevsky 3.9.1. Previous work
(2014) which suggests to keep the product of mini-batch size and The performance of the current deep neural network is highly
learning rate constant. Results are shown in Fig. 7. The heuristics dependent on the dataset size. Unfortunately, not much research
works, but large (512 and more) mini-batch sizes leads to quite has been published on this topic. In DeepFace (Taigman et al.,
significant decrease in performance. On the other extreme, online 2014), the authors show that dataset reduction from 4.4M to 1.5M
training (mini-batch with a single example) does not bring accu- leads to a 1.74% accuracy drop. Similar dependence is shown by
racy gains over 64 or 256, but significantly slows down the train- Schroff et al. (2015) but on an extra-large dataset: decreasing the
ing wall-clock time. dataset size from 260M to 2.6M leads to accuracy drop in 10%.
But these datasets are private and the experiments are not re-
producible. Another important property of a dataset is the clean-
3.7. Network width
liness of the data. For example, an estimate of human accuracy
on ImageNet is 5.1% for top-5 (Russakovsky et al., 2015). To cre-
All the advances in ImageNet competition so far were caused
ate the ImageNet, each image was voted on by ten different peo-
by architectural improvement. To the best of our knowledge, there
ple (Russakovsky et al., 2015).
is no study about network width – final accuracy dependence.
Canziani et al. (2016) did a comparative analysis of the ImageNet
winner in terms of accuracy, number of parameters and computa- 3.9.2. Experiment
tional complexity, but it is a comparison of different architectures. We explore the dependence between the accuracy and the
In this subsection we evaluate how far one can get by increasing dataset size/cleanliness on ImageNet. For the dataset size experi-
CaffeNet width, with no other changes. The results are shown in ment, 200, 400, 600, 800 thousand examples were random cho-
Fig. 8. The original architecture is close to optimal in accuracy per sen from a full training set. For each reduced dataset, a CaffeNet
FLOPS sense: a decrease in the number of filters leads to a quick is trained from scratch. For the cleanliness test, we replaced the
and significant accuracy drop, while making the network thicker labels to a random incorrect one for 5%, 10%, 15% and 32% of the
brings gains, but it saturates quickly. Making the network thicker examples. The labels are fixed, unlike the recent work on disturb-
more than 3 times leads to a very limited accuracy gain. ing labels as a regularization method (Xie et al., 2016).

Please cite this article as: D. Mishkin et al., Systematic evaluation of convolution neural network advances on the Imagenet, Computer
Vision and Image Understanding (2017), http://dx.doi.org/10.1016/j.cviu.2017.05.007
JID: YCVIU
ARTICLE IN PRESS [m5G;May 19, 2017;10:19]

D. Mishkin et al. / Computer Vision and Image Understanding 000 (2017) 1–9 7

4. Best-of-all experiments

Finally, we test how all the improvements, which do not in-


crease the computational cost, perform together. We combine: the
learned colorspace transform, ELU as non-linearity for convolution
layers and maxout for fully-connected layers, linear learning rate
decay policy, average plus max pooling. The improvements are ap-
plied to CaffeNet128, CaffeNet224, VGGNet128 and GoogleNet128.
The first three demonstrated consistent performance growth
(see Fig. 11), while GoogLeNet performance degraded, as it was
found for batch normalization. Possibly, this is due to the complex
and optimized structure of the GoogLeNet network. Unfortunately,
the cost of training VGGNet224 is prohibitive, one month of GPU
time, so we did not perform the test.

5. Conclusions
Fig. 10. Training dataset size and cleanliness impact on the accuracy.

Table 7
We have compared systematically a set of recent CNN advances
Influence of the bias in con- on large scale dataset. We have shown that benchmarking can be
volution and fully-connected done at an affordable time and computation cost. A summary of
layers. Top-1 accuracy on recommendations:
ImageNet-128px.

Network Accuracy
• use ELU non-linearity without batchnorm or ReLU with it.
• apply a learned colorspace transformation of RGB.
With bias 0.471
• use the linear learning rate decay policy.
Without bias 0.445
• use a sum of the average and max pooling layers.
• use mini-batch size around 128 or 256. If this is too big for your
GPU, decrease the learning rate proportionally to the batch size.
• use fully-connected layers as convolutional and average the
predictions for the final decision.
• when investing in increasing training set size, check if a plateau
has not been reach.
• cleanliness of the data is more important than the size.
• if you cannot increase the input image size, reduce the stride
in the consequent layers, it has roughly the same effect.
• if your network has a complex and highly optimized architec-
ture, like e.g. GoogLeNet, be careful with modifications.

Acknowledgements

The authors were supported by The Czech Science Foun-


dation Project GACR P103/12/G084 and CTU student grant
SGS17/185/OHK3/3T/13.

Appendix A

Fig. 11. Applying all improvements that do not change feature maps size: linear A1. Image pre-processing
learning rate decay policy, colorspace transformation “F”, ELU nonlinearity in con-
volution layers, maxout non-linearity in fully-connected layers and a sum of average
The commonly used input to CNN is raw RGB pixels and
and max pooling.
the commonly adopted recommendation is not to use any pre-
processing. There has not been much research on the optimal col-
The results are shown in Fig. 10 which clearly shows that big- orspace or pre-processing techniques for CNN. Fuad Rachmadi and
ger (and more diverse) dataset brings an improvement. There is Ketut Eddy Purnama (2015) explored different colorspaces for ve-
a minimum size below which performance quickly degrades. Less hicle color identification, Dong et al. (2015) compared YCrCb and
of clean data outperforms more noisy ones: a clean dataset with RGB channels for image super-resolution, Graham (2015) extracted
400k images performs on par with 1.2M dataset with 800k correct local average color from retina images in winning solution to the
images. Kaggle competition.
The pre-processing experiment is divided in two parts. First,
3.10. Bias in convolution layers we have tested popular handcrafted image pre-processing meth-
ods and colorspaces. Since all transformations were done on-the-
We conducted a simple experiment on the importance of the fly, we first tested if calculation of the mean pixel and variance
bias in the convolution and fully-connected layers. First, the net- over the training set can be replaced with applying batch normal-
work is trained as usual, for the second – biases are initialized ization to input images. It decreases final accuracy by 0.3% and
with zeros and the bias learning rate is set to zero. The network can be seen as baseline for all other methods. We have tested
without biases shows 2.6% less accuracy than the default – see HSV, YCrCb, Lab, RGB and single-channel grayscale colorspaces. Re-
Table 7. sults are shown in Fig. A.12. The experiment confirms that RGB is

Please cite this article as: D. Mishkin et al., Systematic evaluation of convolution neural network advances on the Imagenet, Computer
Vision and Image Understanding (2017), http://dx.doi.org/10.1016/j.cviu.2017.05.007
JID: YCVIU
ARTICLE IN PRESS [m5G;May 19, 2017;10:19]

8 D. Mishkin et al. / Computer Vision and Image Understanding 000 (2017) 1–9

Fig. A.12. Left: performance of using various colorspaces and pre-processing. Right: learned colorspace transformations. Parameters are given in Table A.9.

Table A.8 (continued)

Group Name Names acc [%]


Table A.8
Results of all tests on ImageNet-128px. Percentage of noisy data 5% 45.8
10% 44.7
Group Name Names acc [%]
15% 43.7
Baseline 47.1 32% 40.1
Non-linearity Linear 38.9 Dataset size 1200k 47.1
tanh 40.1 800k 43.8
VReLU 46.9 600k 42.5
APL2 47.1 400k 39.3
ReLU 47.1 200k 30.5

RReLU 47.8 Network width 4 2 56.5
maxout (MaxS) 48.2 4√ 56.3
PReLU 48.5 2 2 55.2
ELU 48.8 2 53.3

maxout (MaxW) 51.7 2 50.6
Batch Normalization (BN) before non-linearity 47.4 1√ 47.1
after non-linearity 49.9 1/ 2 46.0
BN + Non-linearity Linear 38.4 1/2√ 41.6
tanh 44.8 1/2 2 31.8
sigmoid 47.5 1/4 25.6
maxout (MaxS) 48.7 Batch size BS=1024, lr=0.04 46.5
ELU 49.8 BS=1024, lr=0.01 41.9
ReLU 49.9 BS=512, lr=0.02 46.9
RReLU 50.0 BS=512, lr=0.01 45.5
PReLU 50.3 BS=256, lr=0.01 47.0
Pooling stochastic, no dropout 42.9 BS=128, lr=0.005 47.0
average 43.5 BS=128, lr=0.01 47.2
stochastic 43.8 BS=64, lr=0.0025 47.5
Max 47.1 BS=64, lr=0.01 47.1
strided convolution 47.2 BS=32, lr=0.00125 47.0
max+average 48.3 BS=32, lr=0.01 46.3
Pooling window size 3 × 3/2 47.1 BS=1, lr=0.0 0 0 039 47.4
2 × 2/2 48.4 Bias without 44.5
3 × 3/2 pad=1 48.8 with 47.1
Learning rate decay policy step 47.1 Architectures CaffeNet128 47.1
square 48.3 CaffeNet128All 53.0
square root 48.3 CaffeNet224 56.5
linear 49.3 CaffeNet224All 61.3
Colorspace & Pre-processing OpenCV grayscale 41.3 VGGNet16-128 65.1
grayscale learned 41.9 VGGNet16-128All 68.2
histogram equalized 44.8 GoogLeNet128 61.9
HSV 45.1 GoogLeNet128All 60.6
YCrCb 45.8 GoogLeNet224 67.9
CLAHE 46.7
RGB 47.1
Classifier design pooling-FC-FC-clf 47.1
SPP2-FC-FC-clf 47.1
the best suitable colorspace for CNNs (Table A.8). Lab-based net-
pooling-C3-C1-clf-maxpool 47.3
SPP3-FC-FC-clf 48.3
work has not improved the initial loss after 10K iterations. Remov-
pooling-C3-C1-avepool-clf 48.9 ing color information from images costs from 5.8% to 5.2% of the
C3-C1-clf-avepool 49.1 accuracy, for OpenCV RGB2Gray and learned decolorization resp.
pooling-C3-C1-clf-avepool 49.5 Global (Hummel, 1977) and local (CLAHE (Zuiderveld, 1994)) his-
(continued on next page) togram equalizations hurt performance as well.
Second, we let the network to learn a transformation via 1×1
convolution, so no pixel neighbors are involved. The mini-networks

Please cite this article as: D. Mishkin et al., Systematic evaluation of convolution neural network advances on the Imagenet, Computer
Vision and Image Understanding (2017), http://dx.doi.org/10.1016/j.cviu.2017.05.007
JID: YCVIU
ARTICLE IN PRESS [m5G;May 19, 2017;10:19]

D. Mishkin et al. / Computer Vision and Image Understanding 000 (2017) 1–9 9

Table A.9 Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama,
Mini-networks for learned colorspace transformations, placed after image and be- S., Darrell, T., Caffe: convolutional architecture for fast feature embedding.
fore conv1 layer. In all cases RGB means scales and centered input 0.04∗ (Img - arXiv:1408.5093.
(104, 117,124)). Krizhevsky, A., One weird trick for parallelizing convolutional neural networks.
arXiv:1404.5997.
Name Architecture Non-linearity Acc. Krizhevsky, A., Sutskever, I., Hinton, G.E., 2012. Imagenet classification with deep
convolutional neural networks. In: Pereira, F., Burges, C., Bottou, L., Wein-
A RGB → conv1×1×10 → conv1×1×3 tanh 0.463
berger, K. (Eds.), Advances in Neural Information Processing Systems, Vol. 25.
RGB RGB - 0.471
Curran Associates, Inc., pp. 1097–1105.
B RGB → conv1×1×3 → conv1×1×3 VLReLU 0.480 Larsson, G., Maire, M., Shakhnarovich, G., Fractalnet: ultra-deep neural networks
C RGB → conv1×1×10 → conv1×1×3 + RGB VLReLU 0.482 without residuals. arXiv:1605.07648.
D [RGB; log(RGB)] → conv1×1×10 → conv1×1×3 VLReLU 0.482 LeCun, Y., Bottou, L., Bengio, Y., Haffner, P., 1998. Gradient-based learning applied to
E RGB → conv1×1×16 → conv1×1×3 VLReLU 0.483 document recognition. In: Proceedings of the IEEE, Vol. 86, pp. 2278–2324.
F RGB → conv1×1×10 → conv1×1×3 VLReLU 0.485 LeCun, Y., Bottou, L., Orr, G.B., Müller, K.-R., 1998. Effiicient backprop. In: Neural Net-
works: Tricks of the Trade, This Book is an Outgrowth of a 1996 NIPS Work-
shop. Springer-Verlag, London, UK, UK, pp. 9–50.
Lee, C.-Y., Gallagher, P. W., Tu, Z., Generalizing pooling functions in convolutional
architectures are described in Table A.9. The learning process is neural networks: mixed, gated, and tree. arXiv:1509.08985.
Li, H., Ouyang, W., Wang, X., Multi-bias non-linear activation in deep neural net-
joint with the main network and can be seen as extending the Caf- works. arXiv:1604.00676.
feNet architecture with several 1×1 convolutions at the input. The Lin, M., Chen, Q., Yan, S., Network in network. arXiv:1312.4400.
best performing network gave 1.4% absolute accuracy gain without Luo, W., Li, Y., Urtasun, R., Zemel, R., 2016. Understanding the effective receptive
field in deep convolutional neural networks. In: Lee, D.D., Sugiyama, M.,
a significant computational cost.
Luxburg, U.V., Guyon, I., Garnett, R. (Eds.), Advances in Neural Infor-
mation Processing Systems 29. Curran Associates, Inc., pp. 4898–4906.
References http://www.papers.nips.cc/paper/6203- understanding- the- effective- receptive-
field- in- deep- convolutional- neural- networks.pdf
Agostinelli, F., Hoffman, M., Sadowski, P., Baldi, P., Learning activation functions to Maas, A.L., Hannun, A.Y., Ng, A.Y., Rectifier nonlinearities improve neural network
improve deep neural networks. arXiv:1412.6830. acoustic models. Proc. ICML. 30
Bengio, Y., 2012. Practical recommendations for gradient-based training of deep ar- Mishkin, D., Matas, J., 2016. All you need is a good in it. In: Proceedings of ICLR.
chitectures. Springer Berlin Heidelberg, Berlin, Heidelberg, pp. 437–478. Nam, H., Han, B., Learning multi-domain convolutional neural networks for visual
Canziani, A., Paszke, A., Culurciello, E., An analysis of deep neural network models tracking. arXiv:1510.07945.
for practical applications. arXiv:1605.07678. Radford, A., Metz, L., Chintala, S., Unsupervised representation learning with deep
Choromanska, A., Henaff, M., Mathieu, M., Ben Arous, G., LeCun, Y., The loss surfaces convolutional generative adversarial networks. arXiv:1511.06434.
of multilayer networks. arXiv:1412.0233. Ren, S., He, K., Girshick, R., Zhang, X., Sun, J., Object detection networks on convo-
Clevert, D.-A., Unterthiner, T., Hochreiter, S., Fast and accurate deep network learn- lutional feature maps. arXiv:1504.06066.
ing by Exponential Linear Units (ELUs). arXiv:1511.07289. Rumelhart, D.E., Hinton, G.E., Williams, R.J., 1986. Learning Internal Representations
Dai, J., He, K., Sun, J., Instance-aware semantic segmentation via multi-task network by Error Propagation, Vol. vol. 1. MIT Press, Cambridge, MA, USA, pp. 318–362.
cascades. arXiv:1512.04412. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpa-
Dong, C., Change Loy, C., He, K., Tang, X., Image super-resolution using deep convo- thy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L., 2015. Imagenet large
lutional networks. arXiv:1501.0 0 092. scale visual recognition challenge. Int. J. Comput. Vision (IJCV) 1–42. doi:10.
Everingham, M., Gool, L., Williams, C.K., Winn, J., Zisserman, A., 2010. The Pascal 1007/s11263-015-0816-y.
visual object classes (voc) challenge. Int. J. Comput. Vision (IJCV) 88 (2), 303– Schmidhuber, J., 1997. Discovering neural nets with low Kolmogorov complex-
338. doi:10.10 07/s11263-0 09-0275-4. ity and high generalization capability. Neural Netw. 10 (5), 857–873. doi:10.
Fuad Rachmadi, R., Ketut Eddy Purnama, I., Vehicle color recognition using convo- 1016/S0893-6080(96)00127-X. http://www.sciencedirect.com/science/article/pii/
lutional neural network. arXiv:1510.07391. S089360809600127X
Gao, T., Jojic, V., 2016. Degrees of freedom in deep neural networks. In: Proceed- Schroff, F., Kalenichenko, D., Philbin, J., 2015. Facenet: a unified embedding for face
ings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence, recognition and clustering. In: Proceedings of the IEEE Conference on Computer
UAI’16. AUAI Press, Arlington, Virginia, United States, pp. 232–241. http://dl.acm. Vision and Pattern Recognition, pp. 815–823.
org/citation.cfm?id=3020948.3020973 Simonyan, K., Zisserman, A., 2015. Very deep convolutional networks for large-scale
Glorot, X., Bordes, A., Bengio, Y., 2011. Deep sparse rectifier neural networks. In: visual recognition. In: Proceedings of ICLR.
Gordon, G.J., Dunson, D.B. (Eds.), Proceedings of the Fourteenth International Soudry, D., Carmon, Y., No bad local minima: Data independent training error guar-
Conference on Artificial Intelligence and Statistics (AISTATS-11), Vol. 15. Jour- antees for multilayer neural networks. arXiv:1605.08361.
nal of Machine Learning Research - Workshop and Conference Proceedings, Springenberg, J.T., Dosovitskiy, A., Brox, T., Riedmiller, M., 2014. Striving for simplic-
pp. 315–323. ity: the all convolutional net. Proceedings of ICLR Workshop. arXiv: 1412.6806.
Goodfellow, I.J., Warde-Farley, D., Mirza, M., Courville, A.C., Bengio, Y., 2013. Max- Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R., 2014.
out networks. In: Proceedings of the 30th International Conference on Machine Dropout: a simple way to prevent neural networks from overfitting. J. Mach.
Learning, ICML 2013. Atlanta, GA, USA, Vol. 16–21, pp. 1319–1327. Learn. Res. 15, 1929–1958.
Graham, B., 2014. Train you very own deep convolutional Swietojanski, P., Li, J., Huang, J.-T., 2014. Investigation of maxout networks for speech
network. https://www.kaggle.com/c/cifar-10/forums/t/10493/ recognition. In: Proceedings of ICASSP.
train- you- very- own- deep- convolutional- network. Szegedy, C., Ioffe, S., Vanhoucke, V., Inception-v4, Inception-ResNet and the impact
Graham, B., 2015. Kaggle Diabetic Retinopathy Detection Competition Report. Tech. of residual connections on learning. arXiv:1602.07261.
rep. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Van-
Gulcehre, C., Cho, K., Pascanu, R., Bengio, Y., Learned-norm pooling for deep feed- houcke, V., Rabinovich, A., 2015. Going deeper with convolutions. CVPR, Vol.
forward and recurrent neural networks. arXiv:1311.1780 2015.
He, K., Zhang, X., Ren, S., Sun, J., 2016a. Deep residual learning for image recognition, Taigman, Y., Yang, M., Ranzato, M., Wolf, L., 2014. Deepface: closing the gap to hu-
CVPR. arXiv:1512.03385. man-level performance in face verification. In: Proceedings of the IEEE Confer-
He, K., Zhang, X., Ren, S., Sun, J., 2015. Delving deep into rectifiers: surpassing ence on Computer Vision and Pattern Recognition, pp. 1701–1708.
human-level performance on imagenet classification, CVPR. arXiv:1502.01852. Tolias, G., Sicre, R., Jégou, H., Particular object retrieval with integral max-pooling of
He, K., Zhang, X., Ren, S., Sun, J., 2016b. Identity mappings in deep residual net- CNN activations. arXiv:1511.05879.
works, ECCV. arXiv:1603.05027. Wilson, D.R., Martinez, T.R., 2003. The general inefficiency of batch training
He, K., Zhang, X., Ren, S., Sun, J., 2014. Spatial pyramid pooling in deep convolutional for gradient descent learning. Neural Netw. 16 (10), 1429–1451. doi:10.1016/
networks for visual recognition, ECCV. arXiv:1406.4729. S0893- 6080(03)00138- 2.
Howard, A.G., Some improvements on deep convolutional neural network based im- Xie, L., Wang, J., Wei, Z., Wang, M., Tian, Q., Disturblabel: regularizing CNN on the
age classification. arXiv:1312.5402. loss layer. arXiv:1605.0 0 055.
Hummel, R., 1977. Image enhancement by histogram transformation. Comput. Xu, B., Wang, N., Chen, T., Li, M., Empirical evaluation of rectified activations in con-
Graphics Image Process. 6 (2), 184–195. doi:10.1016/S0146-664X(77)80011-7. volutional network. arXiv:1505.00853.
Ioffe, S., Szegedy, C., 2015. Batch normalization: accelerating deep network training Zagoruyko, S., Komodakis, N., Wide residual networks. arXiv:1605.07146.
by reducing internal covariate shift. In: Blei, D., Bach, F. (Eds.), Proceedings of Žbontar, J., LeCun, Y., Computing the stereo matching cost with a convolutional neu-
the 32nd International Conference on Machine Learning (ICML-15). JMLR Work- ral network. arXiv:1409.4326.
shop and Conference Proceedings, pp. 448–456. Zeiler, M. D., Fergus, R., Stochastic pooling for regularization of deep convolutional
Ithapu, V.K., Ravi, S.N., Singh, V., On architectural choices in deep learning: neural networks. arXiv:1301.3557.
from network structure to gradient convergence and parameter estimation. Zoph, B., Le, Q. V., Neural architecture search with reinforcement learning.
arXiv:1702.08670. arXiv:1611.01578.
Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A., Synthetic data and artificial Zuiderveld, K., 1994. Contrast Limited Adaptive Histogram Equalization. In: Graphics
neural networks for natural scene text recognition. arXiv:1406.2227. Gems IV. Academic Press Professional, Inc., San Diego, CA, USA, pp. 474–485.

Please cite this article as: D. Mishkin et al., Systematic evaluation of convolution neural network advances on the Imagenet, Computer
Vision and Image Understanding (2017), http://dx.doi.org/10.1016/j.cviu.2017.05.007

You might also like