ResNet Presentation
ResNet Presentation
Authors:
Kaiming He, Xiangyu Zhang, Shaoqing Ren and Jian Sun
2
How DEEP should we make our Neural Networks?
NO
4
Why is it not OK to just add more layers?
● Cause it introduces some problems during training such as:
○ Vanishing/Exploding gradients
■ Can be addressed by normalized initialization and intermediate normalization
○ Degradation problem
■ What should we do about it?
5
Degradation problem in training of Deep networks
● Intuitively, if we have more parameters than needed, we would end up with
an “overfitting” problem
● However, by increasing the depth of the network, training accuracy gets
saturated
● Now let’s compare two networks on a hypothetical image classification
problem
6
Degradation problem … (continued)
softmax
conv
conv
conv
fc
Acc. = X%
conv conv
conv conv
conv conv
identity fc
identity softmax
identity
identity
Degradation problem … (continued)
fc
softmax
Acc. = X%
Acc. = X%
conv conv conv
conv identity fc
conv identity
conv identity
Degradation problem … (continued)
fc fc
softmax softmax
Acc. < X%
Acc. = X%
Acc. = X%
Degradation problem … (continued)
10
Degradation problem … (continued)
● Our current optimization solvers are not able to approximate the identity
mappings of a stack of added non-linear layers
● Otherwise, the accuracy of a deeper network should have been at least the
same as a shallower one
● NOTE: This should not be misunderstood with “overfitting”
11
Residual learning
● H(x) is the true mapping function we want to learn
● Let’s define a function F(x), and learn it instead of H(x)
X H(x) Y
F(x) := H(x) - x
X
X F(x) Y
12
Residual block
● Residual architecture adds explicit identity connections throughout the
network to help learning the required identity mappings
X (identity)
weight layer
weight layer
ReLU ReLU
X Y
13
Residual block (continued)
● Using this approach, network will decide how deep it needs to be
● These identity connections introduce no new parameter to the network
architecture, hence it will not add any computational burden
● This method allows us to design deeper networks in order to deal with much
complicated problems and tasks
14
Resnet architecture
15
Resnet architecture
Linear projections
For dimension matching Y = F(x,{Wi}) + Wsx
16
Experiments on ImageNet dataset
● ImageNet dataset has 1000 classes
● 1.28M images were used for training
● 50K images were used for validation
● 100K images were used for final testing
● Batch normalization
● Mini-batch size of 256
● Learning rate of 0.1 and divide by 10 when error plateaus
● Weight decay of 0.0001
● Momentum of 0.9
● Max number of iterations 600’000
18
Resnet architectures for ImageNet dataset
19
“18 layers vs 34 layers” on ImageNet dataset
20
Results on ImageNet dataset
The top-5 error is on the test set of ImageNet and reported by the test server.
22
Experiments on CIFAR-10 dataset
● CIFAR-10 dataset has 10 classes
● 45K images were used for training
● 5K images were used for validation
● 10K images were used for testing
● Batch normalization
● Mini-batch size of 128
● Learning rate of 0.1 and divide by 10 at step 32K and 48K
● Weight decay of 0.0001
● Momentum of 0.9
● Termination of training at step 64K
23
Results on CIFAR-10 dataset
Classification error on the CIFAR-10 test set. All methods are with data augmentation. For
ResNet-110, we run it 5 times and show “best (mean±std)” as in
24
Effect of number of layers on the CIFAR-10 dataset
Training on CIFAR-10. Dashed lines denote training error, and bold lines denote testing error. Left:
plain networks. Right: ResNets.
25
Thank
T You
26