0% found this document useful (0 votes)
303 views25 pages

ResNet Presentation

This document presents the paper "Deep Residual Learning for Image Recognition" by Kaiming He et al. The paper proposes residual learning frameworks to enable the training of networks substantially deeper than those used previously. The authors found that deeper networks can be trained by explicitly letting these layers fit a residual mapping. On the ImageNet dataset, residual nets with a depth of 152 layers achieved 3.57% error on the test set. This represents a significant improvement over previous results and demonstrates the benefits of residual learning for training very deep convolutional networks.

Uploaded by

Faiza
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
303 views25 pages

ResNet Presentation

This document presents the paper "Deep Residual Learning for Image Recognition" by Kaiming He et al. The paper proposes residual learning frameworks to enable the training of networks substantially deeper than those used previously. The authors found that deeper networks can be trained by explicitly letting these layers fit a residual mapping. On the ImageNet dataset, residual nets with a depth of 152 layers achieved 3.57% error on the test set. This represents a significant improvement over previous results and demonstrates the benefits of residual learning for training very deep convolutional networks.

Uploaded by

Faiza
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Deep Residual Learning for Image Recognition

Authors:
Kaiming He, Xiangyu Zhang, Shaoqing Ren and Jian Sun

Presenter: Syeda Faiza Ahmed & Kaies al Mahmud

28th August 2019


How DEEP should we make our Neural Networks?
● It Depends on:
○ The complexity of the task at hand
○ Available computational capacity in the time of training
○ Available computational capacity in the time of inference (e.g. on edge devices)
● If the task needs a lot of parameters:
○ Can we train very deep networks efficiently using current optimization solvers?
○ Is training a better model as simple as adding more and more layers?

2
How DEEP should we make our Neural Networks?

MNIST dataset: ImageNet dataset:


60’000 training samples 1’281’167 training samples
10’000 test samples 100’000 test samples
10 classes 1000 classes
(en.wikipedia.com) (https://patrykchrabaszcz.github.io/Imagenet32/) 3
How DEEP should we make our Neural Networks?
● It Depends on:
○ The complexity of the task at hand
○ Available computational capacity in the time of training
○ Available computational capacity in the time of inference (e.g. on edge devices)
● If the task needs a lot of parameters:
○ Can we train very deep networks efficiently using current optimization solvers?
○ Is training a better model as simple as adding more and more layers?

NO

4
Why is it not OK to just add more layers?
● Cause it introduces some problems during training such as:
○ Vanishing/Exploding gradients
■ Can be addressed by normalized initialization and intermediate normalization
○ Degradation problem
■ What should we do about it?

5
Degradation problem in training of Deep networks
● Intuitively, if we have more parameters than needed, we would end up with
an “overfitting” problem
● However, by increasing the depth of the network, training accuracy gets
saturated
● Now let’s compare two networks on a hypothetical image classification
problem

6
Degradation problem … (continued)

softmax
conv

conv

conv

fc
Acc. = X%
conv conv

conv conv

conv conv

identity fc

identity softmax

identity

identity
Degradation problem … (continued)

fc

softmax
Acc. = X%
Acc. = X%
conv conv conv

conv conv conv

conv conv conv

conv identity fc

conv identity softmax

conv identity

conv identity
Degradation problem … (continued)

fc fc

softmax softmax
Acc. < X%
Acc. = X%
Acc. = X%
Degradation problem … (continued)

10
Degradation problem … (continued)
● Our current optimization solvers are not able to approximate the identity
mappings of a stack of added non-linear layers
● Otherwise, the accuracy of a deeper network should have been at least the
same as a shallower one
● NOTE: This should not be misunderstood with “overfitting”

11
Residual learning
● H(x) is the true mapping function we want to learn
● Let’s define a function F(x), and learn it instead of H(x)

X H(x) Y

F(x) := H(x) - x
X

X F(x) Y

12
Residual block
● Residual architecture adds explicit identity connections throughout the
network to help learning the required identity mappings
X (identity)

weight layer

weight layer
ReLU ReLU
X Y

13
Residual block (continued)
● Using this approach, network will decide how deep it needs to be
● These identity connections introduce no new parameter to the network
architecture, hence it will not add any computational burden
● This method allows us to design deeper networks in order to deal with much
complicated problems and tasks

14
Resnet architecture

15
Resnet architecture
Linear projections
For dimension matching Y = F(x,{Wi}) + Wsx

16
Experiments on ImageNet dataset
● ImageNet dataset has 1000 classes
● 1.28M images were used for training
● 50K images were used for validation
● 100K images were used for final testing
● Batch normalization
● Mini-batch size of 256
● Learning rate of 0.1 and divide by 10 when error plateaus
● Weight decay of 0.0001
● Momentum of 0.9
● Max number of iterations 600’000

18
Resnet architectures for ImageNet dataset

19
“18 layers vs 34 layers” on ImageNet dataset

20
Results on ImageNet dataset

Error rates (%, 10-crop testing) on ImageNet validation.ResNet-50/101/152 are of option B


21
Results on ImageNet dataset

Error rates (%) of ensembles.

The top-5 error is on the test set of ImageNet and reported by the test server.
22
Experiments on CIFAR-10 dataset
● CIFAR-10 dataset has 10 classes
● 45K images were used for training
● 5K images were used for validation
● 10K images were used for testing
● Batch normalization
● Mini-batch size of 128
● Learning rate of 0.1 and divide by 10 at step 32K and 48K
● Weight decay of 0.0001
● Momentum of 0.9
● Termination of training at step 64K

23
Results on CIFAR-10 dataset

Classification error on the CIFAR-10 test set. All methods are with data augmentation. For
ResNet-110, we run it 5 times and show “best (mean±std)” as in
24
Effect of number of layers on the CIFAR-10 dataset

Training on CIFAR-10. Dashed lines denote training error, and bold lines denote testing error. Left:
plain networks. Right: ResNets.
25
Thank
T You

26

You might also like