0% found this document useful (0 votes)
21 views48 pages

L4 Training Neural Networks en

hust

Uploaded by

Phuc Nguyen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views48 pages

L4 Training Neural Networks en

hust

Uploaded by

Phuc Nguyen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

1

Lesson 3:
Training neural networks
(Part 1)
Viet-Trung Tran

2
Outline
• Activation functions
• Data preprocessing
• Weight initializations
• Normalizations

3
Activation functions

4
5
Sigmoid function
• Squashes numbers to
range [0,1]
• They are widely used in
the history of neural
networks because they
simulate the firing rate of
neurons.
• There are 3 downsides:
1. Vanishing gradient due
to saturated neurons

6
Sigmoid function (2)

• What happens when x = -10?


• What happens when x = 0?
• What happens when x = 10?

7
Sigmoid function (3)
• Squashes numbers to
range [0,1]
• They are widely used in
the history of neural
networks because they
simulate the firing rate of
neurons.
• There are 3 downsides:
1. Vanishing gradient due
to saturated neurons
2. Output average is non-
zero

8
Sigmoid function (4)

• What if all the neuron’s inputs


xi are positive?
• Then what will be the gradient
of the objective function with
respect to w?
• All the elements of w have the
same sign as f'(w), i.e. the
same negative or the same
positive.
• Then the gradient can only be
directed in certain dimensions
in the search space.

9
Sigmoid function (5)
• Squashes numbers to
range [0,1]
• They are widely used in
the history of neural
networks because they
simulate the firing rate of
neurons.
• There are 3 downsides:
1. Vanishing gradient due
to saturated neurons
2. Output average is non-
zero
3. Calculating the
exponential exp() is
expensive

10
Tanh function
• Squashes numbers to
range [-1,1]
• Average output is 0
• Still suffering from
saturated neurons,
vanishing gradient

11
ReLU function
• No saturation in the
positive region
• Efficient calculation
• In practice, convergence
is faster than
sigmoid/tanh (about 6
times)
• Downsides
1. Non-zero average
output
2. And one more
problem…

12
ReLU function (2)

• What happens when x = -10?


• What happens when x = 0?
• What happens when x = 10?

13
ReLU function (3)
• The ReLU are "thrown" out due to the data set that
makes the output being always negative and never
updating the weights again.
• Dead ReLU (Dying ReLU)
• Usually initialize the ReLU neuron with a small positive
bias (e.g., 0.01)

14
Leaky ReLU
• Does not saturate
• Computationally efficient
• Converges much faster
thansigmoid/tanh in
practice! (e.g., 6x)
• Will not “die”

15
Parametric Rectifier (PReLU)
• Does not saturate
• Computationally efficient
• Converges much faster
thansigmoid/tanh in
practice! (e.g., 6x)
• Will not “die”

16
Exponential Linear Units
• All benefits of ReLU
• Closer to zero mean
outputs
• Negative saturation
regime compared with
Leaky ReLU adds some
robustness to noise
• Computation requires
exp()

17
Maxout function

• Does not have the basic form of dot product ->


nonlinearity
• Generalizes ReLU and Leaky ReLU
• Linear Regime! Does not saturate! Does not die!
• Doubles the number of parameters/neuron

18
Activation function recap
• In practice
- Use ReLU. Be careful with your learning rates
- Try out Leaky ReLU / Maxout / ELU / SELU
- To squeeze out some marginal gains
- Don’t use sigmoid or tanh
• Some new activation functions
- ReLU6 = min(6, ReLU(x))
- Swish
- Mish

19
Activation function recap

20
Data preprocessing

21
Data preprocessing
• Transform the data distribution to zero mean: subtract
all data samples for the sample mean
• Transform the data distribution to unit standard
deviation

(Assume X [NxD] is data matrix, each example in a row)


22
Data preprocessing (2)
• In reality, we can use PCA or Data Whitening

23
Data preprocessing (3)
• In practice for images: center only
• Example with CIFAR10 set with 32x32x3 images

• Often rarely use PCA or whitening

24
Demo
• https://github.com/trungtv/Preprocessing-for-deep-
learning

25
Weight initializations

26
Weight initialization
• What if all weights W are initialized to zero?
• Neurons will learn same features in each iterations

27
Weight initialization (2)
• First idea: Small random numbers (gaussian with zero
mean and 1e-2 standard deviation)

• Works ~okay for small networks, but problems with


deeper networks.

28
Weight Initialization: Small random
numbers

• Gradient dL/dW = 0
• w is too small

29
Increase Std

• Gradient dL/dW = 0

30
Xavier initialization
• Assume x and w are iid (independent and identically
distributed random variable), and mean 0
• Calculation in forward direction:
var(y) = var(w1x1 + w2x2 + ... + wNinxNin + b)
var(wixi) = E(xi)2var(wi) + E(wi)2var(xi) +
var(wi)var(xi)
var(y) = Nin * var(wi) * var(xi)
Nin * var(wi) = 1
var(wi) = 1/ Nin
• The same goes for the backward gradient signal stream:
var(wi) = 1/ Nout
• Average:
var(wi) = 2/ (Nin + Nout)
31
Xavier initialization (2)

32
Weight Initialization: Tanh to ReLU

var(y) = var(w1x1 + w2x2 + ... + wNinxNin + b)


var(y) = Nin /2 * var(wi) * var(xi)
Nin /2 * var(wi) = 1
var(wi) = 2/ Nin
33
He / MSRA Initialization

34
Normalizations

35
Batch Normalization
• Want the activation function to have an output distribution with
zero mean and unit standard deviation? Let's transform with that
idea!

36
Batch Normalization
• The ideal of zero expectation unit standard deviation is
too tight! Can cause the model to be underfitting.
• Loosen the model, create an exit for the model if it does not
want to be bound.

37
Batch Normalization: Test-time
• Estimates depend on minibatch; can’t do this at test-
time!

38
Batch Normalization: Test-time (2)

39
Batch Normalization

40
Batch Normalization: Usage

41
Batch Normalization: Pros & Cons
• Makes deep networks much easier to train!
• Improves gradient flow
• Allows higher learning rates, faster convergence
• Networks become more robust to initialization
• Acts as regularization during training
• Zero overhead at test-time: can be fused with conv!
• Behaves differently during training and testing: this a
very common source of bugs!

42
Layer Normalization

43
Instance Normalization

44
Comparison of Normalization Layers

45
Group Normalization

46
References
1. http://cs231n.stanford.edu
2. https://prateekvjoshi.com/2016/03/29/understanding-
xavier-initialization-in-deep-neural-networks/

47
Thank you
for your
attention!!!

48

You might also like