L4 Training Neural Networks en
L4 Training Neural Networks en
Lesson 3:
Training neural networks
(Part 1)
Viet-Trung Tran
2
Outline
• Activation functions
• Data preprocessing
• Weight initializations
• Normalizations
3
Activation functions
4
5
Sigmoid function
• Squashes numbers to
range [0,1]
• They are widely used in
the history of neural
networks because they
simulate the firing rate of
neurons.
• There are 3 downsides:
1. Vanishing gradient due
to saturated neurons
6
Sigmoid function (2)
7
Sigmoid function (3)
• Squashes numbers to
range [0,1]
• They are widely used in
the history of neural
networks because they
simulate the firing rate of
neurons.
• There are 3 downsides:
1. Vanishing gradient due
to saturated neurons
2. Output average is non-
zero
8
Sigmoid function (4)
9
Sigmoid function (5)
• Squashes numbers to
range [0,1]
• They are widely used in
the history of neural
networks because they
simulate the firing rate of
neurons.
• There are 3 downsides:
1. Vanishing gradient due
to saturated neurons
2. Output average is non-
zero
3. Calculating the
exponential exp() is
expensive
10
Tanh function
• Squashes numbers to
range [-1,1]
• Average output is 0
• Still suffering from
saturated neurons,
vanishing gradient
11
ReLU function
• No saturation in the
positive region
• Efficient calculation
• In practice, convergence
is faster than
sigmoid/tanh (about 6
times)
• Downsides
1. Non-zero average
output
2. And one more
problem…
12
ReLU function (2)
13
ReLU function (3)
• The ReLU are "thrown" out due to the data set that
makes the output being always negative and never
updating the weights again.
• Dead ReLU (Dying ReLU)
• Usually initialize the ReLU neuron with a small positive
bias (e.g., 0.01)
14
Leaky ReLU
• Does not saturate
• Computationally efficient
• Converges much faster
thansigmoid/tanh in
practice! (e.g., 6x)
• Will not “die”
15
Parametric Rectifier (PReLU)
• Does not saturate
• Computationally efficient
• Converges much faster
thansigmoid/tanh in
practice! (e.g., 6x)
• Will not “die”
16
Exponential Linear Units
• All benefits of ReLU
• Closer to zero mean
outputs
• Negative saturation
regime compared with
Leaky ReLU adds some
robustness to noise
• Computation requires
exp()
17
Maxout function
18
Activation function recap
• In practice
- Use ReLU. Be careful with your learning rates
- Try out Leaky ReLU / Maxout / ELU / SELU
- To squeeze out some marginal gains
- Don’t use sigmoid or tanh
• Some new activation functions
- ReLU6 = min(6, ReLU(x))
- Swish
- Mish
19
Activation function recap
20
Data preprocessing
21
Data preprocessing
• Transform the data distribution to zero mean: subtract
all data samples for the sample mean
• Transform the data distribution to unit standard
deviation
23
Data preprocessing (3)
• In practice for images: center only
• Example with CIFAR10 set with 32x32x3 images
24
Demo
• https://github.com/trungtv/Preprocessing-for-deep-
learning
25
Weight initializations
26
Weight initialization
• What if all weights W are initialized to zero?
• Neurons will learn same features in each iterations
27
Weight initialization (2)
• First idea: Small random numbers (gaussian with zero
mean and 1e-2 standard deviation)
28
Weight Initialization: Small random
numbers
• Gradient dL/dW = 0
• w is too small
29
Increase Std
• Gradient dL/dW = 0
30
Xavier initialization
• Assume x and w are iid (independent and identically
distributed random variable), and mean 0
• Calculation in forward direction:
var(y) = var(w1x1 + w2x2 + ... + wNinxNin + b)
var(wixi) = E(xi)2var(wi) + E(wi)2var(xi) +
var(wi)var(xi)
var(y) = Nin * var(wi) * var(xi)
Nin * var(wi) = 1
var(wi) = 1/ Nin
• The same goes for the backward gradient signal stream:
var(wi) = 1/ Nout
• Average:
var(wi) = 2/ (Nin + Nout)
31
Xavier initialization (2)
32
Weight Initialization: Tanh to ReLU
34
Normalizations
35
Batch Normalization
• Want the activation function to have an output distribution with
zero mean and unit standard deviation? Let's transform with that
idea!
36
Batch Normalization
• The ideal of zero expectation unit standard deviation is
too tight! Can cause the model to be underfitting.
• Loosen the model, create an exit for the model if it does not
want to be bound.
37
Batch Normalization: Test-time
• Estimates depend on minibatch; can’t do this at test-
time!
38
Batch Normalization: Test-time (2)
39
Batch Normalization
40
Batch Normalization: Usage
41
Batch Normalization: Pros & Cons
• Makes deep networks much easier to train!
• Improves gradient flow
• Allows higher learning rates, faster convergence
• Networks become more robust to initialization
• Acts as regularization during training
• Zero overhead at test-time: can be fused with conv!
• Behaves differently during training and testing: this a
very common source of bugs!
42
Layer Normalization
43
Instance Normalization
44
Comparison of Normalization Layers
45
Group Normalization
46
References
1. http://cs231n.stanford.edu
2. https://prateekvjoshi.com/2016/03/29/understanding-
xavier-initialization-in-deep-neural-networks/
47
Thank you
for your
attention!!!
48