0% found this document useful (0 votes)

35 views48 pages

Neural Network Training Essentials

hust

Uploaded by

Phuc Nguyen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views48 pages

Neural Network Training Essentials

hust

Uploaded by

Phuc Nguyen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

1

Lesson 3:
Training neural networks
(Part 1)
Viet-Trung Tran

2
Outline
• Activation functions
• Data preprocessing
• Weight initializations
• Normalizations

3
Activation functions

4
5
Sigmoid function
• Squashes numbers to
range [0,1]
• They are widely used in
the history of neural
networks because they
simulate the firing rate of
neurons.
• There are 3 downsides:
1. Vanishing gradient due
to saturated neurons

6
Sigmoid function (2)

• What happens when x = -10?

• What happens when x = 0?
• What happens when x = 10?

7
Sigmoid function (3)
• Squashes numbers to
range [0,1]
• They are widely used in
the history of neural
networks because they
simulate the firing rate of
neurons.
• There are 3 downsides:
1. Vanishing gradient due
to saturated neurons
2. Output average is non-
zero

8
Sigmoid function (4)

• What if all the neuron’s inputs

xi are positive?
• Then what will be the gradient
of the objective function with
respect to w?
• All the elements of w have the
same sign as f'(w), i.e. the
same negative or the same
positive.
• Then the gradient can only be
directed in certain dimensions
in the search space.

9
Sigmoid function (5)
• Squashes numbers to
range [0,1]
• They are widely used in
the history of neural
networks because they
simulate the firing rate of
neurons.
• There are 3 downsides:
1. Vanishing gradient due
to saturated neurons
2. Output average is non-
zero
3. Calculating the
exponential exp() is
expensive

10
Tanh function
• Squashes numbers to
range [-1,1]
• Average output is 0
• Still suffering from
saturated neurons,
vanishing gradient

11
ReLU function
• No saturation in the
positive region
• Efficient calculation
• In practice, convergence
is faster than
sigmoid/tanh (about 6
times)
• Downsides
1. Non-zero average
output
2. And one more
problem…

12
ReLU function (2)

• What happens when x = -10?

• What happens when x = 0?
• What happens when x = 10?

13
ReLU function (3)
• The ReLU are "thrown" out due to the data set that
makes the output being always negative and never
updating the weights again.
• Dead ReLU (Dying ReLU)
• Usually initialize the ReLU neuron with a small positive
bias (e.g., 0.01)

14
Leaky ReLU
• Does not saturate
• Computationally efficient
• Converges much faster
thansigmoid/tanh in
practice! (e.g., 6x)
• Will not “die”

15
Parametric Rectifier (PReLU)
• Does not saturate
• Computationally efficient
• Converges much faster
thansigmoid/tanh in
practice! (e.g., 6x)
• Will not “die”

16
Exponential Linear Units
• All benefits of ReLU
• Closer to zero mean
outputs
• Negative saturation
regime compared with
Leaky ReLU adds some
robustness to noise
• Computation requires
exp()

17
Maxout function

• Does not have the basic form of dot product ->

nonlinearity
• Generalizes ReLU and Leaky ReLU
• Linear Regime! Does not saturate! Does not die!
• Doubles the number of parameters/neuron

18
Activation function recap
• In practice
- Use ReLU. Be careful with your learning rates
- Try out Leaky ReLU / Maxout / ELU / SELU
- To squeeze out some marginal gains
- Don’t use sigmoid or tanh
• Some new activation functions
- ReLU6 = min(6, ReLU(x))
- Swish
- Mish

19
Activation function recap

20
Data preprocessing

21
Data preprocessing
• Transform the data distribution to zero mean: subtract
all data samples for the sample mean
• Transform the data distribution to unit standard
deviation

(Assume X [NxD] is data matrix, each example in a row)

22
Data preprocessing (2)
• In reality, we can use PCA or Data Whitening

23
Data preprocessing (3)
• In practice for images: center only
• Example with CIFAR10 set with 32x32x3 images

• Often rarely use PCA or whitening

24
Demo
• https://github.com/trungtv/Preprocessing-for-deep-
learning

25
Weight initializations

26
Weight initialization
• What if all weights W are initialized to zero?
• Neurons will learn same features in each iterations

27
Weight initialization (2)
• First idea: Small random numbers (gaussian with zero
mean and 1e-2 standard deviation)

• Works ~okay for small networks, but problems with

deeper networks.

28
Weight Initialization: Small random
numbers

• Gradient dL/dW = 0
• w is too small

29
Increase Std

• Gradient dL/dW = 0

30
Xavier initialization
• Assume x and w are iid (independent and identically
distributed random variable), and mean 0
• Calculation in forward direction:
var(y) = var(w1x1 + w2x2 + ... + wNinxNin + b)
var(wixi) = E(xi)2var(wi) + E(wi)2var(xi) +
var(wi)var(xi)
var(y) = Nin * var(wi) * var(xi)
Nin * var(wi) = 1
var(wi) = 1/ Nin
• The same goes for the backward gradient signal stream:
var(wi) = 1/ Nout
• Average:
var(wi) = 2/ (Nin + Nout)
31
Xavier initialization (2)

32
Weight Initialization: Tanh to ReLU

var(y) = var(w1x1 + w2x2 + ... + wNinxNin + b)

var(y) = Nin /2 * var(wi) * var(xi)
Nin /2 * var(wi) = 1
var(wi) = 2/ Nin
33
He / MSRA Initialization

34
Normalizations

35
Batch Normalization
• Want the activation function to have an output distribution with
zero mean and unit standard deviation? Let's transform with that
idea!

36
Batch Normalization
• The ideal of zero expectation unit standard deviation is
too tight! Can cause the model to be underfitting.
• Loosen the model, create an exit for the model if it does not
want to be bound.

37
Batch Normalization: Test-time
• Estimates depend on minibatch; can’t do this at test-
time!

38
Batch Normalization: Test-time (2)

39
Batch Normalization

40
Batch Normalization: Usage

41
Batch Normalization: Pros & Cons
• Makes deep networks much easier to train!
• Improves gradient flow
• Allows higher learning rates, faster convergence
• Networks become more robust to initialization
• Acts as regularization during training
• Zero overhead at test-time: can be fused with conv!
• Behaves differently during training and testing: this a
very common source of bugs!

42
Layer Normalization

43
Instance Normalization

44
Comparison of Normalization Layers

45
Group Normalization

46
References
1. http://cs231n.stanford.edu
2. https://prateekvjoshi.com/2016/03/29/understanding-
xavier-initialization-in-deep-neural-networks/

47
Thank you
for your
attention!!!

Autoencoders in Deep Learning
No ratings yet
Autoencoders in Deep Learning
73 pages
Activation Functions and Batch Normalization
No ratings yet
Activation Functions and Batch Normalization
6 pages
Training Deep Neural Networks
No ratings yet
Training Deep Neural Networks
55 pages
Module 2
No ratings yet
Module 2
126 pages
Module II
No ratings yet
Module II
152 pages
Deep Learning: MLPs and Regularization Techniques
No ratings yet
Deep Learning: MLPs and Regularization Techniques
44 pages
Deep Neural Network Optimization Guide
100% (1)
Deep Neural Network Optimization Guide
84 pages
Neural Network Training Techniques
No ratings yet
Neural Network Training Techniques
109 pages
Deep Learning Techniques and Strategies
No ratings yet
Deep Learning Techniques and Strategies
52 pages
Introduction To Neural Network
No ratings yet
Introduction To Neural Network
20 pages
Module 2
No ratings yet
Module 2
13 pages
Artificial Neural Networks - DL
No ratings yet
Artificial Neural Networks - DL
55 pages
Backpropagation in Neural Networks
No ratings yet
Backpropagation in Neural Networks
83 pages
Activation Functions and Initialization Techniques
No ratings yet
Activation Functions and Initialization Techniques
20 pages
CS490 Advanced Topics in Computing (Deep Learning)
No ratings yet
CS490 Advanced Topics in Computing (Deep Learning)
37 pages
AML 03 Dense Neural Networks
No ratings yet
AML 03 Dense Neural Networks
20 pages
Understanding Neural Network Types & Functions
No ratings yet
Understanding Neural Network Types & Functions
7 pages
DL Module II Till7thAug
No ratings yet
DL Module II Till7thAug
131 pages
Unit Ii DNN
No ratings yet
Unit Ii DNN
24 pages
Deep Learning: Feedforward Networks Guide
No ratings yet
Deep Learning: Feedforward Networks Guide
15 pages
Ad3451 ML Unit 4 Notes
No ratings yet
Ad3451 ML Unit 4 Notes
36 pages
A Imprimer 4
No ratings yet
A Imprimer 4
4 pages
Deep Learning
No ratings yet
Deep Learning
19 pages
cst414 - Deep Learning
No ratings yet
cst414 - Deep Learning
34 pages
Unit 3
No ratings yet
Unit 3
110 pages
Deep Learning Techniques
No ratings yet
Deep Learning Techniques
72 pages
Understanding Deep Learning Concepts
No ratings yet
Understanding Deep Learning Concepts
78 pages
Deep Learning: Gradient Descent Explained
No ratings yet
Deep Learning: Gradient Descent Explained
41 pages
02 Neural Networks
No ratings yet
02 Neural Networks
32 pages
NN Unit - 1
100% (1)
NN Unit - 1
27 pages
Guide to Neural Network Activation Functions
No ratings yet
Guide to Neural Network Activation Functions
3 pages
Choosing Hidden Units in Deep Learning
No ratings yet
Choosing Hidden Units in Deep Learning
26 pages
Neural Networks Overview and Training
No ratings yet
Neural Networks Overview and Training
48 pages
Dense Neural Nets
No ratings yet
Dense Neural Nets
68 pages
SS 2020 Solutions
No ratings yet
SS 2020 Solutions
22 pages
Lecture 09 Slides - After
No ratings yet
Lecture 09 Slides - After
57 pages
Introduction to Training Deep Models
No ratings yet
Introduction to Training Deep Models
45 pages
Geometric Modeling of Occam's Razor in DL
No ratings yet
Geometric Modeling of Occam's Razor in DL
93 pages
Deep Learning: Stability & Regularization
No ratings yet
Deep Learning: Stability & Regularization
45 pages
Probability Neuron Network
No ratings yet
Probability Neuron Network
84 pages
Weight Initialization and Gradient Issues
No ratings yet
Weight Initialization and Gradient Issues
17 pages
Neural Networks and SVM Overview
No ratings yet
Neural Networks and SVM Overview
106 pages
Activation Function
No ratings yet
Activation Function
6 pages
Introduction to Deep Learning Techniques
100% (1)
Introduction to Deep Learning Techniques
299 pages
9.b Handout-4-Activation Functions
No ratings yet
9.b Handout-4-Activation Functions
4 pages
Deep Learning UNIT-II Part1
No ratings yet
Deep Learning UNIT-II Part1
48 pages
Understanding Neural Networks Basics
No ratings yet
Understanding Neural Networks Basics
51 pages
Machine Learning Lecture II Overview
No ratings yet
Machine Learning Lecture II Overview
118 pages
Deep Learning and Neural Networks Guide
No ratings yet
Deep Learning and Neural Networks Guide
17 pages
Understanding Neurons and Perceptrons
No ratings yet
Understanding Neurons and Perceptrons
23 pages
Unit - 1 - Part - II-nn
No ratings yet
Unit - 1 - Part - II-nn
13 pages
Understanding Neural Networks Basics
No ratings yet
Understanding Neural Networks Basics
52 pages
Neural Networks in Machine Learning
No ratings yet
Neural Networks in Machine Learning
77 pages
Perceptron and Neural Networks Overview
No ratings yet
Perceptron and Neural Networks Overview
4 pages
FDL Module1
No ratings yet
FDL Module1
102 pages
Fundamentals of Deep Learning Overview
No ratings yet
Fundamentals of Deep Learning Overview
51 pages
Understanding Generative Models in AI
No ratings yet
Understanding Generative Models in AI
65 pages
Neural Network Training Techniques
No ratings yet
Neural Network Training Techniques
70 pages
Deep Learning Hardware & Frameworks Guide
No ratings yet
Deep Learning Hardware & Frameworks Guide
66 pages
RADL TTho
No ratings yet
RADL TTho
64 pages
Part 1 Linux Chapter 4 Process Management
No ratings yet
Part 1 Linux Chapter 4 Process Management
10 pages
Exam Überprüfung Des Testversuchs Jku-2
No ratings yet
Exam Überprüfung Des Testversuchs Jku-2
14 pages
NLP Lecture 6
No ratings yet
NLP Lecture 6
57 pages
Introduction to Recurrent Neural Networks
No ratings yet
Introduction to Recurrent Neural Networks
16 pages
Unit 4 Deep Learning
No ratings yet
Unit 4 Deep Learning
17 pages
Ai Resos
No ratings yet
Ai Resos
16 pages
AI Course: Learning & Optimization
No ratings yet
AI Course: Learning & Optimization
18 pages
Bianchi
No ratings yet
Bianchi
62 pages
Iva Unit-5 Edited
No ratings yet
Iva Unit-5 Edited
42 pages
Chest Pneumonia Detection via ResNet101
No ratings yet
Chest Pneumonia Detection via ResNet101
19 pages
Deep Cascade Learning
No ratings yet
Deep Cascade Learning
11 pages
Concepts in Deep Learning
No ratings yet
Concepts in Deep Learning
14 pages
ML Prep For Samsung
No ratings yet
ML Prep For Samsung
73 pages
CS217 2024 Lec11
No ratings yet
CS217 2024 Lec11
7 pages
NLP Unit 1 & 2
No ratings yet
NLP Unit 1 & 2
29 pages
DL Unit-2
No ratings yet
DL Unit-2
81 pages
A Brief Overview of Recurrent Neural Networks (RNN)
No ratings yet
A Brief Overview of Recurrent Neural Networks (RNN)
8 pages
Unit 2
No ratings yet
Unit 2
48 pages
Cheatsheet Recurrent Neural Networks
No ratings yet
Cheatsheet Recurrent Neural Networks
5 pages
AI & Machine Learning Quiz
No ratings yet
AI & Machine Learning Quiz
13 pages
Understanding Neural Networks Basics
No ratings yet
Understanding Neural Networks Basics
32 pages
Deep Learning Course Syllabus EE 414
No ratings yet
Deep Learning Course Syllabus EE 414
4 pages
Machine Learning: The Hundred-Page Book
No ratings yet
Machine Learning: The Hundred-Page Book
17 pages
Some Important Question
No ratings yet
Some Important Question
59 pages
Stock Prediction App for Investors
No ratings yet
Stock Prediction App for Investors
124 pages
Top 10 NLP Question - Answer
No ratings yet
Top 10 NLP Question - Answer
16 pages
Vanishing & Exploding Gradients in Neural Networks
No ratings yet
Vanishing & Exploding Gradients in Neural Networks
8 pages
Neural Networks Explained
No ratings yet
Neural Networks Explained
14 pages
DL Co3 - PPT 1
No ratings yet
DL Co3 - PPT 1
22 pages
Chapter 6 Deep Learning Knowledge
No ratings yet
Chapter 6 Deep Learning Knowledge
24 pages
Activation Functions in Neural Networks
No ratings yet
Activation Functions in Neural Networks
11 pages