0% found this document useful (0 votes)

20 views51 pages

02 Training

Uploaded by

Frank Tang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views51 pages

02 Training

Uploaded by

Frank Tang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 51

Announcements

• Project#1 was announced last week.

• TA will give a tutorial today.
Lecture 2
Training a Deep Neural Network: Part 1

Mei-Chen Yeh

Slides are modified from the lecture slides of

Prof. H. Y. Lee, National Taiwan University
Review: Neuron Model parameters?

Neuron 𝑓 𝑎 = 𝑦 𝑓: 𝑅𝐾 → 𝑅

a1 w1 z  a1w1  a2 w2    aK wK  b

a2 w2
z  z 
 y
wK
…

aK Activation Sigmoid  z 
weights function function
b  z  
1
1  ez z
bias
Neural Network
neuron
Input Layer 1 Layer 2 Layer L Output
x1 …… y1
x2 …… y2

……
……

……

……
xN …… yM
Input Output
Layer Hidden Layers Layer

“Deep” ≈ many hidden layers

Output layer
Used to interpret the outputs

x1 …… y1

Softmax
…… y2

……
……

……
xN …… yM
Input Output Multi-class
Layer Hidden Layers Layer Classification
Multi-class classification
Probability:
Softmax 0 ≤ 𝑦𝑖 ≤ 1
𝑦𝑖 = 1
• Example: #class = 3 𝑖

z1 y1   ( z1 )

z2 y2   ( z 2 )

z3 y3   ( z 3 )
Probability:
Softmax 0 ≤ 𝑦𝑖 ≤ 1
𝑦𝑖 = 1
• Example: #class = 3 𝑖
z1
3 20 0.88e
e
z1
z1 e e ÷ zi
i

e 2.7 0.12
z2 1
z2 z2
e ÷ e
e i
zi

-3 z3 0.05 ≈0
z3 e e ÷ e z3

+ e i
zi

𝑒 𝑧𝑖
𝑖
Example (Project #1)
• Stock price prediction softmax
layer?
x1 y1

Past x2 y2
Neural
Machine y or
price
……

……
Network
data
? y5

Input: output:
?-dim vector a scalar or 5-dim vector
Hidden layers: Feature extraction
Feature extractor replacing
feature engineering
…… f1 y1
x1
x2 …… f2 y2

……
……

……
fK
xN …… yM
Input Output
Layer Hidden Layers Layer
Nonlinear
(hand-crafted)
kernel function
SVM
A linear
classifier
Source of image: http://www.gipsa-lab.grenoble-
inp.fr/transfert/seminaire/455_Kadri2013Gipsa-lab.pdf
Deep Learning
A simple
Learned kernel
𝜙 𝑥 classifier

x1 …… y1
x2 …… y2
𝑥
…
…

…
…

…
xN …… yM
Deeper is better?
Word Error Word Error
Layer X Size Layer X Size
Rate (%) Rate (%)
1 X 2k 24.2
2 X 2k 20.4 Not surprised, more
3 X 2k 18.4 parameters, better
4 X 2k 17.8 performance
5 X 2k 17.2 1 X 3772 22.5
7 X 2k 17.1 1 X 4634 22.6
1 X 16k 22.1
Seide, Frank, Gang Li, and Dong Yu. "Conversational Speech Transcription
Using Context-Dependent Deep Neural Networks." Interspeech. 2011.
Universal approximation theorem

Any continuous function f

f : R N  RM
Can be realized by a network
with one hidden layer
Reference for the reason:
(given enough hidden http://neuralnetworksandde
neurons) eplearning.com/chap4.html

Why deep neural network not fat neural network?

Fat + Short v.s. Thin + Tall
The same number
of parameters

Which one is better?

……

x1 x2 …… xN x1 x2 …… xN

Shallow Deep
Fat + Short v.s. Thin + Tall
Word Error Word Error
Layer X Size Layer X Size
Rate (%) Rate (%)
1 X 2k 24.2
2 X 2k 20.4
Why?
3 X 2k 18.4
4 X 2k 17.8
5 X 2k 17.2 1 X 3772 22.5
7 X 2k 17.1 1 X 4634 22.6
1 X 16k 22.1
Seide, Frank, Gang Li, and Dong Yu. "Conversational Speech Transcription
Using Context-Dependent Deep Neural Networks." Interspeech. 2011.
Modularization
• Deep → Modularization

Don’t put
everything in your
main function.

http://rinuboney.github.io/2015/10/18/theoretical-motivations-deep-learning.html
Modularization
• Deep → Modularization
Classifier Females with
long hair F F
1
F(L)
Classifier Males with
2 weak long hair M (L)
Few examples
Image
Classifier Females with
short hair F
3
F(S)
Classifier Males with
short hair M
4 MM(S)
Each basic classifier can have
Modularization sufficient training examples.

• Deep → Modularization

M(L)
Male or
F(L) v.s.
Female?
F(S)
Basic M(S)
Image
Classifiers
Long or
short? F(L) v.s. F(S)
Classifiers for the M(L)
attributes M(S)
Modularization
can be trained with fewer data

• Deep → Modularization
Classifier Females with
1 long hair
Male or
Female? Classifier Males with
2 fine long ahair
few data
Image Basic
Classifiers Classifier Females with
Long or 3 short hair
short?
Classifier Males with
Sharing by the 4 short hair
following classifiers
as module
Modularization
• Deep → Modularization → Less training data?
x1 ……
x2 The modularization is ……
automatically learned from data.
……

……
……

……
xN ……

The most basic Use 1st layer as module Use 2nd layer as
classifiers to build classifiers module ……
Modularization - Image
• Deep → Modularization
x1 ……
x2 ……
……

……
……

……
xN ……

The most basic Use 1st layer as module Use 2nd layer as
classifiers to build classifiers module ……
Reference: Zeiler, M. D. & Fergus, R. Visualizing and understanding
convolutional networks. ECCV, 2014.
Questions

Q: How many layers? How many neurons for each

layer?
Trial and Error + Intuition
Q: Can the structure be automatically determined?
• Yes, but not widely studied yet. (Eg. Evolutionary
Artificial Neural Networks)
Q: Can we design the network structure?
Variants of neural networks (coming up)
Machine Learning: 3 Steps

Step 1: Define a set of functions

Step 2: Measure goodness of a function

Step 3: Find the best function

Learning network parameters
𝜃 = 𝑊 1 , 𝑏1 , 𝑊 2 , 𝑏 2 , ⋯ 𝑊 𝐿 , 𝑏 𝐿
x1 …… y1
0.1 is 1

Softmax
x2 …… y2
0.7 is 2
……

……

……
x256 …… y10
0.2 is 0
16 x 16 = 256
Ink → 1 Set the network parameters 𝜃 such that ……
No ink → 0
Input: y1 has the maximum value
How to achieve?
Input: y2 has the maximum value
Learning from training data
• Preparing training data: images and their labels

“5” “0” “4” “1”

“9” “2” “1” “3”

We use the training data to find

the network parameters.
A good model should make the total
Loss loss of all examples as small as possible.

“1”

x1 …… y1 As close as 1
x2 possible

Softmax
Given a set ……
of y2 0
parameters
……

……
……

……

……
Loss
x256 …… y10 𝑙 0

Loss measures the difference between the target

network output and target.
Total Loss:
Loss 𝑅

𝐿= 𝑙𝑟
For all training data … 𝑟=1

x1 NN y1 𝑦1
𝑙1 As small as possible
x2 NN y2 𝑦2
𝑙2 Find a function in
function set that
x3 NN y3 𝑦3 minimizes total loss L
𝑙3
……
……

……
……

An iterative approach!
Training data
Inputs class Present a training pattern
1.4 2.7 1.9 0
3.8 3.4 3.2 0
6.4 2.8 1.7 1
4.1 0.1 0.2 0 1.4
etc …
2.7

1.9

An iterative approach!
Training data
Inputs class Feed it through to get output
1.4 2.7 1.9 0
3.8 3.4 3.2 0
6.4 2.8 1.7 1
4.1 0.1 0.2 0 1.4
etc …
2.7 0.8

1.9

An iterative approach!
Training data
Inputs class Compare with target output
1.4 2.7 1.9 0
3.8 3.4 3.2 0
6.4 2.8 1.7 1
4.1 0.1 0.2 0 1.4
etc …
2.7 0.8
0
1.9 loss 0.8

An iterative approach!
Training data
Inputs class Adjust weights based on error
1.4 2.7 1.9 0
3.8 3.4 3.2 0
6.4 2.8 1.7 1
4.1 0.1 0.2 0 1.4
etc …
2.7 0.8
0
1.9 loss 0.8

An iterative approach!
Training data
Inputs class Present a training pattern
1.4 2.7 1.9 0
3.8 3.4 3.2 0
6.4 2.8 1.7 1
4.1 0.1 0.2 0 6.4
etc …
2.8

1.7

An iterative approach!
Training data
Inputs class Feed it through to get output
1.4 2.7 1.9 0
3.8 3.4 3.2 0
6.4 2.8 1.7 1
4.1 0.1 0.2 0 6.4
etc …
2.8 0.9

1.7

An iterative approach!
Training data
Inputs class Compare with target output
1.4 2.7 1.9 0
3.8 3.4 3.2 0
6.4 2.8 1.7 1
4.1 0.1 0.2 0 6.4
etc …
2.8 0.9
1
1.7 loss 0.1

An iterative approach!
Training data
Inputs class Adjust weights based on error
1.4 2.7 1.9 0
3.8 3.4 3.2 0
6.4 2.8 1.7 1
4.1 0.1 0.2 0 6.4
etc …
2.8 0.9
1
1.7 loss 0.1

An iterative approach!
Training data
Inputs class And so on ….
1.4 2.7 1.9 0
3.8 3.4 3.2 0
6.4 2.8 1.7 1
4.1 0.1 0.2 0 6.4
etc …
2.8 0.9
1
1.7 loss 0.1

Repeat this thousands, maybe millions of times – each time

taking a random training instance, and making slight
weight adjustments
Algorithms for weight adjustment are designed to make
changes that will reduce the loss.
A good model should make the total
Loss loss of all examples as small as possible.

Examples: “1”
• cross entropy
x1 …… y1 As close as 1 t
Σ −ti log(yi) 1
x2 possible

Softmax
Given a set ……
of y2 0 t2
• mean squared error
parameters
……

……
……

……

……
(1/n) Σ (t − y ) 2 Loss
x i i ……
256 y10 𝑙 0 t10

Loss measures the difference between the target

network output and target.
Cross Entropy Σ −ti ln(yi)
Cross Entropy Σ −ti ln(yi)

Loss = −1 × log 0.5 + 0 × log 0.3 + 0 × log 0.2

= 0.693 …
Mean Squared Error (1/n) Σ (ti − yi)2

1 2 2 2
Loss = 1 − 0.5 + 0 − 0.3 + 0 − 0.2
3
= 0.126...
Cross entropy vs. MSE
• Example: binary classification, label = 1
Cross entropy vs. MSE
Question
• Why don’t we directly use the
classification error (over the training set)
as the loss?
Minimizing cross entropy ≈
Minimizing KL divergence ≈
Maximum likelihood estimation
Maximum Likelihood Estimation
• Consider a set of m examples X = {x(1), ..., x(m)} drawn
independently from an unknown distribution pdata(x)
• Let pmodel(x;θ) be a parametric family of probability
distributions over the same space indexed by θ
• The maximum likelihood estimator for θ is
Taking log does not change arg max

Rescale does not change arg max

The KL divergence between the training set and the
model distribution is:

nothing to do with the model

To minimize DKL, we only need to minimize

Same!
Minimizing cross entropy ≈
Minimizing KL divergence ≈
Maximum likelihood estimation
Historical Notes
• Two improvements in neural network performance
from 1986-2015
• Large datasets → improve generalization
• Powerful computers → create larger networks
• Besides, two algorithmic changes are:
1. The replacement of MSE with the cross-
entropy family of loss functions
2. The replacement of sigmoid hidden units with
piecewise linear hidden units, such as rectified
linear units (ReLU) (next week)
Coming next
• Training a Deep Neural Network (part 2)

Fixing Neural Network Course 2 1659759284
No ratings yet
Fixing Neural Network Course 2 1659759284
30 pages
Introduction to Machine Learning
No ratings yet
Introduction to Machine Learning
116 pages
Chapter 7 - LAST
No ratings yet
Chapter 7 - LAST
29 pages
8.lecture7 28a 29 NN
No ratings yet
8.lecture7 28a 29 NN
60 pages
Military AI-Week 04-Deep Learning
No ratings yet
Military AI-Week 04-Deep Learning
66 pages
AI Penilaian Aset - MAPPI Rev2
No ratings yet
AI Penilaian Aset - MAPPI Rev2
51 pages
05 Deep Learning
No ratings yet
05 Deep Learning
53 pages
Introduction To Deep Learning: Suresh Jaganathan
No ratings yet
Introduction To Deep Learning: Suresh Jaganathan
73 pages
Lecture7B Classification
No ratings yet
Lecture7B Classification
78 pages
Tensorflow Keras Pytorch: Step 1: For Each Input, Multiply The Input Value X With Weights W
No ratings yet
Tensorflow Keras Pytorch: Step 1: For Each Input, Multiply The Input Value X With Weights W
6 pages
Unit-i Machine Learning Basics
No ratings yet
Unit-i Machine Learning Basics
85 pages
Lec10 Intro ML
No ratings yet
Lec10 Intro ML
93 pages
Week 4 - DL and Conference Call Paper
No ratings yet
Week 4 - DL and Conference Call Paper
41 pages
Lec05-Classifiers-NeuralNets
No ratings yet
Lec05-Classifiers-NeuralNets
54 pages
L10 Learning II Gradient Based Learning
No ratings yet
L10 Learning II Gradient Based Learning
72 pages
DL ppt
No ratings yet
DL ppt
110 pages
Curs4site PDF
No ratings yet
Curs4site PDF
44 pages
ML Lec 09 ANN Quadratic Training
No ratings yet
ML Lec 09 ANN Quadratic Training
44 pages
Neural Networks - 2
No ratings yet
Neural Networks - 2
79 pages
Classification of Cardboard Papers Using A Multilayer Perceptron
No ratings yet
Classification of Cardboard Papers Using A Multilayer Perceptron
14 pages
Deep Learning
No ratings yet
Deep Learning
299 pages
Index: Name - JINESH PRAJAPAT Class - B. Tech, III Year Branch - AI & DS Sem - V
No ratings yet
Index: Name - JINESH PRAJAPAT Class - B. Tech, III Year Branch - AI & DS Sem - V
35 pages
PowerPoint Presentation-2
No ratings yet
PowerPoint Presentation-2
52 pages
Classification BP Regression KNN Other Classifiers_ Final.ppt
No ratings yet
Classification BP Regression KNN Other Classifiers_ Final.ppt
116 pages
ANN Notes
No ratings yet
ANN Notes
54 pages
Initialization
No ratings yet
Initialization
16 pages
Neural network intro lecture 4
No ratings yet
Neural network intro lecture 4
46 pages
Adaptive Linear Neuron Using Linear (Identity) Activation Function With Batch Gradient Method
No ratings yet
Adaptive Linear Neuron Using Linear (Identity) Activation Function With Batch Gradient Method
19 pages
Curs5site PDF
No ratings yet
Curs5site PDF
47 pages
DeepLearning Recap
No ratings yet
DeepLearning Recap
104 pages
Week3_LearningI
No ratings yet
Week3_LearningI
48 pages
Advanced Machine Learning: Neural Networks Decision Trees Random Forest Xgboost
No ratings yet
Advanced Machine Learning: Neural Networks Decision Trees Random Forest Xgboost
61 pages
Programming Assign Unit 7
No ratings yet
Programming Assign Unit 7
5 pages
L2_Problems in ML & Performance Evaluation - Copy
No ratings yet
L2_Problems in ML & Performance Evaluation - Copy
30 pages
6 Working Example 01-08-2024
No ratings yet
6 Working Example 01-08-2024
21 pages
Final Ppt DataMining
No ratings yet
Final Ppt DataMining
64 pages
ANN-Implemetation of Back-Prop
No ratings yet
ANN-Implemetation of Back-Prop
89 pages
Lect 7
No ratings yet
Lect 7
43 pages
Gansp Awareness Quiz PDF
No ratings yet
Gansp Awareness Quiz PDF
13 pages
Python Unit 5
No ratings yet
Python Unit 5
36 pages
Artificial Neural Networks - 12: Dr. Aditya Abhyankar
No ratings yet
Artificial Neural Networks - 12: Dr. Aditya Abhyankar
42 pages
Linear Classifier: by Dr. Sanjeev Kumar Associate Professor Department of Mathematics IIT Roorkee, Roorkee-247 667, India
No ratings yet
Linear Classifier: by Dr. Sanjeev Kumar Associate Professor Department of Mathematics IIT Roorkee, Roorkee-247 667, India
86 pages
Machine Learning Notes ?
No ratings yet
Machine Learning Notes ?
64 pages
AI - W7L13
No ratings yet
AI - W7L13
46 pages
Machine Learning Basics: Lecture Slides For Chapter 5 of Deep Learning Ian Goodfellow
No ratings yet
Machine Learning Basics: Lecture Slides For Chapter 5 of Deep Learning Ian Goodfellow
85 pages
Mod 2.4,2.5,2.6 Architecture Design
No ratings yet
Mod 2.4,2.5,2.6 Architecture Design
20 pages
Back Propagation
No ratings yet
Back Propagation
29 pages
UNIT 4
No ratings yet
UNIT 4
13 pages
Lecture 13.3 Classification ANN
No ratings yet
Lecture 13.3 Classification ANN
64 pages
DWDM Lab1
No ratings yet
DWDM Lab1
3 pages
ML 01
No ratings yet
ML 01
24 pages
Training of Neural Networks: Q.J. Zhang, Carleton University
No ratings yet
Training of Neural Networks: Q.J. Zhang, Carleton University
44 pages
Hyperparameters
No ratings yet
Hyperparameters
15 pages
Neural Net 3rdclass
No ratings yet
Neural Net 3rdclass
35 pages
Artificial Neural NetworkIV
No ratings yet
Artificial Neural NetworkIV
6 pages
Lecture 2 - Supervised Learning
No ratings yet
Lecture 2 - Supervised Learning
6 pages
Create A Neural Network in 7 Steps - Neural Designer
No ratings yet
Create A Neural Network in 7 Steps - Neural Designer
11 pages
Types of Machine Learning
No ratings yet
Types of Machine Learning
63 pages

02 Training

Uploaded by

02 Training

Uploaded by

Announcements

• Project#1 was announced last week.

Slides are modified from the lecture slides of

“Deep” ≈ many hidden layers

Any continuous function f

Why deep neural network not fat neural network?

Which one is better?

Q: How many layers? How many neurons for each

Step 1: Define a set of functions

Step 2: Measure goodness of a function

Step 3: Find the best function

“5” “0” “4” “1”

“9” “2” “1” “3”

We use the training data to find

Loss measures the difference between the target

Find the network

Repeat this thousands, maybe millions of times – each time

Loss measures the difference between the target

Loss = −1 × log 0.5 + 0 × log 0.3 + 0 × log 0.2

Rescale does not change arg max

nothing to do with the model

To minimize DKL, we only need to minimize

You might also like