0% found this document useful (0 votes)
20 views51 pages

02 Training

Uploaded by

Frank Tang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views51 pages

02 Training

Uploaded by

Frank Tang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

Announcements

• Project#1 was announced last week.


• TA will give a tutorial today.
Lecture 2
Training a Deep Neural Network: Part 1

Mei-Chen Yeh

Slides are modified from the lecture slides of


Prof. H. Y. Lee, National Taiwan University
Review: Neuron Model parameters?

Neuron 𝑓 𝑎 = 𝑦 𝑓: 𝑅𝐾 → 𝑅

a1 w1 z  a1w1  a2 w2    aK wK  b

a2 w2
z  z 
 y
wK

aK Activation Sigmoid  z 
weights function function
b  z  
1
1  ez z
bias
Neural Network
neuron
Input Layer 1 Layer 2 Layer L Output
x1 …… y1
x2 …… y2

……
……

……

……

……
xN …… yM
Input Output
Layer Hidden Layers Layer

“Deep” ≈ many hidden layers


Output layer
Used to interpret the outputs

x1 …… y1

x2

Softmax
…… y2

……
……

……
……

……
xN …… yM
Input Output Multi-class
Layer Hidden Layers Layer Classification
Multi-class classification
Probability:
Softmax 0 ≤ 𝑦𝑖 ≤ 1
𝑦𝑖 = 1
• Example: #class = 3 𝑖

z1 y1   ( z1 )

z2 y2   ( z 2 )

z3 y3   ( z 3 )
Probability:
Softmax 0 ≤ 𝑦𝑖 ≤ 1
𝑦𝑖 = 1
• Example: #class = 3 𝑖
z1
3 20 0.88e
e
z1
z1 e e ÷ zi
i

e 2.7 0.12
z2 1
z2 z2
e ÷ e
e i
zi

-3 z3 0.05 ≈0
z3 e e ÷ e z3

+ e i
zi

𝑒 𝑧𝑖
𝑖
Example (Project #1)
• Stock price prediction softmax
layer?
x1 y1

Past x2 y2
Neural
Machine y or
price
……

……
Network
data
? y5

Input: output:
?-dim vector a scalar or 5-dim vector
Hidden layers: Feature extraction
Feature extractor replacing
feature engineering
…… f1 y1
x1
x2 …… f2 y2

……
……

……
……

……
fK
xN …… yM
Input Output
Layer Hidden Layers Layer
Nonlinear
(hand-crafted)
kernel function
SVM
A linear
classifier
Source of image: http://www.gipsa-lab.grenoble-
inp.fr/transfert/seminaire/455_Kadri2013Gipsa-lab.pdf
Deep Learning
A simple
Learned kernel
𝜙 𝑥 classifier

x1 …… y1
x2 …… y2
𝑥



xN …… yM
Deeper is better?
Word Error Word Error
Layer X Size Layer X Size
Rate (%) Rate (%)
1 X 2k 24.2
2 X 2k 20.4 Not surprised, more
3 X 2k 18.4 parameters, better
4 X 2k 17.8 performance
5 X 2k 17.2 1 X 3772 22.5
7 X 2k 17.1 1 X 4634 22.6
1 X 16k 22.1
Seide, Frank, Gang Li, and Dong Yu. "Conversational Speech Transcription
Using Context-Dependent Deep Neural Networks." Interspeech. 2011.
Universal approximation theorem

Any continuous function f

f : R N  RM
Can be realized by a network
with one hidden layer
Reference for the reason:
(given enough hidden http://neuralnetworksandde
neurons) eplearning.com/chap4.html

Why deep neural network not fat neural network?


Fat + Short v.s. Thin + Tall
The same number
of parameters

Which one is better?


……

x1 x2 …… xN x1 x2 …… xN

Shallow Deep
Fat + Short v.s. Thin + Tall
Word Error Word Error
Layer X Size Layer X Size
Rate (%) Rate (%)
1 X 2k 24.2
2 X 2k 20.4
Why?
3 X 2k 18.4
4 X 2k 17.8
5 X 2k 17.2 1 X 3772 22.5
7 X 2k 17.1 1 X 4634 22.6
1 X 16k 22.1
Seide, Frank, Gang Li, and Dong Yu. "Conversational Speech Transcription
Using Context-Dependent Deep Neural Networks." Interspeech. 2011.
Modularization
• Deep → Modularization

Don’t put
everything in your
main function.

http://rinuboney.github.io/2015/10/18/theoretical-motivations-deep-learning.html
Modularization
• Deep → Modularization
Classifier Females with
long hair F F
1
F(L)
Classifier Males with
2 weak long hair M (L)
Few examples
Image
Classifier Females with
short hair F
3
F(S)
Classifier Males with
short hair M
4 MM(S)
Each basic classifier can have
Modularization sufficient training examples.

• Deep → Modularization

M(L)
Male or
F(L) v.s.
Female?
F(S)
Basic M(S)
Image
Classifiers
Long or
short? F(L) v.s. F(S)
Classifiers for the M(L)
attributes M(S)
Modularization
can be trained with fewer data

• Deep → Modularization
Classifier Females with
1 long hair
Male or
Female? Classifier Males with
2 fine long ahair
few data
Image Basic
Classifiers Classifier Females with
Long or 3 short hair
short?
Classifier Males with
Sharing by the 4 short hair
following classifiers
as module
Modularization
• Deep → Modularization → Less training data?
x1 ……
x2 The modularization is ……
automatically learned from data.
……

……
……

……
xN ……

The most basic Use 1st layer as module Use 2nd layer as
classifiers to build classifiers module ……
Modularization - Image
• Deep → Modularization
x1 ……
x2 ……
……

……
……

……
xN ……

The most basic Use 1st layer as module Use 2nd layer as
classifiers to build classifiers module ……
Reference: Zeiler, M. D. & Fergus, R. Visualizing and understanding
convolutional networks. ECCV, 2014.
Questions

Q: How many layers? How many neurons for each


layer?
Trial and Error + Intuition
Q: Can the structure be automatically determined?
• Yes, but not widely studied yet. (Eg. Evolutionary
Artificial Neural Networks)
Q: Can we design the network structure?
Variants of neural networks (coming up)
Machine Learning: 3 Steps

Step 1: Define a set of functions

Step 2: Measure goodness of a function

Step 3: Find the best function


Learning network parameters
𝜃 = 𝑊 1 , 𝑏1 , 𝑊 2 , 𝑏 2 , ⋯ 𝑊 𝐿 , 𝑏 𝐿
x1 …… y1
0.1 is 1

Softmax
x2 …… y2
0.7 is 2
……

……

……
x256 …… y10
0.2 is 0
16 x 16 = 256
Ink → 1 Set the network parameters 𝜃 such that ……
No ink → 0
Input: y1 has the maximum value
How to achieve?
Input: y2 has the maximum value
Learning from training data
• Preparing training data: images and their labels

“5” “0” “4” “1”

“9” “2” “1” “3”

We use the training data to find


the network parameters.
A good model should make the total
Loss loss of all examples as small as possible.

“1”

x1 …… y1 As close as 1
x2 possible

Softmax
Given a set ……
of y2 0
parameters
……

……
……

……

……
Loss
x256 …… y10 𝑙 0

Loss measures the difference between the target


network output and target.
Total Loss:
Loss 𝑅

𝐿= 𝑙𝑟
For all training data … 𝑟=1

x1 NN y1 𝑦1
𝑙1 As small as possible
x2 NN y2 𝑦2
𝑙2 Find a function in
function set that
x3 NN y3 𝑦3 minimizes total loss L
𝑙3
……
……

……
……

Find the network


xR NN yR 𝑦𝑅
parameters 𝜽∗ that
𝑙𝑅 minimize total loss L
Training data
Inputs class Initialise with random weights
1.4 2.7 1.9 0
3.8 3.4 3.2 0
6.4 2.8 1.7 1
4.1 0.1 0.2 0
etc …

An iterative approach!
Training data
Inputs class Present a training pattern
1.4 2.7 1.9 0
3.8 3.4 3.2 0
6.4 2.8 1.7 1
4.1 0.1 0.2 0 1.4
etc …
2.7

1.9

An iterative approach!
Training data
Inputs class Feed it through to get output
1.4 2.7 1.9 0
3.8 3.4 3.2 0
6.4 2.8 1.7 1
4.1 0.1 0.2 0 1.4
etc …
2.7 0.8

1.9

An iterative approach!
Training data
Inputs class Compare with target output
1.4 2.7 1.9 0
3.8 3.4 3.2 0
6.4 2.8 1.7 1
4.1 0.1 0.2 0 1.4
etc …
2.7 0.8
0
1.9 loss 0.8

An iterative approach!
Training data
Inputs class Adjust weights based on error
1.4 2.7 1.9 0
3.8 3.4 3.2 0
6.4 2.8 1.7 1
4.1 0.1 0.2 0 1.4
etc …
2.7 0.8
0
1.9 loss 0.8

An iterative approach!
Training data
Inputs class Present a training pattern
1.4 2.7 1.9 0
3.8 3.4 3.2 0
6.4 2.8 1.7 1
4.1 0.1 0.2 0 6.4
etc …
2.8

1.7

An iterative approach!
Training data
Inputs class Feed it through to get output
1.4 2.7 1.9 0
3.8 3.4 3.2 0
6.4 2.8 1.7 1
4.1 0.1 0.2 0 6.4
etc …
2.8 0.9

1.7

An iterative approach!
Training data
Inputs class Compare with target output
1.4 2.7 1.9 0
3.8 3.4 3.2 0
6.4 2.8 1.7 1
4.1 0.1 0.2 0 6.4
etc …
2.8 0.9
1
1.7 loss 0.1

An iterative approach!
Training data
Inputs class Adjust weights based on error
1.4 2.7 1.9 0
3.8 3.4 3.2 0
6.4 2.8 1.7 1
4.1 0.1 0.2 0 6.4
etc …
2.8 0.9
1
1.7 loss 0.1

An iterative approach!
Training data
Inputs class And so on ….
1.4 2.7 1.9 0
3.8 3.4 3.2 0
6.4 2.8 1.7 1
4.1 0.1 0.2 0 6.4
etc …
2.8 0.9
1
1.7 loss 0.1

Repeat this thousands, maybe millions of times – each time


taking a random training instance, and making slight
weight adjustments
Algorithms for weight adjustment are designed to make
changes that will reduce the loss.
A good model should make the total
Loss loss of all examples as small as possible.

Examples: “1”
• cross entropy
x1 …… y1 As close as 1 t
Σ −ti log(yi) 1
x2 possible

Softmax
Given a set ……
of y2 0 t2
• mean squared error
parameters
……

……
……

……

……
(1/n) Σ (t − y ) 2 Loss
x i i ……
256 y10 𝑙 0 t10

Loss measures the difference between the target


network output and target.
Cross Entropy Σ −ti ln(yi)
Cross Entropy Σ −ti ln(yi)

Loss = −1 × log 0.5 + 0 × log 0.3 + 0 × log 0.2


= 0.693 …
Mean Squared Error (1/n) Σ (ti − yi)2

1 2 2 2
Loss = 1 − 0.5 + 0 − 0.3 + 0 − 0.2
3
= 0.126...
Cross entropy vs. MSE
• Example: binary classification, label = 1
Cross entropy vs. MSE
Question
• Why don’t we directly use the
classification error (over the training set)
as the loss?
Minimizing cross entropy ≈
Minimizing KL divergence ≈
Maximum likelihood estimation
Maximum Likelihood Estimation
• Consider a set of m examples X = {x(1), ..., x(m)} drawn
independently from an unknown distribution pdata(x)
• Let pmodel(x;θ) be a parametric family of probability
distributions over the same space indexed by θ
• The maximum likelihood estimator for θ is
Taking log does not change arg max

Rescale does not change arg max


The KL divergence between the training set and the
model distribution is:

nothing to do with the model

To minimize DKL, we only need to minimize

Same!
Minimizing cross entropy ≈
Minimizing KL divergence ≈
Maximum likelihood estimation
Historical Notes
• Two improvements in neural network performance
from 1986-2015
• Large datasets → improve generalization
• Powerful computers → create larger networks
• Besides, two algorithmic changes are:
1. The replacement of MSE with the cross-
entropy family of loss functions
2. The replacement of sigmoid hidden units with
piecewise linear hidden units, such as rectified
linear units (ReLU) (next week)
Coming next
• Training a Deep Neural Network (part 2)

You might also like