02 Training
02 Training
Mei-Chen Yeh
Neuron 𝑓 𝑎 = 𝑦 𝑓: 𝑅𝐾 → 𝑅
a1 w1 z a1w1 a2 w2 aK wK b
a2 w2
z z
y
wK
…
aK Activation Sigmoid z
weights function function
b z
1
1 ez z
bias
Neural Network
neuron
Input Layer 1 Layer 2 Layer L Output
x1 …… y1
x2 …… y2
……
……
……
……
……
xN …… yM
Input Output
Layer Hidden Layers Layer
x1 …… y1
x2
Softmax
…… y2
……
……
……
……
……
xN …… yM
Input Output Multi-class
Layer Hidden Layers Layer Classification
Multi-class classification
Probability:
Softmax 0 ≤ 𝑦𝑖 ≤ 1
𝑦𝑖 = 1
• Example: #class = 3 𝑖
z1 y1 ( z1 )
z2 y2 ( z 2 )
z3 y3 ( z 3 )
Probability:
Softmax 0 ≤ 𝑦𝑖 ≤ 1
𝑦𝑖 = 1
• Example: #class = 3 𝑖
z1
3 20 0.88e
e
z1
z1 e e ÷ zi
i
e 2.7 0.12
z2 1
z2 z2
e ÷ e
e i
zi
-3 z3 0.05 ≈0
z3 e e ÷ e z3
+ e i
zi
𝑒 𝑧𝑖
𝑖
Example (Project #1)
• Stock price prediction softmax
layer?
x1 y1
Past x2 y2
Neural
Machine y or
price
……
……
Network
data
? y5
Input: output:
?-dim vector a scalar or 5-dim vector
Hidden layers: Feature extraction
Feature extractor replacing
feature engineering
…… f1 y1
x1
x2 …… f2 y2
……
……
……
……
……
fK
xN …… yM
Input Output
Layer Hidden Layers Layer
Nonlinear
(hand-crafted)
kernel function
SVM
A linear
classifier
Source of image: http://www.gipsa-lab.grenoble-
inp.fr/transfert/seminaire/455_Kadri2013Gipsa-lab.pdf
Deep Learning
A simple
Learned kernel
𝜙 𝑥 classifier
x1 …… y1
x2 …… y2
𝑥
…
…
…
…
…
xN …… yM
Deeper is better?
Word Error Word Error
Layer X Size Layer X Size
Rate (%) Rate (%)
1 X 2k 24.2
2 X 2k 20.4 Not surprised, more
3 X 2k 18.4 parameters, better
4 X 2k 17.8 performance
5 X 2k 17.2 1 X 3772 22.5
7 X 2k 17.1 1 X 4634 22.6
1 X 16k 22.1
Seide, Frank, Gang Li, and Dong Yu. "Conversational Speech Transcription
Using Context-Dependent Deep Neural Networks." Interspeech. 2011.
Universal approximation theorem
f : R N RM
Can be realized by a network
with one hidden layer
Reference for the reason:
(given enough hidden http://neuralnetworksandde
neurons) eplearning.com/chap4.html
x1 x2 …… xN x1 x2 …… xN
Shallow Deep
Fat + Short v.s. Thin + Tall
Word Error Word Error
Layer X Size Layer X Size
Rate (%) Rate (%)
1 X 2k 24.2
2 X 2k 20.4
Why?
3 X 2k 18.4
4 X 2k 17.8
5 X 2k 17.2 1 X 3772 22.5
7 X 2k 17.1 1 X 4634 22.6
1 X 16k 22.1
Seide, Frank, Gang Li, and Dong Yu. "Conversational Speech Transcription
Using Context-Dependent Deep Neural Networks." Interspeech. 2011.
Modularization
• Deep → Modularization
Don’t put
everything in your
main function.
http://rinuboney.github.io/2015/10/18/theoretical-motivations-deep-learning.html
Modularization
• Deep → Modularization
Classifier Females with
long hair F F
1
F(L)
Classifier Males with
2 weak long hair M (L)
Few examples
Image
Classifier Females with
short hair F
3
F(S)
Classifier Males with
short hair M
4 MM(S)
Each basic classifier can have
Modularization sufficient training examples.
• Deep → Modularization
M(L)
Male or
F(L) v.s.
Female?
F(S)
Basic M(S)
Image
Classifiers
Long or
short? F(L) v.s. F(S)
Classifiers for the M(L)
attributes M(S)
Modularization
can be trained with fewer data
• Deep → Modularization
Classifier Females with
1 long hair
Male or
Female? Classifier Males with
2 fine long ahair
few data
Image Basic
Classifiers Classifier Females with
Long or 3 short hair
short?
Classifier Males with
Sharing by the 4 short hair
following classifiers
as module
Modularization
• Deep → Modularization → Less training data?
x1 ……
x2 The modularization is ……
automatically learned from data.
……
……
……
……
xN ……
The most basic Use 1st layer as module Use 2nd layer as
classifiers to build classifiers module ……
Modularization - Image
• Deep → Modularization
x1 ……
x2 ……
……
……
……
……
xN ……
The most basic Use 1st layer as module Use 2nd layer as
classifiers to build classifiers module ……
Reference: Zeiler, M. D. & Fergus, R. Visualizing and understanding
convolutional networks. ECCV, 2014.
Questions
Softmax
x2 …… y2
0.7 is 2
……
……
……
x256 …… y10
0.2 is 0
16 x 16 = 256
Ink → 1 Set the network parameters 𝜃 such that ……
No ink → 0
Input: y1 has the maximum value
How to achieve?
Input: y2 has the maximum value
Learning from training data
• Preparing training data: images and their labels
“1”
x1 …… y1 As close as 1
x2 possible
Softmax
Given a set ……
of y2 0
parameters
……
……
……
……
……
Loss
x256 …… y10 𝑙 0
𝐿= 𝑙𝑟
For all training data … 𝑟=1
x1 NN y1 𝑦1
𝑙1 As small as possible
x2 NN y2 𝑦2
𝑙2 Find a function in
function set that
x3 NN y3 𝑦3 minimizes total loss L
𝑙3
……
……
……
……
An iterative approach!
Training data
Inputs class Present a training pattern
1.4 2.7 1.9 0
3.8 3.4 3.2 0
6.4 2.8 1.7 1
4.1 0.1 0.2 0 1.4
etc …
2.7
1.9
An iterative approach!
Training data
Inputs class Feed it through to get output
1.4 2.7 1.9 0
3.8 3.4 3.2 0
6.4 2.8 1.7 1
4.1 0.1 0.2 0 1.4
etc …
2.7 0.8
1.9
An iterative approach!
Training data
Inputs class Compare with target output
1.4 2.7 1.9 0
3.8 3.4 3.2 0
6.4 2.8 1.7 1
4.1 0.1 0.2 0 1.4
etc …
2.7 0.8
0
1.9 loss 0.8
An iterative approach!
Training data
Inputs class Adjust weights based on error
1.4 2.7 1.9 0
3.8 3.4 3.2 0
6.4 2.8 1.7 1
4.1 0.1 0.2 0 1.4
etc …
2.7 0.8
0
1.9 loss 0.8
An iterative approach!
Training data
Inputs class Present a training pattern
1.4 2.7 1.9 0
3.8 3.4 3.2 0
6.4 2.8 1.7 1
4.1 0.1 0.2 0 6.4
etc …
2.8
1.7
An iterative approach!
Training data
Inputs class Feed it through to get output
1.4 2.7 1.9 0
3.8 3.4 3.2 0
6.4 2.8 1.7 1
4.1 0.1 0.2 0 6.4
etc …
2.8 0.9
1.7
An iterative approach!
Training data
Inputs class Compare with target output
1.4 2.7 1.9 0
3.8 3.4 3.2 0
6.4 2.8 1.7 1
4.1 0.1 0.2 0 6.4
etc …
2.8 0.9
1
1.7 loss 0.1
An iterative approach!
Training data
Inputs class Adjust weights based on error
1.4 2.7 1.9 0
3.8 3.4 3.2 0
6.4 2.8 1.7 1
4.1 0.1 0.2 0 6.4
etc …
2.8 0.9
1
1.7 loss 0.1
An iterative approach!
Training data
Inputs class And so on ….
1.4 2.7 1.9 0
3.8 3.4 3.2 0
6.4 2.8 1.7 1
4.1 0.1 0.2 0 6.4
etc …
2.8 0.9
1
1.7 loss 0.1
Examples: “1”
• cross entropy
x1 …… y1 As close as 1 t
Σ −ti log(yi) 1
x2 possible
Softmax
Given a set ……
of y2 0 t2
• mean squared error
parameters
……
……
……
……
……
(1/n) Σ (t − y ) 2 Loss
x i i ……
256 y10 𝑙 0 t10
1 2 2 2
Loss = 1 − 0.5 + 0 − 0.3 + 0 − 0.2
3
= 0.126...
Cross entropy vs. MSE
• Example: binary classification, label = 1
Cross entropy vs. MSE
Question
• Why don’t we directly use the
classification error (over the training set)
as the loss?
Minimizing cross entropy ≈
Minimizing KL divergence ≈
Maximum likelihood estimation
Maximum Likelihood Estimation
• Consider a set of m examples X = {x(1), ..., x(m)} drawn
independently from an unknown distribution pdata(x)
• Let pmodel(x;θ) be a parametric family of probability
distributions over the same space indexed by θ
• The maximum likelihood estimator for θ is
Taking log does not change arg max
Same!
Minimizing cross entropy ≈
Minimizing KL divergence ≈
Maximum likelihood estimation
Historical Notes
• Two improvements in neural network performance
from 1986-2015
• Large datasets → improve generalization
• Powerful computers → create larger networks
• Besides, two algorithmic changes are:
1. The replacement of MSE with the cross-
entropy family of loss functions
2. The replacement of sigmoid hidden units with
piecewise linear hidden units, such as rectified
linear units (ReLU) (next week)
Coming next
• Training a Deep Neural Network (part 2)