0% found this document useful (0 votes)
14 views21 pages

Chap 6 - Deep FeedForward Networks - Eunjeong Yi

Uploaded by

amr hassan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views21 pages

Chap 6 - Deep FeedForward Networks - Eunjeong Yi

Uploaded by

amr hassan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

InfoLab

Deep Learning Seminar


CH 6. Deep Feedforward Networks

2017-07-26
Eunjeong Yi
Chapter 6. Deep Feedforward Networks

6.1 Example: Learning XOR


6.2 Gradient-Based Learning
6.3 Hidden Units
6.4 Architecture Design
6.5 Back-Propagation and Other Differentiation
6.6 Historical Notes

2 InfoLab
Deep feedforward network
No feedback connection
Structure of model
Ø Input layer, hidden layer, output layer
Ø The depth, width of model
Ø Cost function

Depth

Width

Input layer Hidden layer Output layer

3 InfoLab
Example: Learning XOR
XOR function (“exclusive or”)

𝒙𝟏 𝒙𝟐 𝒙𝟏 𝑿𝑶𝑹 𝒙𝟐

0 0 0

0 1 1

1 0 1

1 1 0

𝕏 = { 𝟎, 𝟎 𝑻 , 𝟎, 𝟏 𝑻 , 𝟏, 𝟎 𝑻 , 𝟏, 𝟏 𝑻 }
Mean squared error (MSE) loss function
𝟏 𝟐
𝑱 𝜽 = , 𝒇∗ 𝒙 − 𝒇 𝒙; 𝜽
𝟒
𝐱∈𝕏
• 𝑓 ∗ 𝑥 : 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑎𝑛𝑠𝑤𝑒𝑟
• 𝑓 𝑥; 𝜃 : 𝑐𝑎𝑙𝑐𝑢𝑙𝑎𝑡𝑒𝑑 𝑟𝑒𝑠𝑢𝑙𝑡 𝑏𝑦 𝑛𝑒𝑢𝑟𝑎𝑙 𝑛𝑒𝑡𝑤𝑜𝑟𝑘

4 InfoLab
Example: Learning XOR
ℎ = 𝑔(𝑤MQ 𝑥 + 𝑏)
𝑤MQ 𝑥 + 𝑏

𝑤M 𝑤N
𝑥M ℎM

𝑓(𝑥; 𝜃)

𝑥N ℎN
𝑤NQ ℎ + 𝑐
Input layer Hidden layer Output layer
(g = activation function, b, c = bias)

𝑓 𝑥; 𝜃 = 𝑓 𝑥; 𝑤M , 𝑏, 𝑤N , 𝑐 = 𝑤NQ 𝑔 𝑤MQ 𝑥 + 𝑏 + 𝑐

5 InfoLab
Gradient-based Learning
Loss function of neural network is non-convex
ØTrained by using iterative, gradient-based optimizers

Cost function
Ø Learning Conditional Distribution
Ø Learning Conditional Statistics

Output layers
Ø Linear Units for Gaussian Output Distributions
Ø Sigmoid Units
Ø Softmax Units

6 InfoLab
Learning Conditional Distribution
Negative log-likelihood as cross-entropy between
training data and model distribution

𝑱 𝜽 = 𝑯 𝒑j𝒅𝒂𝒕𝒂 , 𝒑𝒎𝒐𝒅𝒆𝒍 𝒚 𝒙 =− , 𝒑
j 𝒅𝒂𝒕𝒂 𝒍𝒐𝒈 𝒑𝒎𝒐𝒅𝒆𝒍 𝒚 𝒙
𝒙,𝒚~𝒑
j 𝒅𝒂𝒕𝒂

𝑱 𝜽 = −𝔼𝒙,𝒚~𝒑V𝒅𝒂𝒕𝒂 𝒍𝒐𝒈𝒑𝒎𝒐𝒅𝒆𝒍 𝒚 𝒙

V𝒅𝒂𝒕𝒂 ∶ 𝒕𝒓𝒂𝒊𝒏𝒊𝒏𝒈 𝒅𝒂𝒕𝒂 − 𝒈𝒆𝒏𝒆𝒓𝒂𝒕𝒊𝒏𝒈 𝒅𝒊𝒔𝒕𝒓𝒊𝒃𝒖𝒕𝒊𝒐𝒏


• 𝒑
V𝒅𝒂𝒕𝒂
• 𝒑𝒎𝒐𝒅𝒆𝒍 𝒚 𝒙 : 𝒑𝒓𝒐𝒃𝒂𝒃𝒊𝒍𝒊𝒕𝒚 𝒅𝒊𝒔𝒕𝒓𝒊𝒃𝒖𝒕𝒊𝒐𝒏 𝒆𝒔𝒕𝒊𝒎𝒂𝒕𝒊𝒏𝒈 𝒑
V𝒅𝒂𝒕𝒂 , 𝒑𝒎𝒐𝒅𝒆𝒍 𝒚 𝒙 : 𝒄𝒓𝒐𝒔𝒔 𝒆𝒏𝒕𝒓𝒐𝒑𝒚 𝒃𝒆𝒕𝒘𝒆𝒆𝒏 𝒑
• 𝑯 𝒑 V𝒅𝒂𝒕𝒂 𝒂𝒏𝒅 𝒑𝒎𝒐𝒅𝒆𝒍

𝒍𝒐𝒈 function undoes the exp of some output units

7 InfoLab
Learning Conditional Statistics

Mean Square Error (MSE)


N
𝑓 ∗ = arg min 𝔼t,u~vwxyx 𝑦 − 𝑓 𝑥
s

Mean Absolute Error (MAE)

𝑓 ∗ = arg min 𝔼t,u~vwxyx 𝑦 − 𝑓 𝑥 M


s

MSE, MAE often lead to poor results when used with


gradient-based learning

8 InfoLab
Linear Units for Gaussian Output Distributions
Feature h, produce a vector 𝑦j

𝑦j = 𝑊 Q ℎ + 𝑏

To produce the mean of a conditional Gaussian


distribution
𝑝 𝑦 𝑥 = 𝒩(𝑦; 𝑦j, 𝐼)

No saturation → little difficult to gradient-based


optimization
𝑊Qℎ + 𝑏

𝑦j

Output layer
ℎ = 𝑓(𝑥; 𝜃)
9 InfoLab
Sigmoid Units
Binary classification
V = 𝝈 𝒘𝑻 𝒉 + 𝒃
Output: 𝒚
M
Ø𝜎 𝑧 =
M‚ƒ „…
Saturate to 0 and 1

© Copyright 2015-2017, CodeReclaimers


http://neat-python.readthedocs.io/en/latest/activation.html

10 InfoLab
Softmax Units
Multiclass classification
To generalize to the case of a discrete variable with n
V, 𝒘𝒊𝒕𝒉 𝒚V𝒊 = 𝑷(𝒚 = 𝒊|𝒙)
values, vector 𝒚

𝒛 = 𝑾𝑻 𝒉 + 𝒃 𝒛𝒊 = 𝒍𝒐𝒈 𝑷Š 𝒚=𝒊𝒙
𝒆𝒙𝒑 𝒛𝒊
𝒔𝒐𝒇𝒕𝒎𝒂𝒙 𝒛 𝒊 =
𝚺𝐣 𝒆𝒙𝒑 𝒛𝒋

11 InfoLab
Hidden Units
How to choose the type of hidden unit to use in the
hidden layers of the model
Input: 𝒛 = 𝑾𝑻 𝒙 + 𝒃
Activation function 𝒈(𝒛)
Ex) Rectified Linear Units (ReLU), Logistic Sigmoid and
Hyperbolic Tangent
ℎ = 𝑔(𝑤MQ 𝑥 + 𝑏)

𝑤M 𝑤N
𝑥M ℎM

𝑓(𝑥; 𝜃)

𝑥N ℎN

Input layer Hidden layer Output layer

12 (g = activation
InfoLab function, b, c = bias)
Rectified Linear Units (ReLU)
Activation function 𝑔 𝑧

𝑔 𝑧 = max(0, 𝑧)

If model’s behavior is closer to linear, models are easier


to optimize

© Copyright 2015-2017, CodeReclaimers


http://neat-python.readthedocs.io/en/latest/activation.html

13 InfoLab
Logistic Sigmoid and Hyperbolic Tangent
Activation function
𝒈 𝒛 =𝝈 𝒛
or
𝒈 𝒛 = 𝒕𝒂𝒏𝒉(𝒛) = 𝟐𝝈(𝟐𝒛) − 𝟏

Saturation of sigmoidal units make gradient-based


learning difficult

© Copyright 2015-2017, CodeReclaimers


http://neat-python.readthedocs.io/en/latest/activation.html

14 InfoLab
Back-Propagation

Method to calculate the gradient of the loss


function with respect to the weights in an artificial
neural network
Forward propagation result
𝑦
𝒚 = 𝒘𝑻𝟐 𝒉 + 𝒃 = 𝒘𝑻𝟐 𝒈 𝒘𝑻𝟏 𝒙 + 𝒄 + 𝒃
(g = activation function, b, c = bias)
𝑤N

ℎM ℎN

𝑤M

𝑥M 𝑥N

15 InfoLab
Back-Propagation

Method to calculate the gradient of the loss


function with respect to the weights in an artificial
neural network

𝑤N

𝒅𝒚
ℎM ℎN
𝒅𝒉

𝑤M

𝑥M 𝑥N

16 InfoLab
Back-Propagation

Method to calculate the gradient of the loss


function with respect to the weights in an artificial
neural network

𝑤N

𝒅𝒚
ℎM ℎN
𝒅𝒉

𝑤M

𝒅𝒉
𝑥M 𝑥N
𝒅𝒙

17 InfoLab
Back-Propagation

Method to calculate the gradient of the loss


function with respect to the weights in an artificial
neural network

𝑤N

𝒅𝒚
ℎM ℎN
𝒅𝒉

𝑤M

𝒅𝒉 𝒅𝒚 𝒅𝒚 𝒅𝒉
𝑥M 𝑥N = ×
𝒅𝒙 𝒅𝒙 𝒅𝒉 𝒅𝒙

18 InfoLab
Back-Propagation

𝛿𝑦 𝛿𝑦 𝛿𝑥M 𝛿ℎM
𝑦 = × ×
𝛿𝑤M,M 𝛿𝑥M 𝛿ℎM 𝛿𝑤M,M
𝑤N
Update 𝑤M,M
”u
ℎM 𝑤M,M = 𝑤M,M − 𝜂
”•–,–
( 𝜂: learning weight)
𝑤M,M 𝑤M,N

Using same way,


𝑥M 𝑥N
all weight update

19 InfoLab
Next Deep learning seminar

Chapter 7. Regularization for Deep Learning

7.1 Parameter Norm Penalties


7.2 Norm Penalties as Constrained Optimizaiton
7.3 Regularization and Under-Constrained Problems
7.4 Dataset Augmentation
7.5 Noise Robustness
7.6 Semi-Supervised Learning
7.7 Multitask Learning

20 InfoLab
InfoLab

Thank you

21

You might also like