0% found this document useful (0 votes)

31 views17 pages

3a Variations

Uploaded by

jiejialing08

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views17 pages

3a Variations

Uploaded by

jiejialing08

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

COMP9444: Neural Networks and

Deep Learning
Week 3a. Backprop Variations

Alan Blair
School of Computer Science and Engineering
June 5, 2024
Outline

➛ Cross Entropy (3.13)

➛ Maximum Likelihood (5.5)
➛ Softmax (6.2.2)
➛ Weight Decay (5.2.2)
➛ Bayesian Inference and MAP Estimation (5.6.1)
➛ Second Order Methods

2
Cross Entropy
For function approximation, we normally use the sum squared error (SSE) loss:
1X
E= (z i − ti )2
2
i
where zi is the output of the network, and ti is the target output.
However, for classification tasks, where the target output ti is either 0 or 1, it is
more logical to use the cross entropy loss:
X
E= − ti log(z i ) − (1 − ti ) log(1 − z i )
i
The motivation for these loss functions can be explained using the mathematical
concept of Maximum Likelihood.

3
Maximum Likelihood (5.5)

Let H be a class of hypotheses for predicting observed data D.

Prob(D | h) = probability of data D being generated under hypothesis h ∈ H.
log Prob(D | h) is called the likelihood of D, given h.
ML Principle: Choose h ∈ H which maximizes this likelihood,
i.e. maximize Prob(D | h) or, maximize log Prob(D | h)
Here, the data D are the target values {ti } corresponding to input features {xi },
and each hypothesis h is a function f () determined by a neural network with
specified weights or, to give a simpler example, f () could be a straight line with a
specified slope and y-intercept.

4
Least Squares Fit
f(x)

x
5
Derivation of Least Squares
Due to the Central Limit Theorem, an accumulation of small errors will tend to
produce “noise” in the form of a Gaussian distribution.
Suppose the data are generated by a linear function f () plus Gaussian noise with
mean zero and standard deviation σ . Then
Y 1 1 2
Prob(D|h) = Prob({ti }|f ) = √ e− 2σ2 (ti −f (xi ))
i
2πσ
X 1 1
log Prob({ti }|f ) = − 2 ( ti − f (xi ))2 − log(σ) − log(2π)
2σ 2
i
fM L = argmaxf ∈H log Prob({ti }|f )
X
= argminf ∈H (ti − f (xi ))2
i

(Note: we do not need to know σ)

6
Derivation of Cross Entropy (3.9.1)
For binary classification tasks, the target value ti is either 0 or 1.
It makes sense to interpret the output f (xi ) of the neural network as the probability
of the true value being 1, i.e.

P (1 | f (xi )) = f (xi )
P (0 | f (xi )) = (1 − f (xi ))
i.e. P (ti | f (xi )) = f (xi ) ti (1 − f (xi ))(1− ti )

X
log P ({ti }|f ) = ti log f (xi ) + (1 − ti ) log(1 − f (xi ))
i
X
fM L = argmaxf ∈H ti log f (xi ) + (1 − ti ) log(1 − f (xi ))
i

(Can also be generalized to multiple classes.)

7
Cross Entropy and Backprop

Cross Entropy loss is often used in combination with sigmoid activation at the
output node, which guarantees an output strictly between 0 and 1, and also makes
the backprop computations a bit simpler, as follows:
X
E = − ti log(z i ) − (1 − ti ) log(1 − z i )
i
∂E ti 1 − ti z i − ti
=− + =
∂z zi 1 − zi z i (1 − z i )
1 ∂E ∂E ∂z i
If z = , = = z i − ti
1 + e−s ∂si ∂z i ∂si

8
Cross Entropy and KL-Divergence
If we consider pi = h ti , 1− ti i, qi = h f (xi ), 1−f (xi )i as discrete probability
distributions, the Cross Entropy loss can be written as:
X
log P ({ti }| f ) = ti log f (xi ) + (1 − ti ) log (1 − f (xi ))
i
X
= ti (log f (xi ) − log(ti )) + (1− ti )(log(1−f (xi )) − log(1− ti )
i

− −ti log(ti ) − (1− ti ) log(1− ti )
X
= DKL (pi || qi ) − H(pi )
i

Since H(pi ) is fixed, minimizing the Cross Entropy loss is the same as minimizing
P
DKL (pi || qi ).
i

9
Cross Entropy and Outliers

SSE and Cross Entropy behave a bit differently when it comes to outliers.
SSE is more likely to misclassify outliers, because the loss function for each item
is bounded between 0 and 1.
Cross Entropy is more likely to keep outliers correctly classified, because the loss
function grows logarithmically (unbounded) as the difference between the target
and network output approaches 1.
For this reason, Cross Entropy works particularly well for classification tasks that
are unbalanced in terms of negative items vastly outnumbering positive ones (or
vice versa).

10
Softmax (6.2.2)

➛ classification task with N classes

➛ neural network with N outputs z1 , . . . , zN
➛ assume the network’s estimate for the probability of the correct class being j
is proportional to exp(zj )
➛ because the probabilites must add up to 1, we need to normalize by dividing
by their sum:
exp(z i )
Prob(i) = PN
j=1 exp(zj )
XN
log Prob(i) = z i − log exp(zj )
j=1

11
Log Softmax and Backprop

If the correct class is k, we can treat − log Prob(k) as our cost function, and the
gradient is

d exp(z i )
log Prob(k) = δik − P N = δik − Prob(i),
dz i j=1 exp(zj )

where δik is the Kronecker delta.

This gradient pushes up the correct class i = k in proportion to the difference
between its assigned probability and 1, and it pushes down the incorrect classes
i 6= k in proportion to the probabilities assigned to them by the network.

12
Softmax, Boltzmann and Sigmoid

If you have studied mathematics or physics, you may be interested to know that
Softmax is related to the Boltzmann Distribution, with the negative of output z i
playing the role of the “energy” for “state” i.
The Sigmoid function can also be seen as a special case of Softmax, with two
classes and one output, as follows:
Consider a simplified case where there is a choice between two classes, Class 0
and Class 1. We consider the output z of the network to be associated with Class 1
and we imagine a fixed “output” for Class 0 which is always equal to zero. In this
case, the Softmax becomes:
ez 1
Prob(1) = z =
e + e0 1 + e−z

13
Weight Decay (5.2.2)

Sometimes we add a penalty term to the loss function which encourages the
neural network weights wj to remain small:

1X λX 2
E= (z i − ti )2 + wj
2 2
i j

This can prevent the weights from “saturating” to very high values.
It is sometimes referred to as “elastic weights” because the weights experience a
force as if there were a spring pulling them back towards the origin according to
Hooke’s Law.
The scaling factor λ needs to be determined from experience, or empirically.

Prob(h) is called the prior because it is our estimate of the probability of h before
the data have been observed.
Prob(h | D) is called the posterior because it is our estimate of the probability of h
after the data have been observed.

15
Weight Decay as MAP Estimation (5.6.1)
We assume a Gaussian prior distribution for the weights, i.e.
1 2 2
e−wj /2σ0
Y
P (w) = √
Then j
2πσ0

P (t | w)P (w) 1 Y 1 1 2 Y 1 2 2
P (w | t) = = √ e− 2σ2 (z i − ti ) √ e−wj /2σ0
P (t) P (t) 2πσ 2πσ0
i j
1 X 1 X 2
log P (w | t) = − 2 (z i − ti )2 − 2 wj + constant
2σ 2σ0
i j
wMAP = argmaxw∈H log P (w | t)
1 X λ X 2
= argminw∈H (z i − ti )2 + wj , where λ = σ 2 /σ02
2 2
i j

This is known as Maximum A Posteriori (MAP) estimation.

16
Second Order Methods
Some optimization methods involve computing second order partial derivatives of
the loss function with respect to each pair of weights:
∂2E
∂wi ∂wj
➛ Conjugate Gradients
→ approximate the landscape with a quadratic function (paraboloid) and
jump to the minimum of this quadratic function
➛ Natural Gradients (Amari, 1995)
→ use methods from information geometry to find a “natural” re-scaling of
the partial derivatives
These methods are not normally used for deep learning, because the number of
weights is too high. In practice, the Adam optimizer tends to provide similar
benefits with low computational cost.

2.0L VCDi LNP DIESEL ENGINE MANUAL PDF
No ratings yet
2.0L VCDi LNP DIESEL ENGINE MANUAL PDF
933 pages
DL Unit-2
No ratings yet
DL Unit-2
24 pages
PINOUT Del KIT Kornak (STM32F407 Discovery F4) 0001 Rev 1.00 Module Pinouts Functions 1
No ratings yet
PINOUT Del KIT Kornak (STM32F407 Discovery F4) 0001 Rev 1.00 Module Pinouts Functions 1
28 pages
3a Variations4
No ratings yet
3a Variations4
5 pages
Lec 04 Deep Networks 2
No ratings yet
Lec 04 Deep Networks 2
78 pages
Cross Entropy Loss Intro, Applications
No ratings yet
Cross Entropy Loss Intro, Applications
21 pages
Deep learning
No ratings yet
Deep learning
15 pages
Mathematical Foundations of Computational Linguistics: Manfred Klenner and Jannis Vamvas
No ratings yet
Mathematical Foundations of Computational Linguistics: Manfred Klenner and Jannis Vamvas
32 pages
Cross Interopy
No ratings yet
Cross Interopy
7 pages
03-Linear Classification
No ratings yet
03-Linear Classification
17 pages
Logistic_Regression
No ratings yet
Logistic_Regression
19 pages
Ch2-Training, Optimization and Regularization of DNN-new (1)
No ratings yet
Ch2-Training, Optimization and Regularization of DNN-new (1)
114 pages
Unit 2.1
No ratings yet
Unit 2.1
37 pages
Practical-5_2CEIT606_Artificial Intelligence
No ratings yet
Practical-5_2CEIT606_Artificial Intelligence
14 pages
Curs4site PDF
No ratings yet
Curs4site PDF
44 pages
DeepNotes Softmax&Crossentropy
No ratings yet
DeepNotes Softmax&Crossentropy
14 pages
Mathematics of Deep Learning: Lecture 1-Introduction and The Universality of Depth 1 Nets
No ratings yet
Mathematics of Deep Learning: Lecture 1-Introduction and The Universality of Depth 1 Nets
12 pages
Theory of Deep Learning 1652786371
No ratings yet
Theory of Deep Learning 1652786371
118 pages
unit 2 DL
No ratings yet
unit 2 DL
70 pages
Lecture 03 - Feedforward Networks - 4p
No ratings yet
Lecture 03 - Feedforward Networks - 4p
19 pages
Lecture 3
No ratings yet
Lecture 3
24 pages
Lect 8
No ratings yet
Lect 8
117 pages
Types of Neural Networks
No ratings yet
Types of Neural Networks
7 pages
Cs217 Perceptron Sigmoid Softmax Week5 3feb25
No ratings yet
Cs217 Perceptron Sigmoid Softmax Week5 3feb25
90 pages
7.TrainingNN-2
No ratings yet
7.TrainingNN-2
84 pages
WINSEM2024-25_CSE4006_ETH_AP2024254000689_2025-01-03_Reference-Material-I
No ratings yet
WINSEM2024-25_CSE4006_ETH_AP2024254000689_2025-01-03_Reference-Material-I
39 pages
Neural Networks
No ratings yet
Neural Networks
63 pages
Lecture04 VDL
No ratings yet
Lecture04 VDL
93 pages
Montanari
No ratings yet
Montanari
10 pages
Notes Chapter8
No ratings yet
Notes Chapter8
4 pages
Chapter02-Introduction-to-DeepLearning
No ratings yet
Chapter02-Introduction-to-DeepLearning
84 pages
Dat 300
No ratings yet
Dat 300
12 pages
02 - Linear Models - D (Multiclass Classification)
No ratings yet
02 - Linear Models - D (Multiclass Classification)
9 pages
Machine_learning(unit 3)
No ratings yet
Machine_learning(unit 3)
9 pages
Deep Learning Assignment3 Solution
No ratings yet
Deep Learning Assignment3 Solution
9 pages
Deep Learning 2017 Lecture7GAN
No ratings yet
Deep Learning 2017 Lecture7GAN
62 pages
W02 MLOptDL
No ratings yet
W02 MLOptDL
23 pages
deep neura network lab
No ratings yet
deep neura network lab
11 pages
BackPropogationCrossEntNotes PDF
No ratings yet
BackPropogationCrossEntNotes PDF
4 pages
Lesson 4 Deep Neural Network and Tools
No ratings yet
Lesson 4 Deep Neural Network and Tools
159 pages
HODL Lec 2 Training NNs Intro TF
No ratings yet
HODL Lec 2 Training NNs Intro TF
83 pages
CSD411-_Week_4-_MF,_IT_and_Model_1724689912176241587666ccadf8821c9
No ratings yet
CSD411-_Week_4-_MF,_IT_and_Model_1724689912176241587666ccadf8821c9
48 pages
Christopher Manning Lecture 3: Neural Net Learning: Gradients by Hand (Matrix Calculus) and Algorithmically (The Backpropagation Algorithm)
No ratings yet
Christopher Manning Lecture 3: Neural Net Learning: Gradients by Hand (Matrix Calculus) and Algorithmically (The Backpropagation Algorithm)
84 pages
Notes6_Classification
No ratings yet
Notes6_Classification
10 pages
6.036: Intro To Machine Learning: Lecturer: Professor Leslie Kaelbling Notes By: Andrew Lin Fall 2019
No ratings yet
6.036: Intro To Machine Learning: Lecturer: Professor Leslie Kaelbling Notes By: Andrew Lin Fall 2019
50 pages
Deep Learning 1
No ratings yet
Deep Learning 1
48 pages
Lecture 12 Bayesian Neural Network
No ratings yet
Lecture 12 Bayesian Neural Network
46 pages
DeepLearning Recap
No ratings yet
DeepLearning Recap
104 pages
DLbook
No ratings yet
DLbook
165 pages
slide07-bayes
No ratings yet
slide07-bayes
51 pages
ML Fundamentals by Bitspace
No ratings yet
ML Fundamentals by Bitspace
19 pages
CS 229 - Supervised Learning Cheatsheet
No ratings yet
CS 229 - Supervised Learning Cheatsheet
2 pages
Notes On Backpropagation: Peter.j.sadowski@uci - Edu
No ratings yet
Notes On Backpropagation: Peter.j.sadowski@uci - Edu
3 pages
cs231n Github Io Neural Networks Case Study
No ratings yet
cs231n Github Io Neural Networks Case Study
17 pages
Cheatsheet Supervised Learning
No ratings yet
Cheatsheet Supervised Learning
4 pages
L06 Slides.mlp3
No ratings yet
L06 Slides.mlp3
26 pages
Lecture 4
No ratings yet
Lecture 4
50 pages
AIML_Book
No ratings yet
AIML_Book
238 pages
Hands-On Bayesian Neural Network
No ratings yet
Hands-On Bayesian Neural Network
28 pages
Module 1 - Problems in Neural Network
No ratings yet
Module 1 - Problems in Neural Network
20 pages
Differential Forms
From Everand
Differential Forms
Henri Cartan
5/5 (2)
Multiple Integrals, A Collection of Solved Problems
From Everand
Multiple Integrals, A Collection of Solved Problems
Steven Tan
No ratings yet
Syllabus For ENG-1B-45999 OL (06 - 17-07 - 25) Critical Thinking - Writing
No ratings yet
Syllabus For ENG-1B-45999 OL (06 - 17-07 - 25) Critical Thinking - Writing
5 pages
(9789004233041 - Crossing Boundaries) Crossing Boundaries
100% (1)
(9789004233041 - Crossing Boundaries) Crossing Boundaries
273 pages
I Need U
No ratings yet
I Need U
21 pages
Apm Coursework
No ratings yet
Apm Coursework
4 pages
ACCT5919 Individual Assignment and Reflection Term 2 2024
No ratings yet
ACCT5919 Individual Assignment and Reflection Term 2 2024
3 pages
2022 Book HumanAnimalRelationshipsInTran
No ratings yet
2022 Book HumanAnimalRelationshipsInTran
414 pages
SM Assignment 1
No ratings yet
SM Assignment 1
2 pages
BM - CA1 - July24 Term - Canvas
No ratings yet
BM - CA1 - July24 Term - Canvas
3 pages
MCD2080 Business Statistics Group Assignment-Final
No ratings yet
MCD2080 Business Statistics Group Assignment-Final
5 pages
COMP9444 Project Marking Criteria
No ratings yet
COMP9444 Project Marking Criteria
2 pages
ES1086 Scientific Report - Search For Life On Mars - Summer 2024
No ratings yet
ES1086 Scientific Report - Search For Life On Mars - Summer 2024
16 pages
大纲 MARK3088 Assessment Guide 2024 T2
No ratings yet
大纲 MARK3088 Assessment Guide 2024 T2
14 pages
MARK3088 - Lecture WK 5 - New Product Idea Generation
No ratings yet
MARK3088 - Lecture WK 5 - New Product Idea Generation
46 pages
Assignment 2 Option 1 - Advertisment Design Canvas
No ratings yet
Assignment 2 Option 1 - Advertisment Design Canvas
2 pages
4a Convolutional Neural Networks
No ratings yet
4a Convolutional Neural Networks
56 pages
125.364 Week 05
No ratings yet
125.364 Week 05
41 pages
HIST 11 Notes
No ratings yet
HIST 11 Notes
52 pages
125.364 Week 06 Interest Rate Swap
No ratings yet
125.364 Week 06 Interest Rate Swap
21 pages
Assignment 2 Option 3 - Store Design Canvas
No ratings yet
Assignment 2 Option 3 - Store Design Canvas
3 pages
IRAC Method
No ratings yet
IRAC Method
13 pages
125.364 Week 04-05 Part 01
No ratings yet
125.364 Week 04-05 Part 01
25 pages
125.364 Topic03
No ratings yet
125.364 Topic03
45 pages
125.364 Week 01
No ratings yet
125.364 Week 01
61 pages
Income From Property
No ratings yet
Income From Property
12 pages
Massey Uni Sustainability Reporting-Final2
No ratings yet
Massey Uni Sustainability Reporting-Final2
23 pages
General Concepts of Income
No ratings yet
General Concepts of Income
16 pages
Week 6 Lecture Slides
No ratings yet
Week 6 Lecture Slides
46 pages
Week 3 LECTURE SLIDES
No ratings yet
Week 3 LECTURE SLIDES
53 pages
Week 2 Lecture Slides
No ratings yet
Week 2 Lecture Slides
29 pages
Week (1) Lecture
No ratings yet
Week (1) Lecture
25 pages
APPRAISAL FORM (Basic Research Proposal)
No ratings yet
APPRAISAL FORM (Basic Research Proposal)
2 pages
Week 06. Programming of Safety Critical Systems - MISRA-C
No ratings yet
Week 06. Programming of Safety Critical Systems - MISRA-C
32 pages
Bs 205 0512e
No ratings yet
Bs 205 0512e
49 pages
ACSE API Reference
No ratings yet
ACSE API Reference
6 pages
Scribd2 PDF Free
No ratings yet
Scribd2 PDF Free
3 pages
Math 10 Worksheets 2nd Quarter
No ratings yet
Math 10 Worksheets 2nd Quarter
22 pages
Target 260 Installation Manual Issue 12
No ratings yet
Target 260 Installation Manual Issue 12
43 pages
C++ User'S Guide: Forte Developer 6 Update 2 (Sun Workshop 6 Update 2)
No ratings yet
C++ User'S Guide: Forte Developer 6 Update 2 (Sun Workshop 6 Update 2)
384 pages
HGKGB6_2014_v13n5_21
No ratings yet
HGKGB6_2014_v13n5_21
7 pages
Feasibility_Study_E_Learning_System
No ratings yet
Feasibility_Study_E_Learning_System
5 pages
DAA Programs
No ratings yet
DAA Programs
3 pages
Experiment No. 2 Aim:-Write C Program To Implement First of A Given Grammar
No ratings yet
Experiment No. 2 Aim:-Write C Program To Implement First of A Given Grammar
4 pages
Resume Tami Hernandez
No ratings yet
Resume Tami Hernandez
2 pages
Old Car Selling Voice Process
No ratings yet
Old Car Selling Voice Process
2 pages
HAZOP Worksheet_GET
No ratings yet
HAZOP Worksheet_GET
2 pages
Hp z440 Workstation Technical Guide
No ratings yet
Hp z440 Workstation Technical Guide
20 pages
Infinite Insight: ABB Ability E-Mesh
No ratings yet
Infinite Insight: ABB Ability E-Mesh
5 pages
Mindspark Brochure 2011
No ratings yet
Mindspark Brochure 2011
16 pages
Transmission Development Plan NGCP
No ratings yet
Transmission Development Plan NGCP
37 pages
SCHEDULES in Transaction and Concurrency Control
No ratings yet
SCHEDULES in Transaction and Concurrency Control
46 pages
An Industry Vision For Offers and Orders: Airline Retailing
No ratings yet
An Industry Vision For Offers and Orders: Airline Retailing
26 pages
Lyna-Huang-BDC%20Manager
No ratings yet
Lyna-Huang-BDC%20Manager
2 pages
Project (Lap Tech)
No ratings yet
Project (Lap Tech)
216 pages
Khanitkar Resume 2025
No ratings yet
Khanitkar Resume 2025
1 page
FM Should Be A Master of Building's All Tecnnical Systems
No ratings yet
FM Should Be A Master of Building's All Tecnnical Systems
61 pages
A419 Schematic
No ratings yet
A419 Schematic
16 pages
VHDL Cheat Sheet PDF
No ratings yet
VHDL Cheat Sheet PDF
1 page
A10VSO
100% (1)
A10VSO
40 pages

3a Variations

Uploaded by

3a Variations

Uploaded by

COMP9444: Neural Networks and

➛ Cross Entropy (3.13)

Let H be a class of hypotheses for predicting observed data D.

(Note: we do not need to know σ)

(Can also be generalized to multiple classes.)

➛ classification task with N classes

where δik is the Kronecker delta.

This is known as Maximum A Posteriori (MAP) estimation.

You might also like