0% found this document useful (0 votes)
31 views17 pages

3a Variations

Uploaded by

jiejialing08
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views17 pages

3a Variations

Uploaded by

jiejialing08
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

COMP9444: Neural Networks and

Deep Learning
Week 3a. Backprop Variations

Alan Blair
School of Computer Science and Engineering
June 5, 2024
Outline

➛ Cross Entropy (3.13)


➛ Maximum Likelihood (5.5)
➛ Softmax (6.2.2)
➛ Weight Decay (5.2.2)
➛ Bayesian Inference and MAP Estimation (5.6.1)
➛ Second Order Methods

2
Cross Entropy
For function approximation, we normally use the sum squared error (SSE) loss:
1X
E= (z i − ti )2
2
i
where zi is the output of the network, and ti is the target output.
However, for classification tasks, where the target output ti is either 0 or 1, it is
more logical to use the cross entropy loss:
X 
E= − ti log(z i ) − (1 − ti ) log(1 − z i )
i
The motivation for these loss functions can be explained using the mathematical
concept of Maximum Likelihood.

3
Maximum Likelihood (5.5)

Let H be a class of hypotheses for predicting observed data D.


Prob(D | h) = probability of data D being generated under hypothesis h ∈ H.
log Prob(D | h) is called the likelihood of D, given h.
ML Principle: Choose h ∈ H which maximizes this likelihood, 
i.e. maximize Prob(D | h) or, maximize log Prob(D | h)
Here, the data D are the target values {ti } corresponding to input features {xi },
and each hypothesis h is a function f () determined by a neural network with
specified weights or, to give a simpler example, f () could be a straight line with a
specified slope and y-intercept.

4
Least Squares Fit
f(x)

x
5
Derivation of Least Squares
Due to the Central Limit Theorem, an accumulation of small errors will tend to
produce “noise” in the form of a Gaussian distribution.
Suppose the data are generated by a linear function f () plus Gaussian noise with
mean zero and standard deviation σ . Then
Y 1 1 2
Prob(D|h) = Prob({ti }|f ) = √ e− 2σ2 (ti −f (xi ))
i
2πσ
X 1 1
log Prob({ti }|f ) = − 2 ( ti − f (xi ))2 − log(σ) − log(2π)
2σ 2
i
fM L = argmaxf ∈H log Prob({ti }|f )
X
= argminf ∈H (ti − f (xi ))2
i

(Note: we do not need to know σ)

6
Derivation of Cross Entropy (3.9.1)
For binary classification tasks, the target value ti is either 0 or 1.
It makes sense to interpret the output f (xi ) of the neural network as the probability
of the true value being 1, i.e.

P (1 | f (xi )) = f (xi )
P (0 | f (xi )) = (1 − f (xi ))
i.e. P (ti | f (xi )) = f (xi ) ti (1 − f (xi ))(1− ti )

X
log P ({ti }|f ) = ti log f (xi ) + (1 − ti ) log(1 − f (xi ))
i
X
fM L = argmaxf ∈H ti log f (xi ) + (1 − ti ) log(1 − f (xi ))
i

(Can also be generalized to multiple classes.)

7
Cross Entropy and Backprop

Cross Entropy loss is often used in combination with sigmoid activation at the
output node, which guarantees an output strictly between 0 and 1, and also makes
the backprop computations a bit simpler, as follows:
X 
E = − ti log(z i ) − (1 − ti ) log(1 − z i )
i
∂E ti 1 − ti z i − ti
=− + =
∂z zi 1 − zi z i (1 − z i )
1 ∂E ∂E ∂z i
If z = , = = z i − ti
1 + e−s ∂si ∂z i ∂si

8
Cross Entropy and KL-Divergence
If we consider pi = h ti , 1− ti i, qi = h f (xi ), 1−f (xi )i as discrete probability
distributions, the Cross Entropy loss can be written as:
X
log P ({ti }| f ) = ti log f (xi ) + (1 − ti ) log (1 − f (xi ))
i
X 
= ti (log f (xi ) − log(ti )) + (1− ti )(log(1−f (xi )) − log(1− ti )
i

− −ti log(ti ) − (1− ti ) log(1− ti )
X
= DKL (pi || qi ) − H(pi )
i

Since H(pi ) is fixed, minimizing the Cross Entropy loss is the same as minimizing
P
DKL (pi || qi ).
i

9
Cross Entropy and Outliers

SSE and Cross Entropy behave a bit differently when it comes to outliers.
SSE is more likely to misclassify outliers, because the loss function for each item
is bounded between 0 and 1.
Cross Entropy is more likely to keep outliers correctly classified, because the loss
function grows logarithmically (unbounded) as the difference between the target
and network output approaches 1.
For this reason, Cross Entropy works particularly well for classification tasks that
are unbalanced in terms of negative items vastly outnumbering positive ones (or
vice versa).

10
Softmax (6.2.2)

➛ classification task with N classes


➛ neural network with N outputs z1 , . . . , zN
➛ assume the network’s estimate for the probability of the correct class being j
is proportional to exp(zj )
➛ because the probabilites must add up to 1, we need to normalize by dividing
by their sum:
exp(z i )
Prob(i) = PN
j=1 exp(zj )
XN
log Prob(i) = z i − log exp(zj )
j=1

11
Log Softmax and Backprop

If the correct class is k, we can treat − log Prob(k) as our cost function, and the
gradient is

d exp(z i )
log Prob(k) = δik − P N = δik − Prob(i),
dz i j=1 exp(zj )

where δik is the Kronecker delta.


This gradient pushes up the correct class i = k in proportion to the difference
between its assigned probability and 1, and it pushes down the incorrect classes
i 6= k in proportion to the probabilities assigned to them by the network.

12
Softmax, Boltzmann and Sigmoid

If you have studied mathematics or physics, you may be interested to know that
Softmax is related to the Boltzmann Distribution, with the negative of output z i
playing the role of the “energy” for “state” i.
The Sigmoid function can also be seen as a special case of Softmax, with two
classes and one output, as follows:
Consider a simplified case where there is a choice between two classes, Class 0
and Class 1. We consider the output z of the network to be associated with Class 1
and we imagine a fixed “output” for Class 0 which is always equal to zero. In this
case, the Softmax becomes:
ez 1
Prob(1) = z =
e + e0 1 + e−z

13
Weight Decay (5.2.2)

Sometimes we add a penalty term to the loss function which encourages the
neural network weights wj to remain small:

1X λX 2
E= (z i − ti )2 + wj
2 2
i j

This can prevent the weights from “saturating” to very high values.
It is sometimes referred to as “elastic weights” because the weights experience a
force as if there were a spring pulling them back towards the origin according to
Hooke’s Law.
The scaling factor λ needs to be determined from experience, or empirically.

14
Bayesian Inference
H is a class of hypotheses.
Prob(D | h) = probability of data D being generated under hypothesis h ∈ H.
Prob(h | D) = probability that h is correct, given that data D were observed.
Bayes’ Theorem:
Prob(h | D)Prob(D) = Prob(D | h) Prob(h)
Prob(D | h) Prob(h)
Prob(h | D) =
Prob(D)

Prob(h) is called the prior because it is our estimate of the probability of h before
the data have been observed.
Prob(h | D) is called the posterior because it is our estimate of the probability of h
after the data have been observed.

15
Weight Decay as MAP Estimation (5.6.1)
We assume a Gaussian prior distribution for the weights, i.e.
1 2 2
e−wj /2σ0
Y
P (w) = √
Then j
2πσ0

P (t | w)P (w) 1 Y 1 1 2 Y 1 2 2
P (w | t) = = √ e− 2σ2 (z i − ti ) √ e−wj /2σ0
P (t) P (t) 2πσ 2πσ0
i j
1 X 1 X 2
log P (w | t) = − 2 (z i − ti )2 − 2 wj + constant
2σ 2σ0
i j
wMAP = argmaxw∈H log P (w | t)
1 X λ X 2
= argminw∈H (z i − ti )2 + wj , where λ = σ 2 /σ02
2 2
i j

This is known as Maximum A Posteriori (MAP) estimation.

16
Second Order Methods
Some optimization methods involve computing second order partial derivatives of
the loss function with respect to each pair of weights:
∂2E
∂wi ∂wj
➛ Conjugate Gradients
→ approximate the landscape with a quadratic function (paraboloid) and
jump to the minimum of this quadratic function
➛ Natural Gradients (Amari, 1995)
→ use methods from information geometry to find a “natural” re-scaling of
the partial derivatives
These methods are not normally used for deep learning, because the number of
weights is too high. In practice, the Adam optimizer tends to provide similar
benefits with low computational cost.

17

You might also like