3a Variations
3a Variations
Deep Learning
Week 3a. Backprop Variations
Alan Blair
School of Computer Science and Engineering
June 5, 2024
Outline
2
Cross Entropy
For function approximation, we normally use the sum squared error (SSE) loss:
1X
E= (z i − ti )2
2
i
where zi is the output of the network, and ti is the target output.
However, for classification tasks, where the target output ti is either 0 or 1, it is
more logical to use the cross entropy loss:
X
E= − ti log(z i ) − (1 − ti ) log(1 − z i )
i
The motivation for these loss functions can be explained using the mathematical
concept of Maximum Likelihood.
3
Maximum Likelihood (5.5)
4
Least Squares Fit
f(x)
x
5
Derivation of Least Squares
Due to the Central Limit Theorem, an accumulation of small errors will tend to
produce “noise” in the form of a Gaussian distribution.
Suppose the data are generated by a linear function f () plus Gaussian noise with
mean zero and standard deviation σ . Then
Y 1 1 2
Prob(D|h) = Prob({ti }|f ) = √ e− 2σ2 (ti −f (xi ))
i
2πσ
X 1 1
log Prob({ti }|f ) = − 2 ( ti − f (xi ))2 − log(σ) − log(2π)
2σ 2
i
fM L = argmaxf ∈H log Prob({ti }|f )
X
= argminf ∈H (ti − f (xi ))2
i
6
Derivation of Cross Entropy (3.9.1)
For binary classification tasks, the target value ti is either 0 or 1.
It makes sense to interpret the output f (xi ) of the neural network as the probability
of the true value being 1, i.e.
P (1 | f (xi )) = f (xi )
P (0 | f (xi )) = (1 − f (xi ))
i.e. P (ti | f (xi )) = f (xi ) ti (1 − f (xi ))(1− ti )
X
log P ({ti }|f ) = ti log f (xi ) + (1 − ti ) log(1 − f (xi ))
i
X
fM L = argmaxf ∈H ti log f (xi ) + (1 − ti ) log(1 − f (xi ))
i
7
Cross Entropy and Backprop
Cross Entropy loss is often used in combination with sigmoid activation at the
output node, which guarantees an output strictly between 0 and 1, and also makes
the backprop computations a bit simpler, as follows:
X
E = − ti log(z i ) − (1 − ti ) log(1 − z i )
i
∂E ti 1 − ti z i − ti
=− + =
∂z zi 1 − zi z i (1 − z i )
1 ∂E ∂E ∂z i
If z = , = = z i − ti
1 + e−s ∂si ∂z i ∂si
8
Cross Entropy and KL-Divergence
If we consider pi = h ti , 1− ti i, qi = h f (xi ), 1−f (xi )i as discrete probability
distributions, the Cross Entropy loss can be written as:
X
log P ({ti }| f ) = ti log f (xi ) + (1 − ti ) log (1 − f (xi ))
i
X
= ti (log f (xi ) − log(ti )) + (1− ti )(log(1−f (xi )) − log(1− ti )
i
− −ti log(ti ) − (1− ti ) log(1− ti )
X
= DKL (pi || qi ) − H(pi )
i
Since H(pi ) is fixed, minimizing the Cross Entropy loss is the same as minimizing
P
DKL (pi || qi ).
i
9
Cross Entropy and Outliers
SSE and Cross Entropy behave a bit differently when it comes to outliers.
SSE is more likely to misclassify outliers, because the loss function for each item
is bounded between 0 and 1.
Cross Entropy is more likely to keep outliers correctly classified, because the loss
function grows logarithmically (unbounded) as the difference between the target
and network output approaches 1.
For this reason, Cross Entropy works particularly well for classification tasks that
are unbalanced in terms of negative items vastly outnumbering positive ones (or
vice versa).
10
Softmax (6.2.2)
11
Log Softmax and Backprop
If the correct class is k, we can treat − log Prob(k) as our cost function, and the
gradient is
d exp(z i )
log Prob(k) = δik − P N = δik − Prob(i),
dz i j=1 exp(zj )
12
Softmax, Boltzmann and Sigmoid
If you have studied mathematics or physics, you may be interested to know that
Softmax is related to the Boltzmann Distribution, with the negative of output z i
playing the role of the “energy” for “state” i.
The Sigmoid function can also be seen as a special case of Softmax, with two
classes and one output, as follows:
Consider a simplified case where there is a choice between two classes, Class 0
and Class 1. We consider the output z of the network to be associated with Class 1
and we imagine a fixed “output” for Class 0 which is always equal to zero. In this
case, the Softmax becomes:
ez 1
Prob(1) = z =
e + e0 1 + e−z
13
Weight Decay (5.2.2)
Sometimes we add a penalty term to the loss function which encourages the
neural network weights wj to remain small:
1X λX 2
E= (z i − ti )2 + wj
2 2
i j
This can prevent the weights from “saturating” to very high values.
It is sometimes referred to as “elastic weights” because the weights experience a
force as if there were a spring pulling them back towards the origin according to
Hooke’s Law.
The scaling factor λ needs to be determined from experience, or empirically.
14
Bayesian Inference
H is a class of hypotheses.
Prob(D | h) = probability of data D being generated under hypothesis h ∈ H.
Prob(h | D) = probability that h is correct, given that data D were observed.
Bayes’ Theorem:
Prob(h | D)Prob(D) = Prob(D | h) Prob(h)
Prob(D | h) Prob(h)
Prob(h | D) =
Prob(D)
Prob(h) is called the prior because it is our estimate of the probability of h before
the data have been observed.
Prob(h | D) is called the posterior because it is our estimate of the probability of h
after the data have been observed.
15
Weight Decay as MAP Estimation (5.6.1)
We assume a Gaussian prior distribution for the weights, i.e.
1 2 2
e−wj /2σ0
Y
P (w) = √
Then j
2πσ0
P (t | w)P (w) 1 Y 1 1 2 Y 1 2 2
P (w | t) = = √ e− 2σ2 (z i − ti ) √ e−wj /2σ0
P (t) P (t) 2πσ 2πσ0
i j
1 X 1 X 2
log P (w | t) = − 2 (z i − ti )2 − 2 wj + constant
2σ 2σ0
i j
wMAP = argmaxw∈H log P (w | t)
1 X λ X 2
= argminw∈H (z i − ti )2 + wj , where λ = σ 2 /σ02
2 2
i j
16
Second Order Methods
Some optimization methods involve computing second order partial derivatives of
the loss function with respect to each pair of weights:
∂2E
∂wi ∂wj
➛ Conjugate Gradients
→ approximate the landscape with a quadratic function (paraboloid) and
jump to the minimum of this quadratic function
➛ Natural Gradients (Amari, 1995)
→ use methods from information geometry to find a “natural” re-scaling of
the partial derivatives
These methods are not normally used for deep learning, because the number of
weights is too high. In practice, the Adam optimizer tends to provide similar
benefits with low computational cost.
17