0% found this document useful (0 votes)
30 views4 pages

Notes Chapter8

This document discusses feedforward neural networks and backpropagation. It contains the following key points: 1) It describes the structure of a feedforward neural network, including notation for the number of neurons in each layer, weights, inputs, outputs, and activation functions. 2) It explains how to perform a forward pass through the network to calculate outputs given inputs and weights. 3) It discusses different loss functions that can be used for regression and classification problems with neural networks, including mean squared error and cross entropy. 4) It states that the objective is to find the weights that minimize the loss, and gradient descent can be used to optimize the weights in a computationally efficient manner through backpropagation.

Uploaded by

dnthrtm3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views4 pages

Notes Chapter8

This document discusses feedforward neural networks and backpropagation. It contains the following key points: 1) It describes the structure of a feedforward neural network, including notation for the number of neurons in each layer, weights, inputs, outputs, and activation functions. 2) It explains how to perform a forward pass through the network to calculate outputs given inputs and weights. 3) It discusses different loss functions that can be used for regression and classification problems with neural networks, including mean squared error and cross entropy. 4) It states that the objective is to find the weights that minimize the loss, and gradient descent can be used to optimize the weights in a computationally efficient manner through backpropagation.

Uploaded by

dnthrtm3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

EEE 485-585, SPRING 2019

Chapter 8 - Perceptron, Neural Networks and


Backpropagation

Slide 8.6

Perceptron is an online algorithm that takes one data instance at each iteration, and updates the
weights based on the performance on that particular data instance. When the classes are linearly
separable, it converges in finite number of iterations, and is able to classify all training data points
correctly. The learning rate of perceptron can be taken as a constant ηpnq “ η.
When the prediction at the nth iteration is correct, we have ypnq ´ ŷpnq “ 0. This implies that
wpn ` 1q “ wpnq.
When the prediction at the nth iteration is wrong, there are two possibilities:
• If ŷpnq “ 1 and ypnq “ 0, then wpn ` 1q “ wpnq ´ ηpnqxpnq, so the weight moves in the
opposite direction of xpnq.
• If ŷpnq “ 0 and ypnq “ 1, then wpn ` 1q “ wpnq ` ηpnqxpnq, so the weight moves in the
direction of xpnq.
This ensures that the weights are updated in a way to correct mistakes.

Slide 8.10

Single layer perceptrons are only capable of learning linearly separable patterns. The XOR problem
is not linearly separable. However, multilayer neural networks can solve this problem. Moreover,
they can also approximate functions. In class, we discussed how the XOR problem can be solved by
using a neural network with one hidden layer. There, we manually calculated the weights to solve the
XOR problem. This is not possible to do in high-dimensional and complex datasets. Thus, we need
methods that will automatically learn the weights by using the training dataset.
The perceptron we considered used the step function as the activation function. Due to the disconti-
nuity at 0 and the zero derivatives everywhere except 0, it is not practical to use when learning the
weights via gradient descent (which is the standard method to train neural networks). Therefore, other
activation functions are used to train neural networks with many neurons and many layers. Some of
the examples are given below. The first two are sigmoid nonlinearities.
Logistic function:
1
φpvq “ ,a ą 0
1 ` expp´avq
where the hyperparameter a can be taken as 1. The derivative is φ1 pvq “ aφpvqp1 ´ φpvq.
Hyperbolic tangent function:
sinh z ez ´ e´z
φpvq “ a tanhpbvq, where tanhpzq “ “ z .
cosh z e ` e´z
Softplus:
φpvq “ lnp1 ` ev q
Rectified linear unit (ReLU):
φpvq “ maxt0, vu

What is the relation between softplus and ReLU?

Slide 8.11

In this chapter, we consider feedforward neural networks. A feedforward neural network has a very
special structure. As you can see from the figure a neuron in layer l is connected to all other neurons
in layer l ` 1. There are no direct connections between neurons in non-adjecent layers and between
neurons in the same layer. Although the figure shows a neural network with two hidden layers, we
will represent the results for neural networks with an arbitrary number layers.
Feedforward neural network structure:
Consider a neural network with L ´ 1 hidden layers. We call input layer, layer 0. Hidden layers are
indexed by l P t1, . . . , L ´ 1u and the output layer is called layer L. We use the following notation:

• dplq : number of neurons in layer l. We index the bias neuron (bias term) by 0.
pl´1q
• xi , 0 ď i ď dpl´1q : output of ith neuron at layer l ´ 1.
plq
• xj , 0 ď j ď dplq : output of jth neuron at layer l.
plq
• wij , 0 ď i ď dpl´1q , 1 ď j ď dplq : weight between ith neuron in layer l ´ 1 and jth neuron
in layer l.
plq
• w “ twij u: set of all weights.
plq
• vj , 1 ď j ď dplq : induced local field of jth neuron at layer l.

We have for j ą 0:
dpl´1q
ÿ
plq plq pl´1q
vj “ wij xi .
i“0
Thus, the output of neuron j at layer l is given as
´ ¯
plq plq
x j “ φ vj .

Performing a forward pass to compute the output given the input:


p0q p0q
In a feedforward neural network, given a dp0q -dimensional input x “ rx1 , . . . , xdp0q sT the dpLq -
pLq pLq
dimensional output xpLq “ rx1 , . . . , xdpLq sT can be computed by calculating the outputs of layers
1 to L consecutively:
» ´ř pl´1q ¯ fi
plq d plq pl´1q
x1 “ φ i“0 wi1 xi
» p0q fi » p1q fi » pLq fi
x1 x1 — ffi x1
— .. ffi — .. ffi . . ffi
x “ – . fl Ñ – . fl Ñ . . . Ñ — .. ffi Ñ . . . Ñ – .. fl “ xpLq
— ffi —
p0q p1q
– ´ řdpl´1q plq pl´1q ¯fl pLq
plq
xdp0q xdp1q xdplq “ φ i“0 widplq xi
xdpLq

Assuming that the activation functions are fixed, the output xpLq is a (vector-valued) func-
tion of the input and the weights in the network, and thus, it can be written as f w pxq “
pLq
rfw,1 pxq, . . . , fw,dpLq pxqsT “ xpLq . The kth component of f w pxq is fw,k pxq “ xk .
Loss functions:
Consider a dataset D “ tpxi , y i quni“1 . The loss on data instance i is given as lpf w pxi q, y i q, where
lp¨, ¨q is usually taken as a continuously differentiable function for the sake of computing gradients.
The loss over entire D is given as
n
ÿ
Jpwq “ lpf w pxi q, y i q.
i“1

2
pLq pLq
In a regression problem usually the identity activation is used at the output layer. Thus, xk “ vk .
In this case, lp¨, ¨q can be taken as the squared error
pLq
dÿ
lpf w pxi q, y i q “ pyik ´ fw,k pxi qq2 .
k“1

In a regression problem usually dpLq “ 1. In this case, the loss becomes


lpf w pxi q, y i q “ pyi ´ fw pxi qq2 .

When using a neural network for a classification problem, each class is assigned to one output neuron.
Thus, if there are K classes, then dpLq “ K. It is also customary for the output to represent a
probability distribution over the classes. For this, usually softmax activation function is used at the
output layer. Then, the output becomes
pLq
pLq e vk
xk “ ř pLq pLq
, 1 ď k ď dpLq .
d vj
j“1 e
pLq
Predictions can be produced by argmax operation, i.e., ŷ “ arg max1ďkďdpLq xk .
In addition, labels of data instances are one-hot encoded, i.e., if data instance i belongs to class k,
then its label is represented by y i “ ryi1 , . . . , yiK s where yik “ 1 and yij “ 0 for j ‰ k. A suitable
loss function for classification problems is the cross-entropy
pLq
dÿ
lpf w pxi q, y i q “ ´ yik logpfw,k pxi qq.
k“1

Since class probabilities are always in r0, 1s, this loss depends on the log of the probability that the
output puts on the correct label.
Minimizing the loss:
Similar to what is done in the previous chapters, our objective is to select a set of weights that
minimize the loss, i.e.,
w˚ “ arg min Jpwq.
w

This is a daunting task since there are many parameters to optimize. Moreover, Jpwq can have many
local minimizers whose performance is much worse than the global minimizer. Nevertheless, in the
following part, we will explain how gradient descent can be implemented in a feedworfard neural
network in a computationally efficient way to minimize Jpwq.
Batch gradient descent: Start with a random set of initial weights wp0q. At each iteration perform the
following update
wpn ` 1q “ wpnq ´ η∇Jpwpnqq
where Jpwpnqq denotes the loss of neural network with weights wpnq on the entire training data.
This is not the preferred method since we need to evaluate the loss on the entire dataset at each
iteration, which is computationally prohibitive.
Stochastic gradient descent (SGD): Start with a random set of initial weights wp0q. At each iteration
randomly pick one data instance pxpnq, ypnqq from D, and perform the following update:
wpn ` 1q “ wpnq ´ η∇lpf wpnq pxpnqq, ypnqq.
This allows faster updates. Moreover, the randomization also helps by avoiding bad local minimum,
and the expectation of the gradient taken over the randomness of the chosen data sample is equal to
the batch gradient.
To make SGD less random, usually mini batch gradient descent is preferred, where the gradient is
computed over a mini batch of samples (e.g., 10 samples) instead of a single sample.

3
Slide 8.14

Here, we explain how to compute gradients using back-propagation. We focus on SGD. Let
epwq “ lpf w pxq, yq.
For each layer l neuron j let
plq Bepwq
δj “ ´ plq
Bvj
pl´1q plq
By the chain rule, we can calculate δi for each layer l ´ 1 neuron i if we know δj for each layer
l neuron j as follows
plq plq plq
dÿ pl´1q dÿ
pl´1q Bepwq Bepwq Bvj Bxi plq plq pl´1q
δi “´ pl´1q
“ ´ plq
ˆ pl´1q
ˆ pl´1q
“ δj wij φ1 pvi q
Bvi j“1 Bvj Bxi Bvi j“1

Thus, all gradients can be computed in the following way:


pLq
• First compute δj for j “ 1, . . . , dpLq .
plq pl´1q
• Use δj , j “ 1, . . . , dplq to compute δj for all j “ 1, . . . , dpl´1q .

To update the weights, we need to compute the gradients with respect to the weights. This can be
done easily after δs are computed. Note that
plq
Bepwq Bepwq Bvj plq pl´1q
plq
“ plq
ˆ plq
“ ´δj xi .
Bwij Bvj Bwij

Finally, we show how δs can be computed at the output layer. We illustrate this using the squared
error loss function. A similar analysis can be done for the cross-entropy loss function as well. Let
pLq
d
1 ÿ 2 pLq
epwq “ e pwq, where ek pwq “ pyk ´ ŷk q “ pyk ´ φpvk qq.
2 k“1 k

Then,
´ ř pLq ¯
1 d
B 2 k“1 e2k pwq Bej pwq
pLq pLq
δj “´ pLq
“ ´ej pwq pLq
“ ej pwqφ1 pvj q
Bvj Bvj

You might also like