Multilayer neural
networks
Agenda
Some historical notes
Multilayer neural networks
Backpropagation
Error surfaces and feature mapping
Speed ups
Conclusions
Some historical notes
Rosenblatt’s perceptron 1958: a single neuron
with adjustable synaptic weights and bias
Perceptron convergence theorem 1962
(Rosenblatt): If patterns used to train the
perceptron are drawn from two linearly
separable classes, then the algorithm
converges and positions the decision surface in
the form of hyperplane between two classes.
The limitations of networks implementing linear
discriminants were well known in the 1950s and
1960s.
Some historical notes
A single neuron -> the class of solutions that
can be obtained is not general enough ->
multilayer perceptron
Widrow & Hoff (1960): least-mean square
algorithm (LMS) = delta rule: three-layer
networks designed by hand (fixed input-to-
hidden weights, trained hidden-to-output
weights).
The development of backpropagation: Kalman
filtering (Kalman 1960, Bryson,Denham,Dreyfus
1963,1969).
Some historical notes
Werbos : Beyond Regression: New
Tools for Prediction and Analysis in
the Behavioral Science, Ph.D. thesis,
Harvard University, 1974
Parker 1982,1985
Rumelhart, Hinton, Williams: Learning
internal representations by
backpropagating errors, Nature
323(99),pp533-536,1986
Multilayer Neural Networks
Input layer
Hidden layer
Output layer
Bias unit = neuron
netj = wtjx
yj = f(netj)
f(net) = Sgn(net)
zk = f(netk)
Multilayer Neural Networks
The XOR problem:
(x1 OR x2)
AND NOT (x1 AND
x2)
y1 : x1 + x2 +0.5 = 0
> 0 -> y1 = 1,
otherwise –1
y2 : x1 + x2 -1.5 = 0
> 0 -> y2 = 1,
otherwise –1
Multilayer Neural Networks
Any desired
continuous function
can be implemented by
a three-layer network
given sufficient number
of hidden units, proper
nonlinearitiers and
weights (Kolmogorov)
Feedforward
Learning
(Demo chapter 11)
Multilayer Neural Networks
Nonlinear multilayer
networks
How to set the
weights of a three-
layer neural network
based on training
patterns and the
desired output?
Backpropagation Algorithm
The LMS
algorithm exists
for linear
systems
Training error
Learning rule
Backpropagation Algorithm
Learning rule
Hidden-to-output
Input-to-hidden
Note, that wij are
initialized with
random values
Demo Chapter 11
Backpropagation Algorithm
Compare with LMS
algoritms
1) Method of Steepest
Descent
The direction of
steepest descent is in
direction opposite to
the gradient vector g =
E(w)
w(n+1) = w(n) –g(n)
is the stepsize or
learning-rate parameter
Backpropagation Algorithm
Training set = a set of patterns with known
labels
Stochastic training = patterns are chosen
randomly from the training set
Batch training = all patterns are presented to
the network before learning takes place
On-line protocol = each pattern is presented
once and only once, no memory in use
Epoch = a single presentation of all patterns in
the training set. The number of epochs is an
indication of the relative amount of learning.
Backpropagation Algorithm
Stopping criterion
Backpropagation Algorithm
Learning set
Validation set
Test set
Stopping criterion
Learning curve, the
average error per
pattern
Cross-validation
Error Surfaces and Feature
Mapping
Note, error backpropagation is based
on gradient descent in a criterion
function J(w) that is represented
represented
Error Surfaces and Feature
Mapping
The total training
error is minimized.
It usually
decreases
monotonically,
even though this is
not the case for the
error on each
individual pattern.
Error Surfaces and Feature
Mapping
Hidden-to-output
weights ~ a linear
discriminant
Input-to-hidden
weights ~ ”matched
filter”
Practical Techniques for
Improving Backpropagation
How to improve
convergence,
performance, and
results?
Neuron:
Sigmoid function =
centered zero and
antisymmetric
Practical Techniques for
Improving Backpropagation
Scaling input variables
= the input patterns should be shifted so
that the average over the training set of
each feature is zero.
= the full data set should be scaled to have
the same variance in each feature
component
Note, this standardization can only be done
for stochastic and batch learning!
Practical Techniques for
Improving Backpropagation
When the training set is small one can generate
surrogate training patterns.
In the absence of problem-specific information,
the surrogate patterns should be made by
adding d-dimensional Gaussian noise to true
training points, the category label should be left
unchanged.
If we know about the source of variation among
patterns we can manufacture training data.
The number of hidden units should be less than
the total number of training points n, say roughly
n/10.
Practical Techniques for
Improving Backpropagation
We cannot initialize the weights to 0.
Initializing weights = uniform learning ->
choose weights randomly from a
single distribution
Input-to-hidden weights: -1 / d < wij < + 1 /
d ,where d is the number of input units
Hidden-to-output weights: -1 / nH < wkj < +
1 / nH ,where d is the number of hidden
units
Practical Techniques for
Improving Backpropagation
Learning Rates
Demo Chapter 9
The optimal rate
Practical Techniques for
Improving Backpropagation
Momentum : allows
the network to learn
more quickly when
plateaus in the
error surface exist.
Demo chapter 12.
0.9
Practical Techniques for
Improving Backpropagation
Weight Decay to avoid overfitting = to
start with a network with too many
weights and decay all the weights
during training wnew =
wold(1-), where 0 < < 1
The weights that are not needed for
reducing the error function become
smaller and smaller -> eliminated
Practical Techniques for
Improving Backpropagation
If we have insufficient
training data for the
desired classification
accuracy.
Learning with hints is
to add output units for
addressing an ancillary
problem, one different
from but related to the
specific classification
problem at hand.
Practical Techniques for
Improving Backpropagation
Stopped Training
Number of hidden layers: typically 2-3.
Criterion Function: cross entropy,
Minkowski Error
Practical Techniques for
Improving Backpropagation
Second-Order Methods
to speed up the
learning rate:
Newton’s Method,
demo chapter 9
Quickprop
Conjugate Gradient
Descent requires batch
training, demo chapter
9
Practical Techniques for
Improving Backpropagation
2) Newton’s method
The idea is to minimize the quadratic
approximation of the cost function E(w) around
the current point w(n).
Using a second-order Taylor series expansion
of the cost function around the point w(n).
Ew(n) gT(n)w(n) +½ wT(n) H(n) w(n)
g(n) is the m-by-1 gradient vector of the cost
function E(w) evaluated at the point w(n). The
matrix H(n) is the m-by-m Hessian matrix of
E(w) (second derivative), H = ²E(w)
Practical Techniques for
Improving Backpropagation
H = ²E(w) requires the cost function E(w) to
be twice continuously differentiable with
respect to the elements of w.
Differentiating Ew(n) gT(n)w(n) +½
wT(n) H(n) w(n) with respect to w, the
change E(n) is minimized when
g(n) + H(n)w(n) = 0 -> w(n) = H-1(n)g(n)
w(n+1) = w(n) + w(n)
w(n+1) = w(n)+H-1(n)g(n)
where H-1(n) is the inverse of the Hessian of
E(w).
Practical Techniques for
Improving Backpropagation
Newton’s method converges quickly
asymtotically and does not exhibit the
zigzagging behavior.
Newton’s method requires that the
Hessian H(n) has to be a positive
definite matrix for all n!
Practical Techniques for
Improving Backpropagation
3) Gauss-Newton Method
It is applicable to a cost function that is expressed
as the sum of error squares.
E(w) = ½ n e²(i), note that all the error terms are
i=1
calculated on the basis of a weight vector w that is
fixed over the entire observation interval 1 i n.
The error signal e(i) is a function of the adjustable
weight vector w. Given an operating point w(n), we
linearize the dependence of e(i) on w by writing
e’(i,w) = e(i) + [e(i)/w]Tw=w(n) (w-w(n)), i=1,2,...,n
Practical Techniques for
Improving Backpropagation
e’(n,w) = e(n) + J(n)(w-w(n))
where e(n) is the error vector e(n) =
[e(1),e(2),...,e(n)]T and J(n) is the n-by-m
Jacobian matrix of e(n) (The Jacobian J(n) is
the transpose of the m-by-n gradient matrix
e(n), where e(n) =[e(1), e(2), ...,e(n)].
w(n+1) = arg min w {½e’(n,w)²}
= ½e(n)² +eT(n)J(n)(w-w(n)) +
½(w-w(n))TJT(n)J(n)(w-w(n))
Differentiating the expression with respect w and
setting the result equal to zero
Practical Techniques for
Improving Backpropagation
JT(n)e(n) + JT(n)e(n) (w-w(n)) = 0
w(n+1) = w(n) – (JT(n)J(n))-1JT(n)e(n)
The Gauss-Newton requires only the Jacobian
matrix of the error function e(n).
For the Gauss-Newton iteration to be
computable, the matrix product JT(n)J(n) must
be nonsigular. JT(n)J(n) is always nonnegative
definite but to ensure that it is nonsingular, the
Jacobian J(n) must have row rank n. -> add
the diagonal matrix I to the matrix JT(n)J(n),
the parameter is a small positive constant.
Practical Techniques for
Improving Backpropagation
JT(n)J(n)+ I ; positve definite for all n.
-> The Gauss-Newton method is
implemented in the following form:
w(n+1) = w(n) – (JT(n)J(n) + I )-
J (n)e(n)
1 T
This is the solution to the modified cost
function:
E(w) = ½{w-w(0)²+ n e²(i)}
i=1
where w(0) is the initial value of w.
Regularization, Complexity
Adjustment and Pruning
If we have too many
weight and train too
long -> overfitting
Wald Statistic: we can
estimate the
importance of a
parameter in a model
The Optimal Brain
Damage, The Optimal
Brain Surgeon:
eliminate the
parameter having the
least importance
Summary
Multilayer nonlinear neural networks trained
by gradient descent methods such as
backpropagation perform a maximum-
likelihood estimation of the weight values in
the model defined by the network topology.
f(net) at the hidden units allows the
networks to form an arbitary decision
boundary, so long as there are sufficiently
many hidden units.