Lecture 2: Multilayer Perceptrons
CS460: Deep Learning
What is a Neural network ?
Neural network is a machine that is
designed to model the way in which
the brain performs a particular task
or function of interest.
To achieve good performance, neural
networks employ a massive
interconnection of simple computing
cells referred to as "neurons" or
"processing units."
Inspired by humans
Thebrain is a highly complex,
nonlinear; and parallel computer. It
has the capability to organize its
neurons, so as to perform certain
computations (e.g., pattern
recognition, perception, and motor
control) many times faster than the
fastest digital computer in existence
today.
Biological Neural
Networks
Dendrites
Synapse
Synapse
Axon
Axon
Dendrites Soma
Soma
Modeling the single
neuron
Learning in simple
neurons
If we have two groups of objects, one
group of several written A's, and the
other of B's, we may want our
neuron to tell the A's from the B's, as
in figure.
We want it to output a 1 when an A
is presented and a 0 when it sees a
B.
Biology analogy
Biological Artificial
Soma Node/neuron
Dendrites Input
Axon Output
Synapse Weight
The perceptron
The simplest kind of neural network is a single-layer
perceptron network, which consists of a single layer
of output nodes; the inputs are fed directly to the
outputs via a series of weights. The sum of the
products of the weights and the inputs is calculated in
each node, and if the value is above some threshold
the neuron fires and takes the activated value;
otherwise it takes the deactivated value.
Neurons with this kind of activation function are also
called artificial neurons or linear threshold units.
In the literature the term perceptron often refers to
networks consisting of just one of these units.
The perceptron (cont’d)
theperceptron is an algorithm for learning
a binary classifier called a
threshold function: a function that maps its
input x(a real-valued vector) to an output
value f(X) (a single binary value):
wherew is a vector of real-valued
weights, w . x is the dot product ,
where m is the number of inputs to the
perceptron, and b is the bias.
The perceptron (cont’d)
Perceptrons can be trained by a simple
learning algorithm that is usually called
the delta rule. It calculates the errors
between calculated output and sample
output data, and uses this to create an
adjustment to the weights, thus
implementing a form of gradient descent.
Single-layer perceptrons are only capable
of learning linearly separable patterns
Linearly Separable
XOR Function
Itis impossible for a single-layer
perceptron network to learn an
XOR function
Non-linear
transformations
A single-layer neural network can
compute a continuous output instead
of a step function. A common choice
is the so-called logistic function:
Non-linear
transformations
The logistic function is one of the
family of functions called
sigmoid functions. It has a
continuous derivative, which allows
it to be used in backpropagation.
This function is also preferred
because its derivative is easily
calculated (differentiable) :
Sigmoid function
Multi Layer Perceptron
(MLP)
MLP is a class of a feedforward (Acyclic) artificial neural
network (ANN).
Each neuron in one layer has directed connections to
the neurons of the subsequent layer. In many
applications the units of these networks apply a
sigmoid function as an activation function.
MLPs models are the most basic deep neural network,
which is composed of a series of fully connected layers.
Each new layer is a set of nonlinear functions of a
weighted sum of all outputs (fully connected) from the
prior one.
Multilayer feed-forward networks, given enough hidden
units and enough training samples, can closely
approximate any function.
The Architecture
MLP with one hiddenx layer
1 (PE)
x2 Weighted Transfer
(PE) Sum Function
Y1
x3 (S) (f)
(PE)
(PE) (PE)
Output
(PE)
Layer
Hidden
(PE)
Layer
Input
Layer
MLP processing
(a) Single neuron (b) Multiple neurons
x1 x1 w11 (PE) Y1
w1
w21
(PE) Y
w1 w12
x2 Y X 1W1 X 2W2
x2 w22 (PE) Y2
PE: Processing Element (or neuron)
Y1 X1W11 X 2W21
w23
Y2 X1W12 X2W22
Y3 X 2W 23 (PE) Y3
MLP processing (cont’d)
Summation function: Y = 3(0.2) + 1(0.4) + 2(0.1) = 1.2
X1 = 3 Transfer function: YT = 1/(1 + e-1.2) = 0.77
W2 = 0.4 Processing Y = 1.2
X2 = 1 YT = 0.77
element (PE)
X3 = 2
Designing the MLP
Before training can begin, the user must decide on
the network topology by specifying:
the number of units in the input layer,
the number of hidden layers (if more than one), the
number of units in each hidden layer, and
the number of units in the output layer.
Normalizing the input values (between 0.0 and 1.0)
for each attribute measured in the training tuples
will help speed up the learning phase and prevent
the exploding gradient problem.
Discrete-valued attributes may be encoded such
that there is one input unit per domain value.
Choice of the transfer function
Transformation (Transfer)
Function
Linear function
Sigmoid (logical activation) function [0
1]
Tangent Hyperbolic function [-1 1]
MLP: Design issues
Neural networks can be used for both
classification (to predict the class label
of a given tuple) and numeric prediction
(to predict a continuous-valued output).
For classification, one output unit may
be used to represent two classes (where
the value 1 represents one class, and
the value 0 represents the other).
If there are more than two classes, then
one output unit per class is used.
MLP: Design issues
There are no clear rules as to the “best”
number of hidden layer units.
Network design is a trial-and-error process and
may affect the accuracy of the resulting trained
network.
The initial values of the weights may also affect
the resulting accuracy.
Once a network has been trained and its
accuracy is not considered acceptable, it is
common to repeat the training process with
a different network topology or
a different set of initial weights.
The XOR function -
revisted
MLP Box Office prediction
example
The Learning algorithm
Itadjusts the weights of the
machine, in order to minimize the
average squared error.
Learning in MLP
The learning algorithm procedure
Initialize weights with random values and set
other network parameters
Read in the inputs and the desired outputs
Compute the actual output (by working
forward through the layers)
Compute the error (difference between the
actual and desired output)
Change the weights by working backward
through the hidden layers
Repeat steps 2-5 until weights stabilize
Learning in MLP (cont’d)
Backpropagation learns by iteratively
processing a data set of training tuples,
comparing the network’s prediction for each
tuple with the actual known target value.
The target value may be the known class label
of the training tuple (for classification
problems) or a continuous value (for numeric
prediction).
For each training tuple, the weights are
modified so as to minimize the mean-squared
error between the network’s prediction and the
actual target value.
Learning in MLP (cont’d)
These modifications are made in the
“backwards” direction (i.e., from the
output layer) through each hidden
layer down to the first hidden layer
(hence the name backpropagation).
Although it is not guaranteed, in
general the weights will eventually
converge, and the learning process
stops.
MLPs Bottlenecks
1. Dimensionality issue
Rule of thumb: The number of
training samples should be at least 5
to 10 times the number of weights in
the network.
Otherwise,the network is prone to
overfitting
2. Overfitting
2. Overfitting (cont’d)
3. The black-box syndrome
A common criticism for ANN: The lack of
transparency/explainability
Answer: sensitivity analysis
Conducted on a trained ANN
The inputs are perturbed while the
relative change on the output is
measured/recorded
Results illustrate the relative importance
of input variables
sensitivity analysis
4. Vanishing gradient
problem
In machine learning, the vanishing gradient problem is
encountered when training artificial neural networks with
gradient-based learning methods and backpropagation. In such
methods, during each iteration of training each of the neural
network's weights receives an update proportional to the
partial derivative of the error function with respect to the current
weight. The problem is that in some cases, the gradient will be
vanishingly small, effectively preventing the weight from changing
its value. In the worst case, this may completely stop the neural
network from further training. As one example of the problem cause,
traditional activation functions such as the hyperbolic tangent
function have gradients in the range (0,1], and backpropagation
computes gradients by the chain rule. This has the effect of
multiplying n of these small numbers to compute gradients of the
early layers in an n-layer network, meaning that the gradient (error
signal) decreases exponentially with n while the early layers train
very slowly.
Building Neural
Networks
Architecture of a neural network is driven
by the task it is intended to address
Classification, regression, clustering,
general optimization, association, ….
Most popular architecture: Feedforward
multi-layered perceptron with
backpropagation learning algorithm
Used for both classification and regression
type problems
Others – Recurrent, self-organizing feature
maps, Hopfield networks, …
Development of NNs
Backpropagation
Multi-layer networks use a variety of learning techniques, the most
popular being back-propagation.
The output values are compared with the correct answer to compute the
value of some predefined error-function. By various techniques, the error
is then fed back through the network.
The algorithm adjusts the weights of each connection in order to reduce
the value of the error function by some small amount.
After repeating this process for a sufficiently large number of training
cycles, the network will usually converge to some state where the error
of the calculations is small.
In this case, one would say that the network has learned a certain target
function. To adjust weights properly, one applies a general method for
non-linear optimization that is called gradient descent. For this, the
network calculates the derivative of the error function with respect to the
network weights, and changes the weights such that the error decreases
(thus going downhill on the surface of the error function).
For this reason, back-propagation can only be applied on networks with
differentiable activation functions.
The steps Of The
Backpropagation
Initialize the weights:
The weights in the network are
initialized to small random numbers
(e.g., ranging from−1.0 to 1.0, or−0.5
to 0.5).
Each unit has a bias associated with it,
as explained later.
The biases are similarly initialized to
small random numbers.
Each training tuple, X, is processed by
the following steps.
Propagate the inputs
forward:
First,the training tuple is fed to the
network’s input layer.
The inputs pass through the input units,
unchanged.
That is, for an input unit, j, its output, Oj, is
equal to its input value, Ij.
Next, the net input and output of each unit in
the hidden and output layers are computed.
The net input to a unit in the hidden or
output layers is computed as a linear
combination of its inputs.
The steps Of The
Backpropagation
Propagate the inputs forward:
Each hidden layer or output layer unit has a
number of inputs to it that are, in fact, the
outputs of the units connected to it in the
previous layer.
Propagate the inputs
forward
To compute the net input to the unit, each input
connected to the unit is multiplied by its
corresponding weight, and this is summed.
Given a unit, j in a hidden or output layer, the net
input, Ij, to unit j is
where wij is the weight of the connection from unit i
in the previous layer to unit j; Oi is the output of
unit i from the previous layer; and θj is the bias of
unit j.
The bias acts as a threshold in that it serves to vary
the activity of the unit.
Propagate the inputs
forward
Each unit in the hidden and output layers takes its net
input and then applies an activation function to it.
The function symbolizes the activation of the neuron
represented by the unit.
The logistic, or sigmoid, function is used. Given the
net input Ij to unit j, then Oj, the output of unit j, is
computed as
Thelogistic function is nonlinear and differentiable,
allowing the backpropagation algorithm to model
classification problems that are linearly inseparable.
Propagate the inputs
forward
We compute the output values, Oj, for
each hidden layer, up to and including
the output layer, which gives the
network’s prediction.
In practice, it is a good idea to cache (i.e.,
save) the intermediate output values at
each unit as they are required again later
when back propagating the error.
This trick can substantially reduce the
amount of computation required.
Back propagate the
error
Theerror is propagated backward by updating
the weights and biases to reflect the error of
the network’s prediction. For a unit j in the
output layer, the error Errj is computed by
where Oj is the actual output of unit j, and Tj is
the known target value of the given training
tuple.
Note that Oj(1−Oj) is the derivative of the
logistic function.
Back propagate the
error
To compute the error of a hidden layer unit j,
the weighted sum of the errors of the units
connected to unit j in the next layer are
considered.
The error of a hidden layer unit j is
where wjk is the weight of the connection from
unit j to a unit k in the next higher layer, and
Errk is the error of unit k.
Back propagate the
error
The weights and biases are updated to reflect the
propagated errors.
Weights are updated by the following equations,
where delta(wij) is the change in weight wij:
The variable l is the learning rate, a constant typically
having a value between 0.0 and 1.0.
The learning rate helps avoid getting stuck at a local
minimum in decision space. If the learning rate is too
small, then learning will occur at a very slow pace. If
learning rate is too large, then oscillation.
Back propagate the
error
Biasesare updated by the following equations, where
delta(θj) is the change in bias θj:
The updating of the weights and biases after the
presentation of each tuple, referred to case updating.
Alternatively, the weight and bias increments could
be accumulated in variables, so that the weights and
biases are updated after all the tuples in the training
set have been presented. (called epoch updating)
Batch/mini-batch updating : weight and bias are
updated after several samples
one iteration through the training set is an epoch.
Terminating condition
Training stops when:
All delta(wij) in the previous epoch are so small
as to be below some specified threshold, or
The percentage of tuples misclassified in the
previous epoch is below some threshold, or
A pre-specified number of epochs has expired.
Inpractice, several hundreds of thousands
of epochs may be required before the
weights will converge.