Introduction To Artificial Neural Networks and Perceptron
Introduction To Artificial Neural Networks and Perceptron
• The inputs and output are now numbers (instead of binary on/off
values) and each input connection is associated with a weight.
• wi, j is the connection weight between the ith input neuron and the
jth output neuron.
• xi is the ith input value of the current training instance.
• is the output of the jth output neuron for the current training instance.
• yj is the target output of the jth output neuron for the current training
instance.
• η is the learning rate.
• The decision boundary of each output neuron is linear, so Perceptrons
are incapable of learning complex patterns (just like Logistic Regression
classifiers).
32
The Exclusive OR problem
33
Minsky & Papert (1969) offered solution to XOR problem by
combining perceptron unit responses using a second layer of
Units. Piecewise linear classification using an MLP with
threshold (perceptron) units
+1
+1
34
• In particular, an MLP can solve the XOR problem
• For each combination of inputs: with inputs (0, 0) or (1, 1) the network
outputs 0, and with inputs (0, 1) or (1, 0) it outputs 1.
Multi-Layer Perceptron and Its Properties
.
• This reverse pass efficiently measures the error gradient across all the
connection weights in the network by propagating the error gradient
backward in the network (hence the name of the algorithm).
• For each training instance the back propagation algorithm :
1. First makes a prediction (forward pass),
2. Measures the error,
3. Then goes through each layer in reverse to
measure the error contribution from each
connection (reverse pass),
4. and finally slightly tweaks the connection weights
to reduce the error (Gradient Descent step).
Key change to the MLP’s
architecture
• To work algorithm properly ,
• the step function replaced with the logistic function, σ(z) =
1 / (1 + exp(–z)).
• Because ,
• The step function contains only flat segments - Gradient
Descent cannot move on a flat surface
• The logistic function has a well-defined nonzero derivative
everywhere, allowing Gradient Descent to make some
progress at every step.
• The back propagation algorithm also used with two more
popular activation functions :
• The hyperbolic tangent function: tanh (z) = 2σ(2z) – 1
• It is S-shaped, continuous, and differentiable, but its output
value ranges from –1 to 1 (instead of 0 to 1 in the case of the
logistic function),
• which tends to make each layer’s output more or less
normalized (i.e., centered around 0) at the beginning of
training. This often helps speed up convergence.
• The ReLU function: ReLU (z) = max (0, z).
• It is continuous but unfortunately not differentiable at z = 0
(the slope changes abruptly, which can make Gradient
Descent bounce around).
• In practice it works very well and has the advantage of being
fast to compute.
• Most importantly, the fact that it does not have a maximum
output value also helps reduce some issues during Gradient
Descent
Conceptually: Forward Activity -
Backward Error
45
Forward Propagation of Activity
• Step 1: Initialise weights at random, choose a
learning rate η
• Until network is trained:
• For each training example i.e. input pattern and
target output(s):
• Step 2: Do forward pass through net (with fixed
weights) to produce output(s)
• i.e., in Forward Direction, layer by layer:
• Inputs applied
• Multiplied by weights
• Summed
• ‘Squashed’ by sigmoid activation function
• Output passed to each neuron in next layer
• Repeat above until network output(s) produced
46
Step 3. Back-propagation of error
Compute error (delta or local gradient) for each output
unit δ k
Layer-by-layer, compute error (delta or local gradient)
for each hidden unit δ j by backpropagating errors
(as shown previously)
47
‘Back-prop’ algorithm summary (with NO Maths!)
48
‘Back-prop’ algorithm summary
(with Maths!) (Not Examinable)
49
MLP/BP: A worked example
50
Worked example: Forward Pass
51
Worked example: Forward Pass
52
Worked example: Backward Pass
53
Worked example: Update Weights
Using Generalized Delta Rule (BP)
Update = LearningFactor· (DesiredOutput − ActualOutput) ·
Input
54
Similarly for the all weights wij:
55
Verification that it works
56
• An MLP is often used for classification, with each output
corresponding to a different binary class
• e.g., spam/ham, urgent/not-urgent, and so on.
• When the classes are exclusive (e.g., classes 0 through 9 for
digit image classification), the output layer is typically
modified by replacing the individual activation functions by a
shared softmax function.
Note that the signal flows only in one direction (from the
inputs to the outputs), so this architecture is an example of
a feedforward neural network (FNN).
• Biological neurons seem to implement a roughly sigmoid (S-shaped)
activation function, so researchers stuck to sigmoid functions for a very
long time.
• But it turns out that the ReLU activation function generally works better
in ANNs.