Lesson 14 ANN Supervised
Lesson 14 ANN Supervised
Supervised models
Lesson 14
Daniele Tonini
[email protected]
Agenda
5. Final remarks
2
Machine Learning
Definition
• «A computer program is said to learn from experience E, with respect to some class of tasks T and performance
measure P, if its performance at tasks in T, as measured by P, improves with experience E» (T. Mitchell, 1997)
• In simpler words ... Machine Learning is that part of the artificial intelligence methods, that deals with the
development of algorithms based on input data, according to specific learning processes, with the aim of creating
new knowledge
3
Machine Learning
ML vs. Statistical Modeling
1. Machine Learning is focused on precision / generalization of the prediction, while in statistical modeling is also very important
interpretation and identification of a causal dynamics
2. In statistical modeling particular attention is paid to hypotheses / assumptions underlying the models (distribution of
variables, independence of observations, etc.), while in machine learning is less rigorous and formal about these topics
3. Machine Learning uses various methodologies from different disciplines (mathematics, statistics, optimization, text analysis,
programming and computer science, etc.), while the statistical modeling is more focused on purely statistical aspects
4
Artificial Neural Networks (ANN)
Overview
An Artificial Neural Network (ANN) is a system loosely modeled based on the human brain An ANN is a processing
device whose design was motivated by the design and functioning of human brains and components thereof
A first rough description is as follows:
• An ANN is a network of many simple processors ("units"), each possibly having a small amount of local memory
• The units are connected by communication channels ("connections"), which carry numeric (as opposed to
symbolic) data
• The units operate only on their local data and on the inputs they receive via the connections
• So ANNs are distributed system, made up of simple processing units (“artificial neurons”)
And also..
• Most neural networks have some sort of "training" rule whereby the weights of connections are
adjusted on the basis of presented patterns Knowledge is acquired by the network from its
environment through a learning process
• In other words, neural networks "learn" from examples, just like children learn to recognize dogs
from examples of dogs, and exhibit some structural capability for generalization
• Neural networks normally have great potential for parallelism, since the computations of the
components are independent of each other
Clustering
A clustering algorithm explores the similarity between patterns and places similar patterns in a cluster
Classification/Pattern recognition
The task of pattern recognition is to assign an input pattern (like handwritten symbol) to one of many classes. This
category includes algorithmic implementations such as associative memory
Function approximation
The tasks of function approximation is to find an estimate of the unknown function f() subject to noise. Various
engineering and scientific disciplines require function approximation
Forecasting
The task is to forecast some future values of a time-sequenced data. Prediction has a significant impact on decision
support systems. Prediction differs from Function approximation by considering time factor.
Here the system is dynamic and may produce different results for the same input data based on system state (time)
Anomaly detection
Identification of objects, events or observations which do not conform to an expected pattern
Artificial Neural Networks (ANN)
Overview: properties
• The ANN structure consists of n interconnected elementary units, called artificial neurons (or units)
• These processing units are organized in layers: a neural network may have a variable number of neurons
depending also on the number of layers
• The following figure shows a simple multi-layer neural network
Input Intermediate Output N11 and N12 are the input neurons that receive the input data:
layer “hidden” layer layer these neurons process the input signal, according to a certain
function, and distribute the result to the neurons of the next layer
The information is not simply forwarded to the next layer, but is
weighed (W11,W12, W13, … are the weights)
The neurons N21, N22, N23, N24, are neurons constituting the
intermediate layer
Each neuron in the intermediate layer sums up the received
inputs, which is equal to the product between the output of the
neurons of the input layer and the weight of the connection
The result of this sum is again processed in the intermediate
layer on the basis of a specific function and forwarded to the
next layer (that is, in this example, the output layer)
12
Artificial Neural Networks (ANN)
Structure: Connectivity
Feedforward networks
• Information moves in only one direction, forward, from the input nodes, through the hidden
nodes (if any) and to the output nodes
• Connections between the units do not form a cycle
Recurrent networks
• These have directed cycles in their connection graph (they can have complicated dynamics)
• More biologically realistic
13
Artificial Neural Networks (ANN)
Structure: Examples
Single layer
network
Input Output
layer layer
Artificial Neural Networks (ANN)
Structure: Examples
2-layers or
1-hidden layer
fully connected
network
The basic computational element is the artificial neuron (also called node or unit); there are different types (or
models) of single artificial neuron, but the general structure/functioning can be presented as follows:
Input Input Next • Each artificial neuron receives input from other units, or from an
data weights layer external source (if it’s in the first layer)
Input Activation
• Its output, in turn, can serve as input to other units (next layer)
function function
*NOTE: in order to insert more flexibility in the unit, it’s also possible to consider a third function called «output function» that transforms the result of the
activation function; for our purpose we will assume “output function = identity function”, so that the final output is just the result of the activation 16
function.
Perceptron
Structure
The simplest type of artificial network is called “Perceptron” and it’s a feed-forward net constituted by one single unit
A perceptron takes several inputs 𝑥𝑗 and produces a single binary output ( 𝑦 ) according to a simple “step”
activation function that compares the net input value with a specified threshold 𝜽 (“theta”)
Considering its structure, the perceptron is useful for classification application (where the target variable is a
binary variable)
𝑥2 𝑤2 0, 𝑛𝑒𝑡 = 𝑤𝑗 𝑥𝑗 ≤ 𝜃
OUTPUT 𝑦 = 𝑓 𝑛𝑒𝑡, 𝜃 =
𝑥3 𝑤3 𝑓 𝒚 = (𝟎 𝒐𝒓 𝟏)
1, 𝑛𝑒𝑡 = 𝑤𝑗 𝑥𝑗 > 𝜃
... ..
.
𝑥𝑝 𝑤𝑝
Note: here we don’t use the subscript i in the equations, because we have just one single neuron in the net! The
17
perceptron is one single unit network
Perceptron
Learning process
The learning process in the perceptron is based on the so called “Perceptron Learning Rule”
• This rule is a way of training a perceptron so that the weights are continuously adjusted (i.e. online learning)
to produce correct detection results the adjustment is done by calculating the errors the perceptron has done
while detecting a target class
• Practically speaking, the change in value for one single weight j (in each iteration of the learning process) is
computed as the difference between the perceptron's output and the expected “real” output (= the error!),
multiplied by the perceptron’s input j and also multiplied by a small constant called learning rate:
Change in Weight j = Learning Rate × Current Value of Input j × (Expected Output - Current Output)
More formally:
∆𝑤𝑗 = 𝛼 ∙ 𝑥𝑗 ∙ (𝑦 − 𝑦)
Where ∆𝑤𝑗 is the change in weight from input j to perceptron node, 𝛼 is the learning rate, 𝑦 is the target for the current
instance, 𝑦 is the current output, and 𝑥𝑗 is the jth input
The learning rate 𝜶 is a constant, normally selected a-priori in a range between 0 and 1 (frequently = 0.1 or 0.2), that
controls the speed of weights adjustment*
*NOTE: If the learning rate is set to a large value, then the neural network may learn more quickly, but if there is a large variability in the input set then
the network may not learn very well or at all.. it may „bounce around‟ the correct weights. In real terms, setting the learning rate to a large value is
analogous to giving a child a spanking, but that is counter-productive to learning if the offense is so simple as forgetting to tie their shoelaces.
18
Perceptron
Learning process
Steps of learning process using the perceptron learning rule are as follows:
1. Randomly choose the weights for each input variable (normally in the range -1 and 1)
2. Training observations are presented to perceptron one by one starting from the beginning, and its output is
observed for each training observation
3. If the output is correct (error = 0) then the next training observation is presented to perceptron
4. If the output is incorrect then the weights are modified using the following formula:
Notes:
One pass through all the weights for the whole training set is called an “epoch” of training
If after some epochs, the network outputs match the targets for all the training patterns all the patterns, all the
∆𝑤𝑗 are zero and the training process ceases: we then say that the training process has converged to a solution
19
Perceptron
Learning example
x1 x2 y x1 0.6
0.8 0.3 1
𝜃 = 0.1 𝑦
0.4 0.1 0
x2 0.2
20
Perceptron
Learning example
x1 x2 y 0.8 0.6
0.8 0.3 1
𝜃 = 0.1 𝑦=1
0.4 0.1 0
Expected value:
No error!
Apply input function: Apply activation function: In the next
iteration no
𝑛𝑒𝑡 = 0.8 ∗ (0.6) + 0.3 ∗ 0.2 = 0.54 𝒏𝒆𝒕 = 0.54 > 𝜃 = 0.1 → 𝑦 = 1 adjustment to the
weights
21
Perceptron
Learning example
STEP 3/4: Present second observation to the perceptron, without changing the weights, cause no error in the previous iteration
x1 x2 y 0.4 0.6
0.8 0.3 1
𝜃 = 0.1 𝑦=1
0.4 0.1 0
Expected value:
Wrong!
Apply input function: Apply activation function: In the next
iteration adjust the
𝑛𝑒𝑡 = 0.4 ∗ (0.6) + 0.1 ∗ 0.2 = 0.26 𝑛𝑒𝑡 = 0.26 > 𝜃 = 0.1 → 𝑦 = 1 weights according
to these changes:
Note: First epoch ends here, the first cycle of iterations is concluded because we used all the observations in the 22
training set.. If the algorithm has not converged yet, let’s start with another epoch!
Perceptron
Learning example
x1 x2 y 0.8 0.52
0.8 0.3 1
𝜃 = 0.1 𝑦=1
0.4 0.1 0
Expected value:
No error!
Apply input function: Apply activation function: In the next
iteration no
𝑛𝑒𝑡 = 0.8 ∗ (0.52) + 0.3 ∗ 0.18 = 0.47 𝑛𝑒𝑡 = 0.47 > 𝜃 = 0.1 → 𝑦 = 1 adjustment to the
weights
23
Perceptron
Learning example
STEP 6: Repeat steps 2-5 until total training set error ceases to improve (or all the observation are correctly classified)
After the 6th epoch both the observations are correctly classified so the algorithm stops
Activation
Epoch x1 x2 y W1 W2 net Ѳ Error Converged?
output
24
Perceptron
Introducing Bias term
As you can easily understand from the previous slides, the perceptron learning process and outcome is strongly
influenced by the threshold 𝜽 In most of the cases you need to adjust the value of the threshold to obtain
better results and quick convergence time
This can be done introducing the concept of bias just move the threshold to the left side of the activation function
expression:
0, 𝑛𝑒𝑡 = 𝑤𝑗 𝑥𝑗 ≤ 𝜃 0, 𝑛𝑒𝑡 = 𝑤𝑗 𝑥𝑗 − 𝜃 ≤ 0
𝑓 𝑛𝑒𝑡, 𝜃 = 𝑓 𝑛𝑒𝑡, 𝜃 =
1, 𝑛𝑒𝑡 = 𝑤𝑗 𝑥𝑗 > 𝜃 1, 𝑛𝑒𝑡 = 𝑤𝑗 𝑥𝑗 − 𝜃 > 0
0, 𝑛𝑒𝑡 = 𝑤𝑗 𝑥𝑗 + 𝑏 ≤ 0 And then let the artificial unit estimates the proper
𝑓 𝑛𝑒𝑡 = value for the bias.. How?
Just use a constant input with its own weight!
1, 𝑛𝑒𝑡 = 𝑤𝑗 𝑥𝑗 + 𝑏 > 0
25
Perceptron
Introducing Bias term
𝑥0 = 1 𝑤0
𝑥1 𝑤1
𝑥1 𝑤1
𝑥2 𝑤2
𝜃 OUTPUT 𝑥2 𝑤2 0 OUTPUT
... ..
. ... ..
.
Specific threshhold Always set to zero
𝑥𝑝 𝑤𝑝 𝑥𝑝 𝑤𝑝
• Now considering 𝑤0 = 𝑏 and let the weight change during the learning process, increasing the flexibility of the
model to fit the data
• You can think of the bias as the intercept in a regression model! Without the bias, it’s like you force the
regression line to pass through the origin
• In more complex network typically you have one bias term for each node of the network (sometimes one bias
for one layer)
26
Perceptron
Enhancing the perceptron: more activation functions
In the standard perceptron, a small change in the weights (or bias) can sometimes cause the output of that perceptron
to completely flip, from 0 to 1; and also it can be useful to have something different from a simple binary step output!
Changing the activation function you will change the output of the neuron*
Here below the most commonly used activation functions:
Sigmoid Logistic function Sigmoid Hyperbolic Tangent function Rectifier Linear Unit (ReLU) Function
*NOTE: here we’re still assuming that the output of the activation function is always the output of the neuron, but in some cases this is not true (e.g.
when you have a specific output function that changes the result of the activation function)
27
Perceptron
Enhancing the perceptron: beyond classes linear separability
As we said before, the aim of the perceptron is to classify the But perceptron classification can only happen when the
observations into two classes, for example C1 and C2, given a classes are linearly separable!!
set of inputs, x1, x2, . . ., xp • In other cases the perceptron (and in general any single layer
Mathematically, the equations used in the perceptron describe a ANN) is unable to find a solution (the algorithm doesn’t
hyperplane in the input space this hyperplane (a line in two converge)
dimensions space) is used to separate the two classes C1 and C2 • Consider this two examples, where it’s impossible to
separate the classes using just one line:
Decision
x2 Example 1 Example 2
region for C1
w1x1 + w2x2 + b > 0
Decision
boundary
C1
x1
Decision C2
region for C2
w1x1 + w2x2 + b <= 0 w1x1 + w2x2 + b = 0
How can we overcome this issue?
Adding one or more layers to the artificial network!!
28
Multilayer Perceptron (MLP)
Structure
As the name suggest, the Multilayer Perceptron adds an additional layer (or layers) of neurons to a perceptron
• The additional layers are called hidden (or intermediate) layers
• A MLP is a feedforward network and, typically, the activation function is sigmoidal shape
• A MLP with one hidden layer of sufficient size can approximate any continuous function to any desired accuracy
Consequently, whereas a standard perceptron performs just binary classification, a multilayer perceptron is free to perform classification or
regression, depending upon its activation function and its layers configuration
*NOTE: here we’re still assuming that the output of the activation function is always the output of the neuron, but in some cases this is not true (e.g.
when you have a specific output function that changes the result of the activation function)
29
Multilayer Perceptron (MLP)
Structure
30
Multilayer Perceptron (MLP)
Dealing with different Non-Linearly Separable Problems
Single-Layer
(no hidden layers) Half Plane A B
Bounded By B
Hyperplane A
B A
Two-Layers
(1 hidden layers) Convex Open A B
Or B
Closed Regions A
B A
Three-Layers Arbitrary
(2 hidden layers)
(Complexity A B
Limited by No. B
A
of Nodes) B A
31
Multilayer Perceptron (MLP)
Dealing with different Non-Linearly Separable Problems
MLPs with three layers are capable to approximate any desired bounded continuous function:
The units in the first hidden layer generate hyperplanes to divide the input space in half-spaces.
Units in the second hidden layer form convex regions as intersections of these hyperplanes.
Output units form unisons of the convex regions into arbitrarily shaped, convex, non-convex or disjoint regions
𝑥1 𝑥2
32
Credits: BI3S - lab - Hamburg
Multilayer Perceptron (MLP)
Learning procedure
The learning process adjusts neural network weights in order to effectively map inputs to outputs
• For MLP we can’t use Perceptron Learning Rule cause no output values are available for hidden units
• Several methods exist to train MLP but the Backpropagation Algorithm is the most commonly
used
• The final purpose is to learn to generalize!
33
Multilayer Perceptron (MLP)
Learning procedure: few words on gradient descent rule
1- Define error:
where D is the set of training observations,
E ( w) ( ykd yˆ kd ) 2 K is the set of output units, 𝑦𝑘𝑑 and 𝑦𝑘𝑑
are, respectively, the target and current
d D kK
output for unit k for observation d
The descent rule is basically to change the weights by taking a small step (determined by the learning rate 𝜶) in
the direction opposite the gradient
Warning: Not guaranteed to converge to zero training error, may converge to local optima or oscillate indefinitely
To avoid local-minima problems, run several trials starting with different random weights (random restarts)
*NOTE: In order to calculate the partial derivatives, the activation function must be continuous and differentiable.. That is the reason why it’s
convenient to use sigmoid functions instead of step functions
34
Multilayer Perceptron (MLP)
Learning procedure: the backpropagation algorithm
The Back Propagation algorithm is composed of two parts that get repeated over and over
After the initialization of each weight to some small random value, every iteration performs the
following:
• Part I, the feedforward pass: the activation values of the hidden and then output units are
computed
• Part II, the backpropagation pass: the weights of the network are updated (starting with the
hidden to output weights and followed by the input to hidden weights) with respect to the loss
function and through a series of weight update according to the gradient descent rule
Repeat the two steps until tuning set error stops improving or until a pre-set maximal number of
epochs
36
Final remarks
Wrap up
Perceptrons, with threshold logic function as activation function, are suitable for pattern classification
tasks that involve linearly separable classes
Multilayer feedforward neural networks like MLP are suitable for pattern classification tasks that
involve nonlinearly separable classes:
• Complexity of the model to be used for a given task depends on
o Dimension of the input pattern vector and number of classes to be classified
o Shapes of the decision surfaces to be formed
• Architecture of the model is empirically determined
• Large number of training observations are required when the complexity of the model is high
• Local minima problem when using gradient descend rule
• Complex networks, if not managed well, suffer from serious overfitting problems
Multilayer feedforward neural network with just one hidden layer is now called a shallow network,
consequently if you have many hidden layers you obtain a deep network (… and the so called “deep
learning”)
37
Final remarks
ANN big family
The artificial neural networks family its really large.. many ANN architectures for different goals
38
Final remarks
Real world applications
In general: natural language processing, image / In finance: foreign exchange rate and stock market
video and handwriting recognition forecasting
In marketing: consumer spending pattern In manufacturing: predictive maintenance
classification, intelligent ads, human activity In agriculture: crop yield forecasting
recognition In marketing: sales forecasting
In finance: fraud detection, signature verification In meteorology: weather prediction
and bank note verification
In Automotive: lane & signs recognition,
pedestrian detection
In security: motion detection, surveillance image
analysis and fingerprint matching
In medicine: ultrasound and electrocardiogram
image, EEGs, medical diagnosis
39
Final remarks
Use case: Fraud prevention
40
Final remarks
Future?
41