0% found this document useful (0 votes)
38 views

Lesson 14 ANN Supervised

The document discusses artificial neural networks including an overview of their structure and properties. ANNs are loosely modeled after the human brain and consist of interconnected processing units. The document outlines different ANN types and applications including clustering, classification, function approximation, and forecasting. It also discusses ANN topology, learning methods, and properties such as their ability to handle non-linear relationships and noisy data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views

Lesson 14 ANN Supervised

The document discusses artificial neural networks including an overview of their structure and properties. ANNs are loosely modeled after the human brain and consist of interconnected processing units. The document outlines different ANN types and applications including clustering, classification, function approximation, and forecasting. It also discusses ANN topology, learning methods, and properties such as their ability to handle non-linear relationships and noisy data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Artificial Neural Networks:

Supervised models

Lesson 14

20538 – Predictive Analytics for


data driven decision making

Daniele Tonini
[email protected]
Agenda

1. Introduction: Machine Learning

2. Overview and structure of Artificial Neural Networks

3. Simple Perceptron model

4. Multi Layer Perceptron model

5. Final remarks

2
Machine Learning
Definition

• «A computer program is said to learn from experience E, with respect to some class of tasks T and performance
measure P, if its performance at tasks in T, as measured by P, improves with experience E» (T. Mitchell, 1997)

• In simpler words ... Machine Learning is that part of the artificial intelligence methods, that deals with the
development of algorithms based on input data, according to specific learning processes, with the aim of creating
new knowledge

Source: SAS Institute

3
Machine Learning
ML vs. Statistical Modeling

To some extent it is just a matter of terminology ..

Machine Learning Statistical modeling


Algorithms Models
Dependence Analysis
Supervised Learning
(e.g. Regression / Classification)
Inter-Dependence Analysis
Unsupervised Learning
(e.g. Clustering)
Weights Parameters
Learning Estimation

But there are some other key differences, such as:

1. Machine Learning is focused on precision / generalization of the prediction, while in statistical modeling is also very important
interpretation and identification of a causal dynamics

2. In statistical modeling particular attention is paid to hypotheses / assumptions underlying the models (distribution of
variables, independence of observations, etc.), while in machine learning is less rigorous and formal about these topics

3. Machine Learning uses various methodologies from different disciplines (mathematics, statistics, optimization, text analysis,
programming and computer science, etc.), while the statistical modeling is more focused on purely statistical aspects

4
Artificial Neural Networks (ANN)
Overview

An Artificial Neural Network (ANN) is a system loosely modeled based on the human brain  An ANN is a processing
device whose design was motivated by the design and functioning of human brains and components thereof
A first rough description is as follows:

• An ANN is a network of many simple processors ("units"), each possibly having a small amount of local memory

• The units are connected by communication channels ("connections"), which carry numeric (as opposed to
symbolic) data

• The units operate only on their local data and on the inputs they receive via the connections

• So ANNs are distributed system, made up of simple processing units (“artificial neurons”)

(A) Human neuron


(B) Artificial neuron or hidden unity
(C) Biological synapse
(D) ANN synapses
Artificial Neural Networks (ANN)
Overview

And also..

• Most neural networks have some sort of "training" rule whereby the weights of connections are
adjusted on the basis of presented patterns  Knowledge is acquired by the network from its
environment through a learning process

• In other words, neural networks "learn" from examples, just like children learn to recognize dogs
from examples of dogs, and exhibit some structural capability for generalization

• Neural networks normally have great potential for parallelism, since the computations of the
components are independent of each other

Neural Network types can be classified based on following attributes:

Learning Methods: Topology/Architecture: Connection Type:


- Supervised (many types) - Single-layer - Feedforward
- Unsupervised (many types) - Multi-layer - Recurrent
- Self-organized

Artificial Neural Networks (ANN)
Overview: analytical applications

Neural Network analytical applications can be grouped in following categories:

Clustering
A clustering algorithm explores the similarity between patterns and places similar patterns in a cluster

Classification/Pattern recognition
The task of pattern recognition is to assign an input pattern (like handwritten symbol) to one of many classes. This
category includes algorithmic implementations such as associative memory

Function approximation
The tasks of function approximation is to find an estimate of the unknown function f() subject to noise. Various
engineering and scientific disciplines require function approximation

Forecasting
The task is to forecast some future values of a time-sequenced data. Prediction has a significant impact on decision
support systems. Prediction differs from Function approximation by considering time factor.
Here the system is dynamic and may produce different results for the same input data based on system state (time)

Anomaly detection
Identification of objects, events or observations which do not conform to an expected pattern
Artificial Neural Networks (ANN)
Overview: properties

ANNs’ properties (considering modeling issues):


+ Inputs are flexible: any real values / highly correlated or independent
+ Target variable can be qualitative (binary or multi-class) or quantitative
+ Resistant to errors in the training data
+ Fast evaluation of performance
+ Manage non-linear dynamics
- Long training time / computationally intensive
- Overfitting risk
- Black box tool (difficult interpretation of relations between input and output)

When to consider artificial neural networks for modeling data:

• Input is high-dimensional discrete or raw-valued (complex inputs)


• Output is qualitative or quantitative
• Possibly noisy data
• Form of target function is unknown
• Human readability of the result is not important
• Small improvement in the performance of the model can result in a significant difference
in operational efficiency
Artificial Neural Networks (ANN)
Structure: Topology

• The ANN structure consists of n interconnected elementary units, called artificial neurons (or units)
• These processing units are organized in layers: a neural network may have a variable number of neurons
depending also on the number of layers
• The following figure shows a simple multi-layer neural network

Input Intermediate Output  N11 and N12 are the input neurons that receive the input data:
layer “hidden” layer layer these neurons process the input signal, according to a certain
function, and distribute the result to the neurons of the next layer
 The information is not simply forwarded to the next layer, but is
weighed (W11,W12, W13, … are the weights)
 The neurons N21, N22, N23, N24, are neurons constituting the
intermediate layer
 Each neuron in the intermediate layer sums up the received
inputs, which is equal to the product between the output of the
neurons of the input layer and the weight of the connection
 The result of this sum is again processed in the intermediate
layer on the basis of a specific function and forwarded to the
next layer (that is, in this example, the output layer)

12
Artificial Neural Networks (ANN)
Structure: Connectivity

Feedforward networks
• Information moves in only one direction, forward, from the input nodes, through the hidden
nodes (if any) and to the output nodes
• Connections between the units do not form a cycle

Recurrent networks
• These have directed cycles in their connection graph (they can have complicated dynamics)
• More biologically realistic

13
Artificial Neural Networks (ANN)
Structure: Examples

Single layer feed-forward networks


• Input layer projecting into the output layer. Generally the input layer is not considered
a real processing layer because it just take the information and pass it forward to the
next layer

Single layer
network

Input Output
layer layer
Artificial Neural Networks (ANN)
Structure: Examples

Multi-layer feed-forward networks


• One or more hidden layers.

2-layers or
1-hidden layer
fully connected
network

Input Hidden Output


layer layer layer
Artificial Neural Networks (ANN)
Structure: Single Artificial Unit

The basic computational element is the artificial neuron (also called node or unit); there are different types (or
models) of single artificial neuron, but the general structure/functioning can be presented as follows:

Input Input Next • Each artificial neuron receives input from other units, or from an
data weights layer external source (if it’s in the first layer)

• Each j-th input has an associated weight 𝒘𝒊𝒋 , which can be


Artificial Output
𝑥1 𝑤𝑖1 Unit 𝒊 signal*
modified during the learning process

• The unit computes a transformation of the weighted inputs


𝑥2 𝑤𝑖2 according two functions:

𝑛𝑒𝑡𝑖 o A linear input function that simply sums up the weighted


𝑥3 𝑤𝑖3 𝑓(𝑛𝑒𝑡𝑖 ) inputs and produce the “net input”, 𝐧𝐞𝐭𝐢 = 𝒘𝒊𝒋 𝒙𝒋

... .. o A typically “non-linear” activation function that transform


the net value received from the input function  This
. function introduces non-linearities in the model that are
desirable in most of ANN in order to detect non-linear
𝑥𝑝 𝑤𝑖𝑝 … features in the data and to learn more complex functions

Input Activation
• Its output, in turn, can serve as input to other units (next layer)
function function

*NOTE: in order to insert more flexibility in the unit, it’s also possible to consider a third function called «output function» that transforms the result of the
activation function; for our purpose we will assume “output function = identity function”, so that the final output is just the result of the activation 16
function.
Perceptron
Structure

The simplest type of artificial network is called “Perceptron” and it’s a feed-forward net constituted by one single unit
 A perceptron takes several inputs 𝑥𝑗 and produces a single binary output ( 𝑦 ) according to a simple “step”
activation function that compares the net input value with a specified threshold 𝜽 (“theta”)
 Considering its structure, the perceptron is useful for classification application (where the target variable is a
binary variable)

𝑥1 𝑤1 Where the output is calculated as:

𝑥2 𝑤2 0, 𝑛𝑒𝑡 = 𝑤𝑗 𝑥𝑗 ≤ 𝜃
OUTPUT 𝑦 = 𝑓 𝑛𝑒𝑡, 𝜃 =
𝑥3 𝑤3 𝑓 𝒚 = (𝟎 𝒐𝒓 𝟏)
1, 𝑛𝑒𝑡 = 𝑤𝑗 𝑥𝑗 > 𝜃
... ..
.
𝑥𝑝 𝑤𝑝

Note: here we don’t use the subscript i in the equations, because we have just one single neuron in the net! The
17
perceptron is one single unit network
Perceptron
Learning process

The learning process in the perceptron is based on the so called “Perceptron Learning Rule”

• This rule is a way of training a perceptron so that the weights are continuously adjusted (i.e. online learning)
to produce correct detection results  the adjustment is done by calculating the errors the perceptron has done
while detecting a target class

• Practically speaking, the change in value for one single weight j (in each iteration of the learning process) is
computed as the difference between the perceptron's output and the expected “real” output (= the error!),
multiplied by the perceptron’s input j and also multiplied by a small constant called learning rate:

Change in Weight j = Learning Rate × Current Value of Input j × (Expected Output - Current Output)

More formally:
∆𝑤𝑗 = 𝛼 ∙ 𝑥𝑗 ∙ (𝑦 − 𝑦)

Where ∆𝑤𝑗 is the change in weight from input j to perceptron node, 𝛼 is the learning rate, 𝑦 is the target for the current
instance, 𝑦 is the current output, and 𝑥𝑗 is the jth input

The learning rate 𝜶 is a constant, normally selected a-priori in a range between 0 and 1 (frequently = 0.1 or 0.2), that
controls the speed of weights adjustment*

*NOTE: If the learning rate is set to a large value, then the neural network may learn more quickly, but if there is a large variability in the input set then
the network may not learn very well or at all.. it may „bounce around‟ the correct weights. In real terms, setting the learning rate to a large value is
analogous to giving a child a spanking, but that is counter-productive to learning if the offense is so simple as forgetting to tie their shoelaces.
18
Perceptron
Learning process

Steps of learning process using the perceptron learning rule are as follows:

1. Randomly choose the weights for each input variable (normally in the range -1 and 1)
2. Training observations are presented to perceptron one by one starting from the beginning, and its output is
observed for each training observation
3. If the output is correct (error = 0) then the next training observation is presented to perceptron
4. If the output is incorrect then the weights are modified using the following formula:

𝑈𝑝𝑑𝑎𝑡𝑒𝑑 𝑤𝑒𝑖𝑔ℎ𝑡 = 𝑃𝑟𝑒𝑣𝑖𝑜𝑢𝑠 𝑤𝑒𝑖𝑔ℎ𝑡 + 𝐷𝑒𝑙𝑡𝑎 𝑤𝑒𝑖𝑔ℎ𝑡 → 𝑤𝑗′ = 𝑤𝑗 + ∆𝑤𝑗

5. Repeat steps 2-4 with the modified weights


6. Repeat steps 2-5 until all the observation are correctly classified

Notes:
 One pass through all the weights for the whole training set is called an “epoch” of training
 If after some epochs, the network outputs match the targets for all the training patterns all the patterns, all the
∆𝑤𝑗 are zero and the training process ceases: we then say that the training process has converged to a solution

19
Perceptron
Learning example

STEP 1: Weights random initialization; learning rate & threshold definition

Training Dataset: Perceptron structure in the first step:


Very simple, two variables two
observations Random weights:

x1 x2 y x1 0.6
0.8 0.3 1
𝜃 = 0.1 𝑦
0.4 0.1 0

x2 0.2

Activation function: Learning rate: Threshold:

0, 𝑛𝑒𝑡 = 𝑤𝑗 𝑥𝑗 ≤ 𝜃 = 0.1 𝛼 = 0.2 𝜃 = 0.1


𝑦 = 𝑓 𝑛𝑒𝑡, 𝜃 =
1, 𝑛𝑒𝑡 = 𝑤𝑗 𝑥𝑗 > 𝜃 = 0.1

20
Perceptron
Learning example

STEP 2: Present first observation to the perceptron

Training Dataset: First observation into the perceptron


Very simple, two input variables,
two observations

x1 x2 y 0.8 0.6
0.8 0.3 1
𝜃 = 0.1 𝑦=1
0.4 0.1 0
Expected value:

0.3 0.2 𝑦=1

No error!
Apply input function: Apply activation function: In the next
iteration no
𝑛𝑒𝑡 = 0.8 ∗ (0.6) + 0.3 ∗ 0.2 = 0.54 𝒏𝒆𝒕 = 0.54 > 𝜃 = 0.1 → 𝑦 = 1 adjustment to the
weights

21
Perceptron
Learning example

STEP 3/4: Present second observation to the perceptron, without changing the weights, cause no error in the previous iteration

Training Dataset: Second observation into the perceptron


Very simple, two variables two
observations

x1 x2 y 0.4 0.6
0.8 0.3 1
𝜃 = 0.1 𝑦=1
0.4 0.1 0
Expected value:

0.1 0.2 𝑦=0

Wrong!
Apply input function: Apply activation function: In the next
iteration adjust the
𝑛𝑒𝑡 = 0.4 ∗ (0.6) + 0.1 ∗ 0.2 = 0.26 𝑛𝑒𝑡 = 0.26 > 𝜃 = 0.1 → 𝑦 = 1 weights according
to these changes:

∆𝑤1 = 𝛼 ∙ 𝑥1 ∙ 𝑦 − 𝑦 = 0.2 ∙ 0.4 ∙ 0 − 1 = −0.08


∆𝑤2 = 𝛼 ∙ 𝑥2 ∙ 𝑦 − 𝑦 = 0.2 ∙ 0.1 ∙ 0 − 1 = −0.02

Note: First epoch ends here, the first cycle of iterations is concluded because we used all the observations in the 22
training set.. If the algorithm has not converged yet, let’s start with another epoch!
Perceptron
Learning example

𝑤1′ = 𝑤1 + ∆𝑤1 = 0.6 − 0.08 = 0.52


STEP 5: Repeat steps 2-4 with the modified weights
𝑤2′ = 𝑤2 + ∆𝑤2 = 0.2 − 0.02 = 0.18
Training Dataset: First observation into the perceptron
Very simple, two variables two
observations

x1 x2 y 0.8 0.52
0.8 0.3 1
𝜃 = 0.1 𝑦=1
0.4 0.1 0
Expected value:

0.3 0.18 𝑦=1

No error!
Apply input function: Apply activation function: In the next
iteration no
𝑛𝑒𝑡 = 0.8 ∗ (0.52) + 0.3 ∗ 0.18 = 0.47 𝑛𝑒𝑡 = 0.47 > 𝜃 = 0.1 → 𝑦 = 1 adjustment to the
weights

23
Perceptron
Learning example

STEP 6: Repeat steps 2-5 until total training set error ceases to improve (or all the observation are correctly classified)

After the 6th epoch both the observations are correctly classified so the algorithm stops

Activation
Epoch x1 x2 y W1 W2 net Ѳ Error Converged?
output

0.8 0.3 1 0.6 0.2 0.54 0.1 1 0


1 Not Converged
0.4 0.1 0 0.6 0.2 0.26 0.1 1 -1

0.8 0.3 1 0.52 0.18 0.47 0.1 1 0


2 Not Converged
0.4 0.1 0 0.52 0.18 0.226 0.1 1 -1

0.8 0.3 1 0.44 0.16 0.4 0.1 1 0


3 Not Converged
0.4 0.1 0 0.44 0.16 0.192 0.1 1 -1

0.8 0.3 1 0.36 0.14 0.33 0.1 1 0


4 Not Converged
0.4 0.1 0 0.36 0.14 0.158 0.1 1 -1

0.8 0.3 1 0.28 0.12 0.26 0.1 1 0


5 Not Converged
0.4 0.1 0 0.28 0.12 0.124 0.1 1 -1

0.8 0.3 1 0.2 0.1 0.19 0.1 1 0


6 Converged
0.4 0.1 0 0.2 0.1 0.09 0.1 0 0

24
Perceptron
Introducing Bias term

As you can easily understand from the previous slides, the perceptron learning process and outcome is strongly
influenced by the threshold 𝜽  In most of the cases you need to adjust the value of the threshold to obtain
better results and quick convergence time

This can be done introducing the concept of bias  just move the threshold to the left side of the activation function
expression:

0, 𝑛𝑒𝑡 = 𝑤𝑗 𝑥𝑗 ≤ 𝜃 0, 𝑛𝑒𝑡 = 𝑤𝑗 𝑥𝑗 − 𝜃 ≤ 0
𝑓 𝑛𝑒𝑡, 𝜃 = 𝑓 𝑛𝑒𝑡, 𝜃 =
1, 𝑛𝑒𝑡 = 𝑤𝑗 𝑥𝑗 > 𝜃 1, 𝑛𝑒𝑡 = 𝑤𝑗 𝑥𝑗 − 𝜃 > 0

The bias can be defined as: 𝒃 = −𝜽


The perceptron activation function will became:

0, 𝑛𝑒𝑡 = 𝑤𝑗 𝑥𝑗 + 𝑏 ≤ 0 And then let the artificial unit estimates the proper
𝑓 𝑛𝑒𝑡 = value for the bias.. How?
Just use a constant input with its own weight!
1, 𝑛𝑒𝑡 = 𝑤𝑗 𝑥𝑗 + 𝑏 > 0

25
Perceptron
Introducing Bias term

Perceptron with threshold Perceptron with bias

𝑥0 = 1 𝑤0
𝑥1 𝑤1
𝑥1 𝑤1
𝑥2 𝑤2
𝜃 OUTPUT 𝑥2 𝑤2 0 OUTPUT

... ..
. ... ..
.
Specific threshhold Always set to zero
𝑥𝑝 𝑤𝑝 𝑥𝑝 𝑤𝑝

• Now considering 𝑤0 = 𝑏 and let the weight change during the learning process, increasing the flexibility of the
model to fit the data
• You can think of the bias as the intercept in a regression model! Without the bias, it’s like you force the
regression line to pass through the origin
• In more complex network typically you have one bias term for each node of the network (sometimes one bias
for one layer)

26
Perceptron
Enhancing the perceptron: more activation functions

In the standard perceptron, a small change in the weights (or bias) can sometimes cause the output of that perceptron
to completely flip, from 0 to 1; and also it can be useful to have something different from a simple binary step output!
Changing the activation function you will change the output of the neuron*
Here below the most commonly used activation functions:

Sigmoid Logistic function Sigmoid Hyperbolic Tangent function Rectifier Linear Unit (ReLU) Function

𝑒 𝑛𝑒𝑡 𝑒 𝑛𝑒𝑡 − 𝑒 −𝑛𝑒𝑡 𝑜𝑢𝑡𝑝𝑢𝑡 = max(0, 𝑛𝑒𝑡)


𝑜𝑢𝑡𝑝𝑢𝑡 = 𝑜𝑢𝑡𝑝𝑢𝑡 = tanh 𝑛𝑒𝑡 = 𝑛𝑒𝑡
1 + 𝑒 𝑛𝑒𝑡 𝑒 + 𝑒 −𝑛𝑒𝑡
A perceptron with a different activation function, strictly speaking, is not a perceptron anymore (for example it becomes
a “sigmoid neuron” if you use a sigmoid function)

*NOTE: here we’re still assuming that the output of the activation function is always the output of the neuron, but in some cases this is not true (e.g.
when you have a specific output function that changes the result of the activation function)
27
Perceptron
Enhancing the perceptron: beyond classes linear separability

As we said before, the aim of the perceptron is to classify the But perceptron classification can only happen when the
observations into two classes, for example C1 and C2, given a classes are linearly separable!!
set of inputs, x1, x2, . . ., xp • In other cases the perceptron (and in general any single layer
Mathematically, the equations used in the perceptron describe a ANN) is unable to find a solution (the algorithm doesn’t
hyperplane in the input space  this hyperplane (a line in two converge)
dimensions space) is used to separate the two classes C1 and C2 • Consider this two examples, where it’s impossible to
separate the classes using just one line:
Decision
x2 Example 1 Example 2
region for C1
w1x1 + w2x2 + b > 0

Decision
boundary
C1

x1

Decision C2
region for C2
w1x1 + w2x2 + b <= 0 w1x1 + w2x2 + b = 0
How can we overcome this issue?
 Adding one or more layers to the artificial network!!

28
Multilayer Perceptron (MLP)
Structure

As the name suggest, the Multilayer Perceptron adds an additional layer (or layers) of neurons to a perceptron
• The additional layers are called hidden (or intermediate) layers
• A MLP is a feedforward network and, typically, the activation function is sigmoidal shape
• A MLP with one hidden layer of sufficient size can approximate any continuous function to any desired accuracy

NOTE: To be more precise, the term "multilayer perceptron"


may causes confusion, because:

• the model is not a single perceptron that has multiple


layers, but it contains many perceptrons that are
organised into layers (at least one hidden)

• Moreover, these "perceptrons" are not really perceptrons in


the strictest possible sense: as we know, perceptrons are a
special case of artificial neurons that use a step activation
function whereas the artificial neurons in a multilayer
perceptron are free to take on any arbitrary
activation function

Consequently, whereas a standard perceptron performs just binary classification, a multilayer perceptron is free to perform classification or
regression, depending upon its activation function and its layers configuration

*NOTE: here we’re still assuming that the output of the activation function is always the output of the neuron, but in some cases this is not true (e.g.
when you have a specific output function that changes the result of the activation function)
29
Multilayer Perceptron (MLP)
Structure

 The input layer


– Introduces input values into the network
– Typically no activation function or other processing

 The hidden layer(s)


– Perform features extraction (extract salient features in the input
data which have predictive power with respect to the outputs)
– Two hidden layers are sufficient to solve a lot of problems but for
very complex cases you need “deeper” networks

 The output layer


– Functionally just like the hidden layers
– Outputs are passed on to the world outside the neural network

30
Multilayer Perceptron (MLP)
Dealing with different Non-Linearly Separable Problems

Types of Exclusive-OR Classes with Most General


Structure Decision Regions Problem Meshed regions Region Shapes

Single-Layer
(no hidden layers) Half Plane A B
Bounded By B
Hyperplane A
B A

Two-Layers
(1 hidden layers) Convex Open A B
Or B
Closed Regions A
B A

Three-Layers Arbitrary
(2 hidden layers)
(Complexity A B
Limited by No. B
A
of Nodes) B A

31
Multilayer Perceptron (MLP)
Dealing with different Non-Linearly Separable Problems

MLPs with three layers are capable to approximate any desired bounded continuous function:
 The units in the first hidden layer generate hyperplanes to divide the input space in half-spaces.
 Units in the second hidden layer form convex regions as intersections of these hyperplanes.
 Output units form unisons of the convex regions into arbitrarily shaped, convex, non-convex or disjoint regions

𝑥1 𝑥2
32
Credits: BI3S - lab - Hamburg
Multilayer Perceptron (MLP)
Learning procedure

The learning process adjusts neural network weights in order to effectively map inputs to outputs

• For MLP we can’t use Perceptron Learning Rule cause no output values are available for hidden units
• Several methods exist to train MLP but the Backpropagation Algorithm is the most commonly
used
• The final purpose is to learn to generalize!

Back Propagation Algorithm


• It’s a learning procedure which allows multi-layer feedforward
Neural Networks to be trained

• Can theoretically perform “any” input-output mapping

• Can learn to solve linearly inseparable problems

• Backpropagation works by applying the so called Gradient


Descent Rule to a feedforward network

33
Multilayer Perceptron (MLP)
Learning procedure: few words on gradient descent rule

Think of the n weights 𝑤1 , 𝑤2 , … , 𝑤𝑛 as a point in an n-dimensional space, and:


• Define the observed error (i.e. SSE or other loss/cost expressions) as a function of the weights  𝐸(𝑤1 , 𝑤2 , … , 𝑤𝑛 )
• Try to minimize the position on the “error surface”
Example with 2 weights

1- Define error:
where D is the set of training observations,
E ( w)    ( ykd  yˆ kd ) 2 K is the set of output units, 𝑦𝑘𝑑 and 𝑦𝑘𝑑
are, respectively, the target and current
d D kK
output for unit k for observation d

2- Compute Gradient as*: 3- Change i-th weight by


E[w]=[E/w0,… E/wn] wi =- αE[wi]

The descent rule is basically to change the weights by taking a small step (determined by the learning rate 𝜶) in
the direction opposite the gradient
Warning: Not guaranteed to converge to zero training error, may converge to local optima or oscillate indefinitely
To avoid local-minima problems, run several trials starting with different random weights (random restarts)

*NOTE: In order to calculate the partial derivatives, the activation function must be continuous and differentiable.. That is the reason why it’s
convenient to use sigmoid functions instead of step functions
34
Multilayer Perceptron (MLP)
Learning procedure: the backpropagation algorithm

The Back Propagation algorithm is composed of two parts that get repeated over and over

After the initialization of each weight to some small random value, every iteration performs the
following:

• Part I, the feedforward pass: the activation values of the hidden and then output units are
computed

• Part II, the backpropagation pass: the weights of the network are updated (starting with the
hidden to output weights and followed by the input to hidden weights) with respect to the loss
function and through a series of weight update according to the gradient descent rule

Repeat the two steps until tuning set error stops improving or until a pre-set maximal number of
epochs

36
Final remarks
Wrap up

Perceptrons, with threshold logic function as activation function, are suitable for pattern classification
tasks that involve linearly separable classes

Multilayer feedforward neural networks like MLP are suitable for pattern classification tasks that
involve nonlinearly separable classes:
• Complexity of the model to be used for a given task depends on
o Dimension of the input pattern vector and number of classes to be classified
o Shapes of the decision surfaces to be formed
• Architecture of the model is empirically determined
• Large number of training observations are required when the complexity of the model is high
• Local minima problem when using gradient descend rule
• Complex networks, if not managed well, suffer from serious overfitting problems

Multilayer feedforward neural network with just one hidden layer is now called a shallow network,
consequently if you have many hidden layers you obtain a deep network (… and the so called “deep
learning”)

37
Final remarks
ANN big family

The artificial neural networks family its really large.. many ANN architectures for different goals

38
Final remarks
Real world applications

 Classification, recognition and identification  Forecasting and prediction

In general: natural language processing, image / In finance: foreign exchange rate and stock market
video and handwriting recognition forecasting
In marketing: consumer spending pattern In manufacturing: predictive maintenance
classification, intelligent ads, human activity In agriculture: crop yield forecasting
recognition In marketing: sales forecasting
In finance: fraud detection, signature verification In meteorology: weather prediction
and bank note verification
In Automotive: lane & signs recognition,
pedestrian detection
In security: motion detection, surveillance image
analysis and fingerprint matching
In medicine: ultrasound and electrocardiogram
image, EEGs, medical diagnosis

39
Final remarks
Use case: Fraud prevention

Fraudsters are becoming increasingly smarter and adaptive so:


• Need cost effective solutions that can model complex attack patterns not previously observed
• Need scalable and computationally efficient prediction models

Solution developed by Paypal©

The “machine” ANN Settings Results

40
Final remarks
Future?

This can be true… But also…

41

You might also like