ML unit-2
ML unit-2
UNIT-II
Multi-Layer Perceptron: Going Forwards, Going Backwards, Back Propagation Error, Multi-
layer perceptron in practice, Examples of using the MLP, Deriving Back-propagation.
Radial Basis Functions and Splines: Concepts, RBF Network, Curse of Dimensionality,
Interpolations and Basis Functions, Support Vector Machine
1
R22 Machine Learning Lecture Notes
Input: (0,1)
A=1, B=0
At Neuron C:
1x1+0x1+1x(-0.5)=1+0-0.5=0.5 > Threshold 0
Neuron C Fires, so output is 1
At Neuron D:
1x1+0x1+1x(-1)=1+0-1=0
Neuron D does not fire, so output is 0
At Neuron E:
1x1+0x(-1)+1x(-0.5)=1-0-0.5=0.5 > Threshold 0
Neuron E fires, so output is 1.
Going Forwards:
Training the MLP consists of two parts: working out what the outputs are for the given
inputs and the current weights, and then updating the weights according to the error, which
is a function of the difference between the outputs and the targets.
These are generally known as going forwards and backwards through the network.
Each neuron in the network (whether it is a hidden layer or the output) has one extra input,
with fixed value is called bias.
Going Backwards- Back Propagation of Error:
Back-propagation of error makes it clear that the errors are sent backwards through the
network.
It is a form of gradient descent.
The problem is that when we try to adapt the weights of the Multi-layer Perceptron, we
have to work out which weights caused the error.
This could be the weights connecting the inputs to the hidden layer, or the weights
connecting the hidden layer to the output layer.
We use sum-of-squares error function, which calculates the difference between y and t for
each node, squares them, and adds them all together.
2
R22 Machine Learning Lecture Notes
We need an activation function that looks like a threshold function but is differentiable
so that we can compute the gradient.
Activation Functions:
The activation function basically decides whether a neuron should be activated or not.
The activation function is a non-linear transformation that we do over the input before
sending it to the next layer of neurons or finalizing it as output.
Sigmoid Function:
The Sigmoid activation function, also known as the logistic activation function,
takes inputs and turns them into outputs ranging between 0 and 1.
For this reason, sigmoid is referred to as the “squashing function” and is
differentiable.
Larger, more positive inputs should produce output values close to 1.0, with
smaller, more negative inputs producing outputs closer to 0.0.
3
R22 Machine Learning Lecture Notes
4
R22 Machine Learning Lecture Notes
5
R22 Machine Learning Lecture Notes
6
R22 Machine Learning Lecture Notes
A local minimum is a point in the parameter space where the loss function is minimized
in a local neighborhood.
A global minimum is a point in the parameter space where the loss function is
minimized globally.
Picking Up Momentum:
Momentum in neural networks is a parameter optimization technique that accelerates
gradient descent by adding a fraction of the previous weight update to the current
weight update.
7
R22 Machine Learning Lecture Notes
8
R22 Machine Learning Lecture Notes
The training of the MLP requires that the algorithm runs over the entire dataset many
times, with the weights changing as the network makes errors in each iteration.
Two options
o Predefined number of Iterations
o Predefined minimum error reached
Using both of these options together can help, as can terminating the learning once the
error stops decreasing.
We train the network for some predetermined amount of time, and then use the
validation set to estimate how well the network is generalising.
We then carry on training for a few more iterations, and repeat the whole process.
At some stage the error on the validation set will start increasing again, because the
network has stopped learning about the function that generated the data, and started to
learn about the noise that is in the data itself.
At this stage we stop the training. This technique is called early stopping.
9
R22 Machine Learning Lecture Notes
The loss functions that can be used in Regression MLP include Mean Squared Error(MSE)
and Mean Absolute Error(MAE).
MSE can be used in datasets with fewer outliers, while MAE is a good measure in datasets
which has more outliers.
Example: Rainfall prediction, Stock price prediction
Classification:
If the output variable is categorical, then we have to use classification for prediction.
Example: Iris Flower classification
The aim is to classify iris flowers among three species (Setosa, Versicolor, or Virginica)
from the sepals’ and petals’ length and width measurements.
The above neural network has one input layer, two hidden layers and one output layer.
In the hidden layers we use sigmoid as an activation function for all neurons.
In the output layer, we use softmax as an activation function for the three output
neurons.
In this regard, all outputs are between 0 and 1, and their sum is 1.
The neural network has three outputs since the target variable contains three classes
(Setosa, Versicolor, and Virginica).
10
R22 Machine Learning Lecture Notes
Working of Softmax:
11
R22 Machine Learning Lecture Notes
They are finding a different representation of the input data that extracts important
components of the data, and ignores the noise.
This auto-associative network can be used to compress images and other data.
Deriving Back-propagation:
Things to know:
1. Derivative of ½ x2 is x
2. Chain rule:
12
R22 Machine Learning Lecture Notes
Note that i is an index over the input nodes, j is an index over the hidden layer neurons,
and k is an index over the output neurons.
The Error of the Network:
Error function E(v,w) remind us that the only things that we can change are the weights
v and w.
We will choose sum of squared error function
We are going to use a gradient descent algorithm that adjusts each weight.
The gradient that we want to know is how the error function changes with respect to
the different weights
13
R22 Machine Learning Lecture Notes
There is a family of functions called sigmoid functions because they are S-shaped that satisfy
all those criteria perfectly.
since we don’t know much about the inputs to a neuron, we just know about its output. That’s
fine, because we can use the chain rule again
14
R22 Machine Learning Lecture Notes
The important thing that we need to remember is that inputs to the output layer neurons come
from the activations of the hidden layer neurons multiplied by the second layer weights:
15
R22 Machine Learning Lecture Notes
16
R22 Machine Learning Lecture Notes
RBF Networks are conceptually similar to K-Nearest Neighbor (k-NN) models, though
their implementation is distinct.
The fundamental idea is that an item’s predicted target value is influenced by nearby
items with similar predictor variable values.
Here’s how RBF Networks operate:
o Input Vector: The network receives an n-dimensional input vector that needs
classification or regression.
o RBF Neurons: Each neuron in the hidden layer represents a prototype vector
(center, radius/spread) from the training set. The network computes the
Euclidean distance between the input vector and each neuron’s center.
o Activation Function: The Euclidean distance is transformed using a Radial
Basis Function (typically a Gaussian function) to compute the neuron’s
activation value. This value decreases exponentially as the distance increases.
17
R22 Machine Learning Lecture Notes
o Output Nodes: Each output node calculates a score based on a weighted sum of
the activation values from all RBF neurons. For classification, the category with
the highest score is chosen.
18
R22 Machine Learning Lecture Notes
Example: if a child's height was measured at age 5 and age 6, interpolation could be
used to estimate the child's height at age 5.5.
Basis Function:
Radial basis functions and several other machine learning algorithms can be written in
this form:
19
R22 Machine Learning Lecture Notes
Curse of Dimensionality:
The Curse of Dimensionality refers to the phenomenon where the efficiency and
effectiveness of algorithms deteriorate as the dimensionality of the data increases
exponentially.
It is crucial to understand this concept because as the number of features or dimensions
in a dataset increases, the amount of data we need to generalize accurately grows
exponentially.
Dimensions refer to the features or attributes of data.
For instance, if we consider a dataset of houses, the dimensions could include the
house's price, size, number of bedrooms, location, and so on.
What problems does it cause?
1. Data sparsity. As mentioned, data becomes sparse, meaning that most of the high-
dimensional space is empty. This makes clustering and classification tasks challenging.
2. Increased computation. More dimensions mean more computational resources and
time to process the data.
20
R22 Machine Learning Lecture Notes
3. Overfitting. With higher dimensions, models can become overly complex, fitting to
the noise rather than the underlying pattern. This reduces the model's ability to
generalize to new data.
4. Distances lose meaning. In high dimensions, the difference in distances between data
points tends to become negligible, making measures like Euclidean distance less
meaningful.
5. Performance degradation. Algorithms, especially those relying on distance
measurements like k-nearest neighbors, can see a drop in performance.
6. Visualization challenges. High-dimensional data is hard to visualize, making
exploratory data analysis more difficult.
21
R22 Machine Learning Lecture Notes
SVM algorithm can be used for Face detection, image classification, text
categorization, etc.
Types of SVM:
Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset
can be classified into two classes by using a single straight line, then such data is termed
as linearly separable data, and classifier is used called as Linear SVM classifier.
Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which
means if a dataset cannot be classified by using a straight line, then such data is termed
as non-linear data and classifier used is called as Non-linear SVM classifier.
Hyperplane:
22
R22 Machine Learning Lecture Notes
We always create a hyperplane that has a maximum margin, which means the maximum
distance between the data points.
Support Vectors:
The data points or vectors that are the closest to the hyperplane and which affect the
position of the hyperplane are termed as Support Vector. Since these vectors support
the hyperplane, hence called a Support vector.
Linear SVM:
Suppose we have a dataset that has two tags (green and blue), and the dataset has two
features x1 and x2. We want a classifier that can classify the pair(x1, x2) of coordinates
in either green or blue. Consider the below image:
So as it is 2-d space so by just using a straight line, we can easily separate these two
classes. But there can be multiple lines that can separate these classes. Consider the
below image:
23
R22 Machine Learning Lecture Notes
Hence, the SVM algorithm helps to find the best line or decision boundary; this best
boundary or region is called as a hyperplane.
SVM algorithm finds the closest point of the lines from both the classes. These points
are called support vectors.
The distance between the vectors and the hyperplane is called as margin.
And the goal of SVM is to maximize this margin.
The hyperplane with maximum margin is called the optimal hyperplane.
Non-Linear SVM:
If data is linearly arranged, then we can separate it by using a straight line, but for non-
linear data, we cannot draw a single straight line. Consider the below image:
24
R22 Machine Learning Lecture Notes
So to separate these data points, we need to add one more dimension. For linear data,
we have used two dimensions x and y, so for non-linear data, we will add a third
dimension z. It can be calculated as:
Z=x2+y2
By adding the third dimension, the sample space will become as below image:
So now, SVM will divide the datasets into classes in the following way. Consider the
below image:
25
R22 Machine Learning Lecture Notes
Kernels:
The most interesting feature of SVM is that it can even work with a non-linear dataset
and for this, we use “Kernel Trick” which makes it easier to classifies the points.
Suppose we have a dataset like this:
Here we see we cannot draw a single line or say hyperplane which can classify the
points correctly.
So we convert this lower dimension space to a higher dimension space using some
quadratic functions which will allow us to find a decision boundary that clearly divides
the data points.
The functions which help us to do this are called Kernels and which kernel to use is
purely determined by hyperparameter tuning.
So we basically need to find X12, X22 and X1.X2, and now we can see that 2
dimensions got converted into 5 dimensions.
26
R22 Machine Learning Lecture Notes
Sigmoid Kernel
It is just taking your input, mapping them to a value of 0 and 1 so that they can be
separated by a simple straight line.
RBF Kernel
It creates non-linear combinations of our features to lift your samples onto a higher-
dimensional feature space where we can use a linear decision boundary to separate your
classes
It is the most used kernel in SVM classifications, the following formula explains it
mathematically:
– identify the support vectors as those that are within some specified distance of the
closest point and dispose of the rest of the training data
– compute b* using equation
27
R22 Machine Learning Lecture Notes
Advantages of SVM:
SVM works better when the data is Linear
It is more effective in high dimensions
With the help of the kernel trick, we can solve any complex problem
SVM is not sensitive to outliers
Can help us with Image classification
Disadvantages of SVM:
Choosing a good kernel is not easy
It doesn’t show good results on a large dataset
The SVM hyperparameters are Cost -C and gamma. It is not that easy to fine-tune
these hyper-parameters. It is hard to visualize their impact.
******
28