Unsupervised Machine Learning
Unsupervised Machine Learning
Learning
Dr K Nagi Reddy
Professor of ECE
Introduction
Unsupervised learning is a type of machine learning
in which models are trained using unlabeled
dataset and are allowed to act on that data without
any supervision.
Unsupervised learning cannot be directly applied to
a regression or classification problem because
unlike supervised learning, we have the input data
but no corresponding output data.
The goal of unsupervised learning is to find the
underlying structure of dataset, group that
data according to similarities, and represent
that dataset in a compressed format.
Example: Suppose the unsupervised learning
algorithm is given an input dataset containing
images of different types of cats and dogs. The
algorithm is never trained upon the given dataset,
which means it does not have any idea about the
features of the dataset.
The task of the unsupervised learning algorithm is
to identify the image features on their own.
Unsupervised learning algorithm will perform this
task by clustering the image dataset into the
groups according to similarities between images.
Why use Unsupervised Learning?
Below are some main reasons which describe the
importance of Unsupervised Learning:
Unsupervised learning is helpful for finding useful
insights from the data.
Unsupervised learning is much similar as a human
learns to think by their own experiences, which makes
it closer to the real AI.
Unsupervised learning works on unlabeled and
uncategorized data which make unsupervised learning
more important.
In real-world, we do not always have input data with
the corresponding output so to solve such cases, we
need unsupervised learning.
Working of Unsupervised Learning
Here, we have taken an unlabeled input data,
which means it is not categorized and
corresponding outputs are also not given. Now,
this unlabeled input data is fed to the machine
learning model in order to train it. Firstly, it will
interpret the raw data to find the hidden patterns
from the data and then will apply suitable
algorithms such as k-means clustering, Decision
tree, etc.
Once it applies the suitable algorithm, the
algorithm divides the data objects into groups
according to the similarities and difference
between the objects.
Types of Unsupervised Learning Algorithm
Using this formula we will get a value which tells us about the
similarity between the two vectors and 1-cosΘ will give us their
cosine distance.
Using this distance we get values between 0 and 1, where 0
means the vectors are 100% similar to each other and 1 means
Jaccard Distance
The Jaccard coefficient is a similar method of
comparison to the Cosine Similarity due to how
both methods compare one type of attribute
distributed among all data.
The Jaccard approach looks at the two data sets
and finds the incident where both values are equal
to 1.
So the resulting value reflects how many 1 to 1
matches occur in comparison to the total number of
data points. This is also known as the frequency
that 1 to 1 match, which is what the Cosine
Similarity looks for, how frequent a certain attribute
occurs.
It is extremely sensitive to small samples sizes and
may give erroneous results, especially with very
Jaccard distance is the complement of the Jaccard
index and can be found by subtracting the Jaccard
Index from 100%, thus the formula for Jaccard
distance is:
D(A,B) = 1 – J(A,B)
Hamming Distance
Hamming distance is a metric for comparing two binary data strings.
While comparing two binary strings of equal length, Hamming
distance is the number of bit positions in which the two bits are
different.
The Hamming distance method looks at the whole data and finds
when data points are similar and dissimilar one to one. The Hamming
distance gives the result of how many attributes were different.
This is used mostly when you one-hot encode your data and need to
find distances between the two binary vectors.
Suppose we have two strings “ABCDE” and “AGDDF” of same length
and we want to find the hamming distance between these. We will go
letter by letter in each string and see if they are similar or not like
first letters of both strings are similar, then second is not similar and
so on.
ABCDE and AGDDF
When we are doing this we will see that only two letters marked in
red were similar and three were dissimilar in the strings. Hence, the
Hamming Distance here will be 3. Note that larger the Hamming
Distance between two strings, more dissimilar will be those strings
Feature Reduction
Feature reduction, also known as dimensionality
reduction
It is the process of reducing the number of features in
a resource heavy computation without losing
important information. Reducing the number of
features means the number of variables is reduced
making the computer’s work easier and faster.
Feature reduction can be divided into two processes:
feature selection and feature extraction. There are
many techniques by which feature reduction is
accomplished.
Some of the most popular are generalized
discriminant analysis, autoencoders, non-negative
matrix factorization, and principal component analysis
Why is this Useful?
The purpose of using feature reduction is to reduce the number of
features (or variables) that the computer must process to perform
its function. Feature reduction leads to the need for fewer
resources to complete computations or tasks. During
machine learning, feature reduction removes multicollinearity
which is resulting in improvement of the machine learning model
in use.
Another benefit of feature reduction is that it makes data easier to
visualize for humans, particularly when the data is reduced to two
or three dimensions which can be easily displayed graphically.
An interesting problem that feature reduction also called the
curse of dimensionality. This refers to a group of phenomena in
which a problem will have so many dimensions that the data
becomes sparse. Feature reduction is used to decrease the
number of dimensions, making the data less sparse and more
statistically significant for machine learning applications.
It is commonly used in the fields that deal with high-dimensional
data, such as speech recognition, signal processing,
bioinformatics, etc. It can also be used for data
The Curse of Dimensionality
Disadvantages:
If categorical variable was not observed in training
data set, then model will assign a ‘0’ probability
and will unable to make a prediction.
If the features are dependent, then Naïve Bayes
classifier doesn't produce optimal results.
Bayesian Network
P(R)=P(R/W,C)*P(W˄C)+P(R/W',C)*P(W'˄C)+P(R/W,C')
*P(W˄C')+P(R/W',C')*P(W'˄C')
=0.002526
P(R‘)=0.9975
P(OFF)=0.7*0.002526+0.01*0.9975=0.01174
From this, we can estimate the probability for taking off is
0.01174
P(R
WC
)
P(W)=0.001 P(C)=0.002
Clouḍ 0.9
Windy
y 5
T T 0.9
R P(OF R P(W T F 5
F) G)
Rain F T 0.2
T 0.7 T 0.9 F F 9
F 0.01 F 0.0 0.0
5 01
Wet Take
grass off
Infer the value of P(John Calls | Burglary) with an accuracy
up to two decimal places
Example
Disadvantages
Computationally expensive.
Forces random variables to be in a cause-effect relationship. As a
result, it does not depicts variables which are correlated.
Bayesian network only encodes directional relationship and not
the bi-directional. BN does not provides any guarantee of
depicting the cause and effect relationship.
Adding to point 2, BN is a DAG that said. If the data was
generated from a model where there at least 3 variables
correlated to each other (cyclic relationship) then Bayesian
networks (BNs) will not be able to model this relationship.
One of the most important issues with BNs is that some of the
sophisticated scoring functions require reliable priors in order to
find a structure closer to the original model.
kernel function
In machine learning, a “kernel”
is usually used to refer to
the kernel trick, a method of using
a linear classifier to solve a non-
linear problem.
The kernel function is what is
applied on each data instance to
map the original non-linear
observations into a higher-
dimensional space in which they
become separable.
SVM Kernel Functions
Figure1 Figure1
For a fixed xi, the function is K(z-xi)/h) = 1 inside the closed ball
of radius h centered at xi, and 0 otherwise as shown in the
figure 2
So, by choosing the argument of K(·), we have moved the
window to be centered at the point xi and to be of radius h.
Examples of SVM Kernels
α is the slop of a Sigmoid Kernel and C is a
constant
Bessel function of the first kind Kernel
The Bessel kernel is well known in the theory of
function spaces of fractional smoothness. It is given
by: