0% found this document useful (0 votes)
8 views

Unsupervised Machine Learning

Unsupervised machine learning involves training models on unlabeled datasets to uncover underlying structures and group data based on similarities. Key techniques include clustering and association, with popular algorithms like K-means and hierarchical clustering. While unsupervised learning can yield valuable insights, it presents challenges such as lack of accuracy due to the absence of labeled data.

Uploaded by

mouneshyatham99
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Unsupervised Machine Learning

Unsupervised machine learning involves training models on unlabeled datasets to uncover underlying structures and group data based on similarities. Key techniques include clustering and association, with popular algorithms like K-means and hierarchical clustering. While unsupervised learning can yield valuable insights, it presents challenges such as lack of accuracy due to the absence of labeled data.

Uploaded by

mouneshyatham99
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 130

Unsupervised Machine

Learning

Dr K Nagi Reddy
Professor of ECE
Introduction
Unsupervised learning is a type of machine learning
in which models are trained using unlabeled
dataset and are allowed to act on that data without
any supervision.
Unsupervised learning cannot be directly applied to
a regression or classification problem because
unlike supervised learning, we have the input data
but no corresponding output data.
The goal of unsupervised learning is to find the
underlying structure of dataset, group that
data according to similarities, and represent
that dataset in a compressed format.
Example: Suppose the unsupervised learning
algorithm is given an input dataset containing
images of different types of cats and dogs. The
algorithm is never trained upon the given dataset,
which means it does not have any idea about the
features of the dataset.
The task of the unsupervised learning algorithm is
to identify the image features on their own.
Unsupervised learning algorithm will perform this
task by clustering the image dataset into the
groups according to similarities between images.
Why use Unsupervised Learning?
Below are some main reasons which describe the
importance of Unsupervised Learning:
Unsupervised learning is helpful for finding useful
insights from the data.
Unsupervised learning is much similar as a human
learns to think by their own experiences, which makes
it closer to the real AI.
Unsupervised learning works on unlabeled and
uncategorized data which make unsupervised learning
more important.
In real-world, we do not always have input data with
the corresponding output so to solve such cases, we
need unsupervised learning.
Working of Unsupervised Learning
Here, we have taken an unlabeled input data,
which means it is not categorized and
corresponding outputs are also not given. Now,
this unlabeled input data is fed to the machine
learning model in order to train it. Firstly, it will
interpret the raw data to find the hidden patterns
from the data and then will apply suitable
algorithms such as k-means clustering, Decision
tree, etc.
Once it applies the suitable algorithm, the
algorithm divides the data objects into groups
according to the similarities and difference
between the objects.
Types of Unsupervised Learning Algorithm

The unsupervised learning algorithm can be


further categorized into two types of problems:
 Clustering: Clustering is a method of grouping the
objects into clusters such that objects with most
similarities remains into a group and has less or no
similarities with the objects of another group. Cluster
analysis finds the commonalities between the data objects
and categorizes them as per the presence and absence of
those commonalities.
 Association: An association rule is an unsupervised
learning method which is used for finding the relationships
between variables in the large database. It determines the
set of items that occurs together in the dataset.
Association rule makes marketing strategy more effective.
Such as people who buy X item (suppose a bread) are also
tend to purchase Y (Butter/Jam) item. A typical example of
Association rule is Market Basket Analysis.
Unsupervised Learning algorithms
Below is the list of some popular
unsupervised learning algorithms:
K-means clustering
KNN (k-nearest neighbors)
Hierarchal clustering
Abnomaly detection
Neural Networks
Principle Component Analysis
Independent Component Analysis
Apriori algorithm
Singular value decomposition
Advantages of Unsupervised Learning
•Unsupervised learning is used for more complex
tasks as compared to supervised learning
because, in unsupervised learning, we don't have
labeled input data.
•Unsupervised learning is preferable as it is easy
to get unlabeled data in comparison to labeled
data.
Disadvantages of Unsupervised Learning
•Unsupervised learning is intrinsically more
difficult than supervised learning as it does not
have corresponding output.
•The result of the unsupervised learning algorithm
might be less accurate as input data is not
Instance Based Learning
In machine learning, instance-based
learning (sometimes called memory-based learning) is a
family of learning algorithms that, instead of performing
explicit generalization, compare new
problem instances with instances seen in training, which
have been stored in memory.
The Machine Learning systems which are categorized
as instance-based learning are the systems that learn
the training examples by heart and then generalizes to
new instances based on some similarity measure.
It is called instance-based because it builds the hypotheses
from the training instances. It is also known as memory-
based learning or lazy-learning.
The time complexity of this algorithm depends upon the
size of training data. The worst-case time complexity of
this algorithm is O(n), where n is the number of training
instances.
For example, If we were to create a spam
filter with an instance-based learning
algorithm, instead of just flagging emails
that are already marked as spam emails,
our spam filter would be programmed to
also flag emails that are very similar to
them.
This requires a measure of resemblance
between two emails. A similarity measure
between two emails could be the same
sender or the repetitive use of the same
keywords or something else.
Advantages:
 Instead of estimating for the entire instance set, local
approximations can be made to the target function.
 This algorithm can adapt to new data easily, one which is
collected as we go
Disadvantages:
 Classification costs are high
 Large amount of memory required to store the data, and each
query involves starting the identification of a local model from
scratch.
Some of the instance-based learning
algorithms are
 K Nearest Neighbor (KNN)
 Self-Organizing Map (SOM)
 Learning Vector Quantization (LVQ)
 Locally Weighted Learning (LWL)
K- Nearest Neighbor
learning(KNN)
The k-nearest neighbors (KNN) algorithm is a
simple, supervised machine learning
algorithm that can be used to solve both
classification and regression problems. It's easy to
implement and understand, but has a major drawback
of becoming significantly slows as the size of that
data in use grows.
With the help of KNN algorithms, we can classify a
potential voter into various classes like “Will Vote”,
“Will not Vote”, “Will Vote to Party 'Congress', “Will
Vote to Party 'BJP'. Other areas in which KNN
algorithm can be used are Speech Recognition,
Handwriting Detection, Image Recognition and Video
Recognition.
The KNN algorithm assumes that similar things
exist in close proximity. In other words, similar
things are near to each other. It is mainly used for
classification predictive problems in industry. The
following two properties would define KNN well −
Lazy learning algorithm − KNN is a lazy learning
algorithm because it does not have a specialized
training phase and uses all the data for training
while classification.
Non-parametric learning algorithm − KNN is
also a non-parametric learning algorithm because
it doesn’t assume anything about the underlying
data
Working of KNN Algorithm
 K-nearest neighbors (KNN) algorithm uses ‘feature similarity’ to
predict the values of new datapoints which further means that the
new data point will be assigned a value based on how closely it
matches the points in the training set. We can understand its
working with the help of following steps.
Step 1 : For implementing any algorithm, we need dataset. So during
the first step of KNN, we must load the training as well as test data.
Step 2 :Next, we need to choose the value of K i.e. the nearest data
points. K can be any integer.
Step 3 :For each point in the test data do the following −
3.1- Calculate the distance between test data and each row of
training data with the help of any of the method namely: Euclidean,
Manhattan or Hamming distance. The most commonly used method
to calculate distance is Euclidean.
3.2-Now, based on the distance value, sort them in ascending
order.
3.3 - Next, it will choose the top K rows from the sorted array.
3.4 - Now, it will assign a class to the test point based on most
frequent class of these rows.
Example
The following is an example to understand
the concept of K and working of KNN
algorithm .
Suppose we have a dataset which can be
plotted as follow
Now, we need to classify new data point with
black dot (at point 60,60) into brown or red class.
Let us assuming K = 3 i.e. it would find three
nearest data points as shown

From the fig it is seen that the three nearest


neighbors of the data point with black dot. Among
those three, two of them lies in Red class hence
the black dot will also be assigned in red class
Distance metrics used in KNN-
Introduction
KNN is the most commonly used and one of the simplest
algorithms for finding patterns in classification and
regression problems.
 It is an unsupervised algorithm and also known as lazy
learning algorithm.
It works by calculating the distance of 1 test observation
from all the observation of the training dataset and then
finding K nearest neighbors of it.
This happens for each and every test observation and
that is how it finds similarities in the data.
 For calculating distances KNN uses a distance metric
from the list of available metrics.
Distance metrics
For the algorithm to work best on a particular
dataset we need to choose the most appropriate
distance metric accordingly. There are a lot of
different distance metrics available, but we are
only going to talk about a few widely used ones.
Euclidean distance function is the most popular
one among all of them as it is set default in the
SKlearn KNN classifier library in python.So here
are some of the distances used

K-nearest neighbor classification example for


k=3 and k=7
Minkowski Distance
 It is a metric intended for real-valued vector spaces. We can
calculate Minkowski distance only in a normed vector space,
which means in a space where distances can be represented as
a vector that has a length and the lengths cannot be negative.
There are a few conditions that the distance metric must satisfy:
1. Non-negativity: d(x, y) >= 0
2. Identity: d(x, y) = 0 if and only if x == y
3. Symmetry: d(x, y) = d(y, x)
4. Triangle Inequality: d(x, y) + d(y, z) >= d(x, z)=

 This above formula for Minkowski distance is in generalized


form and we can manipulate it to get different distance
metrices. Based on the value of p different distances can be
calculated i.e.
 p = 1, when p is set to 1 we get Manhattan distance
 p = 2, when p is set to 2 we get Euclidean distance
Manhattan Distance
 This distance is also known as taxicab distance or city block
distance, that is because the way this distance is calculated.
The distance between two points is the sum of the absolute
differences of their Cartesian coordinates. The formula for
Manhattan distance is by substituting p=1 in the Minkowski
distance formula.

 Suppose we have two points as shown in the image the red(4,4)


and the green(1,1). distance using Manhattan distance metric.
 d = |4-1| + |4-1| = 6
 This distance is preferred over
Euclidean distance when we have
a case of high dimensionality.
Euclidean Distance
This distance is the most widely used one as it is
the default metric that SKlearn library of Python
uses for K-Nearest Neighbour. It is a measure of
the true straight line distance between two
points in Euclidean space It can be used by
setting the value of p equal to 2 in Minkowski
distance metric.
Now suppose we have
two point the red (4,4)
and the green (1,1).
The distance using
Euclidean distance
metric is 4.24
Cosine Distance
 This distance metric is used mainly to calculate similarity
between two vectors. It is measured by the cosine of the angle
between two vectors and determines whether two vectors are
pointing in the same direction.
 It is often used to measure document similarity in text analysis.
When used with KNN this distance gives us a new perspective
to a business problem and lets us find some hidden information
in the data which we didn’t see using the above two distance
matrices.

 Using this formula we will get a value which tells us about the
similarity between the two vectors and 1-cosΘ will give us their
cosine distance.
 Using this distance we get values between 0 and 1, where 0
means the vectors are 100% similar to each other and 1 means
Jaccard Distance
The Jaccard coefficient is a similar method of
comparison to the Cosine Similarity due to how
both methods compare one type of attribute
distributed among all data.
The Jaccard approach looks at the two data sets
and finds the incident where both values are equal
to 1.
So the resulting value reflects how many 1 to 1
matches occur in comparison to the total number of
data points. This is also known as the frequency
that 1 to 1 match, which is what the Cosine
Similarity looks for, how frequent a certain attribute
occurs.
It is extremely sensitive to small samples sizes and
may give erroneous results, especially with very
Jaccard distance is the complement of the Jaccard
index and can be found by subtracting the Jaccard
Index from 100%, thus the formula for Jaccard
distance is:

D(A,B) = 1 – J(A,B)
Hamming Distance
 Hamming distance is a metric for comparing two binary data strings.
While comparing two binary strings of equal length, Hamming
distance is the number of bit positions in which the two bits are
different.
 The Hamming distance method looks at the whole data and finds
when data points are similar and dissimilar one to one. The Hamming
distance gives the result of how many attributes were different.
 This is used mostly when you one-hot encode your data and need to
find distances between the two binary vectors.
 Suppose we have two strings “ABCDE” and “AGDDF” of same length
and we want to find the hamming distance between these. We will go
letter by letter in each string and see if they are similar or not like
first letters of both strings are similar, then second is not similar and
so on.
 ABCDE and AGDDF
 When we are doing this we will see that only two letters marked in
red were similar and three were dissimilar in the strings. Hence, the
Hamming Distance here will be 3. Note that larger the Hamming
Distance between two strings, more dissimilar will be those strings
Feature Reduction
Feature reduction, also known as dimensionality
reduction
 It is the process of reducing the number of features in
a resource heavy computation without losing
important information. Reducing the number of
features means the number of variables is reduced
making the computer’s work easier and faster.
Feature reduction can be divided into two processes:
feature selection and feature extraction. There are
many techniques by which feature reduction is
accomplished.
 Some of the most popular are generalized
discriminant analysis, autoencoders, non-negative
matrix factorization, and principal component analysis
Why is this Useful?
 The purpose of using feature reduction is to reduce the number of
features (or variables) that the computer must process to perform
its function. Feature reduction leads to the need for fewer
resources to complete computations or tasks. During
machine learning, feature reduction removes multicollinearity
which is resulting in improvement of the machine learning model
in use.
 Another benefit of feature reduction is that it makes data easier to
visualize for humans, particularly when the data is reduced to two
or three dimensions which can be easily displayed graphically.
 An interesting problem that feature reduction also called the
curse of dimensionality. This refers to a group of phenomena in
which a problem will have so many dimensions that the data
becomes sparse. Feature reduction is used to decrease the
number of dimensions, making the data less sparse and more
statistically significant for machine learning applications.
 It is commonly used in the fields that deal with high-dimensional
data, such as speech recognition, signal processing,
bioinformatics, etc. It can also be used for data
The Curse of Dimensionality

Handling the high-dimensional data is very


difficult in practice, commonly known as
the curse of dimensionality. If the
dimensionality of the input dataset
increases, any machine learning algorithm
and model becomes more complex. As the
number of features increases, the number
of samples also gets increased
proportionally, and the chance of
overfitting also increases. If the machine
learning model is trained on high-
dimensional data, it becomes overfitted
and results in poor performance.
Benefits of applying Dimensionality Reduction
Some benefits of applying dimensionality
reduction technique to the given dataset
are given below:
By reducing the dimensions of the features,
the space required to store the dataset also
gets reduced.
Less Computation training time is required
for reduced dimensions of features.
Reduced dimensions of features of the
dataset help in visualizing the data quickly.
It removes the redundant features (if
present) by taking care of multicollinearity.
Disadvantages of dimensionality Reduction
here are also some disadvantages of
applying the dimensionality reduction,
which are given below:
Some data may be lost due to
dimensionality reduction.
In the PCA dimensionality reduction
technique, sometimes the principal
components required to consider are
unknown.
Approaches of Dimension Reduction
Feature Selection

Feature selection is the process of selecting


the subset of the relevant features and
leaving out the irrelevant features present
in a dataset to build a model of high
accuracy. In other words, it is a way of
selecting the optimal features from the
input dataset.
Three methods are used for the feature
selection:
Filters Methods
Wrappers Methods
Embedded Methods
Filters Methods
In this method, the dataset is filtered, and a
subset that contains only the relevant
features is taken. Some common
techniques of filters method are:
Correlation
Chi-Square Test
ANOVA
Information Gain, etc
Correlation

It's used to summarize the strength of the


linear relationship between two data
variables, which can vary between 1 and -1,
1 means a positive correlation: the values
of one variable increase as the values of
another increase.
Wrappers Methods
The wrapper method has the same goal as the filter
method, but it takes a machine learning model for its
evaluation. In this method, some features are fed to
the ML model, and evaluate the performance. The
performance decides whether to add those features
or remove to increase the accuracy of the model.
This method is more accurate than the filtering
method but complex to work. Some common
techniques of wrapper methods are:
Forward Selection
Backward Selection
Bi-directional Elimination
Forward Selection

 Forward selection is an iterative method


in which we start with having no feature in
the model. In each iteration, we keep
adding the feature which best improves
our model till an addition of a new variable
does not improve the performance of the
model.
Backward Selection
 Backward elimination is a feature selection technique
while building a machine learning model. It is used to
remove those features that do not have a significant
effect on the dependent variable or prediction of output.
 Below are some main steps which are used to apply
backward elimination process:
Step-1: Firstly, We need to select a significance level to
stay in the model. (SL=0.05)
Step-2: Fit the complete model with all possible
predictors/independent variables.
Step-3: Choose the predictor which has the highest P-
value, such that.
If P-value >SL, go to step 4. Else Finish, and Our model is
ready.
Step-4: Remove that predictor.
Embedded Methods
Embedded methods check the different
training iterations of the machine learning
model and evaluate the importance of each
feature. Some common techniques of
Embedded methods are:
LASSO
Elastic Net
Ridge Regression, etc.
Feature Extraction:
Feature extraction is the process of transforming
the space containing many dimensions into space
with fewer dimensions.
This approach is useful when we want to keep the
whole information but use fewer resources while
processing the information.
Some common feature extraction techniques are:
1. Principal Component Analysis
2. Linear Discriminant Analysis
3. Kernel PCA
4. Quadratic Discriminant Analysis
Principal Component Analysis (PCA)
Principal Component Analysis is a statistical process
that converts the observations of correlated features
into a set of linearly uncorrelated features with the help
of orthogonal transformation.
These new transformed features are called
the Principal Components. It is one of the popular
tools that is used for exploratory data analysis and
predictive modeling.
PCA works by considering the variance of each attribute
because the high attribute shows the good split
between the classes, and hence it reduces the
dimensionality. Some real-world applications of PCA
are image processing, movie recommendation
system, optimizing the power allocation in
various communication channels.
PCA is used to decompose a multivariate dataset
in a set of successive orthogonal components that
explain a maximum amount of the variance,
Thus, PCA is a technique through which we can
decompose the data into perpendicular vector
where the spread of information of feature is more
Here in the fig,
x=data points in the
data set
f1 axis=feature 1 on
x-axis
f2 axis=feature 2 on
y-axis
In the fig ,we can
see that maximum
variance is in f1'
axis.(new
Coordinates)
Geometric Intuitions
1. We want to find a direction ‘fi’ such that variance of xi’s projected
on fi’s is maximum.
2. Rotating the previous axis to find fi’s with maximum variance.
3. drop f2

Fig.Unit Vector in direction of


maximum Variance
 The direction of fi’s for which maximum variance occur ,from the fig.
it is observed that the unit vector direction variance is maximum =Ui
Mathematical Intuition
To find out Ui which gives the maximum variance.
This problem can al be understood as Distance
Minimization also.
Here,our aim is to find the vector which gives
minimum distance (d1,d2…) when xi’s are
projected on Ui.
Here, the equation of above Fig , find the
vector Ui which gives the minimum
distance where the difference of distance is
minimum.
Solution to above Equations is Eigen
Values and Eigen Vectors which can be
evaluated as follws.
X=Matrix of data points of shape(n X d)
where n is no of data points and d is no of
dimensions.
Steps to find the Eigen-
vector
1. Do the column standardization of X.(Important Step for PCA)
2. Co-variance Matrix: S= XᵀX
3. λ=Eigen Value and V=Eigen Vector
4. λ V=S V
Thus PCA explains,
 How one variable is associated with other variable(Co-variance
Matrix)
 Direction in which our data is spread(Eigenvectors)
 Importance of these directions or how much amount of
variance is explained in these directions.(eigenvalues)
Importance of these Steps
1. Co-variance Matrix XᵀX that contains estimates of how
every variable in X relates to every other variable in X.
Understanding how one variable is associated with
another is quite powerful.
2. Eigenvalues and eigenvectors are important.
Eigenvectors represent directions. Think of plotting
your data on a multidimensional scatterplot. Then one
can think of an individual eigenvector as a particular
“direction” in your scatterplot of data. Eigenvalues
represent magnitude, or importance. Bigger
eigenvalues correlate with more important directions.
3. Finally, we make an assumption that more variability in a
particular direction correlates with the behavior of the
dependent variable. Lots of variability usually indicates
signal, whereas little variability usually indicates noise.
Thus, the more variability there is in a particular direction
is.
Advantages of PCA
1. Remove Co-relation Features.
2. Improves the Algorithm Performance by reducing
the no of dimensions: The training time of the
algorithms reduces significantly with less number of
features.
3. Reduces over-fitting of data: Overfitting mainly
occurs when there are too many variables in the
dataset. So, PCA helps in overcoming the overfitting
issue by reducing the number of features.
4. Improves Visualization: It is very hard to
visualize and understand the data in high
dimensions. PCA transforms a high dimensional
data to low dimensional data (2 dimension) so that
it can be visualized easily.
Disadvantages of PCA
 Independent variables becomes less interpretable.
 Data Standardization is must for PCA: Principal components
will be biased towards features with high variance, leading
to false results. PCA is affected by scale, so you need to
scale the features in your data before applying PCA.
 Categorical features requires Encoding as PCA works only
on numerical data.
 Information is loss when data is spread in different
structures/shapes: Although Principal Components try to
cover maximum variance among the features in a dataset,
if we don’t select the number of Principal Components with
care, it may miss some information as compared to the
original list of features.
Collaborative filtering
based Recommendation
Introduction
A recommender system refers to a system
that is capable of predicting the future
preference of a set of items for a user, and
recommend the top items.
One key reason why we need a recommender
system in modern society is that people have
too much options to use from due to the
prevalence of Internet recommend the top
items.
In the past, people used to shop in a physical
store, in which the items available are
limited. For instance, the number of movies
that can be placed in a Blockbuster store
depends on the size of that store
.
Introduction

By contrast, nowadays, the Internet allows


people to access abundant resources
online. Netflix, for example, has an
enormous collection of movies
Although the amount of available
information increased, a new problem arose
as people had a hard time selecting the
items they actually want to see.
This is where the recommender system
comes in.
 for this introduction to two typical ways for
building a recommender system,

Traditional Approach
Traditionally, there are two methods to
construct a recommender system :
1. Content-based recommendation
2. Collaborative Filtering
 The first one analyzes the nature of each
item. For instance, recommending poets
to a user by performing Natural Language
Processing on the content of each poet.
 Collaborative Filtering, on the other hand,
does not require any information about
the items or the users themselves. It
recommends items based on users’ past
behavior.
Collaborative Filtering
Collaborative Filtering (CF) is a mean of
recommendation based on users’ past
behavior. There are two categories of CF.
User-based: measure the similarity
between target users and other users
Item-based: measure the similarity
between the items that target users rates/
interacts with and other items
The key idea behind CF is that similar users
share the same interest and that similar
items are liked by a user.
Assume there are m users and n items, we use a
matrix with size m*n to denote the past behavior
of users. Each cell in the matrix represents the
associated opinion that a user holds.
For instance, M_{i, j} denotes how user i likes
item j. Such matrix is called utility matrix.
CF is like filling the blank (cell) in the utility matrix
that a user has not seen/rated before based on
the similarity between users or items
There are two types of opinions,
1. Explicit opinion
2. Implicit opinion.
Explicit opinion directly shows how a user rates that
item (think of it as rating an app or a movie), while the
implicit opinion. only serves as a proxy which
provides us heuristics about how an user likes an item
(e.g. number of likes, clicks, visits).
Explicit opinion is more straight-forward than the implicit
one as we do not need to guess what does that number
implies.
For instance, there can be a song that user likes very
much, but he listens to it only once because he was
busy while he was listening to it.
 Without explicit opinion, we cannot be sure whether the
user dislikes that item or not.
However, most of the feedback that we collect from
users are implicit. Thus, handling implicit feedback
properly is very important,
WeUser-based Collaborative Filtering
know that we need to compute the similarity
between users in user-based CF. to measure the
similarity there are two opinions.
Pearson Correlation or cosine similarity.
Let u_{i, k} denotes the similarity between user i
and user k and v_{i, j} denotes the rating that
user i gives to item j with v_{i, j} = ? if the user
has not rated that item.
These two methods can be expressed as the
followings:
Both measures are commonly used. The
difference is that Pearson Correlation is
invariant to adding a constant to all elements.
Now, we can predict the users’ opinion on the
unrated items with the below equation:
Example. In the following matrixes, each row
represents a user, while the columns correspond to
different movies except the last one which records
the similarity between that user and the target
user. Each cell represents the rating that the user
gives to that movie. Assume user E is the target.
 Since user A and F do not share any movie ratings in common with
user E, their similarities with user E are not defined in Pearson
Correlation. Therefore, we only need to consider user B, C, and D.
 Now, we can start to fill in the blank for the movies that user E has
not rated based on other users.

 Although computing user-based CF is very simple, it suffers from


several problems. One main issue is that users’ preference can
change over time. It indicates that precomputing the matrix based on
their neighboring users may lead to bad performance. To tackle this
problem, we can apply item-based CF.
Item-based Collaborative Filtering

Instead of measuring the similarity


between users, the item-based CF
recommends items based on their similarity
with the items that the target user rated.
Likewise, the similarity can be computed
with Pearson Correlation or Cosine
Similarity.
The major difference is that, with item-
based collaborative filtering, we fill in the
blank vertically, as oppose to the horizontal
manner that user-based CF does.
The following table shows how to do so for
the movie Me Before You.
It successfully avoids the problem posed by dynamic
user preference as item-based CF is more static.
However, several problems remain for this method
First, the main issue is scalability. The computation
grows with both the customer and the product.
The worst case complexity is O(mn) with m users and n
items. In addition, sparsity is another concern.
 Take a look at the above table again. Although there is
only one user that rated both Matrix and Titanic rated,
the similarity between them is 1.
 In extreme cases, we can have millions of users and
the similarity between two fairly different movies could
be very high simply because they have similar rank for
the only user who ranked them both.
Singular Value Decomposition(SVD)

One way to handle the scalability and sparsity issue


created by CF is to leverage a latent factor model to
capture the similarity between users and items.
Essentially, we want to turn the recommendation
problem into an optimization problem.
We can view it as how good we are in predicting the
rating for items given a user.
One common metric is Root Mean Square
Error (RMSE). The lower the RMSE, the better the
performance.
Since we do not know the rating for the unseen items,
we will temporarily ignore them. Namely, we are only
minimizing RMSE on the known entries in the utility
matrix.
To achieve minimal RMSE, Singular Value
Decomposition (SVD) is adopted as shown in the
below formula.

X denotes the utility matrix, and U is a left


singular matrix, representing the relationship
between users and latent factors.
S is a diagonal matrix describing the strength of
each latent factor, while V transpose is a right
singular matrix, indicating the similarity between
It is a broad idea which describes a
property or concept that a user or an item
have. For instance, for music, latent factor
can refer to the genre that the music
belongs to. SVD decreases the dimension of
the utility matrix by extracting its latent
factors. Essentially, we map each user and
each item into a latent space with
dimension r.
Therefore, it helps us better understand the
relationship between users and items as
they become directly comparable. The
below figure illustrates this idea.
SVD has a great property that it has the minimal
reconstruction Sum of Square Error (SSE);
therefore, it is also commonly used in
dimensionality reduction. The below formula
replace X with A, and S with Σ

It turns out that RMSE and SSE are monotonically


related. This means that the lower the SSE, the
lower the RMSE. With the convenient property of
SVD that it minimizes SSE, we know that it also
minimizes RMSE. Thus, SVD is a great tool for this
optimization problem. To predict the unseen item
for a user, we simply multiply U, Σ, and T.
SVD handles the problem of scalability and
sparsity posed by CF successfully. However,
SVD is not without flaw. The main drawback
of SVD is that there is no to little
explanation to the reason that we
recommend an item to an user. This can be
a huge problem if users are eager to know
why a specific item is recommended to
them.
Bayesian learning
Bayesian learning and the frequentist method can also
be considered as two ways of looking at the tasks of
estimating values of unknown parameters given some
observations caused by those parameters. For certain
tasks, either the concept of uncertainty is meaningless
or interpreting prior beliefs is too complex
Imagine a situation where your friend gives you a new
coin and asks you the fairness of the coin (or the
probability of observing heads) without even flipping
the coin once.
 In fact, you are also aware that your friend has not
made the coin biased. In general, you have seen that
coins are fair, thus you expect the probability of
observing heads is 0.5.
Suppose that you are allowed to flip the coin 10 times in
order to determine the fairness of the coin. Your
observations from the experiment will fall under one of
the following cases:
Case 1: observing 5 heads and 5 tails.
Case 2: observing h heads and 10-h tails, where h\neq
10-h
If case 1 is observed, you are now more certain that the
coin is a fair coin, and you will decide that the probability
of observing heads is 0.5 with more confidence. If case 2
is observed you can either
1. Neglect your prior beliefs since now you have new
data, decide the probability of observing heads is h/10
by solely depending on recent observations.
2. Adjust your belief accordingly to the value of h that you
have just observed, and decide the probability of
observing heads using your recent observations.
The first method suggests that we use the frequentist
method, where we omit our beliefs when making
decisions.
However, the second method seems to be more
convenient because 10 coins are insufficient to
determine the fairness of a coin.
Therefore, we can make better decisions by combining
our recent observations and beliefs that we have
gained through our past experiences.
It is this thinking model which uses our most recent
observations together with our beliefs or inclination for
critical thinking that is known as Bayesian thinking.
Bayesian learning comes into play on such occasions,
where we are unable to use frequentist statistics due to
the drawbacks .We can use Bayesian learning to
address all these drawbacks and even with additional
capabilities.
Bayes’ Theorem
Bayes’ theorem describes how the conditional probability
of an event or a hypothesis can be computed using
evidence and prior knowledge.
It is similar to concluding that our code has no bugs given
the evidence that it has passed all the test cases,
including our prior belief that we have rarely observed
any bugs in our code.
However, this intuition goes beyond that simple
hypothesis test where there are multiple events or
hypotheses involved
The Bayes’ theorem is given by:

Consider the hypothesis that there are no bugs in our


code. θ and X denote that our code is bug free and passes
all the test cases respectively.
P(θ) - Prior Probability is the probability of the hypothesis
P(X| θ) - Likelihood is the conditional probability of the
evidence given a hypothesis
P(X) - Evidence term denotes the probability of evidence
or data. This can be expressed as a summation (or
integral) of the probabilities of all possible hypotheses
weighted by the likelihood of the same.
Therefore, we can write P(X) as:

For the continuous θ we write P(X) as an integration:

In the above example there are only two possible


hypotheses,
 1) observing no bugs in our code or 2) observing a bug
in our code. Therefore we can denotes evidence as
follows:
¬θ denotes observing a bug in our code. Therefore, ,
P(X|¬θ) is the conditional probability of passing all the
tests even when there are bugs present in our code.
This term depends on the test coverage of the test
cases.
 Even though we do not know the value of this term
without proper measurements.
let us assume that P(X|¬θ)=0.5. Accordingly

P(θ|X)- Posteriori probability denotes the conditional


probability of the hypothesis θ after observing the
evidence X. This is the probability of observing no bugs
in our code given that it passes all the test cases.
The posterior probability using the following formula:
We can also calculate the probability of observing
a bug, given that our code passes all the test
cases

We now know both conditional probabilities of


observing a bug in the code and not observing the
bug in the code. But we have to confirm the valid
hypothesis using these posterior probabilities for
that Maximum a Posteriori (MAP) is used.
Maximum a Posteriori
(MAP)
 We can use MAP to determine the valid hypothesis
from a set of hypotheses. According to MAP, the
hypothesis that has the maximum posterior
probability is considered as the valid hypothesis.
 Therefore, we can express the hypothesis θ MAP that is
concluded using MAP as follows:

 The argmaxθ operator estimates the event or


hypothesis θi that maximizes the posterior
probability P(θi|X).
Let us apply MAP to the above example in order
to determine the true hypothesis

Fig. P(θ|X) andP(¬θ|X) when changing


the P(θ)=p
Figure illustrates how the posterior probabilities of
possible hypotheses change with the value of prior
probability. Unlike frequentist statistics where our belief
or past experience had no influence on the concluded
hypothesis, Bayesian learning is capable of incorporating
our belief to improve the accuracy of predictions.
Assuming that we have fairly good programmers and
therefore the probability of observing a bug is P(θ)=0.4 ,
then we find the θMAP:

However, P(X) is independent of θ, and thus P(X) is same


for all the events or hypotheses. Therefore, we can
simplify the θMAP estimation, without the denominator of
each posterior computation as shown below:
 MAP estimation algorithms do not compute posterior probability
of each hypothesis to decide which is the most probable
hypothesis.
Assuming that our hypothesis space is continuous where
endless possible hypotheses are present even in the smallest
range that the human mind can think of, or for even a discrete
hypothesis space with a large number of possible outcomes for
an event, we do not need to find the posterior of each
hypothesis in order to decide which is the most probable
hypothesis
Therefore, the practical implementation of MAP estimation
algorithms use approximation techniques, which are capable of
finding the most probable hypothesis without computing
posteriors or only by computing some of them..
ML Hypothesis

Maximum likelihood estimation is


a method that determines values for the
parameters of a model. The parameter
values are found such that they maximize
the likelihood that the process described by
the model produced the data that were
actually observed.
Naive Bayes
Naive Bayes methods are a set of supervised
learning algorithms based on applying Bayes’
theorem with the “naive” assumption of conditional
independence between every pair of features given
the value of the class variable.
 Bayes’ theorem states the following relationship,
given class variable y and dependent feature
vector x1 through xn, :
Since P(x1,…,xn) is constant given the input, we can use the
following classification rule:

and we can use Maximum A Posteriori (MAP) estimation to


estimate P(y) and P(xi∣y); the former is then the relative
frequency of class y in the training set.
The different naive Bayes classifiers differ mainly by the
assumptions they make regarding the distribution of P(x i∣y).
In spite of their apparently over-simplified assumptions,
naive Bayes classifiers have worked quite well in many real-
world situations, famously document classification and
spam filtering.
They require a small amount of training data to estimate
the necessary parameters.
Naive Bayes learners and classifiers can be
extremely fast compared to more sophisticated
methods.
The decoupling of the class conditional feature
distributions means that each distribution can be
independently estimated as a one dimensional
distribution. This in turn helps to alleviate problems
stemming from the curse of dimensionality.
 Here there are various methods Naïve bayes
1. Gaussian Naive Bayes
2. Multinomial Naive Bayes
3. Complement Naive Bayes
4. Bernoulli Naive Bayes
5. Categorical Naive Bayes
6. Out-of-core naive Bayes model fitting
Gaussian Naive Bayes

GaussianNB implements the Gaussian


Naive Bayes algorithm for classification.
The likelihood of the features is assumed to
be Gaussian:

The parameters σy and μy are estimated


using maximum likelihood.
Multinomial Naive Bayes
 MultinomialNB implements the naive Bayes algorithm for
multinomially distributed data, and is one of the two classic naive
Bayes variants used in text classification The distribution is
parametrized by vectors θy=(θy1,…,θyn) for each class y, where n is the
number of features and θyi is the probability P(xi∣y) of
feature i appearing in a sample belonging to class y.
 The parameters θy is estimated by a smoothed version of maximum
likelihood, i.e. relative frequency counting:

 Where is the number of times feature i appears


in a sample of class y in the training set T, and is
the total count of all features for class y.
 The smoothing priors α>0 accounts for features not present in
the learning samples and prevents zero probabilities in further
computations.
Setting α=0 is called Laplace smoothing, while α<0 is called
Lidstone smoothing
Complement Naive Bayes
 ComplementNB implements the complement
naive Bayes (CNB) algorithm. CNB is an
adaptation of the standard multinomial naive
Bayes (MNB) algorithm that is particularly
suited for imbalanced data sets.
 Specifically, CNB uses statistics from
the complement of each class to compute the
model’s weights.
 The inventors of CNB show empirically that the
parameter estimates for CNB are more stable
than those for MNB.
 Further, CNB regularly outperforms MNB on text
classification tasks. The procedure for
calculating the weights is as follows
where the summations are over all
documents j not in class c, dij is either the count
or tf-idf value of term i in document j, αi is a
smoothing hyperparameter like that found in
MNB, and α=∑iαi. The second normalization
addresses the tendency for longer documents
to dominate parameter estimates in MNB.
 The classification rule is:

i.e., a document is assigned to the class that is


the poorest complement match.
Bernoulli Naive Bayes
BernoulliNB implements the naive Bayes training and
classification algorithms for data that is distributed according
to multivariate Bernoulli distributions; i.e., there may be
multiple features but each one is assumed to be a binary-
valued (Bernoulli, boolean) variable. Therefore, this class
requires samples to be represented as binary-valued feature
vectors The decision rule for Bernoulli naive Bayes is based on

which differs from multinomial NB’s rule in that it explicitly


penalizes the non-occurrence of a feature i that is an indicator
for class y, where the multinomial variant would simply ignore
a non-occurring feature.
In the case of text classification, word occurrence vectors
(rather than word count vectors) may be used to train and use
this classifier.
 BernoulliNB might perform better on some datasets, especially
those with shorter documents
Categorical Naive Bayes
 CategoricalNB implements the categorical naive Bayes algorithm for
categorically distributed data. It assumes that each feature, which is
described by the index i, has its own categorical distribution.
 For each feature i in the training set X, CategoricalNB estimates a
categorical distribution for each feature i of X conditioned on the class
y. The index set of the samples is defined as J={1,…,m}, with m as the
number of samples.
 The probability of category t in feature i given class c is estimated as:

 Where is the number of times


category t appears in the samples xi, which belong to class c,
is the number of samples with class c, α is a smoothing
parameter and ni is the number of available categories of feature i.
 CategoricalNB assumes that the sample matrix X is encoded such that
all categories for each feature i are represented with numbers 0,...,n i−1
where ni is the number of available categories of feature i.
Out-of-core naive Bayes model fitting

 Naive Bayes models can be used to tackle


large scale classification problems for which
the full training set might not fit in memory. To
handle this case, MultinomialNB, BernoulliNB,
and GaussianNB expose a partial_fit method
that can be used incrementally as done with
other classifiers as demonstrated in
Out-of-core classification of text documents. All
naive Bayes classifiers support sample
weighting.
Contrary to the fit method, the first call to
partial_fit needs to be passed the list of all
the expected class labels.
Example: There are 1000 fruits which could be either
Banana, Orange or other. So there are 3 possible classes.
We have data for the following X variables, all of which are
Not
Not Swe No Yello
Type Long Yello Total
Long et Sweet w
w
Bana
400 100 350 150 450 50 500
na
Oran
0 300 150 150 300 0 300
ge
Other 100 100 150 50 50 150 200
Total 500 500 650 350 800 200 1000

The objective of the classifier is to predict if given fruit is


‘Banana’ or ‘Orange’ or ‘Other’ when only three features
are known.
Let us say given fruit is Long Sweet and Yellow, then fruit
is ----?

Advantages:
 It is easy and fast to predict class of test data set.
 When independence exists, then Naïve Bayes
classifier is better.
 It perform well for categorical input variables.

Disadvantages:
 If categorical variable was not observed in training
data set, then model will assign a ‘0’ probability
and will unable to make a prediction.
 If the features are dependent, then Naïve Bayes
classifier doesn't produce optimal results.
Bayesian Network

Bayesian Belief Network specify joint conditional probability distributions.


They are also called as Bayesian Networks and Probabilistic Network are
known as belief network.
 Bayesian Belief Network allows class conditional independencies to be
defined between subsets of variables.
 Bayesian Belief Network provide a graphical model of causal relationship
on which learning can be performed.
 We can use the trained Bayesian Network for classification.

There are two components to define Bayesian Belief Network:


 Directed acyclic graph
 A set of conditional probability tables

Directed Acyclic Graph


 Each node in directed acyclic graph is represents a random variable.
 These variable may be discrete or continuous valued.
 These variable may corresponds to actual attribute given in data.
Directed acyclic graph

The arc in the diagram allows


representation of causal knowledge.
For example, lung cancer is
influenced by a person's family
history of lung cancer, as well as
whether or not the person is a
smoker.
It is worth noting that the variable
Positive X-ray is independent of
whether the patient has a family
history of lung cancer or that the
patient is a smoker, given that we
know the patient has lung cancer
Conditional Probability Table
The conditional
probability table for
the values of the
variable Lung
Cancer (LC)
showing each
possible
combination of the
values of its parent
nodes, Family
History (FH), and
Smoker (S) is
Example: Let us find the probability of taking off
P(OFF)=P(OFF/R)*P(R)+P(OFF/R')*P(R‘) =0.7*P(R)+0.01*P(R')

P(R)=P(R/W,C)*P(W˄C)+P(R/W',C)*P(W'˄C)+P(R/W,C')
*P(W˄C')+P(R/W',C')*P(W'˄C')
=0.002526
P(R‘)=0.9975
P(OFF)=0.7*0.002526+0.01*0.9975=0.01174
From this, we can estimate the probability for taking off is
0.01174
P(R
WC
)
P(W)=0.001 P(C)=0.002
Clouḍ 0.9
Windy
y 5
T T 0.9
R P(OF R P(W T F 5
F) G)
Rain F T 0.2
T 0.7 T 0.9 F F 9
F 0.01 F 0.0 0.0
5 01
Wet Take
grass off
Infer the value of P(John Calls | Burglary) with an accuracy
up to two decimal places
Example

 In the above figure, we have an alarm


‘A’ – a node, say installed in a house of
a person ‘gfg’, which rings upon two
probabilities i.e burglary ‘B’ and fire ‘F’,
which are – parent nodes of the alarm
node. The alarm is the parent node of
two probabilities P1 calls ‘P1’ & P2
calls ‘P2’ person nodes.
 Upon the instance of burglary and fire,
‘P1’ and ‘P2’ call person ‘gfg’,
respectively. But, there are few
drawbacks in this case, as sometimes
‘P1’ may forget to call the person ‘gfg’,
even after hearing the alarm, as he has
a tendency to forget things, quick.
Similarly, ‘P2’, sometimes fails to call
the person ‘gfg’, as he is only able to
hear the alarm, from a certain distance
Advantages
 Suitable for small and incomplete data sets
 Structural learning possible
 Combining different sources of knowledge
 Explicit treatment of uncertainty and support for decision
analysis
 Fast responses

Disadvantages
 Computationally expensive.
 Forces random variables to be in a cause-effect relationship. As a
result, it does not depicts variables which are correlated.
Bayesian network only encodes directional relationship and not
the bi-directional. BN does not provides any guarantee of
depicting the cause and effect relationship.
 Adding to point 2, BN is a DAG that said. If the data was
generated from a model where there at least 3 variables
correlated to each other (cyclic relationship) then Bayesian
networks (BNs) will not be able to model this relationship.
 One of the most important issues with BNs is that some of the
sophisticated scoring functions require reliable priors in order to
find a structure closer to the original model.
kernel function
In machine learning, a “kernel”
is usually used to refer to
the kernel trick, a method of using
a linear classifier to solve a non-
linear problem.
 The kernel function is what is
applied on each data instance to
map the original non-linear
observations into a higher-
dimensional space in which they
become separable.
SVM Kernel Functions

 SVM algorithms use a set of mathematical functions that are


defined as the kernel.
 The function of kernel is to take data as input and transform it into
the required form. Different SVM algorithms use different types of
kernel functions.
 These functions can be different types. For example linear,
nonlinear, polynomial, radial basis function (RBF), and
sigmoid.

 Introduce Kernel functions for sequence data, graphs, text,


images, as well as vectors.
 The most used type of kernel function is RBF. Because it has
localized and finite response along the entire x-axis.
The kernel functions return the inner product between two points
in a suitable feature space. Thus by defining a notion of similarity,
with little computational cost even in very high-dimensional
spaces
Kernel Rules

 Define kernel or a window function as follows

 This value of this function is 1 inside the closed ball of radius 1


centered at the origin, and 0 otherwise . As shown in the figure1

Figure1 Figure1

 For a fixed xi, the function is K(z-xi)/h) = 1 inside the closed ball
of radius h centered at xi, and 0 otherwise as shown in the
figure 2
 So, by choosing the argument of K(·), we have moved the
window to be centered at the point xi and to be of radius h.
Examples of SVM Kernels

Polynomial kernel: It is popular in image processing

where d is the degree of the polynomial.


Let’s say originally X space is 2-dimensional such that
Xa = (a1 ,a2)
Xb = (b1 ,b2)
now if we want to map our data into higher dimension
let’s say in Z space which is six-dimensional it may
seem like

In order to solve the solve this dual SVM we would


require the dot product of (transpose) Za ^t and Zb.
Method 1: traditionally we would solve this by

which will a lot of time as we would have to


performs dot product on each data point and then
to compute the dot product we may need to do
multiplications .
Method 2 : using kernel trick:

In this method, we can simply calculate the dot


product by increasing the value of power.
Gaussian kernel
It is a general-purpose kernel; used when there is
no prior knowledge about the data. Equation is

Gaussian radial basis function (RBF):


Gaussian RBF(Radial Basis Function) is another
popular Kernel method used in SVM models for
more. RBF kernel is a function whose value
depends on the distance from the origin or from
some point. Gaussian Kernel is of the following
format;
for Gaussian radial basis function
 for parametric data
Laplace RBF kernel : It is general-
purpose kernel; used when there is
no prior knowledge about the data.

Hyperbolic tangent kernel: It can


be used in neural networks.

for some (not every) k>0 and c<0.


Sigmoid Kernel
The Sigmoid Kernel comes from the
Neural Networks field, where the
bipolar sigmoid function is often used as
an activation function for artificial neurons.
It is interesting to note that a SVM model
using a sigmoid kernel function is
equivalent to a two-layer, perceptron
neural network.


α is the slop of a Sigmoid Kernel and C is a
constant
Bessel function of the first kind Kernel
The Bessel kernel is well known in the theory of
function spaces of fractional smoothness. It is given
by:

ANOVA radial basis kernel


The ANOVA kernel is also a radial basis function
kernel, just as the Gaussian and Laplacian kernels.
It is said to
perform well in multidimensional regression proble
ms
Linear splines kernel in one-dimension

It is useful when dealing with large sparse


data vectors. It is often used in text
categorization. The splines kernel also
performs well in regression problems
Non-Linear SVM

The implementation of non-linear SVM is so


similar to linear or simple SVM. The difference
is to select any kernel function like
RBF(gaussian), polynomial, sigmoid and etc
instead of a linear and 1-degree model.
Kernel function takes as its inputs vectors in
the original space and returns the dot product
of the vectors in the feature space and this is
called kernel function.

Non-linear transformation is to make a


dataset higher-dimensional space (Mapping
a higher dimension). And it is also the
fundamental of a non-linear system.
The below graph reveals a non-linear
dataset and how it can not be used Linear
kernel rather than the Gaussian kernel.
In geometry, a hyperplane is a subspace whose
dimension is one less than that of its ambient
space.
If space is 3-dimensional then its hyperplanes are
the 2-dimensional planes, while if space is 2-
dimensional, its hyperplanes are the 1-
dimensional lines.
This notion can be used in any general space in
which the concept of the dimension of a subspace
is defined.
Example
Given an arbitrary dataset, we don't know which
kernel may work best. For that we should
examine that the hypothesis is a linear data or
not.
So, the linear kernel works fine if the dataset is
linearly separable; however, if the dataset isn't
linearly separable, a linear kernel isn't going to
cut it.
 For simplicity (and visualization purposes), let's
assume our dataset consists of 2 dimensions
only as shown Below
This works perfectly fine. And here comes the RBF
kernel SVM

Now, it looks like both linear and RBF kernel SVM


would work equally well on this dataset. So, why
prefer the simpler, linear hypothesis?
Linear SVM is a parametric model, an RBF
kernel SVM isn't, and the complexity of the
latter grows with the size of the training set.
Not only it is more expensive to train an RBF
kernel SVM, but you also have to keep the
kernel matrix around, and the projection into
this "infinite" higher dimensional space where
the data becomes linearly separable is more
expensive as well during prediction.
 Furthermore, more hyper parameters to tune,
so model selection is more expensive as well!
And finally, it's much easier to overfit a
complex model!
But RBF really depends on the dataset.
E.g., if your data is not linearly separable, it
doesn't make sense to use a linear
classifier . In this case, a RBF kernel would
make so much more sense.
 In practice, polynomial kernel is less useful for efficiency
(computational as well as predictive) performance reasons. So,
the rule of thumb is: use linear SVMs (or logistic regression) for
linear problems, and nonlinear kernels such as the Radial Basis
Function kernel for non-linear problems.

 The RBF kernel SVM decision region is actually also a linear


decision region. What RBF kernel SVM actually does is to create
non-linear combinations of your features to uplift the samples
onto a higher-dimensional feature space where the data can use
a linear decision boundary to separate your classes:
 Below one can visualize our data in 2 dimensional for
the dataset with more than 2 dimensions we keep an
eye on our objective function to minimize the hinge-
loss.
 We would setup a hyper parameter search and
compare different kernels to each other.
 Based on the loss function (or a performance metric
such as accuracy, F1, MCC, ROC auc, etc.) we could
determine which kernel is "appropriate" for the given
task.
Thank you
Any Queries

You might also like