SlideShare a Scribd company logo
DA 5230 – Statistical & Machine Learning
Lecture 11 – KNN and Clustering
Maninda Edirisooriya
manindaw@uom.lk
Minkowski Distance
• Distance between two data points can be measured by Minkowski
Distance
• Given by: 𝑑(𝑖, 𝑗) =
𝑞
(|𝑥𝑖1 − 𝑥𝑗1|𝑞 + |𝑥𝑖2 − 𝑥𝑗2|𝑞+ . . . +|𝑥𝑖𝑝 − 𝑥𝑗𝑝|𝑞)
• When q=1 ⇒ 𝑑(𝑖, 𝑗) = (|𝑥𝑖1 − 𝑥𝑗1|2 + |𝑥𝑖2 − 𝑥𝑗2|2+ . . . +|𝑥𝑖𝑝 − 𝑥𝑗𝑝|2)
• i.e. Euclidean Distance, which is the direct distance between two points
• When q=2 ⇒ 𝑑 𝑖, 𝑗 = 𝑥𝑖1 − 𝑥𝑗1 + 𝑥𝑖2 − 𝑥𝑗2 + . . . +|𝑥𝑖𝑝 − 𝑥𝑗𝑝|
• i.e. Manhatton Distance
K-Nearest Neighbors (KNN) Algorithm
• KNN is a very simple Instance-based Learning algorithm used for
both Regression and Classification
• This is a lazy (most calculations are done on prediction time)
algorithm compared to model-based algorithms discussed before
• This algorithm assumes nearby (with less distance) data points belong
to the same class and far away data points belong to different classes
• In KNN all the data points are kept in memory
• Hyperparameter K has to be defined at the beginning
• When a prediction is to be done for a given data point, distances from
all the data points from the given data point are calculated
K - Nearest Neighbors (KNN) Algorithm
• Then the closest (with least distance) K data points are selected
• E.g. Euclidian Distance can be used
• For Classification problems, Y class is found by voting to find the Y
classes of the majority of the selected K data points
• For Regression problems, Y value is found by averaging (or weighted
averaging in Weighted KNN) distances on all the selected K data
points
• For lower K values model will have higher variance due to overfitting
• For higher K values model will have higher bias due to underfitting
• Optimum value for K can be found with Cross-validation
Characteristics of KNN
• Each feature is given equal weight. Therefore, scale the features
• Increased number of features create Curse of Dimensionality
• Higer the K, it becomes less sensitive to the noise datapoints
• When KNN voting gets tied (when get equal votes for the majority
class), a randomly selection or weighted distances can be used for
selecting the class
• As there is no model to be trained, KNN can be used for Online
Learning where the predictions have to be updated with continuously
added new data points
KNN Example
Source: https://medium.com/analytics-vidhya/k-nearest-neighbor-the-maths-behind-it-how-it-works-and-an-example-f1de1208546c
Unsupervised Learning
• Labeled data is expensive – i.e. need to collect previous data
accurately with Y values
• Accuracy of Supervised Machine Learning models are dependent on
the accuracy of the dataset given
• But, there are many unlabeled data available in most of the cases
• So, extracting information/patterns out of unlabeled data is valuable
whenever possible
• Extracting insights from unlabeled data is known as Unsupervised
Learning
Clustering
• Clustering is one of the widely used Unsupervised Learning technique
• In Clustering we assume that the data points are naturally organized
into categories/classes known as Clusters
• In Clustering the main assumption is that similar datapoints belong to
the same cluster and different data points belong to different clusters
Why Clustering?
• Clustering is used to identify the high level concepts associated with
the data
• For example, when you need to identify the customer segments who
are visiting your online shopping website, clustering will be helpful
• As the human genome is large, it is impossible to visually analyze the
common gene partitions by a human but clustering algorithms can
• When you want to identify urbanized areas in a country using the
satellite images, clustering can help to identify these areas using the
light level density in night time images
Clustering - Example
Source: From text book PPTs by Prof. Jiawei Han
Clustering Algorithm Types
There are several approaches of extracting clusters out of data
1. Distance-based Methods
2. Density-based Methods
3. Model-based Methods
Let’s understand each of the approach
Distance-based Methods
• In the multidimensional feature space (e.g. in the area in a 2D graph
between X1 and X2) nearby data points are grouped into a one cluster
and distant points are grouped to other clusters
• Here we assume that the similar data points in a cluster are near to
each other and different clusters are distant from other clusters
• Measuring distance (difference) or the closeness (similarity) is one of
the important decision in Distance-based Methods
• K-Means Clustering is an example for the Single Level Distance-based
Clustering Method (Multi Level Methods are discussed later)
Measure of Distance
• Measuring the distance (or closeness) is a key factor in designing
Distance-based (Partitioning) clustering
• In general, distances of each feature is assumed to be equally
weighted while clustering
• Therefore, all the features used for clustering should be scaled
• Standardization is usually used for scaling
• One of the popular distance measure is known as Minkowski
Distance
• Minkowski Distance formula can be used to derive popular distance
measures like Euclidean Distance and Manhatton (City-Block)
Distance as explained before
K-Means Clustering
• Feature Space will get partitioned into K distinct partitions for each of
the cluster
• Each cluster has its own Centroid, a point representing the cluster
which has the least total distance from each data point in the cluster
• We have to provide the hyperparameter K, the number of clusters
• K-Means Clustering Algorithm is generally used for K-Means
Clustering
K-Means Clustering Algorithm
• Initialize with random K centroids in the space (K random data points
from the dataset are taken in general)
• Until, either centroids or data points stop getting changed:
• Assign each datapoint in the dataset to the nearest (e.g. with least Euclidean
Distance) centroid
• Calculate the new centroid of each cluster (e.g. with Mean Squares of
Euclidean Distances)
• This algorithm always converges to an optimum point
• However, this may not be the Global Optimum point and can be a
Local Optimum. Therefore, we have to run the algorithm several
times with different initialization points and select the best model
Find best K for K-Means Clustering
• Finding the optimum class count, K is important
• One way is to get K =
𝒏
𝟐
where n is the number of data points in the
dataset
• Another well-known technique is known as Elbow Method which is
not practical in many cases (hence, not explained here)
• Best way to find K is using K-fold Cross Validation by using Total
Squared distances from data points to their centroids in each cluster
as the cost
K-Means Clustering Algorithm
Source: https://www.reddit.com/r/learnmachinelearning/comments/qiid2e/kmeans_clustering_algorithm/?onetap_auto=true
Problems with K-Means Clustering
• Although the K-Means Clustering Algorithm is fast (efficient at
computation) they have some limitations
• For example, K-Means Clustering works well when the clusters are,
• Well-separated
• Circular and
• Having the same size
• When each of the assumption is violated, the K-Means may not
cluster as we expect
When Classes are Not Well Separated
True class separation K-Means Clustering class separation
Source: https://www.youtube.com/watch?v=BaZWcSq3IuI Source: https://www.youtube.com/watch?v=BaZWcSq3IuI
When Classes are Not Circular
K-Means Clustering class separation
Source: https://www.youtube.com/watch?v=BaZWcSq3IuI
When Classes are Not Same Sized
K-Means Clustering when class
radiuses are different
K-Means Clustering when class data
counts are different
Source: https://www.youtube.com/watch?v=BaZWcSq3IuI Source: https://www.youtube.com/watch?v=BaZWcSq3IuI
Distance-based Hierarchical Methods
• Methods like K-Means and K-Medoid are Single Level clustering
methods. i.e. There are no clusters inside the clusters
• There are other types of Clustering Method known as Hierarchical
Clustering where clusters are defined inside other clusters as a
hierarchy of clusters
• There are two approaches of creating hierarchical clusters
1. Agglomerative Clustering where the algorithm starts assuming each
datapoint as cluster and combining each of them until the whole dataset is
considered as a single cluster. E.g.: AGNES algorithm
2. Divisive Clustering where the algorithm starts assuming the whole dataset
is considered as the single cluster and dividing the clusters until each
datapoint is considered as a cluster. E.g.: DIANA algorithm
Distance-based Hierarchical Methods
Step 0 Step 1 Step 2 Step 3 Step 4
b
d
c
e
a
a b
d e
c d e
a b c d e
Step 4 Step 3 Step 2 Step 1 Step 0
agglomerative
(AGNES)
divisive
(DIANA)
Source: From text book PPTs by Prof. Jiawei Han
Density-based Methods
• Distance-based Methods have drawbacks
• Noise/outlier datapoints are affecting the clustering as all the data points are
considered in clustering
• It gets impossible to clusters highly non-circular clusters like population
densities in a country
• Number of clusters, K, has to be provided as a hyperparameter
• Instead of using distance to the centroids to represent the clusters, in
Density-based Methods they use local density to decide the cluster
• Depending on the density distribution throughout the space, clusters
can have highly complex boundaries and arbitrary number of clusters
• DBSCAN and OPTICS are examples to the Density-based methods
Model-based Methods
• In Model-based methods, each data point is assumed to be generated
by a mixture of probability distributions
• A Model-based method tries to predict the parameters related to
these probability distributions using these generated data points
• Expectation-Maximization (EM) Algorithm is a popular approach
• In Expectation (E) step the algorithm estimates the probability that each data
point belongs to each cluster based on the current model parameters
• In Maximization (M) step the algorithm updates the model parameters to
maximize the likelihood of the data given these probabilities
• Iterate the above 2 steps until the convergence. (This is analogous to the K-
Means Clustering algorithm)
Gaussian Mixture Model (GMM)
• GMM is a popular Model-based Method assuming the distributions
are Gaussian. Parameters to be predicted for each distribution:
• Mean Vector
• Covariance Matrix
• Proportion for the distribution (or weight)
• Often the parameters are randomly initialized
• E step: finds the proportions of each datapoint should be assigned to
each of the distribution
• M step: re-estimate the parameters and update Gaussian
distributions with parameters
Gaussian Mixture Model (GMM)
• GMM is converging but may converge to a Local Optimum
• Once converged, datapoints are to be assigned to each cluster
represented by each of the Gaussian distribution
• In some cases each datapoint is assigned to the cluster with the
highest probability (known as Maximum Probability Rule)
• In some other cases each datapoint may be assigned to multiple
clusters based on the probabilities related to each of the distribution
(known as Soft Assignment)
• A heatmap can be used to visualize the datapoints with the soft assignments
Gaussian Mixture Model (GMM)
Source: https://prateekvjoshi.com/2013/06/29/gaussian-mixture-models
Gaussian models
Gaussian mixture model
Evaluating Clustering
• If there are no labeled data to test the performance measures of the
model, it is not possible to accurately evaluate the clustered model
• However, there are other evaluations we can do related to clustering
• First, the dataset should be having a non-random distribution (non-
uniform distribution in the hyperspace)
• In other words, the data should be non-uniformly distributed in the
space, forming clusters
• This measure is known as Spatial Randomness which can be
measured by Hopkins Statistic
Measure Clustering Quality
• When a clustering is done its quality has to be measured
• Extrinsic Methods: possible when data with real cluster labels
(Ground Truth) are available
• E.g.: Bcubed Precision and Recall
• Intrinsic Methods: possible when data with real cluster labels
(Ground Truth) are not available
• Good clusters should have lower intra-cluster distances (distance inside the
cluster) and higher inter-cluster distances (distances between the clusters)
• These measures are considered in Intrinsic Methods
• E.g.: Silhouette coefficient
One Hour Homework
• Officially we have one more hour to do after the end of the lecture
• Therefore, for this week’s extra hour you have a homework
• Learn about the applications of clustering
• Research what type of clustering has to be used in each of the clustering
application
• Find the modified versions of the given clustering algorithms and their usages
• Good Luck!
Questions?
Ad

Recommended

26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt
vikassingh569137
 
Clustering
Clustering
Rashmi Bhat
 
DM_clustering.ppt
DM_clustering.ppt
nandhini manoharan
 
Clustering
Clustering
Smrutiranjan Sahu
 
Unsupervised Learning in Machine Learning
Unsupervised Learning in Machine Learning
Pyingkodi Maran
 
clustering and distance metrics.pptx
clustering and distance metrics.pptx
ssuser2e437f
 
Unsupervised learning Modi.pptx
Unsupervised learning Modi.pptx
ssusere1fd42
 
8.clustering algorithm.k means.em algorithm
8.clustering algorithm.k means.em algorithm
Laura Petrosanu
 
machine learning - Clustering in R
machine learning - Clustering in R
Sudhakar Chavan
 
K MEANS CLUSTERING - UNSUPERVISED LEARNING
K MEANS CLUSTERING - UNSUPERVISED LEARNING
PalanivelG6
 
Chapter#04[Part#01]K-Means Clusterig.pdf
Chapter#04[Part#01]K-Means Clusterig.pdf
MaheenVohra
 
Machine Learning Unit 4 Semester 3 MSc IT Part 2 Mumbai University
Machine Learning Unit 4 Semester 3 MSc IT Part 2 Mumbai University
Madhav Mishra
 
08 clustering
08 clustering
นนทวัฒน์ บุญบา
 
K means clustering
K means clustering
keshav goyal
 
Unsupervised learning clustering
Unsupervised learning clustering
Dr Nisha Arora
 
Clustering.pdf
Clustering.pdf
nadimhossain24
 
Enhance The K Means Algorithm On Spatial Dataset
Enhance The K Means Algorithm On Spatial Dataset
AlaaZ
 
Clustering in Machine Learning: A Brief Overview.ppt
Clustering in Machine Learning: A Brief Overview.ppt
shilpamathur13
 
Unsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and Assumptions
refedey275
 
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...
IJCSIS Research Publications
 
K_MeansK_MeansK_MeansK_MeansK_MeansK_MeansK_Means.ppt
K_MeansK_MeansK_MeansK_MeansK_MeansK_MeansK_Means.ppt
Nishant83346
 
k-mean medoid and-knn-algorithm problems.pptx
k-mean medoid and-knn-algorithm problems.pptx
DulalChandraDas1
 
ClusteringClusteringClusteringClustering.pdf
ClusteringClusteringClusteringClustering.pdf
SsdSsd5
 
Pattern recognition binoy k means clustering
Pattern recognition binoy k means clustering
108kaushik
 
Neural nw k means
Neural nw k means
Eng. Dr. Dennis N. Mwighusa
 
Lect4
Lect4
sumit621
 
big data analytics unit 2 notes for study
big data analytics unit 2 notes for study
DIVYADHARSHINISDIVYA
 
Unit3
Unit3
AishwaryaLakshmiA
 
Lecture - 10 Transformer Model, Motivation to Transformers, Principles, and ...
Lecture - 10 Transformer Model, Motivation to Transformers, Principles, and ...
Maninda Edirisooriya
 
Lecture 11 - Advance Learning Techniques
Lecture 11 - Advance Learning Techniques
Maninda Edirisooriya
 

More Related Content

Similar to Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Machine Learning (20)

machine learning - Clustering in R
machine learning - Clustering in R
Sudhakar Chavan
 
K MEANS CLUSTERING - UNSUPERVISED LEARNING
K MEANS CLUSTERING - UNSUPERVISED LEARNING
PalanivelG6
 
Chapter#04[Part#01]K-Means Clusterig.pdf
Chapter#04[Part#01]K-Means Clusterig.pdf
MaheenVohra
 
Machine Learning Unit 4 Semester 3 MSc IT Part 2 Mumbai University
Machine Learning Unit 4 Semester 3 MSc IT Part 2 Mumbai University
Madhav Mishra
 
08 clustering
08 clustering
นนทวัฒน์ บุญบา
 
K means clustering
K means clustering
keshav goyal
 
Unsupervised learning clustering
Unsupervised learning clustering
Dr Nisha Arora
 
Clustering.pdf
Clustering.pdf
nadimhossain24
 
Enhance The K Means Algorithm On Spatial Dataset
Enhance The K Means Algorithm On Spatial Dataset
AlaaZ
 
Clustering in Machine Learning: A Brief Overview.ppt
Clustering in Machine Learning: A Brief Overview.ppt
shilpamathur13
 
Unsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and Assumptions
refedey275
 
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...
IJCSIS Research Publications
 
K_MeansK_MeansK_MeansK_MeansK_MeansK_MeansK_Means.ppt
K_MeansK_MeansK_MeansK_MeansK_MeansK_MeansK_Means.ppt
Nishant83346
 
k-mean medoid and-knn-algorithm problems.pptx
k-mean medoid and-knn-algorithm problems.pptx
DulalChandraDas1
 
ClusteringClusteringClusteringClustering.pdf
ClusteringClusteringClusteringClustering.pdf
SsdSsd5
 
Pattern recognition binoy k means clustering
Pattern recognition binoy k means clustering
108kaushik
 
Neural nw k means
Neural nw k means
Eng. Dr. Dennis N. Mwighusa
 
Lect4
Lect4
sumit621
 
big data analytics unit 2 notes for study
big data analytics unit 2 notes for study
DIVYADHARSHINISDIVYA
 
Unit3
Unit3
AishwaryaLakshmiA
 
machine learning - Clustering in R
machine learning - Clustering in R
Sudhakar Chavan
 
K MEANS CLUSTERING - UNSUPERVISED LEARNING
K MEANS CLUSTERING - UNSUPERVISED LEARNING
PalanivelG6
 
Chapter#04[Part#01]K-Means Clusterig.pdf
Chapter#04[Part#01]K-Means Clusterig.pdf
MaheenVohra
 
Machine Learning Unit 4 Semester 3 MSc IT Part 2 Mumbai University
Machine Learning Unit 4 Semester 3 MSc IT Part 2 Mumbai University
Madhav Mishra
 
K means clustering
K means clustering
keshav goyal
 
Unsupervised learning clustering
Unsupervised learning clustering
Dr Nisha Arora
 
Enhance The K Means Algorithm On Spatial Dataset
Enhance The K Means Algorithm On Spatial Dataset
AlaaZ
 
Clustering in Machine Learning: A Brief Overview.ppt
Clustering in Machine Learning: A Brief Overview.ppt
shilpamathur13
 
Unsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and Assumptions
refedey275
 
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...
IJCSIS Research Publications
 
K_MeansK_MeansK_MeansK_MeansK_MeansK_MeansK_Means.ppt
K_MeansK_MeansK_MeansK_MeansK_MeansK_MeansK_Means.ppt
Nishant83346
 
k-mean medoid and-knn-algorithm problems.pptx
k-mean medoid and-knn-algorithm problems.pptx
DulalChandraDas1
 
ClusteringClusteringClusteringClustering.pdf
ClusteringClusteringClusteringClustering.pdf
SsdSsd5
 
Pattern recognition binoy k means clustering
Pattern recognition binoy k means clustering
108kaushik
 
big data analytics unit 2 notes for study
big data analytics unit 2 notes for study
DIVYADHARSHINISDIVYA
 

More from Maninda Edirisooriya (20)

Lecture - 10 Transformer Model, Motivation to Transformers, Principles, and ...
Lecture - 10 Transformer Model, Motivation to Transformers, Principles, and ...
Maninda Edirisooriya
 
Lecture 11 - Advance Learning Techniques
Lecture 11 - Advance Learning Techniques
Maninda Edirisooriya
 
Lecture 9 - Deep Sequence Models, Learn Recurrent Neural Networks (RNN), GRU ...
Lecture 9 - Deep Sequence Models, Learn Recurrent Neural Networks (RNN), GRU ...
Maninda Edirisooriya
 
Extra Lecture - Support Vector Machines (SVM), a lecture in subject module St...
Extra Lecture - Support Vector Machines (SVM), a lecture in subject module St...
Maninda Edirisooriya
 
Lecture 10 - Model Testing and Evaluation, a lecture in subject module Statis...
Lecture 10 - Model Testing and Evaluation, a lecture in subject module Statis...
Maninda Edirisooriya
 
Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...
Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...
Maninda Edirisooriya
 
Lecture 8 - Feature Engineering and Optimization, a lecture in subject module...
Lecture 8 - Feature Engineering and Optimization, a lecture in subject module...
Maninda Edirisooriya
 
Lecture 7 - Bias, Variance and Regularization, a lecture in subject module St...
Lecture 7 - Bias, Variance and Regularization, a lecture in subject module St...
Maninda Edirisooriya
 
Lecture 6 - Logistic Regression, a lecture in subject module Statistical & Ma...
Lecture 6 - Logistic Regression, a lecture in subject module Statistical & Ma...
Maninda Edirisooriya
 
Lecture 5 - Gradient Descent, a lecture in subject module Statistical & Machi...
Lecture 5 - Gradient Descent, a lecture in subject module Statistical & Machi...
Maninda Edirisooriya
 
Lecture 4 - Linear Regression, a lecture in subject module Statistical & Mach...
Lecture 4 - Linear Regression, a lecture in subject module Statistical & Mach...
Maninda Edirisooriya
 
Lecture 3 - Exploratory Data Analytics (EDA), a lecture in subject module Sta...
Lecture 3 - Exploratory Data Analytics (EDA), a lecture in subject module Sta...
Maninda Edirisooriya
 
Lecture 2 - Introduction to Machine Learning, a lecture in subject module Sta...
Lecture 2 - Introduction to Machine Learning, a lecture in subject module Sta...
Maninda Edirisooriya
 
Analyzing the effectiveness of mobile and web channels using WSO2 BAM
Analyzing the effectiveness of mobile and web channels using WSO2 BAM
Maninda Edirisooriya
 
WSO2 BAM - Your big data toolbox
WSO2 BAM - Your big data toolbox
Maninda Edirisooriya
 
Training Report
Training Report
Maninda Edirisooriya
 
GViz - Project Report
GViz - Project Report
Maninda Edirisooriya
 
Mortivation
Mortivation
Maninda Edirisooriya
 
Hafnium impact 2008
Hafnium impact 2008
Maninda Edirisooriya
 
ChatCrypt
ChatCrypt
Maninda Edirisooriya
 
Lecture - 10 Transformer Model, Motivation to Transformers, Principles, and ...
Lecture - 10 Transformer Model, Motivation to Transformers, Principles, and ...
Maninda Edirisooriya
 
Lecture 11 - Advance Learning Techniques
Lecture 11 - Advance Learning Techniques
Maninda Edirisooriya
 
Lecture 9 - Deep Sequence Models, Learn Recurrent Neural Networks (RNN), GRU ...
Lecture 9 - Deep Sequence Models, Learn Recurrent Neural Networks (RNN), GRU ...
Maninda Edirisooriya
 
Extra Lecture - Support Vector Machines (SVM), a lecture in subject module St...
Extra Lecture - Support Vector Machines (SVM), a lecture in subject module St...
Maninda Edirisooriya
 
Lecture 10 - Model Testing and Evaluation, a lecture in subject module Statis...
Lecture 10 - Model Testing and Evaluation, a lecture in subject module Statis...
Maninda Edirisooriya
 
Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...
Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...
Maninda Edirisooriya
 
Lecture 8 - Feature Engineering and Optimization, a lecture in subject module...
Lecture 8 - Feature Engineering and Optimization, a lecture in subject module...
Maninda Edirisooriya
 
Lecture 7 - Bias, Variance and Regularization, a lecture in subject module St...
Lecture 7 - Bias, Variance and Regularization, a lecture in subject module St...
Maninda Edirisooriya
 
Lecture 6 - Logistic Regression, a lecture in subject module Statistical & Ma...
Lecture 6 - Logistic Regression, a lecture in subject module Statistical & Ma...
Maninda Edirisooriya
 
Lecture 5 - Gradient Descent, a lecture in subject module Statistical & Machi...
Lecture 5 - Gradient Descent, a lecture in subject module Statistical & Machi...
Maninda Edirisooriya
 
Lecture 4 - Linear Regression, a lecture in subject module Statistical & Mach...
Lecture 4 - Linear Regression, a lecture in subject module Statistical & Mach...
Maninda Edirisooriya
 
Lecture 3 - Exploratory Data Analytics (EDA), a lecture in subject module Sta...
Lecture 3 - Exploratory Data Analytics (EDA), a lecture in subject module Sta...
Maninda Edirisooriya
 
Lecture 2 - Introduction to Machine Learning, a lecture in subject module Sta...
Lecture 2 - Introduction to Machine Learning, a lecture in subject module Sta...
Maninda Edirisooriya
 
Analyzing the effectiveness of mobile and web channels using WSO2 BAM
Analyzing the effectiveness of mobile and web channels using WSO2 BAM
Maninda Edirisooriya
 
Ad

Recently uploaded (20)

Proposal for folders structure division in projects.pdf
Proposal for folders structure division in projects.pdf
Mohamed Ahmed
 
Microwatt: Open Tiny Core, Big Possibilities
Microwatt: Open Tiny Core, Big Possibilities
IBM
 
362 Alec Data Center Solutions-Slysium Data Center-AUH-Adaptaflex.pdf
362 Alec Data Center Solutions-Slysium Data Center-AUH-Adaptaflex.pdf
djiceramil
 
machine learning is a advance technology
machine learning is a advance technology
ynancy893
 
IPL_Logic_Flow.pdf Mainframe IPLMainframe IPL
IPL_Logic_Flow.pdf Mainframe IPLMainframe IPL
KhadijaKhadijaAouadi
 
Solar thermal – Flat plate and concentrating collectors .pptx
Solar thermal – Flat plate and concentrating collectors .pptx
jdaniabraham1
 
Introduction to Python Programming Language
Introduction to Python Programming Language
merlinjohnsy
 
ElysiumPro Company Profile 2025-2026.pdf
ElysiumPro Company Profile 2025-2026.pdf
info751436
 
Learning – Types of Machine Learning – Supervised Learning – Unsupervised UNI...
Learning – Types of Machine Learning – Supervised Learning – Unsupervised UNI...
23Q95A6706
 
Introduction to sensing and Week-1.pptx
Introduction to sensing and Week-1.pptx
KNaveenKumarECE
 
60 Years and Beyond eBook 1234567891.pdf
60 Years and Beyond eBook 1234567891.pdf
waseemalazzeh
 
Cadastral Maps
Cadastral Maps
Google
 
最新版美国圣莫尼卡学院毕业证(SMC毕业证书)原版定制
最新版美国圣莫尼卡学院毕业证(SMC毕业证书)原版定制
Taqyea
 
System design handwritten notes guidance
System design handwritten notes guidance
Shabista Imam
 
International Journal of Advanced Information Technology (IJAIT)
International Journal of Advanced Information Technology (IJAIT)
ijait
 
Deep Learning for Image Processing on 16 June 2025 MITS.pptx
Deep Learning for Image Processing on 16 June 2025 MITS.pptx
resming1
 
Fundamentals of Digital Design_Class_12th April.pptx
Fundamentals of Digital Design_Class_12th April.pptx
drdebarshi1993
 
nnnnnnnnnnnn7777777777777777777777777777777.pptx
nnnnnnnnnnnn7777777777777777777777777777777.pptx
gayathri venkataramani
 
Abraham Silberschatz-Operating System Concepts (9th,2012.12).pdf
Abraham Silberschatz-Operating System Concepts (9th,2012.12).pdf
Shabista Imam
 
Unit III_One Dimensional Consolidation theory
Unit III_One Dimensional Consolidation theory
saravananr808639
 
Proposal for folders structure division in projects.pdf
Proposal for folders structure division in projects.pdf
Mohamed Ahmed
 
Microwatt: Open Tiny Core, Big Possibilities
Microwatt: Open Tiny Core, Big Possibilities
IBM
 
362 Alec Data Center Solutions-Slysium Data Center-AUH-Adaptaflex.pdf
362 Alec Data Center Solutions-Slysium Data Center-AUH-Adaptaflex.pdf
djiceramil
 
machine learning is a advance technology
machine learning is a advance technology
ynancy893
 
IPL_Logic_Flow.pdf Mainframe IPLMainframe IPL
IPL_Logic_Flow.pdf Mainframe IPLMainframe IPL
KhadijaKhadijaAouadi
 
Solar thermal – Flat plate and concentrating collectors .pptx
Solar thermal – Flat plate and concentrating collectors .pptx
jdaniabraham1
 
Introduction to Python Programming Language
Introduction to Python Programming Language
merlinjohnsy
 
ElysiumPro Company Profile 2025-2026.pdf
ElysiumPro Company Profile 2025-2026.pdf
info751436
 
Learning – Types of Machine Learning – Supervised Learning – Unsupervised UNI...
Learning – Types of Machine Learning – Supervised Learning – Unsupervised UNI...
23Q95A6706
 
Introduction to sensing and Week-1.pptx
Introduction to sensing and Week-1.pptx
KNaveenKumarECE
 
60 Years and Beyond eBook 1234567891.pdf
60 Years and Beyond eBook 1234567891.pdf
waseemalazzeh
 
Cadastral Maps
Cadastral Maps
Google
 
最新版美国圣莫尼卡学院毕业证(SMC毕业证书)原版定制
最新版美国圣莫尼卡学院毕业证(SMC毕业证书)原版定制
Taqyea
 
System design handwritten notes guidance
System design handwritten notes guidance
Shabista Imam
 
International Journal of Advanced Information Technology (IJAIT)
International Journal of Advanced Information Technology (IJAIT)
ijait
 
Deep Learning for Image Processing on 16 June 2025 MITS.pptx
Deep Learning for Image Processing on 16 June 2025 MITS.pptx
resming1
 
Fundamentals of Digital Design_Class_12th April.pptx
Fundamentals of Digital Design_Class_12th April.pptx
drdebarshi1993
 
nnnnnnnnnnnn7777777777777777777777777777777.pptx
nnnnnnnnnnnn7777777777777777777777777777777.pptx
gayathri venkataramani
 
Abraham Silberschatz-Operating System Concepts (9th,2012.12).pdf
Abraham Silberschatz-Operating System Concepts (9th,2012.12).pdf
Shabista Imam
 
Unit III_One Dimensional Consolidation theory
Unit III_One Dimensional Consolidation theory
saravananr808639
 
Ad

Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Machine Learning

  • 1. DA 5230 – Statistical & Machine Learning Lecture 11 – KNN and Clustering Maninda Edirisooriya [email protected]
  • 2. Minkowski Distance • Distance between two data points can be measured by Minkowski Distance • Given by: 𝑑(𝑖, 𝑗) = 𝑞 (|𝑥𝑖1 − 𝑥𝑗1|𝑞 + |𝑥𝑖2 − 𝑥𝑗2|𝑞+ . . . +|𝑥𝑖𝑝 − 𝑥𝑗𝑝|𝑞) • When q=1 ⇒ 𝑑(𝑖, 𝑗) = (|𝑥𝑖1 − 𝑥𝑗1|2 + |𝑥𝑖2 − 𝑥𝑗2|2+ . . . +|𝑥𝑖𝑝 − 𝑥𝑗𝑝|2) • i.e. Euclidean Distance, which is the direct distance between two points • When q=2 ⇒ 𝑑 𝑖, 𝑗 = 𝑥𝑖1 − 𝑥𝑗1 + 𝑥𝑖2 − 𝑥𝑗2 + . . . +|𝑥𝑖𝑝 − 𝑥𝑗𝑝| • i.e. Manhatton Distance
  • 3. K-Nearest Neighbors (KNN) Algorithm • KNN is a very simple Instance-based Learning algorithm used for both Regression and Classification • This is a lazy (most calculations are done on prediction time) algorithm compared to model-based algorithms discussed before • This algorithm assumes nearby (with less distance) data points belong to the same class and far away data points belong to different classes • In KNN all the data points are kept in memory • Hyperparameter K has to be defined at the beginning • When a prediction is to be done for a given data point, distances from all the data points from the given data point are calculated
  • 4. K - Nearest Neighbors (KNN) Algorithm • Then the closest (with least distance) K data points are selected • E.g. Euclidian Distance can be used • For Classification problems, Y class is found by voting to find the Y classes of the majority of the selected K data points • For Regression problems, Y value is found by averaging (or weighted averaging in Weighted KNN) distances on all the selected K data points • For lower K values model will have higher variance due to overfitting • For higher K values model will have higher bias due to underfitting • Optimum value for K can be found with Cross-validation
  • 5. Characteristics of KNN • Each feature is given equal weight. Therefore, scale the features • Increased number of features create Curse of Dimensionality • Higer the K, it becomes less sensitive to the noise datapoints • When KNN voting gets tied (when get equal votes for the majority class), a randomly selection or weighted distances can be used for selecting the class • As there is no model to be trained, KNN can be used for Online Learning where the predictions have to be updated with continuously added new data points
  • 7. Unsupervised Learning • Labeled data is expensive – i.e. need to collect previous data accurately with Y values • Accuracy of Supervised Machine Learning models are dependent on the accuracy of the dataset given • But, there are many unlabeled data available in most of the cases • So, extracting information/patterns out of unlabeled data is valuable whenever possible • Extracting insights from unlabeled data is known as Unsupervised Learning
  • 8. Clustering • Clustering is one of the widely used Unsupervised Learning technique • In Clustering we assume that the data points are naturally organized into categories/classes known as Clusters • In Clustering the main assumption is that similar datapoints belong to the same cluster and different data points belong to different clusters
  • 9. Why Clustering? • Clustering is used to identify the high level concepts associated with the data • For example, when you need to identify the customer segments who are visiting your online shopping website, clustering will be helpful • As the human genome is large, it is impossible to visually analyze the common gene partitions by a human but clustering algorithms can • When you want to identify urbanized areas in a country using the satellite images, clustering can help to identify these areas using the light level density in night time images
  • 10. Clustering - Example Source: From text book PPTs by Prof. Jiawei Han
  • 11. Clustering Algorithm Types There are several approaches of extracting clusters out of data 1. Distance-based Methods 2. Density-based Methods 3. Model-based Methods Let’s understand each of the approach
  • 12. Distance-based Methods • In the multidimensional feature space (e.g. in the area in a 2D graph between X1 and X2) nearby data points are grouped into a one cluster and distant points are grouped to other clusters • Here we assume that the similar data points in a cluster are near to each other and different clusters are distant from other clusters • Measuring distance (difference) or the closeness (similarity) is one of the important decision in Distance-based Methods • K-Means Clustering is an example for the Single Level Distance-based Clustering Method (Multi Level Methods are discussed later)
  • 13. Measure of Distance • Measuring the distance (or closeness) is a key factor in designing Distance-based (Partitioning) clustering • In general, distances of each feature is assumed to be equally weighted while clustering • Therefore, all the features used for clustering should be scaled • Standardization is usually used for scaling • One of the popular distance measure is known as Minkowski Distance • Minkowski Distance formula can be used to derive popular distance measures like Euclidean Distance and Manhatton (City-Block) Distance as explained before
  • 14. K-Means Clustering • Feature Space will get partitioned into K distinct partitions for each of the cluster • Each cluster has its own Centroid, a point representing the cluster which has the least total distance from each data point in the cluster • We have to provide the hyperparameter K, the number of clusters • K-Means Clustering Algorithm is generally used for K-Means Clustering
  • 15. K-Means Clustering Algorithm • Initialize with random K centroids in the space (K random data points from the dataset are taken in general) • Until, either centroids or data points stop getting changed: • Assign each datapoint in the dataset to the nearest (e.g. with least Euclidean Distance) centroid • Calculate the new centroid of each cluster (e.g. with Mean Squares of Euclidean Distances) • This algorithm always converges to an optimum point • However, this may not be the Global Optimum point and can be a Local Optimum. Therefore, we have to run the algorithm several times with different initialization points and select the best model
  • 16. Find best K for K-Means Clustering • Finding the optimum class count, K is important • One way is to get K = 𝒏 𝟐 where n is the number of data points in the dataset • Another well-known technique is known as Elbow Method which is not practical in many cases (hence, not explained here) • Best way to find K is using K-fold Cross Validation by using Total Squared distances from data points to their centroids in each cluster as the cost
  • 17. K-Means Clustering Algorithm Source: https://www.reddit.com/r/learnmachinelearning/comments/qiid2e/kmeans_clustering_algorithm/?onetap_auto=true
  • 18. Problems with K-Means Clustering • Although the K-Means Clustering Algorithm is fast (efficient at computation) they have some limitations • For example, K-Means Clustering works well when the clusters are, • Well-separated • Circular and • Having the same size • When each of the assumption is violated, the K-Means may not cluster as we expect
  • 19. When Classes are Not Well Separated True class separation K-Means Clustering class separation Source: https://www.youtube.com/watch?v=BaZWcSq3IuI Source: https://www.youtube.com/watch?v=BaZWcSq3IuI
  • 20. When Classes are Not Circular K-Means Clustering class separation Source: https://www.youtube.com/watch?v=BaZWcSq3IuI
  • 21. When Classes are Not Same Sized K-Means Clustering when class radiuses are different K-Means Clustering when class data counts are different Source: https://www.youtube.com/watch?v=BaZWcSq3IuI Source: https://www.youtube.com/watch?v=BaZWcSq3IuI
  • 22. Distance-based Hierarchical Methods • Methods like K-Means and K-Medoid are Single Level clustering methods. i.e. There are no clusters inside the clusters • There are other types of Clustering Method known as Hierarchical Clustering where clusters are defined inside other clusters as a hierarchy of clusters • There are two approaches of creating hierarchical clusters 1. Agglomerative Clustering where the algorithm starts assuming each datapoint as cluster and combining each of them until the whole dataset is considered as a single cluster. E.g.: AGNES algorithm 2. Divisive Clustering where the algorithm starts assuming the whole dataset is considered as the single cluster and dividing the clusters until each datapoint is considered as a cluster. E.g.: DIANA algorithm
  • 23. Distance-based Hierarchical Methods Step 0 Step 1 Step 2 Step 3 Step 4 b d c e a a b d e c d e a b c d e Step 4 Step 3 Step 2 Step 1 Step 0 agglomerative (AGNES) divisive (DIANA) Source: From text book PPTs by Prof. Jiawei Han
  • 24. Density-based Methods • Distance-based Methods have drawbacks • Noise/outlier datapoints are affecting the clustering as all the data points are considered in clustering • It gets impossible to clusters highly non-circular clusters like population densities in a country • Number of clusters, K, has to be provided as a hyperparameter • Instead of using distance to the centroids to represent the clusters, in Density-based Methods they use local density to decide the cluster • Depending on the density distribution throughout the space, clusters can have highly complex boundaries and arbitrary number of clusters • DBSCAN and OPTICS are examples to the Density-based methods
  • 25. Model-based Methods • In Model-based methods, each data point is assumed to be generated by a mixture of probability distributions • A Model-based method tries to predict the parameters related to these probability distributions using these generated data points • Expectation-Maximization (EM) Algorithm is a popular approach • In Expectation (E) step the algorithm estimates the probability that each data point belongs to each cluster based on the current model parameters • In Maximization (M) step the algorithm updates the model parameters to maximize the likelihood of the data given these probabilities • Iterate the above 2 steps until the convergence. (This is analogous to the K- Means Clustering algorithm)
  • 26. Gaussian Mixture Model (GMM) • GMM is a popular Model-based Method assuming the distributions are Gaussian. Parameters to be predicted for each distribution: • Mean Vector • Covariance Matrix • Proportion for the distribution (or weight) • Often the parameters are randomly initialized • E step: finds the proportions of each datapoint should be assigned to each of the distribution • M step: re-estimate the parameters and update Gaussian distributions with parameters
  • 27. Gaussian Mixture Model (GMM) • GMM is converging but may converge to a Local Optimum • Once converged, datapoints are to be assigned to each cluster represented by each of the Gaussian distribution • In some cases each datapoint is assigned to the cluster with the highest probability (known as Maximum Probability Rule) • In some other cases each datapoint may be assigned to multiple clusters based on the probabilities related to each of the distribution (known as Soft Assignment) • A heatmap can be used to visualize the datapoints with the soft assignments
  • 28. Gaussian Mixture Model (GMM) Source: https://prateekvjoshi.com/2013/06/29/gaussian-mixture-models Gaussian models Gaussian mixture model
  • 29. Evaluating Clustering • If there are no labeled data to test the performance measures of the model, it is not possible to accurately evaluate the clustered model • However, there are other evaluations we can do related to clustering • First, the dataset should be having a non-random distribution (non- uniform distribution in the hyperspace) • In other words, the data should be non-uniformly distributed in the space, forming clusters • This measure is known as Spatial Randomness which can be measured by Hopkins Statistic
  • 30. Measure Clustering Quality • When a clustering is done its quality has to be measured • Extrinsic Methods: possible when data with real cluster labels (Ground Truth) are available • E.g.: Bcubed Precision and Recall • Intrinsic Methods: possible when data with real cluster labels (Ground Truth) are not available • Good clusters should have lower intra-cluster distances (distance inside the cluster) and higher inter-cluster distances (distances between the clusters) • These measures are considered in Intrinsic Methods • E.g.: Silhouette coefficient
  • 31. One Hour Homework • Officially we have one more hour to do after the end of the lecture • Therefore, for this week’s extra hour you have a homework • Learn about the applications of clustering • Research what type of clustering has to be used in each of the clustering application • Find the modified versions of the given clustering algorithms and their usages • Good Luck!