Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Machine Learning

DA 5230 – Statistical & Machine Learning
Lecture 11 – KNN and Clustering
Maninda Edirisooriya
manindaw@uom.lk

K-Nearest Neighbors (KNN) Algorithm
• KNN is a very simple Instance-based Learning algorithm used for
both Regression and Classification
• This is a lazy (most calculations are done on prediction time)
algorithm compared to model-based algorithms discussed before
• This algorithm assumes nearby (with less distance) data points belong
to the same class and far away data points belong to different classes
• In KNN all the data points are kept in memory
• Hyperparameter K has to be defined at the beginning
• When a prediction is to be done for a given data point, distances from
all the data points from the given data point are calculated

K - Nearest Neighbors (KNN) Algorithm
• Then the closest (with least distance) K data points are selected
• E.g. Euclidian Distance can be used
• For Classification problems, Y class is found by voting to find the Y
classes of the majority of the selected K data points
• For Regression problems, Y value is found by averaging (or weighted
averaging in Weighted KNN) distances on all the selected K data
points
• For lower K values model will have higher variance due to overfitting
• For higher K values model will have higher bias due to underfitting
• Optimum value for K can be found with Cross-validation

Characteristics of KNN
• Each feature is given equal weight. Therefore, scale the features
• Increased number of features create Curse of Dimensionality
• Higer the K, it becomes less sensitive to the noise datapoints
• When KNN voting gets tied (when get equal votes for the majority
class), a randomly selection or weighted distances can be used for
selecting the class
• As there is no model to be trained, KNN can be used for Online
Learning where the predictions have to be updated with continuously
added new data points

KNN Example
Source: https://medium.com/analytics-vidhya/k-nearest-neighbor-the-maths-behind-it-how-it-works-and-an-example-f1de1208546c

Unsupervised Learning
• Labeled data is expensive – i.e. need to collect previous data
accurately with Y values
• Accuracy of Supervised Machine Learning models are dependent on
the accuracy of the dataset given
• But, there are many unlabeled data available in most of the cases
• So, extracting information/patterns out of unlabeled data is valuable
whenever possible
• Extracting insights from unlabeled data is known as Unsupervised
Learning

Clustering
• Clustering is one of the widely used Unsupervised Learning technique
• In Clustering we assume that the data points are naturally organized
into categories/classes known as Clusters
• In Clustering the main assumption is that similar datapoints belong to
the same cluster and different data points belong to different clusters

Why Clustering?
• Clustering is used to identify the high level concepts associated with
the data
• For example, when you need to identify the customer segments who
are visiting your online shopping website, clustering will be helpful
• As the human genome is large, it is impossible to visually analyze the
common gene partitions by a human but clustering algorithms can
• When you want to identify urbanized areas in a country using the
satellite images, clustering can help to identify these areas using the
light level density in night time images

Clustering - Example
Source: From text book PPTs by Prof. Jiawei Han

Clustering Algorithm Types
There are several approaches of extracting clusters out of data
1. Distance-based Methods
2. Density-based Methods
3. Model-based Methods
Let’s understand each of the approach

Distance-based Methods
• In the multidimensional feature space (e.g. in the area in a 2D graph
between X1 and X2) nearby data points are grouped into a one cluster
and distant points are grouped to other clusters
• Here we assume that the similar data points in a cluster are near to
each other and different clusters are distant from other clusters
• Measuring distance (difference) or the closeness (similarity) is one of
the important decision in Distance-based Methods
• K-Means Clustering is an example for the Single Level Distance-based
Clustering Method (Multi Level Methods are discussed later)

Measure of Distance
• Measuring the distance (or closeness) is a key factor in designing
Distance-based (Partitioning) clustering
• In general, distances of each feature is assumed to be equally
weighted while clustering
• Therefore, all the features used for clustering should be scaled
• Standardization is usually used for scaling
• One of the popular distance measure is known as Minkowski
Distance
• Minkowski Distance formula can be used to derive popular distance
measures like Euclidean Distance and Manhatton (City-Block)
Distance as explained before

K-Means Clustering
• Feature Space will get partitioned into K distinct partitions for each of
the cluster
• Each cluster has its own Centroid, a point representing the cluster
which has the least total distance from each data point in the cluster
• We have to provide the hyperparameter K, the number of clusters
• K-Means Clustering Algorithm is generally used for K-Means
Clustering

K-Means Clustering Algorithm
• Initialize with random K centroids in the space (K random data points
from the dataset are taken in general)
• Until, either centroids or data points stop getting changed:
• Assign each datapoint in the dataset to the nearest (e.g. with least Euclidean
Distance) centroid
• Calculate the new centroid of each cluster (e.g. with Mean Squares of
Euclidean Distances)
• This algorithm always converges to an optimum point
• However, this may not be the Global Optimum point and can be a
Local Optimum. Therefore, we have to run the algorithm several
times with different initialization points and select the best model

Find best K for K-Means Clustering
• Finding the optimum class count, K is important
• One way is to get K =
𝒏
𝟐
where n is the number of data points in the
dataset
• Another well-known technique is known as Elbow Method which is
not practical in many cases (hence, not explained here)
• Best way to find K is using K-fold Cross Validation by using Total
Squared distances from data points to their centroids in each cluster
as the cost

K-Means Clustering Algorithm
Source: https://www.reddit.com/r/learnmachinelearning/comments/qiid2e/kmeans_clustering_algorithm/?onetap_auto=true

Problems with K-Means Clustering
• Although the K-Means Clustering Algorithm is fast (efficient at
computation) they have some limitations
• For example, K-Means Clustering works well when the clusters are,
• Well-separated
• Circular and
• Having the same size
• When each of the assumption is violated, the K-Means may not
cluster as we expect

When Classes are Not Well Separated
True class separation K-Means Clustering class separation
Source: https://www.youtube.com/watch?v=BaZWcSq3IuI Source: https://www.youtube.com/watch?v=BaZWcSq3IuI

When Classes are Not Circular
K-Means Clustering class separation
Source: https://www.youtube.com/watch?v=BaZWcSq3IuI

When Classes are Not Same Sized
K-Means Clustering when class
radiuses are different
K-Means Clustering when class data
counts are different
Source: https://www.youtube.com/watch?v=BaZWcSq3IuI Source: https://www.youtube.com/watch?v=BaZWcSq3IuI

Distance-based Hierarchical Methods
• Methods like K-Means and K-Medoid are Single Level clustering
methods. i.e. There are no clusters inside the clusters
• There are other types of Clustering Method known as Hierarchical
Clustering where clusters are defined inside other clusters as a
hierarchy of clusters
• There are two approaches of creating hierarchical clusters
1. Agglomerative Clustering where the algorithm starts assuming each
datapoint as cluster and combining each of them until the whole dataset is
considered as a single cluster. E.g.: AGNES algorithm
2. Divisive Clustering where the algorithm starts assuming the whole dataset
is considered as the single cluster and dividing the clusters until each
datapoint is considered as a cluster. E.g.: DIANA algorithm

Distance-based Hierarchical Methods
Step 0 Step 1 Step 2 Step 3 Step 4
b
d
c
e
a
a b
d e
c d e
a b c d e
Step 4 Step 3 Step 2 Step 1 Step 0
agglomerative
(AGNES)
divisive
(DIANA)
Source: From text book PPTs by Prof. Jiawei Han

Density-based Methods
• Distance-based Methods have drawbacks
• Noise/outlier datapoints are affecting the clustering as all the data points are
considered in clustering
• It gets impossible to clusters highly non-circular clusters like population
densities in a country
• Number of clusters, K, has to be provided as a hyperparameter
• Instead of using distance to the centroids to represent the clusters, in
Density-based Methods they use local density to decide the cluster
• Depending on the density distribution throughout the space, clusters
can have highly complex boundaries and arbitrary number of clusters
• DBSCAN and OPTICS are examples to the Density-based methods

Model-based Methods
• In Model-based methods, each data point is assumed to be generated
by a mixture of probability distributions
• A Model-based method tries to predict the parameters related to
these probability distributions using these generated data points
• Expectation-Maximization (EM) Algorithm is a popular approach
• In Expectation (E) step the algorithm estimates the probability that each data
point belongs to each cluster based on the current model parameters
• In Maximization (M) step the algorithm updates the model parameters to
maximize the likelihood of the data given these probabilities
• Iterate the above 2 steps until the convergence. (This is analogous to the K-
Means Clustering algorithm)

Gaussian Mixture Model (GMM)
• GMM is a popular Model-based Method assuming the distributions
are Gaussian. Parameters to be predicted for each distribution:
• Mean Vector
• Covariance Matrix
• Proportion for the distribution (or weight)
• Often the parameters are randomly initialized
• E step: finds the proportions of each datapoint should be assigned to
each of the distribution
• M step: re-estimate the parameters and update Gaussian
distributions with parameters

• GMM is converging but may converge to a Local Optimum
• Once converged, datapoints are to be assigned to each cluster
represented by each of the Gaussian distribution
• In some cases each datapoint is assigned to the cluster with the
highest probability (known as Maximum Probability Rule)
• In some other cases each datapoint may be assigned to multiple
clusters based on the probabilities related to each of the distribution
(known as Soft Assignment)
• A heatmap can be used to visualize the datapoints with the soft assignments

Source: https://prateekvjoshi.com/2013/06/29/gaussian-mixture-models
Gaussian models
Gaussian mixture model

Evaluating Clustering
• If there are no labeled data to test the performance measures of the
model, it is not possible to accurately evaluate the clustered model
• However, there are other evaluations we can do related to clustering
• First, the dataset should be having a non-random distribution (non-
uniform distribution in the hyperspace)
• In other words, the data should be non-uniformly distributed in the
space, forming clusters
• This measure is known as Spatial Randomness which can be
measured by Hopkins Statistic

Measure Clustering Quality
• When a clustering is done its quality has to be measured
• Extrinsic Methods: possible when data with real cluster labels
(Ground Truth) are available
• E.g.: Bcubed Precision and Recall
• Intrinsic Methods: possible when data with real cluster labels
(Ground Truth) are not available
• Good clusters should have lower intra-cluster distances (distance inside the
cluster) and higher inter-cluster distances (distances between the clusters)
• These measures are considered in Intrinsic Methods
• E.g.: Silhouette coefficient

One Hour Homework
• Officially we have one more hour to do after the end of the lecture
• Therefore, for this week’s extra hour you have a homework
• Learn about the applications of clustering
• Research what type of clustering has to be used in each of the clustering
application
• Find the modified versions of the given clustering algorithms and their usages
• Good Luck!

Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Machine Learning

Recommended

More Related Content

Similar to Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Machine Learning (20)

More from Maninda Edirisooriya (20)

Recently uploaded (20)

Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Machine Learning