0% found this document useful (0 votes)
20 views57 pages

Clustering

Uploaded by

Madina Dates
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views57 pages

Clustering

Uploaded by

Madina Dates
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

Machine Learning

Clustering

Source:
https://www.kdnuggets.com/2023/05/clustering-scikitlearn-tutorial-unsupervised-learning.html
https://www.projectpro.io/article/clustering-algorithms-in-machine-learning/842
What is Clustering
> Organizing data into clusters such that there is
● high intra-cluster similarity
● low inter-cluster similarity

> Informally, finding natural groupings among objects


based on similarity (pattern matching)
> Unsupervised learning works with unlabelled data
> Used as
● a tool that is used on its own to solve problems
related to identifying patterns within datasets.
● As a pre-processing step for various machine
learning algorithms.
Source: https://www.advancinganalytics.co.uk/blog/2022/6/13/get-started-with-clustering-the-easy-way
Clustering Applications
Customer segmentation: Clustering is often used to group customers based on their
demographics or purchasing behaviour. By segmenting customers into groups,
targeted promotions can be done. For example, in the retail industry, customers are
often segmented so that their behaviours are well understood. Once the segmentation
is performed, targeting a particular segment with group-specific action like
promotions can be easily carried out.
Clustering Applications
Image segmentation: This is a process of partitioning an image into multiple distinct
regions containing sets of pixels with similar attributes. Often, this technique is used
in medical research to identify underlying patterns and in the automotive industry, in
particular in autonomous vehicles to identify objects.
Clustering Applications
Market segmentation: This helps businesses increase the chances of people engaging
with advertisements or content, resulting in more efficient campaigns and improved
return on investment. Similar to customer segmentation, clustering is widely used in
many businesses to perform market segmentation.
Clustering Applications
Anomaly detection: Widely used in businesses, particularly in social media, finance,
healthcare and manufacturing. Identification of fake news, fraudulent transactions, or
defective mechanical components are popular areas where clustering is often used to
detect anomalies. For example, in the finance sector, clustering is often used to
identify fraudulent transactions using historical fraudulent transaction data.
Clustering Applications
Document sorting: Clustering can be used to organise and categorise documents based
on certain key features, for example, category, keywords, word frequency or content.
This is popular within many businesses, where managing risk and compliance is
essential.

Pricing: Grouping new products based on a set of


features using clustering is often used in the
retail industry to price new products accurately.
Clustering Applications
Customer services: With more emphasis being put on customer care, clustering is
widely used to group customer complaints into tiers of importance. This then provides
the ability to understand, prioritise and focus on the most important issues to make
the most significant impact.

Genome analysis: Clustering is often used to


determine similarities between genomes.
In fact, clustering algorithms were utilised
during the Covid-19 pandemic to detect and
analyse distinct strains of Coronavirus to help
establish similarities to origin hosts.
Clustering Applications
Data Compression: By minimizing the amount of data points that must be examined,
various clustering techniques can be utilized to compress huge datasets. Data analysis
may become quicker and more effective as a result.

Recommendation Systems: Clustering can be


used in recommendation systems to group people
or things that share characteristics. This could
enhance the user experience and help to increase
the accuracy of recommendations.
Types of Clustering Algorithms
Distribution models – Clusters in this model belong to a
distribution. Data points are grouped based on the
probability of belonging to either a normal or a gaussian
distribution. The expectation-maximisation algorithm, which
uses multivariate normal distributions, is one of the popular
examples of this algorithm.

Centroid models – This is an iterative algorithm in which


data is organised into clusters based on how close data
points are to the centre of clusters also known as centroid.
An example of centroid models is the K-means algorithm.
Source: https://www.advancinganalytics.co.uk/blog/2022/6/13/10-incredibly-useful-clustering-algorithms-you-need-to-know
Types of Clustering Algorithms

Connectivity models – This is similar to the centroid model


that seeks to build a hierarchy of clusters. An example of a
connectivity model is the hierarchical algorithm.

Density models – Clusters are defined by areas of


concentrated density. It searches areas of dense data points
and assigns those areas to the same clusters. DBSCAN and
OPTICS are two popular density-based clustering models.

Source: https://www.advancinganalytics.co.uk/blog/2022/6/13/10-incredibly-useful-clustering-algorithms-you-need-to-know
Clustering Algorithms
Affinity Propagation: It considers all data points as input measures of similarity
between pairs of data points and simultaneously considers them as potential
exemplars. Real-valued messages are exchanged between data points until a
high-quality set of exemplars and corresponding clusters gradually emerges.

Agglomerative Hierarchical Clustering: This clustering technique uses a hierarchical


“bottom-up” approach. This implies that the algorithm begins with all data points as
clusters and begins merging them depending on the distance between clusters. This
will continue until we establish one large cluster.
Clustering Algorithms
BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies): This technique
is very useful when clustering large datasets as it begins by first generating a more
compact summary that retains as much distribution information as possible and then
clustering the data summary instead of the original large dataset.

DBSCAN (Density-Based Spatial Clustering of Applications with Noise): DBSCAN is a


really well known density-based clustering algorithm. It determines clusters based on
how dense regions are. It is able to find irregular-shaped clusters and outliers really
well.

OPTICS (Ordering Points to Identify the Clustering Structure): This is also a


density-based clustering algorithm. It is very similar to the DBSCAN described above,
but it overcomes one of DBSCAN’s limitations, which is the problem of detecting
meaningful clusters in data of varying density.
Clustering Algorithms
K-Means: This algorithm is one of the most popular and commonly used clustering
technique. It works by assigning data points to clusters based on the shortest distance
to the centroids or centre of the cluster. This algorithm’s main goal is to minimise the
sum of distances between data points and their respective clusters.

Mini-Batch K-Means: This is a k-means version in which cluster centroids are updated
in small batches rather than the entire dataset. When working with a large dataset, the
mini-batch k-means technique can be used to minimise computing time.

Mean Shift Clustering: Mean shift clustering algorithm is a centroid-based algorithm


that works by shifting data points towards centroids to be the mean of other points in
the feature space.
Clustering Algorithms
Spectral Clustering: Spectral clustering is a graph-based algorithm where the
approach is used to identify communities of nodes based on the edges. Because of its
ease of implementation and promising performance, spectral clustering has grown in
popularity.

Gaussian Mixture Models (GMM): The Gaussian mixture models is an extension of the
k-means clustering algorithm. It is based on the idea that each cluster may be
assigned to a different Gaussian distribution. GMM uses soft-assignment of data points
to clusters (i.e. probabilistic and therefore better) when contrasting with the K-means
approach of hard-assignments of data points to clusters.
Circles - two circles,
one circumscribed
by the other.

Moons - two
interleaving half
circles.

Varied variance
blobs – blobs that
have different
variances.

Anisotropically
distributed blobs -
unequal widths and
lengths.

Regular blobs - Just


three regular blobs

Homogenous data
a ‘null’ situation for
clustering.
Distance Measuring Technique – Euclidean vs Manhattan
Distance Measuring Technique – Correlation-based
Pearson correlation
measures the degree of a linear relationship between two profiles.

Spearman correlation
computes the correlation between the rank of x and the rank of y variables.
Cluster
Distance
Measuring
Techniques

Ward’s method: In this method all possible pairs of clusters are combined and the sum of the
squared distances within each cluster is calculated. This is then summed over all clusters. The
combination that gives the lowest sum of squares is chosen.
Agglomerative
Hierarchical
Clustering –
Dendogram
Agglomerative Hierarchical Clustering
Agglomerative Hierarchical Clustering
Agglomerative Hierarchical Clustering
Agglomerative Hierarchical Clustering
Agglomerative Hierarchical Clustering
Agglomerative Hierarchical Clustering
Agglomerative Hierarchical Clustering
Agglomerative Hierarchical Clustering
Ward’s
Hierarchical
Clustering
Hierarchical Clustering
Hierarchical Clustering Implementation
Partitional Clustering – K-Means Algorithm

K-Means++
K-Means
Clustering-1
K-Means
Clustering-2
K-Means
After moving centers, re-assign the objects…

Clustering-3
K-Means
Clustering-4
K-Means
Clustering-5
K-Means Clustering
K-Means Clustering
K-Means Clustering – Right # of clusters
K-Means Clustering – Right # of clusters
K-Means Clustering – Right # of clusters
K-Means Clustering – Right # of clusters
K-Means Clustering – Right # of clusters
K-Means Clustering – Right # of clusters - Silhouette Coefficient
K-Means Clustering – Right # of clusters - Silhouette Coefficient
K-Means Clustering – Right # of clusters - Silhouette Coefficient
K-Means Clustering – Right # of clusters - Silhouette Coefficient
K-Means Clustering – Right # of clusters - Silhouette Coefficient
DBSCAN
Density-Based
Spatial Clustering
and Application
with Noise Source:
www.sthda.com
Why
DBSCAN
Parameters
Terms
DBSCAN
Algorithm
DBSCAN
Other Clustering Algorithms
Optics:
https://www.madrasresearch.org/post/optics-clustering
https://github.com/christianversloot/machine-learning-articles/blob/main/performing
-optics-clustering-with-python-and-scikit-learn.md

Mean Shift:
https://ml-explained.com/blog/mean-shift-explained
https://aitechtrend.com/simplifying-data-clustering-with-mean-shift-algorithm-in-pyt
hon/

You might also like