0% found this document useful (0 votes)
46 views42 pages

Datascience Notes Unit-5

Unsupervised machine learning analyzes data without labeled responses, discovering hidden patterns and structures through algorithms like clustering, association rule learning, and dimensionality reduction. Key applications include customer segmentation, anomaly detection, and recommendation systems, with various algorithms such as K-means and DBSCAN for clustering. The choice of distance measures, like Euclidean and Manhattan distance, significantly impacts clustering results and insights derived from the data.

Uploaded by

22981a0523
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views42 pages

Datascience Notes Unit-5

Unsupervised machine learning analyzes data without labeled responses, discovering hidden patterns and structures through algorithms like clustering, association rule learning, and dimensionality reduction. Key applications include customer segmentation, anomaly detection, and recommendation systems, with various algorithms such as K-means and DBSCAN for clustering. The choice of distance measures, like Euclidean and Manhattan distance, significantly impacts clustering results and insights derived from the data.

Uploaded by

22981a0523
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

UNIT-5

Unsupervised Machine Learning

Unsupervised Learning:

Unsupervised learning is a type of machine learning that analyzes and models data without
labelled responses or predefined categories. Unlike supervised learning, where the
algorithm learns from input-output pairs, unsupervised learning algorithms work solely with
input data and aim to discover hidden patterns, structures or relationships within the
dataset independently, without any human intervention or prior knowledge of the data's
meaning.

The image shows set of animals like elephants, camels and cows that represents raw data
that the unsupervised learning algorithm will process.

The "Interpretation" stage signifies that the algorithm doesn't have predefined labels or
categories for the data. It needs to figure out how to group or organize the data based on
inherent patterns.

An algorithm represents unsupervised learning process which can be clustering,


dimensionality reduction or anomaly detection to identify patterns in the data.

The processing stage shows the algorithm working on the data.

The output shows the results of the unsupervised learning process. In this case, the
algorithm might have grouped the animals into clusters based on their species (elephants,
camels, cows).
Working of Unsupervised Learning:

The working of unsupervised machine learning can be explained in these steps:

1. Collect Unlabeled Data

Gather a dataset without predefined labels or categories.

Example: Images of various animals without any tags.

2. Select an Algorithm

Choose a suitable unsupervised algorithm such as clustering like K-Means, association rule
learning like Apriori or dimensionality reduction like PCA based on the goal.

3. Train the Model on Raw Data

Feed the entire unlabeled dataset to the algorithm.

The algorithm looks for similarities, relationships or hidden structures within the data.

4. Group or Transform Data

The algorithm organizes data into groups (clusters), rules or lower-dimensional forms
without human input.

Example: It may group similar animals together or extract key patterns from large datasets.

5. Interpret and Use Results

Analyze the discovered groups, rules or features to gain insights or use them for further
tasks like visualization, anomaly detection or as input for other models.

Unsupervised Learning Algorithms:

There are mainly 3 types of Unsupervised Algorithms that are used:

1. Clustering Algorithms

Clustering is an unsupervised machine learning technique that groups unlabeled data into
clusters based on similarity. Its goal is to discover patterns or relationships within the data
without any prior knowledge of categories or labels.

Groups data points that share similar features or characteristics.

Helps find natural groupings in raw, unclassified data.


Commonly used for customer segmentation, anomaly detection and data organization.

Works purely from the input data without any output labels.

Enables understanding of data structure for further analysis or decision-making.

Some common clustering algorithms:

K-means Clustering: Groups data into K clusters based on how close the points are to each
other.

Hierarchical Clustering: Creates clusters by building a tree step-by-step, either merging or


splitting groups.

Density-Based Clustering (DBSCAN): Finds clusters in dense areas and treats scattered points
as noise.

Mean-Shift Clustering: Discovers clusters by moving points toward the most crowded areas.

Spectral Clustering: Groups data by analyzing connections between points using graphs.

2. Association Rule Learning

Association rule learning is a rule-based unsupervised learning technique used to discover


interesting relationships between variables in large datasets. It identifies patterns in the
form of “if-then” rules, showing how the presence of some items in the data implies the
presence of others.

Finds frequent item combinations and the rules connecting them.

Commonly used in market basket analysis to understand product purchase relationships.

Helps retailers design promotions and cross-selling strategies.

Some common Association Rule Learning algorithms:

Apriori Algorithm: Finds patterns by exploring frequent item combinations step-by-step.

FP-Growth Algorithm: An Efficient Alternative to Apriori. It quickly identifies frequent


patterns without generating candidate sets.

Eclat Algorithm: Uses intersections of itemsets to efficiently find frequent patterns.

Efficient Tree-based Algorithms: Scales to handle large datasets by organizing data in tree
structures.

3. Dimensionality Reduction
Dimensionality reduction is the process of decreasing the number of features or variables in
a dataset while retaining as much of the original information as possible. This technique
helps simplify complex data making it easier to analyze and visualize. It also improves the
efficiency and performance of machine learning algorithms by reducing noise and
computational cost.

It reduces the dataset’s feature space from many dimensions to fewer, more meaningful
ones.

Helps focus on the most important traits or patterns in the data.

Commonly used to improve model speed and reduce overfitting.

Here are some popular Dimensionality Reduction algorithms:

Principal Component Analysis (PCA): Reduces dimensions by transforming data into


uncorrelated principal components.

Linear Discriminant Analysis (LDA): Reduces dimensions while maximizing class separability
for classification tasks.

Non-negative Matrix Factorization (NMF): Breaks data into non-negative parts to simplify
representation.

Locally Linear Embedding (LLE): Reduces dimensions while preserving the relationships
between nearby points.

Isomap: Captures global data structure by preserving distances along a manifold.

Applications of Unsupervised learning:

Unsupervised learning has diverse applications across industries and domains. Key
applications include:

Customer Segmentation: Algorithms cluster customers based on purchasing behavior or


demographics, enabling targeted marketing strategies.

Anomaly Detection: Identifies unusual patterns in data, aiding fraud detection, cybersecurity
and equipment failure prevention.

Recommendation Systems: Suggests products, movies or music by analyzing user behavior


and preferences.

Image and Text Clustering: Groups similar images or documents for tasks like organization,
classification or content recommendation.
Social Network Analysis: Detects communities or trends in user interactions on social media
platforms.

Advantages:

No need for labeled data: Works with raw, unlabeled data hence saving time and effort on
data annotation.

Discovers hidden patterns: Finds natural groupings and structures that might be missed by
humans.

Handles complex and large datasets: Effective for high-dimensional or vast amounts of data.

Clustering in Machine Learning

Clustering is an unsupervised machine learning technique that groups similar data points
together into clusters based on their characteristics, without using any labeled data. The
objective is to ensure that data points within the same cluster are more similar to each other
than to those in different clusters, enabling the discovery of natural groupings and hidden
patterns in complex datasets.

 Goal: Discover the natural grouping or structure in unlabeled data without


predefined categories.

 How: Data points are assigned to clusters based on similarity or distance measures.

 Similarity Measures: Can include Euclidean distance, cosine similarity or other


metrics depending on data type and clustering method.

 Output: Each group is assigned a cluster ID, representing shared


characteristics within the cluster.
For example, if we have customer purchase data, clustering can group customers with
similar shopping habits. These clusters can then be used for targeted marketing,
personalized recommendations or customer segmentation.

Types of Clustering

Let's see the types of clustering,

1. Hard Clustering: In hard clustering, each data point strictly belongs to exactly one cluster,
no overlap is allowed. This approach assigns a clear membership, making it easier to
interpret and use for definitive segmentation tasks.

 Example: If clustering customer data into 2 segments, each customer belongs fully to
either Cluster 1 or Cluster 2 without partial memberships.

 Use cases: Market segmentation, customer grouping, document clustering.

 Limitations: Cannot represent ambiguity or overlap between groups; boundaries are


crisp.

Let's see an example to see the difference between the hard and soft clustering using a
distribution,

Data
Point Hard Clustering Soft Clustering

A Cluster 1 Cluster 1: 0.91, Cluster 2: 0.09

B Cluster 2 Cluster 1: 0.30, Cluster 2: 0.70

C Cluster 3 Cluster 1: 0.17, Cluster 2: 0.83

D Cluster 4 Cluster 1: 1.00, Cluster 2: 0.00

2. Soft Clustering: Soft clustering assigns each data point a probability or degree of
membership to multiple clusters simultaneously, allowing data points to partially belong to
several groups.

 Example: A data point may have a 70% membership in Cluster 1 and 30% in Cluster 2,
reflecting uncertainty or overlap in group characteristics.
 Use cases: Situations with overlapping class boundaries, fuzzy categories like
customer personas or medical diagnosis.

 Benefits: Captures ambiguity in data, models gradual transitions between clusters.

Types of Clustering Methods

Clustering methods can be classified on the basis of how they for clusters,

1. Centroid-based Clustering (Partitioning Methods)

Centroid-based clustering organizes data points around central prototypes called centroids,
where each cluster is represented by the mean (or medoid) of its members. The number of
clusters is specified in advance and the algorithm allocates points to the nearest centroid,
making this technique efficient for spherical and similarly sized clusters but sensitive to
outliers and initialization.

Algorithms:

 K-means: Iteratively assigns points to nearest centroid and recalculates centroids to


minimize intra-cluster variance.

 K-medoids: Similar to K-means but uses actual data points (medoids) as centers,
robust to outliers.

Pros:

 Fast and scalable for large datasets.

 Simple to implement and interpret.

Cons:

 Requires pre-knowledge of kk.

 Sensitive to initialization and outliers.

 Not suitable for non-spherical clusters.

2. Density-based Clustering (Model-based Methods)

Density-based clustering defines clusters as contiguous regions of high data density


separated by areas of lower density. This approach can identify clusters of arbitrary shapes,
handles noise well and does not require predefining the number of clusters, though its
effectiveness depends on chosen density parameters.

Algorithms:

 DBSCAN (Density-Based Spatial Clustering of Applications with Noise): Groups points


with sufficient neighbors; labels sparse points as noise.
 OPTICS (Ordering Points To Identify Clustering Structure): Extends DBSCAN to handle
varying densities.

Pros:

 Handles clusters of varying shapes and sizes.

 Does not require cluster count upfront.

 Effective in noisy datasets.

Cons:

 Difficult to choose parameters like epsilon and min points.

 Less effective for varying density clusters (except OPTICS).

3. Connectivity-based Clustering (Hierarchical Clustering)

Connectivity-based (or hierarchical) clustering builds nested groupings of data by evaluating


how data points are connected to their neighbors. It creates a dendrogram—a tree-like
structure—that reflects relationships at various granularity levels and does not require
specifying cluster numbers in advance, but can be computationally intensive.

Approaches:

 Agglomerative (Bottom-up): Start with each point as a cluster; iteratively merge


closest clusters.

 Divisive (Top-down): Start with one cluster; iteratively split into smaller clusters.

Pros:

 Provides a full hierarchy, easy to visualize.

 No need to specify number of clusters upfront.

Cons:

 Computationally intensive for large datasets.

 Merging/splitting decisions are irreversible.

4. Distribution-based Clustering

Distribution-based clustering assumes data is generated from a mixture of probability


distributions, such as Gaussian distributions and assigns points to clusters based on
statistical likelihood. This method supports clusters with flexible shapes and overlaps, but
usually requires specifying the number of distributions.

Algorithm:
 Gaussian Mixture Model (GMM): Fits data as a weighted mixture of Gaussian
distributions; assigns data points based on likelihood.

Pros:

 Flexible cluster shapes.

 Provides probabilistic memberships.

 Suitable for overlapping clusters.

Cons:

 Requires specifying number of components.

 Computationally more expensive.

 Sensitive to initialization.

5. Fuzzy Clustering

Fuzzy clustering extends traditional methods by allowing each data point to belong to
multiple clusters with varying degrees of membership. This approach captures ambiguity
and soft boundaries in data and is particularly useful when the clusters overlap or
boundaries are not clear-cut.

Algorithm:

 Fuzzy C-Means: Similar to K-means but with fuzzy memberships updated iteratively.

Pros:

 Models data ambiguity explicitly.

 Useful for complex or imprecise data.

Cons:

 Choosing fuzziness parameter can be tricky.

 Computational overhead compared to hard clustering.

Use Cases

 Customer Segmentation: Grouping customers based on behavior or demographics


for targeted marketing and personalized services.

 Anomaly Detection: Identifying outliers or fraudulent activities in finance, network


security and sensor data.

 Image Segmentation: Dividing images into meaningful parts for object detection,
medical diagnostics or computer vision tasks.
 Recommendation Systems: Clustering user preferences to recommend movies,
products or content tailored to different groups.

 Market Basket Analysis: Discovering products frequently bought together to


optimize store layouts and promotions.

Similarity and Distance Measures:

Distance measures are the backbone of clustering algorithms. Distance measures are
mathematical functions that determine how similar or different two data points are. The
choice of distance measure can significantly impact the clustering results, as it influences
the shape and structure of the clusters.

The choice of distance measure significantly impacts the quality of the clusters formed and
the insights derived from them. A well-chosen distance measure can lead to meaningful
clusters that reveal hidden patterns in the data, while a poorly chosen measure can result in
clusters that are misleading or irrelevant.

 Distance measurements specify how similarity between data points is assessed


which makes them essential for grouping.

 The performance of the clustering method, and its result can be strongly impacted by
the distance measure selection.

 It affects the formation of clusters and may have an impact on the validity and
interpretability of the clusters.

Common Distance Measures

There are several types of distance measures, each with its strengths and weaknesses. Here
are some of the most commonly used distance measures in clustering:

1. Euclidean Distance

The Euclidean distance is the most widely used distance measure in clustering. It calculates
the straight-line distance between two points in n-dimensional space. The formula for
Euclidean distance is:

d(p,q)=Σi=1n(pi−qi)2d(p,q)=Σi=1n(pi−qi)2

where,

 p and q are two data points

 and n is the number of dimensions.

Utilizing Euclidean Distance

import numpy as np
import [Link] as plt

# Calculate Euclidean distance

def euclidean_distance(point1, point2):

return [Link]([Link](([Link](point1) - [Link](point2)) ** 2))

point1 = [2, 3]

point2 = [5, 7]

distance = euclidean_distance(point1, point2)

print(f"Euclidean Distance: {distance}")

# Plotting the points and the Euclidean distance

[Link]()

[Link](*zip(*[point1, point2]), color=['red', 'blue'])

[Link]([point1[0], point2[0]], [point1[1], point2[1]], color='black')

[Link]((point1[0] + point2[0]) / 2, (point1[1] + point2[1]) / 2, f'{distance:.2f}', color='black')

[Link]('Euclidean Distance')

[Link]('X-axis')

[Link]('Y-axis')

[Link]()

Output: Euclidean Distance: 5.0


The two spots that we are computing the Euclidean distance between are represented by
the red and blue dots in the figure. The Euclidean distance, represented by the black line
that separates them is the distance measured in a straight line.

2. Manhattan Distance

The Manhattan distance, is the total of the absolute differences between their Cartesian
coordinates, sometimes referred to as the L1 distance or city block distance. Envision
maneuvering across a city grid in which your only directions are horizontal and vertical. The
Manhattan distance, which computes the total distance traveled along each dimension to
reach a different data point represents this movement. When it comes to categorical data
this metric is more effective than Euclidean distance since it is less susceptible to outliers.
The formula is:

d(p,q)=Σi=1n∣pi−qi∣d(p,q)=Σi=1n∣pi−qi∣

Implementation in Python

# Calculate Manhattan distance

def manhattan_distance(point1, point2):

return [Link]([Link]([Link](point1) - [Link](point2)))

distance = manhattan_distance(point1, point2)

print(f"Manhattan Distance: {distance}")

# Plotting the points and the Manhattan distance

[Link]()

[Link](*zip(*[point1, point2]), color=['red', 'blue'])


[Link]([point1[0], point1[0]], [point1[1], point2[1]], color='black', linestyle='--')

[Link]([point1[0], point2[0]], [point2[1], point2[1]], color='black', linestyle='--')

[Link]((point1[0] + point2[0]) / 2, (point1[1] + point2[1]) / 2, f'{distance:.2f}', color='black')

[Link]('Manhattan Distance')

[Link]('X-axis')

[Link]('Y-axis')

[Link]()

Output:

Manhattan Distance: 7

The two points are represented by the red and blue points in the plot. The grid-line-based
method used to determine the Manhattan distance is depicted by the dashed black lines.

3. Cosine Similarity

Instead than concentrating on the exact distance between data points , cosine
similarity measure looks at their orientation. It calculates the cosine of the angle between
two data points, with a higher cosine value indicating greater similarity. This measure is
often used for text data analysis, where the order of features (words in a sentence) might
not be as crucial as their presence. It is used to determine how similar the vectors are,
irrespective of their magnitude.

similarity(A,B)=A.B∥A∥∥B∥similarity(A,B)=∥A∥∥B∥A.B

Example in Python

# Calculate Cosine Similarity

def cosine_similarity(point1, point2):

dot_product = [Link](point1, point2)

norm1 = [Link](point1)

norm2 = [Link](point2)

return dot_product / (norm1 * norm2)

distance = cosine_similarity(point1, point2)

print(f"Cosine Similarity: {distance}")

# Plotting the points and the Cosine similarity

# For Cosine Similarity, we will plot the vectors originating from the origin

origin = [0, 0]

[Link]()

[Link](*origin, *point1, angles='xy', scale_units='xy', scale=1, color='red')

[Link](*origin, *point2, angles='xy', scale_units='xy', scale=1, color='blue')

[Link](0, max(point1[0], point2[0]) + 1)

[Link](0, max(point1[1], point2[1]) + 1)

[Link]('Cosine Similarity')

[Link]('X-axis')

[Link]('Y-axis')

[Link]()

Output:

Cosine Similarity: 0.9994801143396996


In the plot, the red and blue arrows represent the vectors of the two points from the origin.
The cosine similarity is related to the angle between these vectors.

4. Minkowski Distance

Minkowski distance is a generalized form of both Euclidean and Manhattan distances,


controlled by a parameter p. The Minkowski distance allows adjusting the power parameter
(p). When p=1, it’s equivalent to Manhattan distance; when p=2, it’s Euclidean distance.

d(x,y)=(Σi=1n∣xi−yi∣p)1pd(x,y)=(Σi=1n∣xi−yi∣p)p1

Utilizing Minkowski Distance

# Calculate Minkowski distance

def minkowski_distance(point1, point2, p):

return [Link]([Link]([Link]([Link](point1) - [Link](point2)) ** p), 1/p)

p=3

distance = minkowski_distance(point1, point2, p)

print(f"Minkowski Distance (p={p}): {distance}")

# Plotting the points and the Minkowski distance

[Link]()

[Link](*zip(*[point1, point2]), color=['red', 'blue'])

# For Minkowski with p=3, the visualization isn't straightforward like Euclidean or Manhattan
# We will plot the same line for illustration purposes

[Link]([point1[0], point2[0]], [point1[1], point2[1]], color='black')

[Link]((point1[0] + point2[0]) / 2, (point1[1] + point2[1]) / 2, f'{distance:.2f}', color='black')

[Link]('Minkowski Distance')

[Link]('X-axis')

[Link]('Y-axis')

[Link]()

Output:

Minkowski Distance (p=3): 4.497941445275415

In the plot, the visualization is similar to Euclidean distance when p=3, but the distance
calculation formula changes.

5. Jaccard Index

This measure is ideal for binary data, where features can only take values of 0 or 1. It
calculates the ratio of the number of features shared by two data points to the total number
of features. Jaccard Index measures the similarity between two sets by comparing the size of
their intersection and union.

J(A,B)=∣A∩B∣∣A∪B∣J(A,B)=∣A∪B∣∣A∩B∣
Jaccard Index Example in Python

from [Link] import jaccard_score

# Define two binary vectors

vector1 = [Link]([1, 1, 0, 0])

vector2 = [Link]([1, 1, 1, 0])

# Calculate Jaccard index

jaccard_index = jaccard_score(vector1, vector2)

print("Jaccard Index:", jaccard_index)

Output:

Jaccard Index: 0.6666666666666666

Choosing the Optimal Distance Metric for Clustering: Key Considerations

The type of data and the particulars of the clustering operation will determine which
distance metric is best. Here are some things to think about:

 Data Type: Different distance metrics may be needed for binary, categorical , or
numerical data.

 Scale Sensitivity: The scale of the data affects the measurement of some distances
such as Euclidean distance. Data standardization can aid in resolving this problem.

 Interpretability: For the specified application the selected measure should yield
findings , that are both meaningful and comprehensible.

 Computational Efficiency: Take into account the complexity of computing particularly


when working with big datasets.

 Existence of Outliers: Outliers have a big impact on metrics based on distance. If


outliers are an issue use metrics that are less susceptible to them.

 Clustering Algorithm: Various clustering methods may need a certain distance


metric.

Choosing the Right Distance Measure

The choice of distance measure depends on the nature of the data and the clustering
algorithm being used. Here are some general guidelines:

 Euclidean distance is suitable for continuous data with a Gaussian distribution.


 Manhattan distance is suitable for data with a uniform distribution or when the
dimensions are not equally important.

 Minkowski distance is suitable when you want to generalize the Euclidean and
Manhattan distances.

 Cosine similarity is suitable for text data or when the angle between vectors is more
important than the magnitude.

 Jaccard similarity is suitable for categorical data or when the intersection and union
of sets are more important than the individual elements.

Choosing the Optimal Distance Metric for Clustering:

Key Considerations

The type of data and the particulars of the clustering operation will determine which
distance metric is best. Here are some things to think about:

Data Type: Different distance metrics may be needed for binary, categorical , or numerical
data.

Scale Sensitivity: The scale of the data affects the measurement of some distances such as
Euclidean distance. Data standardization can aid in resolving this problem.

Interpretability: For the specified application the selected measure should yield findings ,
that are both meaningful and comprehensible.

Computational Efficiency: Take into account the complexity of computing particularly when
working with big datasets.

Existence of Outliers: Outliers have a big impact on metrics based on distance. If outliers are
an issue use metrics that are less susceptible to them.

Clustering Algorithm: Various clustering methods may need a certain distance metric.

Choosing the Right Distance Measure

The choice of distance measure depends on the nature of the data and the clustering
algorithm being used. Here are some general guidelines:

Euclidean distance is suitable for continuous data with a Gaussian distribution.

Manhattan distance is suitable for data with a uniform distribution or when the dimensions
are not equally important.
Minkowski distance is suitable when you want to generalize the Euclidean and Manhattan
distances.

Cosine similarity is suitable for text data or when the angle between vectors is more
important than the magnitude.

Jaccard similarity is suitable for categorical data or when the intersection and union of sets
are more important than the individual elements.

Clustering Metrics in Machine Learning

Clustering is a technique in Machine Learning that is used to group similar data points. While
the algorithm performs its job, helping uncover the patterns and structures in the data, it is
important to judge how well it functions. Several metrics have been designed to evaluate
the performance of these clustering algorithms.

Clustering:

Clustering is an unsupervised machine-learning approach that is used to group comparable


data points based on specific traits or attributes. Clustering algorithms do not require
labelled data, which makes them ideal for finding patterns in large datasets. It is a widely
used technique in applications like customer segmentation, image recognition, anomaly
detection, etc.

There are multiple clustering algorithms, and each has its way of grouping data points.
Clustering metrics are used to evaluate all these algorithms. Let us take a look at some of the
most commonly used clustering metrics:

1. Silhouette Score

The Silhouette Score is a way to measure how good the clusters are in a dataset. It helps us
understand how well the data points have been grouped. The score ranges from -1 to 1.

 A score close to 1 means a point fits really well in its group (cluster) and is far from
other groups.

 A score close to 0 means the point is on the border between two clusters.

 A score close to -1 means the point might be in the wrong cluster.

Silhouette Score (S) for a data point i is calculated as:

b(i)−a(i)
S(i)=
max(a(i), b(i))

where,

 a(i) is the average distance from i to other data points in the same cluster.
 b(i) is the smallest average distance from i to data points in a different cluster.

2. Davies-Bouldin Index

The Davies-Bouldin Index (DBI) helps us measure how good the clustering is in a dataset. It
looks at how tight each cluster is (compactness), and how far apart the clusters are
(separation).

 Lower DBI = better, clearer clusters

 Higher DBI = messy, overlapping clusters

A lower score is better, because it means:

 Points in the same cluster are close to each other.

 Different clusters are far apart from one another.

Davies-Bouldin Index (DB) is calculated as:

1
k
R +R
DB= ∑ ❑ max ⁡j≠ i ( ii jj )
k i=1 Rij

where,

 k is the total number of clusters.

 Rii is the compactness of cluster i.

 R jj is the compactness of cluster j.

 Rij is the dissimilarity (distance) between cluster i and cluster j.

3. Calinski-Harabasz Index (Variance Ratio Criterion)

The Calinski-Harabasz Index measures how good the clusters are in a dataset.

It looks at:

 How close the points are inside each cluster?

 How far apart the clusters are?

A higher score is better, as it means the clusters are tight and well-separated. It helps
determine the ideal number of clusters.

Calinski-Harabasz Index (CH) is calculated as:


B N −K
CH = ×
W K−1

where,

 B is the sum of squares between clusters.


 W is the sum of squares within clusters.

 N is the total number of data points.

 K is the number of clusters.

Calculating between group sum of squares (B)


K
B=∑ ❑n k × ∣∣C k – C ∣∣2
k =1

where,

 n k is the number of observation in cluster 'k'

 C k is the centroid of cluster 'k'

 C is the centroid of the dataset

 K is number of clusters

Calculating within the group sum of squares (W)


nk
W =∑ ❑ ∣∣ X ik – C k ∣ ∣2
i=1

where,

 n k is the number of observation in cluster 'k'

 X ik is the i-th observation of cluster 'k'

 C k is the centroid of cluster 'k'

4. Adjusted Rand Index (ARI)

The Adjusted Rand Index (ARI) helps us measure how accurate a clustering result is by
comparing it to the true labels (ground truth).

It checks how well the pairs of points are grouped:

 Are the same pairs together in both the real and predicted clusters?

 Are different pairs also kept apart correctly?

The score ranges from -1 to 1:

 1 means perfect match - the clustering is exactly right.

 0 means random guess - no better than chance.

 Below 0 means worse than random - very poor clustering.

Adjusted Rand Index (ARI) is calculated as:


(RI −Expecte d RI )
ARI=
(max( RI )−Expecte d RI )

where,

 RI is the Rand Index.

 Expecte d RI is the expected value of the Rand Index.

5. Mutual Information (MI)

Mutual Information measures how much two variables are related or connected. In
clustering, it compares how much the true cluster labels match with the predicted labels. It
shows how much knowing about one variable helps us predict the other. The more
agreement there is, the higher the score.

 Higher values mean better agreement between the clusters.

 Zero means no agreement at all.

MI between true labels Y and predicted labels Z is calculated as:

p( y i , z j)
MI ( y , z)=∑ ❑ ∑ ❑ p ( y i , z j )⋅ log ⁡( )
i j p( y i )⋅ p(z j)

where,

 y i is a true label.

 z i is a predicted label.

 p( y i , zi ) is the joint probability of y i and z j .

 p( y i ) and p(z i ) are the marginal probabilities.

These clustering metrics help in evaluating the quality and performance of clustering
algorithms, allowing for informed decisions when selecting the most suitable clustering
solution for a given dataset.

Steps to Evaluate Clustering Using Sklearn:

Let's consider an example using the Iris dataset and the K-Means clustering algorithm. We
will calculate the Silhouette Score, Davies-Bouldin Index, Calinski-Harabasz Index, and
Adjusted Rand Index to evaluate the clustering.

Import Libraries

Import the necessary libraries, including scikit-learn (sklearn).


from [Link] import KMeans

from [Link] import silhouette_score, davies_bouldin_score,


calinski_harabasz_score

from [Link] import mutual_info_score, adjusted_rand_score

Load Your Data

Load or generate your dataset for clustering. Iris dataset consists of 150 samples of iris
flowers. There are three species of iris flower: setosa, versicolor, and virginica with four
features: sepal length, sepal width, petal length, and petal width.

# Example using a built-in dataset (e.g., Iris dataset)

from [Link] import load_iris

iris = load_iris()

X = [Link]

Perform Clustering

Choose a clustering algorithm, such as K-Means, and fit it to your data.

K means is an unsupervised technique used for creating cluster based on similarity. It


iteratively assigns data points to the nearest cluster center and updates the centroids until
convergence.

kmeans = KMeans(n_clusters=3)

[Link](X)

Calculate Clustering Metrics

Use the appropriate clustering metrics to evaluate the clustering results.

# Calculate clustering metrics

silhouette = silhouette_score(X, kmeans.labels_)

db_index = davies_bouldin_score(X, kmeans.labels_)

ch_index = calinski_harabasz_score(X, kmeans.labels_)

ari = adjusted_rand_score([Link], kmeans.labels_)

mi = mutual_info_score([Link], kmeans.labels_)

# Print the metric scores

print(f"Silhouette Score: {silhouette:.2f}")


print(f"Davies-Bouldin Index: {db_index:.2f}")

print(f"Calinski-Harabasz Index: {ch_index:.2f}")

print(f"Adjusted Rand Index: {ari:.2f}")

print(f"Mutual Information (MI): {mi:.2f}")

Output:

Silhouette Score: 0.55


Davies-Bouldin Index: 0.66
Calinski-Harabasz Index: 561.63
Adjusted Rand Index: 0.73
Mutual Information (MI): 0.83

K-Means Clustering Algorithm:

K-means clustering algorithm computes the centroids and iterates until we it finds optimal
centroid. It assumes that the number of clusters are already known. It is also called flat
clustering algorithm. The number of clusters identified from data by algorithm is
represented by 'K' in K-means.

In this algorithm, the data points are assigned to a cluster in such a manner that the sum of
the squared distance between the data points and centroid would be minimum. It is to be
understood that less variation within the clusters will lead to more similar data points within
same cluster.

Working of K-Means Algorithm:

We can understand the working of K-Means clustering algorithm with the help of following
steps −

 Step 1 − First, we need to specify the number of clusters, K, need to be generated by


this algorithm.

 Step 2 − Next, randomly select K data points and assign each data point to a cluster.
In simple words, classify the data based on the number of data points.

 Step 3 − Now it will compute the cluster centroids.

 Step 4 − Next, keep iterating the following until we find optimal centroid which is the
assignment of data points to the clusters that are not changing any more −

4.1 − First, the sum of squared distance between data points and centroids would be
computed.
4.2 − Now, we have to assign each data point to the cluster that is closer than other cluster
(centroid).

4.3 − At last compute the centroids for the clusters by taking the average of all data points of
that cluster.

K-means follows Expectation-Maximization approach to solve the problem. The


Expectation-step is used for assigning the data points to the closest cluster and the
Maximization-step is used for computing the centroid of each cluster.

While working with K-means algorithm we need to take care of the following things −

 While working with clustering algorithms including K-Means, it is recommended to


standardize the data because such algorithms use distance-based measurement to
determine the similarity between data points.

 Due to the iterative nature of K-Means and random initialization of centroids, K-


Means may stick in a local optimum and may not converge to global optimum. That is
why it is recommended to use different initializations of centroids.

The K-Means algorithm is a straightforward and efficient algorithm, and it can handle large
datasets. However, it has some limitations, such as its sensitivity to the initial centroids, its
tendency to converge to local optima, and its assumption of equal variance for all clusters.

Objective of K-means Clustering

The main goals of cluster analysis are −

 To get a meaningful intuition from the data we are working with.

 Cluster-then-predict where different models will be built for different subgroups.

Implementation of K-Means Algorithm Using Python

Python has several libraries that provide implementations of various machine learning
algorithms, including K-Means clustering. Let's see how to implement the K-Means
algorithm in Python using the scikit-learn library.

Example :

It is a simple example to understand how k-means works. In this example, we generate 300
random data points with two features. And apply K-means algorithm to generate clusters.

Step 1 − Import Required Libraries


To implement the K-Means algorithm in Python, we first need to import the required
libraries. We will use the numpy and matplotlib libraries for data processing and
visualization, respectively, and the scikit-learn library for the K-Means algorithm.

import numpy as np

import [Link] as plt

from [Link] import KMeans

Step 2 − Generate Data

To test the K-Means algorithm, we need to generate some sample data. In this example, we
will generate 300 random data points with two features. We will visualize the data also.

X = [Link](300,2)

[Link](figsize=(7.5, 3.5))

[Link](X[:, 0], X[:, 1], s=20, cmap='summer');

[Link]()

Output:

Step 3 − Initialize K-Means

Next, we need to initialize the K-Means algorithm by specifying the number of clusters (K)
and the maximum number of iterations.

kmeans = KMeans(n_clusters=3, max_iter=100)


Step 4 − Train the Model

After initializing the K-Means algorithm, we can train the model by fitting the data to the
algorithm.

[Link](X)

Step 5 − Visualize the Clusters

To visualize the clusters, we can plot the data points and color them based on their assigned
cluster.

[Link](figsize=(7.5, 3.5))

[Link](X[:,0], X[:,1], c=kmeans.labels_, s=20, cmap='summer')

[Link](kmeans.cluster_centers_[:,0], kmeans.cluster_centers_[:,1],

marker='x', c='r', s=50, alpha=0.9)

[Link]()

Output:

The output of the above code will be a plot with the data points colored based on their
assigned cluster, and the centroids marked with an 'x' symbol in red color.

Spectral Clustering

Spectral Clustering is a variant of the clustering algorithm that uses the connectivity
between the data points to form the clustering. It uses eigenvalues and eigenvectors of the
data matrix to forecast the data into lower dimensions space to cluster the data points. It is
based on the idea of a graph representation of data where the data point are represented as
nodes and the similarity between the data points are represented by an edge.

Steps performed for spectral Clustering


Building the Similarity Graph Of The Data: This step builds the Similarity Graph in the form
of an adjacency matrix which is represented by A. The adjacency matrix can be built in the
following manners:

 Epsilon-neighbourhood Graph: A parameter epsilon is fixed beforehand. Then, each


point is connected to all the points which lie in its epsilon-radius. If all the distances
between any two points are similar in scale then typically the weights of the edges ie
the distance between the two points are not stored since they do not provide any
additional information. Thus, in this case, the graph built is an undirected and
unweighted graph.

 K-Nearest Neighbours A parameter k is fixed beforehand. Then, for two vertices u


and v, an edge is directed from u to v only if v is among the k-nearest neighbours of
u. Note that this leads to the formation of a weighted and directed graph because it
is not always the case that for each u having v as one of the k-nearest neighbours, it
will be the same case for v having u among its k-nearest neighbours. To make this
graph undirected, one of the following approaches is followed:-

1. Direct an edge from u to v and from v to u if either v is among the k-nearest


neighbours of u OR u is among the k-nearest neighbours of v.

2. Direct an edge from u to v and from v to u if v is among the k-nearest


neighbours of u AND u is among the k-nearest neighbours of v.

3. Fully-Connected Graph: To build this graph, each point is connected with an


undirected edge-weighted by the distance between the two points to every
other point. Since this approach is used to model the local neighbourhood
relationships thus typically the Gaussian similarity metric is used to calculate
the distance.

Projecting the data onto a lower Dimensional Space: This step is done to account for the
possibility that members of the same cluster may be far away in the given dimensional
space. Thus the dimensional space is reduced so that those points are closer in the reduced
dimensional space and thus can be clustered together by a traditional clustering algorithm.
It is done by computing the Graph Laplacian Matrix.

Python Code For Graph Laplacian Matrix

To compute it though first, the degree of a node needs to be defined. The degree of the ith
n
node is given byd i= ∑ ❑ w ij Note that w ij is the edge between the nodes i and j
j=1 ∣(i , j )ϵE

as defined in the adjacency matrix above.

# Defining the adjaceny matix

import numpy as np
A = [Link]([

[0, 1, 1, 0, 0, 0, 0, 0, 1, 1],
[1, 0, 1, 0, 0, 0, 0, 0, 0, 0],
[1, 1, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 1, 1, 0, 0, 0, 0],
[0, 0, 0, 1, 0, 1, 0, 0, 0, 0],
[0, 0, 0, 1, 1, 0, 1, 1, 0, 0],
[0, 0, 0, 0, 0, 1, 0, 1, 0, 0],
[0, 0, 0, 0, 0, 1, 1, 0, 0, 0],
[1, 0, 0, 0, 0, 0, 0, 0, 0, 1],
[1, 0, 0, 0, 0, 0, 0, 0, 1, 0]])
The degree matrix is defined as follows:- Dij ={d i , i= j ¿ 0 , i≠ j ¿
D = [Link]([Link](axis=1))
print(D)
Thus the Graph Laplacian Matrix is defined as:-
L=D−A
L = D-A
print(L)
This Matrix is then normalized for mathematical efficiency. To reduce the dimensions, first,
the eigenvalues and the respective eigenvectors are calculated. If the number of clusters is k
then the first eigenvalues and their eigenvectors are taken and stacked into a matrix such
that the eigenvectors are the columns.

Code For Calculating eigenvalues and eigenvector of the matrix in Python


# find eigenvalues and eigenvectors
vals, vecs = [Link](A)
Clustering the Data: This process mainly involves clustering the reduced data by using any
traditional clustering technique - typically K-Means Clustering. First, each node is assigned a
row of the normalized of the Graph Laplacian Matrix. Then this data is clustered using any
traditional technique. To transform the clustering result, the node identifier is retained.
Properties:
1. Assumption-Less: This clustering technique, unlike other traditional techniques do
not assume the data to follow some property. Thus this makes this technique to
answer a more-generic class of clustering problems.
2. Ease of implementation and Speed: This algorithm is easier to implement than other
clustering algorithms and is also very fast as it mainly consists of mathematical
computations.
3. Not-Scalable: Since it involves the building of matrices and computation of
eigenvalues and eigenvectors it is time-consuming for dense datasets.
4. Dimensionality Reduction: The algorithm uses eigenvalue decomposition to reduce
the dimensionality of the data, making it easier to visualize and analyze.
5. Cluster Shape: This technique can handle non-linear cluster shapes, making it
suitable for a wide range of applications.
6. Noise Sensitivity: It is sensitive to noise and outliers, which may affect the quality of
the resulting clusters.
7. Number of Clusters: The algorithm requires the user to specify the number of
clusters beforehand, which can be challenging in some cases.
8. Memory Requirements: The algorithm requires significant memory to store the
similarity matrix, which can be a limitation for large datasets.

Example:
Credit Card Data Clustering Using Spectral Clustering
The below steps demonstrate how to implement Spectral Clustering using Sklearn. The data
for the following steps is the Credit Card Data which can be downloaded from Kaggle
Step 1: Importing the required libraries
We will first import all the libraries that are needed for this project
import pandas as pd
import [Link] as plt
from [Link] import SpectralClustering
from [Link] import StandardScaler, normalize
from [Link] import PCA
from [Link] import silhouette_score
Step 2: Loading and Cleaning the Data
# Changing the working location to the location of the data
cd "C:\Users\Dev\Desktop\Kaggle\Credit_Card"
# Loading the data
X = pd.read_csv('CC_GENERAL.csv')
# Dropping the CUST_ID column from the data
X = [Link]('CUST_ID', axis = 1)
# Handling the missing values if any
[Link](method ='ffill', inplace = True)
[Link]()
Output:

Step 3: Preprocessing the data to make the data visualizable

# Preprocessing the data to make it visualizable

# Scaling the Data

scaler = StandardScaler()

X_scaled = scaler.fit_transform(X)

# Normalizing the Data

X_normalized = normalize(X_scaled)

# Converting the numpy array into a pandas DataFrame

X_normalized = [Link](X_normalized)

# Reducing the dimensions of the data

pca = PCA(n_components = 2)

X_principal = pca.fit_transform(X_normalized)

X_principal = [Link](X_principal)

X_principal.columns = ['P1', 'P2']


X_principal.head()

Step 4: Building the Clustering models and Visualizing the Clustering

In the below steps, two different Spectral Clustering models with different values for the
parameter 'affinity'. You can read about the documentation of the Spectral Clustering class
here. a) affinity = 'rbf'

# Building the clustering model

spectral_model_rbf = SpectralClustering(n_clusters = 2, affinity ='rbf')

# Training the model and Storing the predicted cluster labels

labels_rbf = spectral_model_rbf.fit_predict(X_principal)

# Building the label to colour mapping

colours = {}

colours[0] = 'b'

colours[1] = 'y'

# Building the colour vector for each data point

cvec = [colours[label] for label in labels_rbf]

# Plotting the clustered scatter plot


b = [Link](X_principal['P1'], X_principal['P2'], color ='b');

y = [Link](X_principal['P1'], X_principal['P2'], color ='y');

[Link](figsize =(9, 9))

[Link](X_principal['P1'], X_principal['P2'], c = cvec)

[Link]((b, y), ('Label 0', 'Label 1'))

[Link]()

Output:

Step 5: Evaluating the performances

# List of different values of affinity

affinity = ['rbf', 'nearest-neighbours']

# List of Silhouette Scores

s_scores = []

# Evaluating the performance

s_scores.append(silhouette_score(X, labels_rbf))

s_scores.append(silhouette_score(X, labels_nn))

print(s_scores)
Step 6: Comparing the performances

# Plotting a Bar Graph to compare the models

[Link](affinity, s_scores)

[Link]('Affinity')

[Link]('Silhouette Score')

[Link]('Comparison of different Clustering Models')

[Link]()

Output:

Spectral Clustering is a type of clustering algorithm in machine learning that uses


eigenvectors of a similarity matrix to divide a set of data points into clusters. The basic idea
behind spectral clustering is to use the eigenvectors of the Laplacian matrix of a graph to
represent the data points and find clusters by applying k-means or another clustering
algorithm to the eigenvectors.

Advantages of Spectral Clustering:

1. Scalability: Spectral clustering can handle large datasets and high-dimensional data,
as it reduces the dimensionality of the data before clustering.

2. Flexibility: Spectral clustering can be applied to non-linearly separable data, as it does


not rely on traditional distance-based clustering methods.

3. Robustness: Spectral clustering can be more robust to noise and outliers in the data,
as it considers the global structure of the data, rather than just local distances
between data points.

Disadvantages of Spectral Clustering:

1. Complexity: Spectral clustering can be computationally expensive, especially for large


datasets, as it requires the calculation of eigenvectors and eigenvalues.
2. Model selection: Choosing the right number of clusters and the right similarity matrix
can be challenging and may require expert knowledge or trial and error.

Hierarchical Clustering in Machine Learning

Hierarchical clustering is an unsupervised learning technique used to group similar data


points into clusters by building a hierarchy (tree-like structure). Unlike flat clustering like k-
means hierarchical clustering does not require specifying the number of clusters in advance.

The algorithm builds clusters step by step either by progressively merging smaller clusters or
by splitting a large cluster into smaller ones. The process is often visualized using
a dendrogram, which helps to understand data similarity.

Imagine we have four fruits with different weights: an apple (100g), a banana (120g), a
cherry (50g) and a grape (30g). Hierarchical clustering starts by treating each fruit as its own
group.

 Start with each fruit as its own cluster.

 Merge the closest items: grape (30g) and cherry (50g) are grouped first.

 Next, apple (100g) and banana (120g) are grouped.

 Finally, these two clusters merge into one.

Finally all the fruits are merged into one large group, showing how hierarchical clustering
progressively combines the most similar data points.

Dendrogram

A dendrogram is like a family tree for clusters. It shows how individual data points or groups
of data merge together. The bottom shows each data point as its own group and as we move
up, similar groups are combined. The lower the merge point, the more similar the groups
are. It helps us see how things are grouped step by step.

 At the bottom of the dendrogram the points P, Q, R, S and T are all separate.
 As we move up, the closest points are merged into a single group.

 The lines connecting the points show how they are progressively merged based on
similarity.

 The height at which they are connected shows how similar the points are to each
other; the shorter the line the more similar they are

Types of Hierarchical Clustering

Now we understand the basics of hierarchical clustering. There are two main types of
hierarchical clustering.

1. Agglomerative Clustering

2. Divisive clustering

1. Hierarchical Agglomerative Clustering

It is also known as the bottom-up approach or hierarchical agglomerative clustering (HAC).


Bottom-up algorithms treat each data as a singleton cluster at the outset and then
successively agglomerate pairs of clusters until all clusters have been merged into a single
cluster that contains all data.

Workflow for Hierarchical Agglomerative clustering

1. Start with individual points: Each data point is its own cluster. For example if we
have 5 data points we start with 5 clusters each containing just one data point.

2. Calculate distances between clusters: Calculate the distance between every pair of
clusters. Initially since each cluster has one point this is the distance between the
two data points.

3. Merge the closest clusters: Identify the two clusters with the smallest distance and
merge them into a single cluster.
4. Update distance matrix: After merging we now have one less cluster. Recalculate the
distances between the new cluster and the remaining clusters.

5. Repeat steps 3 and 4: Keep merging the closest clusters and updating the distance
matrix until we have only one cluster left.

6. Create a dendrogram: As the process continues we can visualize the merging of


clusters using a tree-like diagram called a dendrogram. It shows the hierarchy of how
clusters are merged.

Implementation

Let's see the implementation of Agglomerative Clustering,

 Start with each data point as its own cluster.

 Compute distances between all clusters.

 Merge the two closest clusters based on a linkage method.

 Update the distances to reflect the new cluster.

 Repeat merging until the desired number of clusters or one cluster remains.

 The dendrogram visualizes these merges as a tree, showing cluster relationships and
distances.

import numpy as np

import [Link] as plt

from [Link] import AgglomerativeClustering

from [Link] import dendrogram

from [Link] import make_blobs

X, _ = make_blobs(n_samples=30, centers=3, cluster_std=10, random_state=42)

clustering = AgglomerativeClustering(n_clusters=3)

labels = clustering.fit_predict(X)

agg = AgglomerativeClustering(distance_threshold=0, n_clusters=None)

[Link](X)
def plot_dendrogram(model, **kwargs):

counts = [Link](model.children_.shape[0])

n_samples = len(model.labels_)

for i, merge in enumerate(model.children_):

current_count = 0

for child_idx in merge:

if child_idx < n_samples:

current_count += 1

else:

current_count += counts[child_idx - n_samples]

counts[i] = current_count

linkage_matrix = np.column_stack(

[model.children_, model.distances_, counts]).astype(float)

dendrogram(linkage_matrix, **kwargs)

fig, (ax1, ax2) = [Link](1, 2, figsize=(14, 6))

[Link](X[:, 0], X[:, 1], c=labels, cmap='viridis', s=70)

ax1.set_title("Agglomerative Clustering")

ax1.set_xlabel("Feature 1")

ax1.set_ylabel("Feature 2")

[Link](ax2)

plot_dendrogram(agg, truncate_mode='level', p=5)


[Link]("Hierarchical Clustering Dendrogram")

[Link]("Sample index")

[Link]("Distance")

plt.tight_layout()

[Link]()

Output :

2. Hierarchical Divisive clustering

Divisive clustering is also known as a top-down approach. Top-down clustering requires a


method for splitting a cluster that contains the whole data and proceeds by splitting clusters
recursively until individual data have been split into singleton clusters.

Workflow for Hierarchical Divisive clustering :

[Link] with all data points in one cluster: Treat the entire dataset as a single large cluster.

[Link] the cluster: Divide the cluster into two smaller clusters. The division is typically done
by finding the two most dissimilar points in the cluster and using them to separate the data
into two parts.

[Link] the process: For each of the new clusters, repeat the splitting process: Choose the
cluster with the most dissimilar points and split it again into two smaller clusters.

[Link] when each data point is in its own cluster: Continue this process until every data
point is its own cluster or the stopping condition (such as a predefined number of clusters) is
met.
Implementation

Let's see the implementation of Divisive Clustering,

 Starts with all data points as one big cluster.


 Finds the largest cluster and splits it into two using KMeans.
 Repeats splitting the largest cluster until reaching the desired number of clusters.
 Assigns cluster labels to each data point based on the splits.
 Returns history of clusters at each step and final labels.
 Visualizes data points colored by their final cluster.

import numpy as np

import [Link] as plt

from [Link] import KMeans

from [Link] import make_blobs

from [Link] import dendrogram, linkage

X, _ = make_blobs(n_samples=30, centers=5, cluster_std=10, random_state=42)

def divisive_clustering(data, max_clusters=3):

while len(clusters) < max_clusters:

cluster_to_split = max(clusters, key=lambda x: len(x))

[Link](cluster_to_split)
kmeans = KMeans(n_clusters=2, random_state=42).fit(cluster_to_split)

cluster1 = cluster_to_split[kmeans.labels_ == 0]

cluster2 = cluster_to_split[kmeans.labels_ == 1]

[Link]([cluster1, cluster2])

return clusters

clusters = divisive_clustering(X, max_clusters=3)

[Link](figsize=(12, 5))

[Link](1, 2, 1)

colors = ['r', 'g', 'b', 'c', 'm', 'y']

for i, cluster in enumerate(clusters):

[Link](cluster[:, 0], cluster[:, 1], s=50,

c=colors[i], label=f'Cluster {i+1}')

[Link]('Divisive Clustering Result')

[Link]()

linked = linkage(X, method='ward')

[Link](1, 2, 2)

dendrogram(linked, orientation='top',

distance_sort='descending', show_leaf_counts=True)

[Link]('Hierarchical Clustering Dendrogram')


plt.tight_layout()

[Link]()

Output:

You might also like