0% found this document useful (0 votes)
29 views51 pages

MachineLearning Unit IV.pptx

The document provides an overview of K-means clustering and hierarchical clustering methods, detailing the steps involved in the K-means algorithm, its advantages and disadvantages, and challenges in unsupervised learning. It also introduces DBSCAN, a density-based clustering algorithm that identifies arbitrary shaped clusters and outliers, explaining its parameters and classification of points. Additionally, it discusses evaluation metrics for clustering algorithms and compares the pros and cons of DBSCAN.

Uploaded by

vishnu060405
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views51 pages

MachineLearning Unit IV.pptx

The document provides an overview of K-means clustering and hierarchical clustering methods, detailing the steps involved in the K-means algorithm, its advantages and disadvantages, and challenges in unsupervised learning. It also introduces DBSCAN, a density-based clustering algorithm that identifies arbitrary shaped clusters and outliers, explaining its parameters and classification of points. Additionally, it discusses evaluation metrics for clustering algorithms and compares the pros and cons of DBSCAN.

Uploaded by

vishnu060405
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

K-means Clustering

K-means:
• K-means algorithm is an algorithm to cluster n objects based
on attributes into k partitions, where k<n.
• K-Means clustering is an unsupervised clustering technique.
• It is a partitions based clustering algorithm.
• A cluster is defined as a group of objects that belongs to the
same class.
K-Means Clustering Algorithm
K-Means Clustering Algorithm involves the following steps-
Step-01:
• Choose the number of clusters K.
Step-02:
• Randomly select any K data points as cluster centres.
• Select cluster centers in such a way that they are as farther as
possible from each other.
Step-03:
• Calculate the distance between each data point and each
cluster center.
• The distance may be calculated either by using given distance
function or by using Euclidean distance formula.
K-Means Clustering Algorithm
Step-04:
• Assign each data point to some cluster.
A data point is assigned to that cluster whose center is nearest to
that data point.
Step-05:
• Re-compute the center of newly formed clusters.
The center of a cluster is computed by taking mean of all the data
points contained in that cluster.
Step-06:
• Keep repeating the procedure from Step-03 to Step-05 until any of
the following stopping criteria is met-
• Center of newly formed clusters do not change
• Data points remain present in the same cluster
• Maximum number of iterations are reached
Squared Error criteria
Flowchart
Example
Use K-Means Algorithm to create two clusters-

Solution-
• We follow the above discussed K-Means Clustering Algorithm.
• Assume A(2, 2) and C(1, 1) are centers of the two clusters.
Iteration-01:
• We calculate the distance of each point from each of the center of the two
clusters.
• The distance is calculated by using the Euclidean distance formula.
The following illustration shows the calculation of distance between point
A(2, 2) and each of the center of the two clusters-
Calculating Distance Between A(2, 2) and C1(2, 2)-
Ρ(A, C1)
= sqrt [ (x2 – x1)2 + (y2 – y1)2 ]
= sqrt [ (2 – 2)2 + (2 – 2)2 ]
= sqrt [ 0 + 0 ]
=0
Calculating Distance Between A(2, 2) and C2(1, 1)-
Ρ(A, C2)
= sqrt [ (x2 – x1)2 + (y2 – y1)2 ]
= sqrt [ (1 – 2)2 + (1 – 2)2 ]
= sqrt [ 1 + 1 ]
= sqrt [ 2 ]
= 1.41
• In the similar manner, we calculate the distance of other points from each
of the center of the two clusters.

From here, New clusters are-


Cluster-01:
• First cluster contains points-A(2, 2), B(3, 2), D(3, 1)
Cluster-02:
• Second cluster contains points-C(1, 1), E(1.5, 0.5)
Now,
• We re-compute the new cluster centers.
• The new cluster center is computed by taking mean of all the points
contained in that cluster.
For Cluster-01:
• Center of Cluster-01
• = ((2 + 3 + 3)/3, (2 + 2 + 1)/3)
• = (2.67, 1.67)

For Cluster-02:
• Center of Cluster-02
• = ((1 + 1.5)/2, (1 + 0.5)/2)
• = (1.25, 0.75)
This is completion of Iteration-01.
Next,
• we go to iteration-02, iteration-03 and so on until the centers do
not change anymore.
Iteration-02:
Given points Distance from Distance from Points belongs
cluster(2.67,1.67) of cluster(1.25,0.75) to cluster
data points of data points

A(2,2) 0.73 1.45 C1


B(3,2) 0.44 2.14 C1
C(1,1) 1.79 0.34 C2
D(3,1) 0.54 1.76 C1
E(1.5,0.5) 1.45 0.34 C2

From here, New clusters are-


Cluster-01:
First cluster contains points-A(2, 2), B(3, 2), D(3, 1)
Cluster-02:
Second cluster contains points-C(1, 1), E(1.5, 0.5)
Here,
Cluster elements are same as in the previous iteration then stop the process.
K-means Advantages
• Relatively simple to implement.
• Scales to large data sets.
• Guarantees convergence.
• Easily adapts to new examples.
• Generalizes to clusters of different shapes and
sizes, such as elliptical clusters.
K-means Disadvantages
• It requires to specify the number of clusters (k)
in advance.
• It can not handle noisy data and outliers.
• It is not suitable to identify clusters with
non-convex shapes.
Exercise Problem
Challenges in Unsupervised Learning
• Number of clusters are normally not known a priori.
• For clustering algorithms, such as K-means, different initial
centers may lead to different clustering results, moreover K is
unknown.
• Time complexity - parititional clustering algorithms
are O(N) whereas hierarchical are O(N2).
• The similarity criteria is not clear - should we use Euclidean
or Manhattan or Hamming?
• In hierarchical clustering, at what stage should we stop?
• Evaluating clustering results are difficult because labels are
not available at the beginning.
Hierarchical clustering
• The hierarchical clustering methods are used to group the data
into hierarchy or tree-like structure.
• For example, in a machine learning problem of organizing
employees of a university in different departments, first the
employees are grouped under the different departments in the
university, and then within each department, the employees
can be grouped according to their roles such as professors,
assistant professors, supervisors, lab assistants, etc. This
creates a hierarchical structure of the employee data and eases
visualization and analysis.
Types of Hierarchal Clustering
There are two types of hierarchal clustering:
1. Agglomerative clustering
2. Divisive Clustering
Types of Hierarchal Clustering
• Agglomerative Clustering is a type of hierarchical clustering

algorithm. It is an unsupervised machine learning technique that

divides the population into several clusters such that data points in the

same cluster are more similar and data points in different clusters are

dissimilar.

• Points in the same cluster are closer to each other.

• Points in the different clusters are far apart.

• On the other hand, the divisive method starts with one cluster with all
given objects and then splits it iteratively to form smaller clusters
• The agglomerative hierarchical clustering method uses
the bottom-up strategy. It starts with each object forming
its own cluster and then iteratively merges the clusters 2
according to their similarity to form larger clusters. It
terminates either when a certain clustering condition
imposed by the user is achieved or all the clusters merge
into a single cluster.
Some pros and cons of Hierarchical
Clustering
Pros
• No assumption of a particular number of
clusters (i.e. k-means)
• May correspond to meaningful taxonomies
Cons
• Once a decision is made to combine two
clusters, it can’t be undone
• Too slow for large data sets, O(𝑛2 log(𝑛))
Agglomerative Clustering: It uses a bottom-up
approach. It starts with each object forming its
own cluster and then iteratively merges the
clusters according to their similarity to form
large clusters. It terminates either
• When certain clustering condition imposed by
user is achieved or
• All clusters merge into a single cluster
variants of Agglomerative methods:
1. Agglomerative Algorithm: Single Link
• Single-nearest distance or single linkage is the
agglomerative method that uses the distance
between the closest members of the two
clusters.
Question. Find the clusters using a single link
technique. Use Euclidean distance and draw
the dendrogram.
Sample
X Y
No.
P1 0.40 0.53
P2 0.22 0.38
P3 0.35 0.32
P4 0.26 0.19
P5 0.08 0.41
P6 0.45 0.30
Step 2: Merging the two closest members of the two
clusters and finding the minimum element in distance
matrix. Here the minimum value is 0.10 and hence we
combine P3 and P6 (as 0.10 came in the P6 row and P3
column).
Now, form clusters of elements corresponding to the
minimum value and update the distance matrix. To update
the distance matrix:
min ((P3,P6), P1) = min ((P3,P1), (P6,P1)) = min (0.22,0.24) = 0.22
min ((P3,P6), P2) = min ((P3,P2), (P6,P2)) = min (0.14,0.24) = 0.14
min ((P3,P6), P4) = min ((P3,P4), (P6,P4)) = min (0.13,0.22) = 0.13
min ((P3,P6), P5) = min ((P3,P5), (P6,P5)) = min (0.28,0.39) = 0.28
Now we will repeat the same process. Merge two closest members of
the two clusters and find the minimum element in distance matrix.
The minimum value is 0.13 and hence we combine P3, P6 and P4.
Now, form the clusters of elements corresponding to the minimum
values and update the Distance matrix. In order to find, what we have
to update in distance matrix,
min (((P3,P6) P4), P1) = min (((P3,P6), P1), (P4,P1)) = min (0.22,0.37) = 0.22

min (((P3,P6), P4), P2) = min (((P3,P6), P2), (P4,P2)) = min (0.14,0.19) = 0.14

min (((P3,P6), P4), P5) = min (((P3,P6), P5), (P4,P5)) = min (0.28,0.23) = 0.23
Again repeating the same process: The minimum value is 0.14 and
hence we combine P2 and P5. Now, form cluster of elements
corresponding to minimum value and update the distance matrix. To
update the distance matrix:
min ((P2,P5), P1) = min ((P2,P1), (P5,P1)) = min (0.23, 0.34) = 0.23
min ((P2,P5), (P3,P6,P4)) = min ((P3,P6,P4), (P3,P6,P4))
= min (0.14. 0.23) = 0.14
Again repeating the same process: The minimum value is 0.14 and
hence we combine P2,P5 and P3,P6,P4. Now, form cluster of
elements corresponding to minimum value and update the distance
matrix. To update the distance matrix:
min ((P2,P5,P3,P6,P4), P1) = min ((P2,P5), P1), ((P3,P6,P4), P1))
= min (0.23, 0.22) = 0.22
So now we have reached
to the solution finally,
the dendrogram for
those question will be as
follows:
DBSCAN Clustering
• There are different approaches and algorithms to
perform clustering tasks which can be divided
into three sub-categories:
1. Partition-based clustering: E.g. k-means,
k-median
2. Hierarchical clustering: E.g. Agglomerative,
Divisive
3. Density-based clustering: E.g. DBSCAN
Density-based clustering
• Partition-based and hierarchical clustering techniques are
highly efficient with normal shaped clusters. However,
when it comes to arbitrary shaped clusters or detecting
outliers, density-based techniques are more efficient.
• For example, the dataset in the figure below can easily be
divided into three clusters using k-means algorithm.

k-means clustering
Consider the following figures:

The data points in these figures are grouped in arbitrary


shapes or include outliers. Density-based clustering
algorithms are very efficient at finding high-density regions
and outliers. It is very important to detect outliers for some
task, e.g. anomaly detection.
DBSCAN Algorithm
• DBSCAN stands for Density-Based Spatial Clustering
of Applications with Noise. It is able to find arbitrary shaped
clusters and clusters with noise (i.e. outliers).
• In DBSCAN, instead of guessing the number of clusters, will
define two hyper parameters: epsilon and minPoints to arrive at
clusters.
• Epsilon (ε): The distance that specifies the neighborhoods.
Two points are considered to be neighbors if the distance
between them are less than or equal to epsilon
• minPoints(n): Minimum number of data points to define a
cluster.
DBSCAN Algorithm
Based on Epsilon (ε) and minPoints(n) parameters, points
are classified as core, border, and outlier or noise points:
• Core point: A point is a core point if there are at least
minPoints number of points (including the point itself) in
its surrounding area with radius epsilon.
• Border point: A point is a border point if it is reachable
from a core point and there are less than minPoints
number of points within its surrounding area.
• Outlier or Noise point: A point is an outlier if it is not a
core point and not reachable from any core points.
DBSCAN Algorithm
• These points may be better explained with visualizations.
Density connected
Three terms are necessary to understand in order to
understand DBSCAN:
• Direct density reachable: A point is called direct density
reachable if it has a core point in its neighbourhood.
• Density Connected: Two points are called density
connected if there is a core point which is density
reachable from both the points.
• Density Reachable: A point is called density reachable
from another point if they are connected through a series
of core points.
Evaluation Metrics of DBSCAN
• We will use the Silhouette score and Adjusted rand
score for evaluating clustering algorithms.
• Silhouette score is in the range of -1 to 1. A score near 1
denotes the best meaning that the data point i is very compact
within the cluster to which it belongs and far away from the
other clusters. Values near 0 denote overlapping clusters.
• Absolute Rand Score is in the range of 0 to 1. More than 0.9
denotes excellent cluster recovery, above 0.8 is a good
recovery. Less than 0.5 is considered to be poor recovery.
DBSCAN
Pros
• The DBSCAN is better than other cluster algorithms because
it does not require a pre-set number of clusters.
• It identifies outliers as noise, unlike the Mean-Shift method
that forces such points into the cluster in spite of having
different characteristics.
• It finds arbitrarily shaped and sized clusters quite well.
Cons
• It is not very effective when you have clusters of varying
densities.
• If you have high dimensional data, the determining of the
distance threshold Ɛ becomes a challenging task.
DBSCAN Algorithm
Step1: Label Core point and Noise point
▪ Select a random starting point, say x
▪ Identify neighborhood of point x using the radius ε
▪ Count the number of points, say k, in this neighborhood including
point x
▪ If k>=Minpts of points then mark x as a core point else mark x as
noise point
▪ Select a new unvisited point and repeat the above steps
Step2: Check if noise point can become boundary point
▪ If noise point is directly density reachable (That is within the
boundary of radius ε from the core point), mark it as boundary and it
will form the part of the cluster
▪ A point which is neither core point nor boundary point is marked as
noise point
Thank you

You might also like