MachineLearning Unit IV.pptx
MachineLearning Unit IV.pptx
K-means:
• K-means algorithm is an algorithm to cluster n objects based
on attributes into k partitions, where k<n.
• K-Means clustering is an unsupervised clustering technique.
• It is a partitions based clustering algorithm.
• A cluster is defined as a group of objects that belongs to the
same class.
K-Means Clustering Algorithm
K-Means Clustering Algorithm involves the following steps-
Step-01:
• Choose the number of clusters K.
Step-02:
• Randomly select any K data points as cluster centres.
• Select cluster centers in such a way that they are as farther as
possible from each other.
Step-03:
• Calculate the distance between each data point and each
cluster center.
• The distance may be calculated either by using given distance
function or by using Euclidean distance formula.
K-Means Clustering Algorithm
Step-04:
• Assign each data point to some cluster.
A data point is assigned to that cluster whose center is nearest to
that data point.
Step-05:
• Re-compute the center of newly formed clusters.
The center of a cluster is computed by taking mean of all the data
points contained in that cluster.
Step-06:
• Keep repeating the procedure from Step-03 to Step-05 until any of
the following stopping criteria is met-
• Center of newly formed clusters do not change
• Data points remain present in the same cluster
• Maximum number of iterations are reached
Squared Error criteria
Flowchart
Example
Use K-Means Algorithm to create two clusters-
Solution-
• We follow the above discussed K-Means Clustering Algorithm.
• Assume A(2, 2) and C(1, 1) are centers of the two clusters.
Iteration-01:
• We calculate the distance of each point from each of the center of the two
clusters.
• The distance is calculated by using the Euclidean distance formula.
The following illustration shows the calculation of distance between point
A(2, 2) and each of the center of the two clusters-
Calculating Distance Between A(2, 2) and C1(2, 2)-
Ρ(A, C1)
= sqrt [ (x2 – x1)2 + (y2 – y1)2 ]
= sqrt [ (2 – 2)2 + (2 – 2)2 ]
= sqrt [ 0 + 0 ]
=0
Calculating Distance Between A(2, 2) and C2(1, 1)-
Ρ(A, C2)
= sqrt [ (x2 – x1)2 + (y2 – y1)2 ]
= sqrt [ (1 – 2)2 + (1 – 2)2 ]
= sqrt [ 1 + 1 ]
= sqrt [ 2 ]
= 1.41
• In the similar manner, we calculate the distance of other points from each
of the center of the two clusters.
For Cluster-02:
• Center of Cluster-02
• = ((1 + 1.5)/2, (1 + 0.5)/2)
• = (1.25, 0.75)
This is completion of Iteration-01.
Next,
• we go to iteration-02, iteration-03 and so on until the centers do
not change anymore.
Iteration-02:
Given points Distance from Distance from Points belongs
cluster(2.67,1.67) of cluster(1.25,0.75) to cluster
data points of data points
divides the population into several clusters such that data points in the
same cluster are more similar and data points in different clusters are
dissimilar.
• On the other hand, the divisive method starts with one cluster with all
given objects and then splits it iteratively to form smaller clusters
• The agglomerative hierarchical clustering method uses
the bottom-up strategy. It starts with each object forming
its own cluster and then iteratively merges the clusters 2
according to their similarity to form larger clusters. It
terminates either when a certain clustering condition
imposed by the user is achieved or all the clusters merge
into a single cluster.
Some pros and cons of Hierarchical
Clustering
Pros
• No assumption of a particular number of
clusters (i.e. k-means)
• May correspond to meaningful taxonomies
Cons
• Once a decision is made to combine two
clusters, it can’t be undone
• Too slow for large data sets, O(𝑛2 log(𝑛))
Agglomerative Clustering: It uses a bottom-up
approach. It starts with each object forming its
own cluster and then iteratively merges the
clusters according to their similarity to form
large clusters. It terminates either
• When certain clustering condition imposed by
user is achieved or
• All clusters merge into a single cluster
variants of Agglomerative methods:
1. Agglomerative Algorithm: Single Link
• Single-nearest distance or single linkage is the
agglomerative method that uses the distance
between the closest members of the two
clusters.
Question. Find the clusters using a single link
technique. Use Euclidean distance and draw
the dendrogram.
Sample
X Y
No.
P1 0.40 0.53
P2 0.22 0.38
P3 0.35 0.32
P4 0.26 0.19
P5 0.08 0.41
P6 0.45 0.30
Step 2: Merging the two closest members of the two
clusters and finding the minimum element in distance
matrix. Here the minimum value is 0.10 and hence we
combine P3 and P6 (as 0.10 came in the P6 row and P3
column).
Now, form clusters of elements corresponding to the
minimum value and update the distance matrix. To update
the distance matrix:
min ((P3,P6), P1) = min ((P3,P1), (P6,P1)) = min (0.22,0.24) = 0.22
min ((P3,P6), P2) = min ((P3,P2), (P6,P2)) = min (0.14,0.24) = 0.14
min ((P3,P6), P4) = min ((P3,P4), (P6,P4)) = min (0.13,0.22) = 0.13
min ((P3,P6), P5) = min ((P3,P5), (P6,P5)) = min (0.28,0.39) = 0.28
Now we will repeat the same process. Merge two closest members of
the two clusters and find the minimum element in distance matrix.
The minimum value is 0.13 and hence we combine P3, P6 and P4.
Now, form the clusters of elements corresponding to the minimum
values and update the Distance matrix. In order to find, what we have
to update in distance matrix,
min (((P3,P6) P4), P1) = min (((P3,P6), P1), (P4,P1)) = min (0.22,0.37) = 0.22
min (((P3,P6), P4), P2) = min (((P3,P6), P2), (P4,P2)) = min (0.14,0.19) = 0.14
min (((P3,P6), P4), P5) = min (((P3,P6), P5), (P4,P5)) = min (0.28,0.23) = 0.23
Again repeating the same process: The minimum value is 0.14 and
hence we combine P2 and P5. Now, form cluster of elements
corresponding to minimum value and update the distance matrix. To
update the distance matrix:
min ((P2,P5), P1) = min ((P2,P1), (P5,P1)) = min (0.23, 0.34) = 0.23
min ((P2,P5), (P3,P6,P4)) = min ((P3,P6,P4), (P3,P6,P4))
= min (0.14. 0.23) = 0.14
Again repeating the same process: The minimum value is 0.14 and
hence we combine P2,P5 and P3,P6,P4. Now, form cluster of
elements corresponding to minimum value and update the distance
matrix. To update the distance matrix:
min ((P2,P5,P3,P6,P4), P1) = min ((P2,P5), P1), ((P3,P6,P4), P1))
= min (0.23, 0.22) = 0.22
So now we have reached
to the solution finally,
the dendrogram for
those question will be as
follows:
DBSCAN Clustering
• There are different approaches and algorithms to
perform clustering tasks which can be divided
into three sub-categories:
1. Partition-based clustering: E.g. k-means,
k-median
2. Hierarchical clustering: E.g. Agglomerative,
Divisive
3. Density-based clustering: E.g. DBSCAN
Density-based clustering
• Partition-based and hierarchical clustering techniques are
highly efficient with normal shaped clusters. However,
when it comes to arbitrary shaped clusters or detecting
outliers, density-based techniques are more efficient.
• For example, the dataset in the figure below can easily be
divided into three clusters using k-means algorithm.
k-means clustering
Consider the following figures: