0% found this document useful (0 votes)
376 views27 pages

K Means Clustering Problem Solved

The k-means clustering algorithm involves the following steps: 1. Randomly select k objects as initial cluster centroids 2. Assign each object to the closest centroid to form k clusters 3. Recalculate the centroid of each cluster 4. Repeat steps 2-3 until centroids stop changing or maximum iterations are reached.

Uploaded by

Bhushan Kelkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
376 views27 pages

K Means Clustering Problem Solved

The k-means clustering algorithm involves the following steps: 1. Randomly select k objects as initial cluster centroids 2. Assign each object to the closest centroid to form k clusters 3. Recalculate the centroid of each cluster 4. Repeat steps 2-3 until centroids stop changing or maximum iterations are reached.

Uploaded by

Bhushan Kelkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

 Assumes Euclidean space/distance

 Start by picking k, the number of clusters

 Initialize clusters by picking one point per


cluster
 Example: Pick one point at random, then k-1
other points, each as far away as possible from
the previous points

BDA 23
 1) For each point, place it in the cluster whose
current centroid it is nearest

 2) After all points are assigned, update the


locations of centroids of the k clusters

 3) Reassign all points to their closest centroid


 Sometimes moves points between clusters

 Repeat 2 and 3 until convergence


 Convergence: Points don’t move between clusters
and centroids stabilize
BDA 24
Algorithm 16.1: k-Means clustering
Input: D is a dataset containing n objects, k is the number of cluster
Output: A set of k clusters
Steps:
1. Randomly choose k objects from D as the initial cluster centroids.
2. For each of the objects in D do
• Compute distance between the current objects and k cluster centroids
• Assign the current object to that cluster to which it is closest.
3. Compute the “cluster centers” of each cluster. These become the new cluster
centroids.
4. Repeat step 2-3 until the convergence criterion is satisfied
5. Stop

25
Note:
1) Objects are defined in terms of set of attributes.
where each is continuous data type.
2) Distance computation: Any distance such as or cosine similarity.
3) Minimum distance is the measure of closeness between an object and
centroid.
4) Mean Calculation: It is the mean value of each attribute values of all objects.
5) Convergence criteria: Any one of the following are termination condition of
the algorithm.
• Number of maximum iteration permissible.
• No change of centroid values in any cluster.
• Zero (or no significant) movement(s) of object from one cluster to another.
• Cluster quality reaches to a certain level of acceptance.

26
BDA 27
x
x
x
x
x

x x x x x x

x … data point
… centroid Clusters after round 1

BDA 28
x
x
x
x
x

x x x x x x

x … data point
… centroid Clusters after round 2

BDA 29
x
x
x
x
x

x x x x x x

x … data point
… centroid Clusters at the end

BDA 30
Fig 16.1: Plotting data of Table 16.1
25
A1 A2
6.8 12.6
0.8 9.8 20

1.2 11.6
2.8 9.6 15

A2
3.8 9.9
4.4 6.5 10
4.8 1.1
6.0 19.9 5
6.2 18.5
7.6 17.4 0
7.8 12.2 0 2 4 6 8 10 12

6.6 7.7 A1
8.2 4.5
8.4 6.9
Table 16.1: 16 objects with two
9.0 3.4
attributes 𝑨𝟏 and 𝑨𝟐 .
9.6 11.1

31
• Suppose, k=3. Three objects are chosen at random shown as circled (see
Fig 16.1). These three centroids are shown below.
Initial Centroids chosen randomly

Centroid Objects
A1 A2
c1 3.8 9.9
c2 7.8 12.2
c3 6.2 18.5

• Let us consider the Euclidean distance measure (L2 Norm) as the distance
measurement in our illustration.
• Let d1, d2 and d3 denote the distance from an object to c1, c2 and c3
respectively. The distance calculations are shown in Table 16.2.
• Assignment of each object to the respective centroid is shown in the right-
most column and the clustering so obtained is shown in Fig 16.2.

32
Table 16.2: Distance calculation Fig 16.2: Initial cluster with respect to Table
A1 A2 d1 d2 d3 cluster
16.2
6.8 12.6 4.0 1.1 5.9 2
0.8 9.8 3.0 7.4 10.2 1
1.2 11.6 3.1 6.6 8.5 1
2.8 9.6 1.0 5.6 9.5 1
3.8 9.9 0.0 4.6 8.9 1
4.4 6.5 3.5 6.6 12.1 1
4.8 1.1 8.9 11.5 17.5 1
6.0 19.9 10.2 7.9 1.4 3
6.2 18.5 8.9 6.5 0.0 3
7.6 17.4 8.4 5.2 1.8 3
7.8 12.2 4.6 0.0 6.5 2
6.6 7.7 3.6 4.7 10.8 1
8.2 4.5 7.0 7.7 14.1 1
8.4 6.9 5.5 5.3 11.8 2
9.0 3.4 8.3 8.9 15.4 1
9.6 11.1 5.9 2.1 8.1 2

CS 40003: Data 33
The calculation new centroids of the three cluster using the mean of attribute
values of A1 and A2 is shown in the Table below. The cluster with new centroids
are shown in Fig 16.3.

Calculation of new centroids

New Objects
Centroid A1 A2
c1 4.6 7.1
c2 8.2 10.7
c3 6.6 18.6

Fig 16.3: Initial cluster with new centroids

CS 40003: Data 34
We next reassign the 16 objects to three clusters by determining which centroid is
closest to each one. This gives the revised set of clusters shown in Fig 16.4.
Note that point p moves from cluster C2 to cluster C1.

Fig 16.4: Cluster after first iteration

CS 40003: Data 35
• The newly obtained centroids after second iteration are given in the table below.
Note that the centroid c3 remains unchanged, where c2 and c1 changed a little.
• With respect to newly obtained cluster centres, 16 points are reassigned again.
These are the same clusters as before. Hence, their centroids also remain
unchanged.
• Considering this as the termination criteria, the k-means algorithm stops here.
Hence, the final cluster in Fig 16.5 is same as Fig 16.4.
Fig 16.5: Cluster after Second iteration

Cluster centres after second iteration

Centroid Revised Centroids


A1 A2
c1 5.0 7.1
c2 8.1 12.0
c3 6.6 18.6

CS 40003: Data 36
How to select k?
 Try different k, looking at the change in the
average distance to centroid as k increases
 Average falls rapidly until right k, then
changes little

Best value
of k
Average
distance to
centroid k

BDA 37
Too few; x
many long x
xx x
distances
x x
to centroid. x x x x x
x x x x x
x xx x xx x
x x x x
x x

x x x
x x x x
x x x
x

BDA 38
x
Just right; x
distances xx x
rather short. x x
x x x x x
x x x x x
x xx x xx x
x x x x
x x

x x x
x x x x
x x x
x

BDA 39
Too many; x
little improvement x
in average xx x
distance. x x
x x x x x
x x x x x
x xx x xx x
x x x x
x x

x x x
x x x x
x x x
x

BDA 40
 Group the medicines below into two groups
based on two feature weight & ph

Medicine Weight (X) Ph (Y)


A 1 1
B 2 1
C 4 3
D 5 4

 Given k=2, Take initial centroid as A & B

BDA 41
Medicine Weight (X) Ph (Y) Assignment
A 1 1 Cluster 1
B 2 1 Cluster 2
C 4 3 ?
D 5 4 ?

 Cluster points can also be represented as


A(1,1) B(2,1) C(4,3) D(5,4)
 Euclidean Distance Formula
 d[(x,y),(a,b)]=

BDA 42
 d[(4,3),(1,1)]= =3.61 Point d(1,1) d(2,1) Cluste
 d[(5,4),(1,1)]= =5
r
 d[(4,3),(2,1)]= =2.83 A(1,1) 0 X 1
 d[(5,4),(2,1)]= =4.24
B(2,1) x 0 2

C(4,3) ?

D(5,4) ?

BDA 43
 d[(4,3),(1,1)]= =3.61 Point d(1,1) d(2,1) Cluste
 d[(5,4),(1,1)]= =5
r
 d[(4,3),(2,1)]= =2.83 A(1,1) 0 X 1
 d[(5,4),(2,1)]= =4.24
B(2,1) x 0 2

C(4,3) 3.61 2.83 2


 Centroid Calculation
D(5,4) 5 4.24 2

Cluster Points Centroid

1 A(1,1) (1,1)

2 B(2,1), C(4,3), D(5,4) =(3.67,2.67)

BDA 44
Point d(1,1) d(3.6 Clust
 d[(2,1),(1,1)]= =1 7,2.6 er
 d[(4,3),(1,1)]= =3.61 7)
 d[(5,4),(1,1)]= =5 A(1,1) 0 X 1
 d[(2,1),(3.67,2.67)]= =
2.36 B(2,1)
 d[(4,3),(3.67,2.67)]= =
2.69 C(4,3)

 d[(5,4),(3.67,2.67)]= = D(5,4)
1.37

BDA 45
Point d(1,1) d(3.6 Clust
 d[(2,1),(1,1)]= =1 7,2.6 er
 d[(4,3),(1,1)]= =3.61 7)
 d[(5,4),(1,1)]= =5 A(1,1) 0 X 1
 d[(2,1),(3.67,2.67)]= =
2.36 B(2,1) 1 2.36 1
 d[(4,3),(3.67,2.67)]= =
2.69 C(4,3) 3.61 2.69 2

 d[(5,4),(3.67,2.67)]= = D(5,4) 5 1.37 2


1.37
 Centroid Calculation
Cluster Points Centroid

1 A(1,1), B(2,1) =(1,1.5)


2 C(4,3), D(5,4) =(4.5,3.5)

BDA 46
Point d(1,1. d(4.5, Clust
 d[(1,1),(1,1.5)]= =0.5 5) 3.5) er
 d[(2,1),(1,1.5)]= =1.12
A(1,1)
 d[(4,3),(1,1.5)]= =3.35
 d[(5,4),(1,1.5)]= =4.72 B(2,1)
 d[(1,1),(4.5,3.5)]= =4.30
 d[(2,1),(4.5,3.5)]= =5.36 C(4,3)

 d[(4,3),(4.5,3.5)]= =0.71
D(5,4)
 d[(5,4),(4.5,3.5)]= =0.71
 Centroid Calculation

BDA 47
Point d(1,1. d(4.5, Clust
 d[(1,1),(1,1.5)]= =0.5 5) 3.5) er
 d[(2,1),(1,1.5)]= =1.12
A(1,1) 0.5 4.30 1
 d[(4,3),(1,1.5)]= =3.35
 d[(5,4),(1,1.5)]= =4.72 B(2,1) 1.12 5.36 1
 d[(1,1),(4.5,3.5)]= =4.30
 d[(2,1),(4.5,3.5)]= =5.36 C(4,3) 3.35 0.71 2

 d[(4,3),(4.5,3.5)]= =0.71
D(5,4) 4.72 0.71 2
 d[(5,4),(4.5,3.5)]= =0.71
 Centroid Calculation

 As the cluster has not changed, the cluster has


reached its stability.

BDA 48

You might also like