P-3 1 2-Kmeans
P-3 1 2-Kmeans
CO-3:
CO-4:
2
Course Objectives
3
Partitional Clustering
• It is a type of clustering that divides the data into non-hierarchical groups. It is
also known as the centroid-based method. The most common example of
partitioning clustering is the K-Means Clustering algorithm.
• In this type, the dataset is divided into a set of k groups, where K is used to define
the number of pre-defined groups. The cluster center is created in such a way
that the distance between the data points of one cluster is minimum as
compared to another cluster centroid.
Given
• A data set of n objects
• K the number of clusters to form
• Organize the objects into k partitions (k<=n) where each partition represents a
cluster
• The clusters are formed to optimize an objective partitioning criterion
• Objects within a cluster are similar
• Objects of different clusters are dissimilar
K-means Clustering
• The basic algorithm is very simple
• Number of clusters, K, must be specified
• Each cluster is associated with a centroid (mean or center point)
• Each point is assigned to the cluster with the closest centroid
K-means clustering
2.5
2
Original Points
1.5
y
1
0.5
3 3
2.5 2.5
2 2
1.5 1.5
y
y
1 1
0.5 0.5
0 0
Iteration 6
1
2
3
4
5
3
2.5
1.5
y
1
0.5
2 2 2
y
1 1 1
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
2 2 2
y
1 1 1
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
Importance of Choosing Initial Centroids (Case ii)
Iteration 5
1
2
3
4
3
2.5
1.5
y
1
0.5
2.5 2.5
2 2
1.5 1.5
y
1 1
0.5 0.5
0 0
2 2 2
y
1 1 1
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
Problems with Selecting Initial Points
• If there are K ‘real’ clusters then the chance of selecting one centroid from
each cluster is small.
• Chance is relatively small when K is large
• If clusters are the same size, n, then
+
+
15
(cont …)
16
K-means example, step 1
k1
Y
Pick 3
k2
initial
cluster
centers
(randomly)
k3
X
K-means example, step 2
k1
Y
k2
Assign
each point
to the closest
cluster
center k3
X
K-means example, step 3
k1 k1
Y
Move k2
each cluster center
to the mean k3
k2
of each cluster
k3
X
K-means example, step 4
Reassign
points
closest to a
different new k1
cluster center Y
Q: Which
points are k3
reassigned? k2
X
K-means example, step 4 …
k1
Y
A: three points
with animation
k3
k2
X
K-means example, step 4b
k1
Y
re-compute
cluster means
k3
k2
X
K-means example, step 5
k1
Y
k2
move cluster
centers to cluster k3
means
X
An example distance function
24
Strengths of k-means
• Strengths:
• Simple: easy to understand and to implement
• Efficient: Time complexity: O(tkn),
where n is the number of data points,
k is the number of clusters, and
t is the number of iterations.
• Since both k and t are small. k-means is considered a linear algorithm.
• K-means is the most popular clustering algorithm.
• Note that: it terminates at a local optimum if SSE is used. The global optimum is
hard to find due to complexity.
25
Weaknesses of k-means
• The algorithm is only applicable if the mean is defined.
• For categorical data, k-mode - the centroid is represented by most frequent
values.
• The user needs to specify k.
• The algorithm is sensitive to outliers
• Outliers are data points that are very far away from other data points.
• Outliers could be errors in the data recording or some special data points with
very different values.
26
Weaknesses of k-means: Problems with
outliers
27
Weaknesses of k-means: To deal with
outliers
• One method is to remove some data points in the clustering process that are
much further away from the centroids than other data points.
• To be safe, we may want to monitor these possible outliers over a few iterations and then
decide to remove them.
• Another method is to perform random sampling. Since in sampling we only
choose a small subset of the data points, the chance of selecting an outlier is
very small.
• Assign the rest of the data points to the clusters by distance or similarity comparison, or
classification
28
Weaknesses of k-means (cont …)
• The algorithm is sensitive to initial seeds.
29
Weaknesses of k-means (cont …)
• If we use different seeds: good results
There are some methods to
help choose good seeds
30
Pros and cons of K-Means
K-means clustering summary
Advantages Disadvantages
• Simple, understandable • Must pick number of clusters before
• items automatically assigned to hand
clusters • All items forced into a cluster
• Too sensitive to outliers since an object
with an extremely large value may
substantially distort the distribution of
data
K-means summary
• Despite weaknesses, k-means is still the most popular algorithm due to
its simplicity, efficiency and
• other clustering algorithms have their own lists of weaknesses.
• No clear evidence that any other clustering algorithm performs better in
general
• although they may be more suitable for some specific types of data or
applications.
• Comparing different clustering algorithms is a difficult task. No one
knows the correct clusters!
33
K-means variations
• Video Link-
• https://www.youtube.com/watch?v=SeswFFdH03U
• https://www.youtube.com/watch?v=iNlZ3IU5Ffw
• Web Link-
• https://en.wikipedia.org/wiki/K-means_clustering
• https://link.springer.com/10.1007/978-0-387-30164-8_425
• https://towardsdatascience.com/a-practical-guide-on-k-means-clustering-ca3bef3c853d
42
THANK YOU