0% found this document useful (0 votes)
23 views43 pages

P-3 1 2-Kmeans

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views43 pages

P-3 1 2-Kmeans

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 43

University Institute of Engineering

DEPARTMENT OF COMPUTER SCIENCE


& ENGINEERING
Bachelor of Engineering (Computer Science & Engineering)
Artificial Intelligence and Machine Learning(21CSH-316)
Prepared by:
Sitaram patel(E13285)
Topic: K means Clustering

DISCOVER . LEARN . EMPOWER


Course Outcomes
CO-1:Understand the fundamental concepts and
techniques of artificial intelligence and machine
learning

CO-2: Apply the basic of python programming to various


problems related to AI &ML.

CO-3:

CO-4:

CO-5:.Apply various unsupervised machine learning


models and evaluate their performance

2
Course Objectives

To study learning processes:


To provide a comprehensive
supervised and unsupervised,
To understand the history and foundation to Machine To understand modern
deterministic and statistical
development of Machine Learning and Optimization techniques and practical
knowledge of Machine
Learning. methodology with trends of Machine learning.
learners, and ensemble
applications t.
learning

3
Partitional Clustering
• It is a type of clustering that divides the data into non-hierarchical groups. It is
also known as the centroid-based method. The most common example of
partitioning clustering is the K-Means Clustering algorithm.
• In this type, the dataset is divided into a set of k groups, where K is used to define
the number of pre-defined groups. The cluster center is created in such a way
that the distance between the data points of one cluster is minimum as
compared to another cluster centroid.
Given
• A data set of n objects
• K the number of clusters to form
• Organize the objects into k partitions (k<=n) where each partition represents a
cluster
• The clusters are formed to optimize an objective partitioning criterion
• Objects within a cluster are similar
• Objects of different clusters are dissimilar
K-means Clustering
• The basic algorithm is very simple
• Number of clusters, K, must be specified
• Each cluster is associated with a centroid (mean or center point)
• Each point is assigned to the cluster with the closest centroid
K-means clustering

Works with numeric data only


1) Pick a number (K) of cluster centers (at random)
2) Assign every item to its nearest cluster center (e.g. using Euclidean
distance)
3) Move each cluster center to the mean of its assigned items
4) Repeat steps 2,3 until convergence (change in cluster assignments
less than a threshold)
K-means Clustering – Details
• Initial centroids are often chosen randomly.
• Clusters produced vary from one run to another.
• The centroid is (typically) the mean of the points in the cluster.
• ‘Closeness’ is measured by Euclidean distance, cosine similarity, correlation,
etc.
• K-means will converge for common similarity measures mentioned above.
• Most of the convergence happens in the first few iterations.
• Often the stopping condition is changed to ‘Until relatively few points
change clusters’ or some measure of clustering doesn’t change.
• Complexity is O( n * K * I * d )
• n = number of points, K = number of clusters,
I = number of iterations, d = number of attributes
Evaluating K-means Clusters
• Most common measure is Sum of Squared Error (SSE)
• For each point, the error is the distance to the nearest cluster
• To get SSE, we square these errors and sum them.
K
SSE    dist 2 ( mi , x )
i 1 xCi

• x is a data point in cluster Ci and mi is the representative point for cluster Ci


• can show that mi corresponds to the center (mean) of the cluster
• Given two clusters, we can choose the one with the smallest error
• One easy way to reduce SSE is to increase K, i.e. the number of clusters
• A good clustering with smaller K can have a lower SSE than a poor clustering with
higher K
Two different K-means Clusterings
3

2.5

2
Original Points
1.5

y
1

0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2


x

3 3

2.5 2.5

2 2

1.5 1.5
y

y
1 1

0.5 0.5

0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2


x x

Optimal Clustering Sub-optimal Clustering


Importance of Choosing Initial Centroids (Case i)

Iteration 6
1
2
3
4
5
3

2.5

1.5

y
1

0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2


x
Importance of Choosing Initial Centroids (Case i)
Iteration 1 Iteration 2 Iteration 3
3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5


y

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x

Iteration 4 Iteration 5 Iteration 6


3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5


y

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
Importance of Choosing Initial Centroids (Case ii)

Iteration 5
1
2
3
4
3

2.5

1.5

y
1

0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2


x
Importance of Choosing Initial Centroids (Case ii)
Iteration 1 Iteration 2
3 3

2.5 2.5

2 2

1.5 1.5

y
1 1

0.5 0.5

0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2


x x

Iteration 3 Iteration 4 Iteration 5


3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5


y

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
Problems with Selecting Initial Points
• If there are K ‘real’ clusters then the chance of selecting one centroid from
each cluster is small.
• Chance is relatively small when K is large
• If clusters are the same size, n, then

For example, if K = 10, then probability = 10!/1010 = 0.00036


• Sometimes the initial centroids will readjust themselves in ‘right’ way,
and sometimes they don’t
• Consider an example of five pairs of clusters
• Initial centers from different clusters may produce good clusters
Another example

+
+

15
(cont …)

16
K-means example, step 1

k1
Y
Pick 3
k2
initial
cluster
centers
(randomly)
k3

X
K-means example, step 2

k1
Y

k2
Assign
each point
to the closest
cluster
center k3

X
K-means example, step 3

k1 k1
Y

Move k2
each cluster center
to the mean k3
k2
of each cluster
k3

X
K-means example, step 4
Reassign
points
closest to a
different new k1
cluster center Y

Q: Which
points are k3
reassigned? k2

X
K-means example, step 4 …

k1
Y
A: three points
with animation
k3
k2

X
K-means example, step 4b

k1
Y
re-compute
cluster means
k3
k2

X
K-means example, step 5

k1
Y

k2
move cluster
centers to cluster k3
means

X
An example distance function

24
Strengths of k-means
• Strengths:
• Simple: easy to understand and to implement
• Efficient: Time complexity: O(tkn),
where n is the number of data points,
k is the number of clusters, and
t is the number of iterations.
• Since both k and t are small. k-means is considered a linear algorithm.
• K-means is the most popular clustering algorithm.
• Note that: it terminates at a local optimum if SSE is used. The global optimum is
hard to find due to complexity.

25
Weaknesses of k-means
• The algorithm is only applicable if the mean is defined.
• For categorical data, k-mode - the centroid is represented by most frequent
values.
• The user needs to specify k.
• The algorithm is sensitive to outliers
• Outliers are data points that are very far away from other data points.
• Outliers could be errors in the data recording or some special data points with
very different values.

26
Weaknesses of k-means: Problems with
outliers

27
Weaknesses of k-means: To deal with
outliers
• One method is to remove some data points in the clustering process that are
much further away from the centroids than other data points.
• To be safe, we may want to monitor these possible outliers over a few iterations and then
decide to remove them.
• Another method is to perform random sampling. Since in sampling we only
choose a small subset of the data points, the chance of selecting an outlier is
very small.
• Assign the rest of the data points to the clusters by distance or similarity comparison, or
classification

28
Weaknesses of k-means (cont …)
• The algorithm is sensitive to initial seeds.

29
Weaknesses of k-means (cont …)
• If we use different seeds: good results
There are some methods to
help choose good seeds

30
Pros and cons of K-Means
K-means clustering summary
Advantages Disadvantages
• Simple, understandable • Must pick number of clusters before
• items automatically assigned to hand
clusters • All items forced into a cluster
• Too sensitive to outliers since an object
with an extremely large value may
substantially distort the distribution of
data
K-means summary
• Despite weaknesses, k-means is still the most popular algorithm due to
its simplicity, efficiency and
• other clustering algorithms have their own lists of weaknesses.
• No clear evidence that any other clustering algorithm performs better in
general
• although they may be more suitable for some specific types of data or
applications.
• Comparing different clustering algorithms is a difficult task. No one
knows the correct clusters!

33
K-means variations

• K-medoids – instead of mean, use medians of each cluster


• Mean of 1, 3, 5, 7, 9 is
• Mean of 1, 3, 5, 7, 1009 is 5
• Median of 1, 3, 5, 7, 1009 is 205
• Median advantage: not affected by extreme values
5
• For large databases, use sampling
k-Medoids
The k-Medoids Algorithm
Evaluating Cost of Swapping Medoids
Evaluating Cost of Swapping Medoids
Four Cases
Total Cost of Swap
Summary
• K-means – simple, sometimes useful
• K-medoids is less sensitive to outliers
• K-means clustering is the unsupervised machine learning algorithm that is
part of a much deep pool of data techniques and operations in the realm
of Data Science.
• It is the fastest and most efficient algorithm to categorize data points
into groups even when very little information is available about data.
References
• Books and Journals
• Understanding Machine Learning: From Theory to Algorithms by Shai Shalev-Shwartz and Shai
Ben-David-Cambridge University Press 2014
• Introduction to machine Learning – the Wikipedia Guide by Osman Omer.

• Video Link-
• https://www.youtube.com/watch?v=SeswFFdH03U
• https://www.youtube.com/watch?v=iNlZ3IU5Ffw
• Web Link-
• https://en.wikipedia.org/wiki/K-means_clustering
• https://link.springer.com/10.1007/978-0-387-30164-8_425
• https://towardsdatascience.com/a-practical-guide-on-k-means-clustering-ca3bef3c853d

42
THANK YOU

You might also like