0% found this document useful (0 votes)

23 views43 pages

P-3 1 2-Kmeans

Uploaded by

theultimatecoderr

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views43 pages

P-3 1 2-Kmeans

Uploaded by

theultimatecoderr

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 43

University Institute of Engineering

DEPARTMENT OF COMPUTER SCIENCE

& ENGINEERING
Bachelor of Engineering (Computer Science & Engineering)
Artificial Intelligence and Machine Learning(21CSH-316)
Prepared by:
Sitaram patel(E13285)
Topic: K means Clustering

DISCOVER . LEARN . EMPOWER

Course Outcomes
CO-1:Understand the fundamental concepts and
techniques of artificial intelligence and machine
learning

CO-2: Apply the basic of python programming to various

problems related to AI &ML.

CO-3:

CO-4:

CO-5:.Apply various unsupervised machine learning

models and evaluate their performance

2
Course Objectives

To study learning processes:

To provide a comprehensive
supervised and unsupervised,
To understand the history and foundation to Machine To understand modern
deterministic and statistical
development of Machine Learning and Optimization techniques and practical
knowledge of Machine
Learning. methodology with trends of Machine learning.
learners, and ensemble
applications t.
learning

3
Partitional Clustering
• It is a type of clustering that divides the data into non-hierarchical groups. It is
also known as the centroid-based method. The most common example of
partitioning clustering is the K-Means Clustering algorithm.
• In this type, the dataset is divided into a set of k groups, where K is used to define
the number of pre-defined groups. The cluster center is created in such a way
that the distance between the data points of one cluster is minimum as
compared to another cluster centroid.
Given
• A data set of n objects
• K the number of clusters to form
• Organize the objects into k partitions (k<=n) where each partition represents a
cluster
• The clusters are formed to optimize an objective partitioning criterion
• Objects within a cluster are similar
• Objects of different clusters are dissimilar
K-means Clustering
• The basic algorithm is very simple
• Number of clusters, K, must be specified
• Each cluster is associated with a centroid (mean or center point)
• Each point is assigned to the cluster with the closest centroid
K-means clustering

Works with numeric data only

1) Pick a number (K) of cluster centers (at random)
2) Assign every item to its nearest cluster center (e.g. using Euclidean
distance)
3) Move each cluster center to the mean of its assigned items
4) Repeat steps 2,3 until convergence (change in cluster assignments
less than a threshold)
K-means Clustering – Details
• Initial centroids are often chosen randomly.
• Clusters produced vary from one run to another.
• The centroid is (typically) the mean of the points in the cluster.
• ‘Closeness’ is measured by Euclidean distance, cosine similarity, correlation,
etc.
• K-means will converge for common similarity measures mentioned above.
• Most of the convergence happens in the first few iterations.
• Often the stopping condition is changed to ‘Until relatively few points
change clusters’ or some measure of clustering doesn’t change.
• Complexity is O( n * K * I * d )
• n = number of points, K = number of clusters,
I = number of iterations, d = number of attributes
Evaluating K-means Clusters
• Most common measure is Sum of Squared Error (SSE)
• For each point, the error is the distance to the nearest cluster
• To get SSE, we square these errors and sum them.
K
SSE    dist 2 ( mi , x )
i 1 xCi

• x is a data point in cluster Ci and mi is the representative point for cluster Ci

• can show that mi corresponds to the center (mean) of the cluster
• Given two clusters, we can choose the one with the smallest error
• One easy way to reduce SSE is to increase K, i.e. the number of clusters
• A good clustering with smaller K can have a lower SSE than a poor clustering with
higher K
Two different K-means Clusterings
3

2.5

2
Original Points
1.5

y
1

0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

3 3

2.5 2.5

2 2

1.5 1.5
y

y
1 1

0.5 0.5

0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

x x

Optimal Clustering Sub-optimal Clustering

Importance of Choosing Initial Centroids (Case i)

Iteration 6
1
2
3
4
5
3

2.5

1.5

y
1

0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

x
Importance of Choosing Initial Centroids (Case i)
Iteration 1 Iteration 2 Iteration 3
3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x

Iteration 4 Iteration 5 Iteration 6

3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
Importance of Choosing Initial Centroids (Case ii)

Iteration 5
1
2
3
4
3

2.5

1.5

y
1

0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

x
Importance of Choosing Initial Centroids (Case ii)
Iteration 1 Iteration 2
3 3

2.5 2.5

2 2

1.5 1.5

y
1 1

0.5 0.5

0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

x x

Iteration 3 Iteration 4 Iteration 5

3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
Problems with Selecting Initial Points
• If there are K ‘real’ clusters then the chance of selecting one centroid from
each cluster is small.
• Chance is relatively small when K is large
• If clusters are the same size, n, then

For example, if K = 10, then probability = 10!/1010 = 0.00036

• Sometimes the initial centroids will readjust themselves in ‘right’ way,
and sometimes they don’t
• Consider an example of five pairs of clusters
• Initial centers from different clusters may produce good clusters
Another example

+
+

15
(cont …)

16
K-means example, step 1

k1
Y
Pick 3
k2
initial
cluster
centers
(randomly)
k3

X
K-means example, step 2

k1
Y

k2
Assign
each point
to the closest
cluster
center k3

X
K-means example, step 3

k1 k1
Y

Move k2
each cluster center
to the mean k3
k2
of each cluster
k3

X
K-means example, step 4
Reassign
points
closest to a
different new k1
cluster center Y

Q: Which
points are k3
reassigned? k2

X
K-means example, step 4 …

k1
Y
A: three points
with animation
k3
k2

X
K-means example, step 4b

k1
Y
re-compute
cluster means
k3
k2

X
K-means example, step 5

k1
Y

k2
move cluster
centers to cluster k3
means

X
An example distance function

24
Strengths of k-means
• Strengths:
• Simple: easy to understand and to implement
• Efficient: Time complexity: O(tkn),
where n is the number of data points,
k is the number of clusters, and
t is the number of iterations.
• Since both k and t are small. k-means is considered a linear algorithm.
• K-means is the most popular clustering algorithm.
• Note that: it terminates at a local optimum if SSE is used. The global optimum is
hard to find due to complexity.

25
Weaknesses of k-means
• The algorithm is only applicable if the mean is defined.
• For categorical data, k-mode - the centroid is represented by most frequent
values.
• The user needs to specify k.
• The algorithm is sensitive to outliers
• Outliers are data points that are very far away from other data points.
• Outliers could be errors in the data recording or some special data points with
very different values.

26
Weaknesses of k-means: Problems with
outliers

27
Weaknesses of k-means: To deal with
outliers
• One method is to remove some data points in the clustering process that are
much further away from the centroids than other data points.
• To be safe, we may want to monitor these possible outliers over a few iterations and then
decide to remove them.
• Another method is to perform random sampling. Since in sampling we only
choose a small subset of the data points, the chance of selecting an outlier is
very small.
• Assign the rest of the data points to the clusters by distance or similarity comparison, or
classification

28
Weaknesses of k-means (cont …)
• The algorithm is sensitive to initial seeds.

29
Weaknesses of k-means (cont …)
• If we use different seeds: good results
There are some methods to
help choose good seeds

30
Pros and cons of K-Means
K-means clustering summary
Advantages Disadvantages
• Simple, understandable • Must pick number of clusters before
• items automatically assigned to hand
clusters • All items forced into a cluster
• Too sensitive to outliers since an object
with an extremely large value may
substantially distort the distribution of
data
K-means summary
• Despite weaknesses, k-means is still the most popular algorithm due to
its simplicity, efficiency and
• other clustering algorithms have their own lists of weaknesses.
• No clear evidence that any other clustering algorithm performs better in
general
• although they may be more suitable for some specific types of data or
applications.
• Comparing different clustering algorithms is a difficult task. No one
knows the correct clusters!

33
K-means variations

• K-medoids – instead of mean, use medians of each cluster

• Mean of 1, 3, 5, 7, 9 is
• Mean of 1, 3, 5, 7, 1009 is 5
• Median of 1, 3, 5, 7, 1009 is 205
• Median advantage: not affected by extreme values
5
• For large databases, use sampling
k-Medoids
The k-Medoids Algorithm
Evaluating Cost of Swapping Medoids
Evaluating Cost of Swapping Medoids
Four Cases
Total Cost of Swap
Summary
• K-means – simple, sometimes useful
• K-medoids is less sensitive to outliers
• K-means clustering is the unsupervised machine learning algorithm that is
part of a much deep pool of data techniques and operations in the realm
of Data Science.
• It is the fastest and most efficient algorithm to categorize data points
into groups even when very little information is available about data.
References
• Books and Journals
• Understanding Machine Learning: From Theory to Algorithms by Shai Shalev-Shwartz and Shai
Ben-David-Cambridge University Press 2014
• Introduction to machine Learning – the Wikipedia Guide by Osman Omer.

• Video Link-
• https://www.youtube.com/watch?v=SeswFFdH03U
• https://www.youtube.com/watch?v=iNlZ3IU5Ffw
• Web Link-
• https://en.wikipedia.org/wiki/K-means_clustering
• https://link.springer.com/10.1007/978-0-387-30164-8_425
• https://towardsdatascience.com/a-practical-guide-on-k-means-clustering-ca3bef3c853d

42
THANK YOU

Lecture 2.1.1 To 2.1.2
No ratings yet
Lecture 2.1.1 To 2.1.2
97 pages
Clustering and Dimensionality Reduction
No ratings yet
Clustering and Dimensionality Reduction
58 pages
Lecture 4.6 Unsupervised-Learning Clustering
No ratings yet
Lecture 4.6 Unsupervised-Learning Clustering
60 pages
K Mean Clustering
No ratings yet
K Mean Clustering
59 pages
Week 10
No ratings yet
Week 10
41 pages
Lecture 3. Partitioning-Based Clustering Methods
No ratings yet
Lecture 3. Partitioning-Based Clustering Methods
27 pages
21csc305p Machine Learning Unit 3 - Updated
No ratings yet
21csc305p Machine Learning Unit 3 - Updated
147 pages
Clustering Classification and Intro Neural Network
No ratings yet
Clustering Classification and Intro Neural Network
168 pages
Unit 4
No ratings yet
Unit 4
125 pages
Cluster
No ratings yet
Cluster
50 pages
Datamining-Lect5 - Clustering. The K-Means Algorithm. Hierarchical Clustering. The DBSCAN Algorithm. Clustering Evaluation
No ratings yet
Datamining-Lect5 - Clustering. The K-Means Algorithm. Hierarchical Clustering. The DBSCAN Algorithm. Clustering Evaluation
110 pages
ML Unit III
No ratings yet
ML Unit III
82 pages
Clustering
No ratings yet
Clustering
29 pages
K Means Clustering
No ratings yet
K Means Clustering
29 pages
cz4041 10 Clustering
No ratings yet
cz4041 10 Clustering
67 pages
ML Module5 Clustering
No ratings yet
ML Module5 Clustering
71 pages
DM Lecture 06
No ratings yet
DM Lecture 06
32 pages
2021 Clustering
No ratings yet
2021 Clustering
50 pages
Unit 5
No ratings yet
Unit 5
63 pages
Clustering K-Means
100% (2)
Clustering K-Means
28 pages
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
No ratings yet
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
40 pages
Datamining Lect8
No ratings yet
Datamining Lect8
79 pages
Jaipur National University: Project Design With Seminar
100% (1)
Jaipur National University: Project Design With Seminar
26 pages
Clustering Part1
No ratings yet
Clustering Part1
84 pages
19.1. Partitioning-Based Clustering Algorithms
No ratings yet
19.1. Partitioning-Based Clustering Algorithms
27 pages
Unit 4
No ratings yet
Unit 4
46 pages
Data Mining - Clustering
No ratings yet
Data Mining - Clustering
90 pages
K Means Algorithm
No ratings yet
K Means Algorithm
4 pages
Unsupervised Learning 1
No ratings yet
Unsupervised Learning 1
40 pages
Unsupervised Learning Update
No ratings yet
Unsupervised Learning Update
37 pages
Unsupervised Learning 2024-PPG
No ratings yet
Unsupervised Learning 2024-PPG
85 pages
Clustering
No ratings yet
Clustering
125 pages
Unsupervised Learning - Clustering
No ratings yet
Unsupervised Learning - Clustering
55 pages
K Means
No ratings yet
K Means
40 pages
Mini Project
No ratings yet
Mini Project
8 pages
DSML-ML09. Unsupervised Learning
No ratings yet
DSML-ML09. Unsupervised Learning
69 pages
Machine Learning & Data Mining
No ratings yet
Machine Learning & Data Mining
108 pages
Clustering in Machine Learning
No ratings yet
Clustering in Machine Learning
20 pages
Clustering
No ratings yet
Clustering
84 pages
22AIP3101A Session 9
No ratings yet
22AIP3101A Session 9
38 pages
2 - K-Mean
No ratings yet
2 - K-Mean
39 pages
K Means
No ratings yet
K Means
25 pages
Chapter 5. Clustering Algorithms-Stud
No ratings yet
Chapter 5. Clustering Algorithms-Stud
44 pages
ML Unit-4 Final 2024-25
No ratings yet
ML Unit-4 Final 2024-25
28 pages
Clustering Algorithm
No ratings yet
Clustering Algorithm
47 pages
Unit 4 Clustering - K-Means and Hierarchical
No ratings yet
Unit 4 Clustering - K-Means and Hierarchical
40 pages
Intro Data Science: Cluster Analysis
No ratings yet
Intro Data Science: Cluster Analysis
60 pages
Clustering Techniques - Hierarchical, K-Means Clustering
No ratings yet
Clustering Techniques - Hierarchical, K-Means Clustering
22 pages
AI Chapter 3 Part 5
No ratings yet
AI Chapter 3 Part 5
30 pages
Presentation: Operating System Concept CS-582
No ratings yet
Presentation: Operating System Concept CS-582
13 pages
K Means
No ratings yet
K Means
23 pages
Minor Project
No ratings yet
Minor Project
10 pages
K Means Clustering
No ratings yet
K Means Clustering
22 pages
Introduction To Unsupervised Learning:: Clustering
No ratings yet
Introduction To Unsupervised Learning:: Clustering
21 pages
WWW Simplilearn Com Tutorials Machine Learning Tutorial K Means Clustering Algor
No ratings yet
WWW Simplilearn Com Tutorials Machine Learning Tutorial K Means Clustering Algor
19 pages
Soft Vs Hard Clustering
No ratings yet
Soft Vs Hard Clustering
5 pages
K Mean
No ratings yet
K Mean
7 pages
A Tutorial On Clustering Algorithms
No ratings yet
A Tutorial On Clustering Algorithms
4 pages
K Mean
No ratings yet
K Mean
12 pages