Chandigarh School of Business, Jhanjeri
Department of Computer Application
Program Name: BCA
Course Code:UGCA1950
Course Name: Machine Learning
Prepared by: Dr. Gurwinder Singh
Department of Computer Application 1
Outlines
• PTU Syllabus of Unit-I
• CO’s Introduction
• Topic Overview
• Brief description of what the presentation will cover
• Importance or relevance of the topic
• Key objectives or learning outcomes
• Summary
• References
Department of Computer Application 2
PTU Syllabus of Unit-I
Clustering What is Clustering & its Use Cases, K-means Clustering, How does K-means
algorithm work, C-means Clustering, Hierarchical Clustering, How Hierarchical
Clustering works.
Department of Computer Application 3
CO Introduction
CO NUMBER TOPICS LEVEL
PO(1,2,3,4,9) &
CO3 Design solution for basic problems using machine learning algorithms PSO(1)
Department of Computer Application 4
Topic Overview
K-means Clustering
K-means Clustering is a popular unsupervised machine learning algorithm used for partitioning data into distinct groups or
clusters.
It aims to group data points in such a way that points within the same cluster are more similar to each other than to those in
other clusters.
How It Works:
Define the Number of Clusters (K):
The user specifies the desired number of clusters (K).
Random Initialization:
K initial "centeroids" (cluster centers) are randomly placed in the data space.
Assignment Step:
Each data point is assigned to the nearest centeroid based on a distance metric (e.g., Euclidean distance).
Update Step:
The centeroid of each cluster is recalculated as the mean of all points assigned to it.
Iterative Process:
Steps 3 and 4 are repeated until centeroids stabilize or a predefined stopping condition is met.
Department of Computer Applications 5
Brief of what the presentation will
What is Clustering & Its Use Cases
Introduction to clustering and its applications.
K-means Clustering
Explanation of the K-means algorithm and its working process.
C-means Clustering
Overview of C-means clustering.
Hierarchical Clustering
Description of hierarchical clustering and its working mechanism.
Department of Computer Applications 6
Clustering Algorithms
• Flat algorithms
– Usually start with a random (partial) partitioning
– Refine it iteratively
• K means clustering
• (Model based clustering)
• Hierarchical algorithms
– Bottom-up, agglomerative
– (Top-down, divisive)
Sec. 16.4
K-Means
• Assumes documents are real-valued vectors.
• Clusters based on centroids (aka the center of gravity or mean) of points in a cluster,
c:
1
μ(c)
| c | xc
x
• Reassignment of instances to clusters is based on distance to the current cluster
centroids.
– (Or one can equivalently phrase it in terms of similarities)
K-Means Algorithm
Select K random docs {s1, s2,… sK} as seeds.
Until clustering converges (or other stopping criterion):
For each doc di:
Assign di to the cluster cj such that dist(xi, sj) is minimal.
(Next, update the seeds to the centroid of each cluster)
For each cluster cj
sj = (cj)
Sec. 16.4
K Means Example
(K=2)
Pick seeds
Reassign clusters
Compute centroids
Reassign clusters
x x Compute centroids
x
x
Reassign clusters
Sec. 16.4
Termination conditions
• Several possibilities, e.g.,
– A fixed number of iterations.
– Doc partition unchanged.
– Centroid positions don’t change.
Does this mean that the docs in a cluster are unchanged?
Sec. 16.4
Convergence
• Why should the K-means algorithm ever reach a fixed point?
– A state in which clusters don’t change.
• K-means is a special case of a general procedure known as the Expectation
Maximization (EM) algorithm.
– EM is known to converge.
– Number of iterations could be large.
– But in practice usually isn’t
Sec. 16.4
Convergence of K-Means
• Recomputation monotonically decreases each Gk since (mk is number of members in
cluster k):
– Σ (di – a)2 reaches minimum for:
– Σ –2(di – a) = 0
– Σ di = Σ a
– mK a = Σ di
– a = (1/ mk) Σ di = ck
• K-means typically converges quickly
Sec. 16.4
Time Complexity
• Computing distance between two docs is O(M) where M is the dimensionality of the
vectors.
• Reassigning clusters: O(KN) distance computations, or O(KNM).
• Computing centroids: Each doc gets added once to some centroid: O(NM).
• Assume these two steps are each done once for I iterations: O(IKNM).
Sec. 16.4
Seed Choice
Example showing
• Results can vary based on random seed selection.
sensitivity to seeds
• Some seeds can result in poor convergence rate, or
convergence to sub-optimal clusterings.
– Select good seeds using a heuristic (e.g., doc least In the above, if you start
with B and E as centroids
similar to any existing mean) you converge to {A,B,C}
and {D,E,F}
– Try out multiple starting points If you start with D and F
you converge to
– Initialize with the results of another method. {A,B,D,E} {C,F}
Key objectives or learning outcomes
Understand the Concept of K-means Clustering
Learn the definition and purpose of K-means clustering as a data segmentation tool.
Explore the Applications of K-means Clustering
Identify real-world scenarios where K-means clustering is commonly applied, such as customer
segmentation, market research, and image segmentation.
Learn How the K-means Algorithm Works
Gain a step-by-step understanding of the K-means clustering process, including initialization,
assignment, and updating steps.
Understand the Role of Distance Metrics
Understand how distance measures (e.g., Euclidean distance) are used to assign data points to clusters.
Appreciate the Strengths and Limitations of K-means
Recognize the advantages of K-means, such as simplicity and efficiency, and its limitations, such as
sensitivity to outliers and reliance on the predefined number of clusters.
Apply K-means in Practice
Department of Computer Applications 16
THANK YOU
Department of Computer Application 17