K-Means
ClusteringGroup-2
Thejaswi S
Samir K
Swathi G
Vatsalya k
Sruthi N
Overview of Clustering
• The task of grouping data points based on their similarity with each other is called
Clustering or Cluster Analysis. .
• Defined under Unsupervised Learning, which derives insights from unlabeled data
without a target variable.
• Forms groups of homogeneous data points from a heterogeneous dataset.
• Evaluates similarity between points using metrics such as: Euclidean Distance, Cosine
Similarity, Manhattan Distance.
Types of clustering
1.Centroid-based Clustering (Partitioning methods):
⚬ Groups data based on proximity, using metrics like Euclidean Distance.
⚬ Example algorithms: K-Means, K-Medoids.
2.Density-based Clustering (Model-based methods):
⚬ Finds clusters based on data density, automatically determining cluster size.
⚬ Example algorithm: DBSCAN.
3.Connectivity-based Clustering (Hierarchical clustering):
⚬ Builds clusters hierarchically, creating a dendrogram (tree structure).
⚬ Two approaches: Agglomerative (Bottom-Up) Divisive (Top-Down)
4.Distribution-based Clustering:
⚬ Groups data points based on statistical probability distributions.
⚬ Example: Gaussian Mixture Model
K-Means clustering:
• K-means clustering is an unsupervised machine learning algorithm used to partition a
dataset into K clusters, where each data point belongs to the cluster with the nearest
mean.
• It iteratively assigns each point to the closest cluster center, recalculates the cluster
centers, and repeats the process until convergence.
• The goal of clustering is to divide a dataset into groups (clusters) such that:
⚬ Data points within the same group are more similar to each other.
⚬ Data points from different groups are more different from each other.
• It’s about grouping data based on similarity and difference to reveal patterns or
insights in the data.
Key Concepts
• Centroids: Central points that represent the center of each cluster. They are
calculated as the mean of all points assigned to a cluster.
• Clusters: Groups of data points that are similar to each other based on proximity to a
centroid. The number of clusters is defined as K.
• Distance Metrics: Methods to calculate the similarity or dissimilarity between
points.
⚬ Euclidean Distance: A popular distance metric, calculated as the straight-line
distance between two points in space.
Algorithm Workflow:
• Step-1: Select the number K to decide the number of clusters.
• Step-2: Select random K points or centroids. (It can be other from the input dataset).
• Step-3: Assign each data point to their closest centroid, which will form the predefined
K clusters.
• Step-4: Calculate the variance and place a new centroid of each cluster.
• Step-5: Repeat the third steps, which means reassign each datapoint to the new closest
centroid of each cluster.
• Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.
• Step-7: The model is ready.
• Suppose we have two variables, M1 and M2, represented in the scatter plot on the right.
We aim to divide the dataset into K=2 clusters.
• To start, we randomly select two points as centroids, which are not part of the dataset.
Next, we assign each data point to its nearest centroid by calculating the distance
between the points.
• A median line is drawn between the centroids to help in this assignment.
• The center of gravity of the assigned data points is calculated to determine new
centroids.
• The assignment process is repeated, and new centroids are found.
• Data points are reassigned to the closest centroid.
• The process continues until no data points switch clusters, forming the final clusters.
• The assumed centroids are removed, and the two final clusters are formed.
Choosing the Number of Clusters (K)
• Elbow Method:
• Objective: Find the optimal number of clusters (K) by evaluating how well the clusters
fit the data.
• WCSS (Within Cluster Sum of Squares)WCSS measures the total variations within a
cluster.
• The formula calculates the sum of squared distances between each data point
(p) and its respective centroid (C₁, C₂, C₃) within each cluster.
Choosing the Number of Clusters (K)
Steps:
• Perform K-means clustering on the dataset for
different K values (typically from 1 to 10).
• Calculate WCSS for each K value.
• Plot WCSS values against the number of clusters
(K).
• Identify the "elbow" point in the graph (sharp
bend).The K value corresponding to the "elbow" is
considered optimal.
Advantages
1.Simplicity and Efficiency:
⚬ Easy to implement
⚬ Computationally efficient for large datasets
2.Scalability:
⚬ Handles large datasets well
3.Versatility:
⚬ Suitable for various data types.
⚬ Works well with well-separated clusters.
4.Flexibility:
⚬ Can be used for market segmentation, anomaly detection, etc.
Disadvantages
1.Choosing the Right K:
⚬ The optimal number of clusters (K) is hard to determine.
2.Sensitive to Initial Centroids:
⚬ The algorithm can converge to different solutions depending on the initial
centroids.
3.Assumes Spherical Clusters:
⚬ Performs poorly when clusters are non-spherical or have different sizes and
densities.
4.Sensitive to Outliers:
⚬ Outliers can significantly affect the clustering results.
Applications
1.Customer Segmentation:
■ Grouping customers based on purchasing behavior for targeted marketing
■ E-commerce platforms like Amazon or Flipkart use K-Means clustering to segment
customers into categories such as "frequent buyers," "occasional shoppers," and
"high-value customers."
■ Marketing teams design personalized ads, product recommendations, and
discount strategies for each cluster.
2. Image Compression:
⚬ Image editing software compression tools (e.g., TinyPNG) use K-Means to
compress images without losing much quality.
⚬ In medical imaging, images (like X-rays) are segmented into different regions for
efficient storage and analysis, This reduces the total number of colors in the
image, saving memory and computational resources.
Applications
■ 3. Document Clustering:
■ To categorize a large collection of text documents into clusters/topics for easier retrieval,
management, and understanding.
■ News organizations (e.g., BBC, Google News) use document clustering to group news articles into
topics like "sports," "politics," "technology," etc.
■ In customer support systems, K-Means is used to group support tickets based on issue types for
faster resolution.
⚬ 4. Anomaly Detection
⚬ To identify outliers or unusual data points that do not conform to the general pattern of the dataset.
⚬ Fraud Detection: Banks and financial institutions use K-Means to identify fraudulent transactions by
flagging transactions that deviate significantly from normal patterns.
⚬ Example: Credit card purchases in unusual locations or at irregular times.
⚬ Cybersecurity: Detecting unusual user behavior, such as login attempts from suspicious locations.
Comparison with Other Clustering
Methods
Conclusion
• K-means Clustering is a powerful tool for grouping data into meaningful clusters.
• It is simple, easy to implement, and widely used in practice for tasks such as
segmentation and anomaly detection.
• Choosing K (number of clusters) is a critical step; methods like the Elbow Method can
help.
• While efficient for large datasets, K-means has limitations like sensitivity to initial
centroids and assumptions about cluster shape.
• Despite its limitations, K-means remains a go-to algorithm for unsupervised learning
and exploratory data analysis.