0% found this document useful (0 votes)
16 views6 pages

Unit IV

Uploaded by

Blaze 08
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views6 pages

Unit IV

Uploaded by

Blaze 08
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Unit IV: Unsupervised Learning (Clustering)

4.1 Clustering

What is Clustering?

Clustering is a task in unsupervised learning. It's about taking a group of unlabeled data points and dividing them into
different "clusters." The goal is to put similar data points into the same cluster, and points that are different into
different clusters.

Why is Clustering Used?

Imagine you run a store and want to understand your customers. You can't look at every single customer's details.
Instead, clustering can group your customers into, say, 10 groups based on their buying habits. Then, you can create
different marketing plans for each group. This helps make sense of lots of data without knowing the answers
beforehand.

Key Points:

• It's an unsupervised learning task.

• Divides unlabeled data into different groups (clusters).

• Goal: Put similar data points together.

• Helps to understand patterns in data.

• Example:

Customer behavior analysis1.

4.2 K-Means Clustering Algorithm

What is K-Means Clustering?

K-Means Clustering is a very popular unsupervised learning algorithm. It groups unlabeled data into a specific
number of clusters, which we call "K." For example, if K=3, it will create three clusters.

How it Works (Simple Steps):

1. Choose K: You first decide how many clusters (K) you want.

2. Pick Centers: The algorithm randomly picks K starting points called "centroids" (center points for each
cluster).

3. Assign Points: Each data point is assigned to the closest centroid.

4. Update Centers: Once all points are assigned, the centroid of each cluster is recalculated to be the actual
center of all points in that cluster.

5. Repeat: Steps 3 and 4 are repeated. Points might move to different clusters, and centroids keep shifting until
they stop moving much, meaning the clusters are stable.

The main aim is to make the sum of distances between data points and their cluster centers as small as possible. It's
an iterative algorithm, meaning it repeats steps until it finds the best groups2.

Key Points:

• An unsupervised learning algorithm.

• Groups unlabeled data into


K predefined clusters3.

• It's an

iterative algorithm4.

• Each cluster has a

centroid (its center point)5.

• Goal:

Minimize the distance between data points and their cluster centers6.

4.3 Adaptive Hierarchical Clustering (HCA)

What is Hierarchical Clustering?

Hierarchical Clustering is another unsupervised ML algorithm used to group unlabeled data into clusters. It builds a
hierarchy (like a family tree) of clusters. This "tree-like" structure is called a "Dendrogram."

Difference from K-Means:

• In K-Means, you tell it how many clusters (K) you want upfront.

• In Hierarchical Clustering, you don't need to specify the number of clusters in advance7. You can decide later
by cutting the dendrogram at different levels.

Two Main Approaches:

1. Agglomerative (Bottom-Up):

o Starts with each data point as its own tiny cluster.

o Then, it repeatedly merges the closest pairs of clusters together.

o This continues until

all data points are merged into one big cluster8888.

2. Divisive (Top-Down):

o Starts with all data points in one big cluster.

o Then, it repeatedly splits the largest clusters into smaller ones.

o This continues until each data point is in its own cluster. This is the reverse of agglomerative9.

Key Points:

• An unsupervised ML algorithm.

• Builds a

hierarchy of clusters10.

• The tree-like structure is called a

Dendrogram11.

• No need to pre-determine the number of clusters12.

• Two main approaches: Agglomerative (bottom-up merging) and Divisive (top-down splitting).
4.4 Gaussian Mixture Models (GMMs)

What are Gaussian Mixture Models (GMMs)?

GMMs are a type of ML algorithm used for clustering. They assume that your data points come from a mix of
different "Gaussian distributions" (which are like bell curves). Each bell curve represents a different cluster.

How They Work:

Instead of just finding a center for each cluster (like K-Means), GMMs try to figure out the shape (spread and
direction) of each cluster. They assume that points in a cluster are distributed according to a bell curve.

• Probabilistic: GMMs are "probabilistic" models. They estimate the probability that each data point belongs
to each cluster. This means a point can belong to a cluster with a certain probability, not just 100%.

• Robust to Outliers: They are generally good at handling unusual data points (outliers) because they can
assign a low probability of belonging to any cluster13.

Why are GMMs Needed?

• They can find clusters that are not perfectly round or equally sized, unlike K-Means.

• They give you the probability of a data point belonging to a cluster, which can be more informative.

Key Points:

• A type of ML algorithm used for

clustering14141414.

• Assumes data points are generated from a

mixture of Gaussian distributions (bell curves)15.

• It's a

probabilistic model16.

• Can find clusters that are

not clearly defined17.

• Can

estimate the probability of a new point belonging to each cluster18.

• Relatively

robust to outliers19.

4.5 Optimization Using Evolutionary Techniques

What are Evolutionary Optimization Techniques?

These are methods in machine learning inspired by how nature evolves. They use ideas like "survival of the fittest" to
find the best solutions for difficult problems, especially optimization tasks. They don't just try one solution; they try
many, combine them, and keep the best ones, allowing them to "evolve" over time towards an optimal answer.

How They Work (General Idea):

1. Population: Start with a group of random possible solutions.

2. Fitness: Evaluate how "good" each solution is.


3. Selection: Keep the best solutions (like "survival of the fittest").

4. Reproduction/Mutation: Create new solutions by combining (crossing over) parts of the best ones and
adding small random changes (mutations).

5. Repeat: Go back to step 2 and keep evolving the solutions until a good answer is found.

Key Points:

• Inspired by biological evolution (survival of the fittest).

• Used for optimization tasks in ML.

• Works by evolving a "population" of solutions over generations.

• Examples: Genetic Algorithms (a common type).

4.6 Number of Clusters

How to Choose "K" (Number of Clusters)?

In clustering, especially K-Means, you often need to decide how many clusters (K) to create. This is an important
choice because it affects how the data is grouped.

Finding the Best Number:

• Trial and Error: Sometimes, you try different values of K and see which one makes the most sense for your
data or problem.

• Application Defined: For some problems, the number of clusters is already known or makes practical sense
(e.g., if you want to group customers into 3 specific loyalty tiers).

• Evaluation Metrics: There are methods that help determine a good K by looking at how "tight" the clusters
are or how well separated they are.

o Elbow Method: You plot a graph showing how much error decreases as you add more clusters.
Often, the graph forms an "elbow" shape, and the "elbow point" suggests a good K.

o Silhouette Score: This measures how similar an object is to its own cluster compared to other
clusters. A higher score is better.

Key Points:

• Deciding the number of clusters (K) is crucial.

• Can be chosen by trial and error.

• Sometimes defined by the problem itself.

• Methods like the Elbow Method and Silhouette Score help find a good K.

4.7 Advanced Discussion on Clustering (Linkage Methods)

Measuring Distance Between Clusters:

In hierarchical clustering, deciding which clusters to merge (or split) depends on how "close" they are. This
"closeness" is measured using different "linkage methods."

Common Linkage Methods:

These methods define how the distance between two clusters is calculated:
• Single Linkage: The distance between two clusters is the shortest distance between any two points in the
different clusters20.

• Complete Linkage: The distance between two clusters is the longest distance between any two points in the
different clusters21. This method tends to form more compact, "tighter" clusters22.

• Average Linkage: The distance between two clusters is the average distance between all possible pairs of
points, where one point is from each cluster23232323.

• Centroid Linkage: The distance between two clusters is the distance between their centroids (their center
points)24.

Key Points:

• Linkage Methods define how distance between clusters is measured.

• Crucial for hierarchical clustering25.

• Different methods lead to different clustering results.

4.8 Expectation Maximization (EM) Algorithm

What is the EM Algorithm?

The Expectation-Maximization (EM) algorithm is a powerful method used for finding hidden (or "latent") variables in
data. It's often used to train models like Gaussian Mixture Models (GMMs) when some data is missing or when we
don't know which cluster each data point belongs to.

Why Do We Need It?

Imagine you have a bag of coins, but you don't know if they're fair or biased. If you knew which coin was which, it
would be easy to flip them and count heads/tails. But you don't know. EM helps in these situations where there's
"missing information" or "hidden variables" that make direct calculation hard. It helps estimate parameters for
models like GMMs, especially when clusters are not clearly defined26.

How it Works (Two Steps that Repeat):

1. Expectation (E-step): Guess the missing information. For example, in GMMs, this step guesses the probability
that each data point belongs to each cluster, based on the current (guessed) cluster properties.

2. Maximization (M-step): Use the guessed information from the E-step to update the model. For example, in
GMMs, this step recalculates the best cluster properties (like their centers and shapes) based on the
probabilities assigned in the E-step.

These two steps are repeated over and over. With each repetition, the algorithm's guesses get better and better,
leading to a good final model.

Key Points:

• An algorithm used to find hidden variables or missing information in data.

• Often used to

train models like Gaussian Mixture Models (GMMs)272727.

• It's an iterative two-step process:

o E-step: Guess the missing information.

o M-step: Update the model based on those guesses.

• Needed when
direct calculation is hard due to unknown parameters or cluster assignments28.

You might also like