0% found this document useful (0 votes)

16 views6 pages

Unit IV

Uploaded by

Blaze 08

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views6 pages

Unit IV

Uploaded by

Blaze 08

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Unit IV: Unsupervised Learning (Clustering)

4.1 Clustering

What is Clustering?

Clustering is a task in unsupervised learning. It's about taking a group of unlabeled data points and dividing them into
different "clusters." The goal is to put similar data points into the same cluster, and points that are different into
different clusters.

Why is Clustering Used?

Imagine you run a store and want to understand your customers. You can't look at every single customer's details.
Instead, clustering can group your customers into, say, 10 groups based on their buying habits. Then, you can create
different marketing plans for each group. This helps make sense of lots of data without knowing the answers
beforehand.

Key Points:

• It's an unsupervised learning task.

• Divides unlabeled data into different groups (clusters).

• Goal: Put similar data points together.

• Helps to understand patterns in data.

• Example:

Customer behavior analysis1.

4.2 K-Means Clustering Algorithm

What is K-Means Clustering?

K-Means Clustering is a very popular unsupervised learning algorithm. It groups unlabeled data into a specific
number of clusters, which we call "K." For example, if K=3, it will create three clusters.

How it Works (Simple Steps):

1. Choose K: You first decide how many clusters (K) you want.

2. Pick Centers: The algorithm randomly picks K starting points called "centroids" (center points for each
cluster).

3. Assign Points: Each data point is assigned to the closest centroid.

4. Update Centers: Once all points are assigned, the centroid of each cluster is recalculated to be the actual
center of all points in that cluster.

5. Repeat: Steps 3 and 4 are repeated. Points might move to different clusters, and centroids keep shifting until
they stop moving much, meaning the clusters are stable.

The main aim is to make the sum of distances between data points and their cluster centers as small as possible. It's
an iterative algorithm, meaning it repeats steps until it finds the best groups2.

Key Points:

• An unsupervised learning algorithm.

• Groups unlabeled data into

K predefined clusters3.

• It's an

iterative algorithm4.

• Each cluster has a

centroid (its center point)5.

• Goal:

Minimize the distance between data points and their cluster centers6.

4.3 Adaptive Hierarchical Clustering (HCA)

What is Hierarchical Clustering?

Hierarchical Clustering is another unsupervised ML algorithm used to group unlabeled data into clusters. It builds a
hierarchy (like a family tree) of clusters. This "tree-like" structure is called a "Dendrogram."

Difference from K-Means:

• In K-Means, you tell it how many clusters (K) you want upfront.

• In Hierarchical Clustering, you don't need to specify the number of clusters in advance7. You can decide later
by cutting the dendrogram at different levels.

Two Main Approaches:

1. Agglomerative (Bottom-Up):

o Starts with each data point as its own tiny cluster.

o Then, it repeatedly merges the closest pairs of clusters together.

o This continues until

all data points are merged into one big cluster8888.

2. Divisive (Top-Down):

o Starts with all data points in one big cluster.

o Then, it repeatedly splits the largest clusters into smaller ones.

o This continues until each data point is in its own cluster. This is the reverse of agglomerative9.

Key Points:

• An unsupervised ML algorithm.

• Builds a

hierarchy of clusters10.

• The tree-like structure is called a

Dendrogram11.

• No need to pre-determine the number of clusters12.

• Two main approaches: Agglomerative (bottom-up merging) and Divisive (top-down splitting).
4.4 Gaussian Mixture Models (GMMs)

What are Gaussian Mixture Models (GMMs)?

GMMs are a type of ML algorithm used for clustering. They assume that your data points come from a mix of
different "Gaussian distributions" (which are like bell curves). Each bell curve represents a different cluster.

How They Work:

Instead of just finding a center for each cluster (like K-Means), GMMs try to figure out the shape (spread and
direction) of each cluster. They assume that points in a cluster are distributed according to a bell curve.

• Probabilistic: GMMs are "probabilistic" models. They estimate the probability that each data point belongs
to each cluster. This means a point can belong to a cluster with a certain probability, not just 100%.

• Robust to Outliers: They are generally good at handling unusual data points (outliers) because they can
assign a low probability of belonging to any cluster13.

Why are GMMs Needed?

• They can find clusters that are not perfectly round or equally sized, unlike K-Means.

• They give you the probability of a data point belonging to a cluster, which can be more informative.

Key Points:

• A type of ML algorithm used for

clustering14141414.

• Assumes data points are generated from a

mixture of Gaussian distributions (bell curves)15.

• It's a

probabilistic model16.

• Can find clusters that are

not clearly defined17.

• Can

estimate the probability of a new point belonging to each cluster18.

• Relatively

robust to outliers19.

4.5 Optimization Using Evolutionary Techniques

What are Evolutionary Optimization Techniques?

These are methods in machine learning inspired by how nature evolves. They use ideas like "survival of the fittest" to
find the best solutions for difficult problems, especially optimization tasks. They don't just try one solution; they try
many, combine them, and keep the best ones, allowing them to "evolve" over time towards an optimal answer.

How They Work (General Idea):

1. Population: Start with a group of random possible solutions.

2. Fitness: Evaluate how "good" each solution is.

3. Selection: Keep the best solutions (like "survival of the fittest").

4. Reproduction/Mutation: Create new solutions by combining (crossing over) parts of the best ones and
adding small random changes (mutations).

5. Repeat: Go back to step 2 and keep evolving the solutions until a good answer is found.

Key Points:

• Inspired by biological evolution (survival of the fittest).

• Used for optimization tasks in ML.

• Works by evolving a "population" of solutions over generations.

• Examples: Genetic Algorithms (a common type).

4.6 Number of Clusters

How to Choose "K" (Number of Clusters)?

In clustering, especially K-Means, you often need to decide how many clusters (K) to create. This is an important
choice because it affects how the data is grouped.

Finding the Best Number:

• Trial and Error: Sometimes, you try different values of K and see which one makes the most sense for your
data or problem.

• Application Defined: For some problems, the number of clusters is already known or makes practical sense
(e.g., if you want to group customers into 3 specific loyalty tiers).

• Evaluation Metrics: There are methods that help determine a good K by looking at how "tight" the clusters
are or how well separated they are.

o Elbow Method: You plot a graph showing how much error decreases as you add more clusters.
Often, the graph forms an "elbow" shape, and the "elbow point" suggests a good K.

o Silhouette Score: This measures how similar an object is to its own cluster compared to other
clusters. A higher score is better.

Key Points:

• Deciding the number of clusters (K) is crucial.

• Can be chosen by trial and error.

• Sometimes defined by the problem itself.

• Methods like the Elbow Method and Silhouette Score help find a good K.

4.7 Advanced Discussion on Clustering (Linkage Methods)

Measuring Distance Between Clusters:

In hierarchical clustering, deciding which clusters to merge (or split) depends on how "close" they are. This
"closeness" is measured using different "linkage methods."

Common Linkage Methods:

These methods define how the distance between two clusters is calculated:
• Single Linkage: The distance between two clusters is the shortest distance between any two points in the
different clusters20.

• Complete Linkage: The distance between two clusters is the longest distance between any two points in the
different clusters21. This method tends to form more compact, "tighter" clusters22.

• Average Linkage: The distance between two clusters is the average distance between all possible pairs of
points, where one point is from each cluster23232323.

• Centroid Linkage: The distance between two clusters is the distance between their centroids (their center
points)24.

Key Points:

• Linkage Methods define how distance between clusters is measured.

• Crucial for hierarchical clustering25.

• Different methods lead to different clustering results.

4.8 Expectation Maximization (EM) Algorithm

What is the EM Algorithm?

The Expectation-Maximization (EM) algorithm is a powerful method used for finding hidden (or "latent") variables in
data. It's often used to train models like Gaussian Mixture Models (GMMs) when some data is missing or when we
don't know which cluster each data point belongs to.

Why Do We Need It?

Imagine you have a bag of coins, but you don't know if they're fair or biased. If you knew which coin was which, it
would be easy to flip them and count heads/tails. But you don't know. EM helps in these situations where there's
"missing information" or "hidden variables" that make direct calculation hard. It helps estimate parameters for
models like GMMs, especially when clusters are not clearly defined26.

How it Works (Two Steps that Repeat):

1. Expectation (E-step): Guess the missing information. For example, in GMMs, this step guesses the probability
that each data point belongs to each cluster, based on the current (guessed) cluster properties.

2. Maximization (M-step): Use the guessed information from the E-step to update the model. For example, in
GMMs, this step recalculates the best cluster properties (like their centers and shapes) based on the
probabilities assigned in the E-step.

These two steps are repeated over and over. With each repetition, the algorithm's guesses get better and better,
leading to a good final model.

Key Points:

• An algorithm used to find hidden variables or missing information in data.

• Often used to

train models like Gaussian Mixture Models (GMMs)272727.

• It's an iterative two-step process:

o E-step: Guess the missing information.

o M-step: Update the model based on those guesses.

• Needed when
direct calculation is hard due to unknown parameters or cluster assignments28.

Clustering
No ratings yet
Clustering
38 pages
22AIP3101A Session 9
No ratings yet
22AIP3101A Session 9
38 pages
Unit 4
No ratings yet
Unit 4
74 pages
Module 4 - 5TH Sem
No ratings yet
Module 4 - 5TH Sem
23 pages
FML Unit4
No ratings yet
FML Unit4
14 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
23 pages
MODULE 4 Clustering
No ratings yet
MODULE 4 Clustering
23 pages
DSML-ML09. Unsupervised Learning
No ratings yet
DSML-ML09. Unsupervised Learning
69 pages
Understanding Clustering - A Comprehensive Guide To
No ratings yet
Understanding Clustering - A Comprehensive Guide To
5 pages
Clustering Algorithm
No ratings yet
Clustering Algorithm
47 pages
U1 - KMeans - 5th Sem - DS
No ratings yet
U1 - KMeans - 5th Sem - DS
14 pages
ML Unit 4
No ratings yet
ML Unit 4
110 pages
"These Are Just Rough Notes For References" What Is K-Means Clustering
No ratings yet
"These Are Just Rough Notes For References" What Is K-Means Clustering
9 pages
Chapter 4 - Clustering
No ratings yet
Chapter 4 - Clustering
21 pages
ML Unit III
No ratings yet
ML Unit III
82 pages
Lect 10 - Unsupervised Learning
No ratings yet
Lect 10 - Unsupervised Learning
50 pages
Unsupervised Learning Insights
No ratings yet
Unsupervised Learning Insights
10 pages
Unit 5
No ratings yet
Unit 5
5 pages
Week 9. Unsupervised Learning
No ratings yet
Week 9. Unsupervised Learning
32 pages
Lecture-02 Unsupervised Learning Algorithm (Clustering)
No ratings yet
Lecture-02 Unsupervised Learning Algorithm (Clustering)
60 pages
Module 6 - Un-Supervised Learning Algorithms
No ratings yet
Module 6 - Un-Supervised Learning Algorithms
31 pages
Artificial Intelligence Report
No ratings yet
Artificial Intelligence Report
23 pages
ML Unit-4 Final 2024-25
No ratings yet
ML Unit-4 Final 2024-25
28 pages
Day 3
No ratings yet
Day 3
74 pages
Data Mining and Machine Learning
No ratings yet
Data Mining and Machine Learning
48 pages
ML Mod 4 Part 1
No ratings yet
ML Mod 4 Part 1
99 pages
Artificial Intelligence Lec 5
No ratings yet
Artificial Intelligence Lec 5
20 pages
Machine Learning: Clustering & Algorithms
No ratings yet
Machine Learning: Clustering & Algorithms
66 pages
Module 5
No ratings yet
Module 5
43 pages
Clustering Techniques for Analysts
No ratings yet
Clustering Techniques for Analysts
7 pages
ML CH 4
No ratings yet
ML CH 4
51 pages
Week 4 - Lecture Slides - K-Means, Mixture Models, & EM
No ratings yet
Week 4 - Lecture Slides - K-Means, Mixture Models, & EM
65 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
6 pages
ML Unit5 Notes
No ratings yet
ML Unit5 Notes
18 pages
Unit 4
No ratings yet
Unit 4
96 pages
Unit IV
No ratings yet
Unit IV
96 pages
Cluster Analysis Concepts & Algorithms
No ratings yet
Cluster Analysis Concepts & Algorithms
82 pages
Cluster Analysis in Data Mining Techniques
No ratings yet
Cluster Analysis in Data Mining Techniques
18 pages
K Means Clustering
No ratings yet
K Means Clustering
22 pages
Assignment 2
No ratings yet
Assignment 2
8 pages
Unsupervised Learning: Clustering
No ratings yet
Unsupervised Learning: Clustering
57 pages
Unsupesfwafarvised Learning
No ratings yet
Unsupesfwafarvised Learning
49 pages
Day 3 - Content
No ratings yet
Day 3 - Content
50 pages
Unsupervised Learning: Clustering Algorithms
No ratings yet
Unsupervised Learning: Clustering Algorithms
13 pages
EML %TH Module
No ratings yet
EML %TH Module
40 pages
K-Means Clustering Guide
100% (1)
K-Means Clustering Guide
14 pages
Clustering Methods in Data Science
No ratings yet
Clustering Methods in Data Science
8 pages
Lecture 3 Types of Machine Learning
No ratings yet
Lecture 3 Types of Machine Learning
40 pages
Clustering Explanation
No ratings yet
Clustering Explanation
8 pages
Clustering
No ratings yet
Clustering
80 pages
DWM 4
No ratings yet
DWM 4
14 pages
K Mean Clustering
No ratings yet
K Mean Clustering
59 pages
Un Supervised Learning
No ratings yet
Un Supervised Learning
22 pages
Unit 4 Mining
No ratings yet
Unit 4 Mining
12 pages
Clustering FinancialData
No ratings yet
Clustering FinancialData
38 pages
Intro to Clustering Methods
No ratings yet
Intro to Clustering Methods
39 pages
Chapter 1 Introduction
No ratings yet
Chapter 1 Introduction
49 pages
Clustering
No ratings yet
Clustering
75 pages
Text Analytics Unit-3
No ratings yet
Text Analytics Unit-3
11 pages
Unit III
No ratings yet
Unit III
5 pages
Unit V
No ratings yet
Unit V
5 pages
Cybersecurity Red Team Audit Part1
No ratings yet
Cybersecurity Red Team Audit Part1
2 pages
Skimming Hindpool Area Overview
No ratings yet
Skimming Hindpool Area Overview
99 pages
Xii A S No. Roll No. Name of Student
No ratings yet
Xii A S No. Roll No. Name of Student
1 page
Xii A S No. Roll No. Name of Student
No ratings yet
Xii A S No. Roll No. Name of Student
1 page
ISE Advanced Financial Accounting 13th Edition Theodore E. Christensen Download
100% (1)
ISE Advanced Financial Accounting 13th Edition Theodore E. Christensen Download
59 pages
Computers Upper 6
No ratings yet
Computers Upper 6
17 pages
EGA Trek - Manual
No ratings yet
EGA Trek - Manual
13 pages
Exploratory Data Analysis Using Python
No ratings yet
Exploratory Data Analysis Using Python
10 pages
Understanding Data Communication Basics
No ratings yet
Understanding Data Communication Basics
10 pages
Sodapdf
No ratings yet
Sodapdf
1 page
Dwsim Simulation
No ratings yet
Dwsim Simulation
2 pages
Managing Updates: Kaspersky Security Center. Scaling
No ratings yet
Managing Updates: Kaspersky Security Center. Scaling
54 pages
CS-104 Master View CPU Switch Manual
No ratings yet
CS-104 Master View CPU Switch Manual
12 pages
Tvp5150am1 PDF
No ratings yet
Tvp5150am1 PDF
86 pages
Lab 12
No ratings yet
Lab 12
18 pages
52236IJFMRJULY2025
No ratings yet
52236IJFMRJULY2025
22 pages
Computer Studies Questions & Answers
No ratings yet
Computer Studies Questions & Answers
10 pages
Chapter Three
No ratings yet
Chapter Three
50 pages
Thesis Writing Help with Scrivener
100% (2)
Thesis Writing Help with Scrivener
4 pages
Notes 10 WordStyles TableOfContexts SendToPowerPoint
No ratings yet
Notes 10 WordStyles TableOfContexts SendToPowerPoint
6 pages
TMS TECHNICAL TRAINING Document
No ratings yet
TMS TECHNICAL TRAINING Document
15 pages
Zhong 2017 Study On The Iot Architecture and A
No ratings yet
Zhong 2017 Study On The Iot Architecture and A
4 pages
Energy Monitoring Project
100% (1)
Energy Monitoring Project
14 pages
SQL Injection Attack
No ratings yet
SQL Injection Attack
4 pages
Toward Formalizing The Q# Programming Language (Marshall)
No ratings yet
Toward Formalizing The Q# Programming Language (Marshall)
4 pages
Comparing Different Deep Learning Architectures For Classification of Chest Radiographs
No ratings yet
Comparing Different Deep Learning Architectures For Classification of Chest Radiographs
16 pages
The Windows Process Journey 2025 1736239026
No ratings yet
The Windows Process Journey 2025 1736239026
155 pages
Netys Rt3 Catalogue Pages 2021 02 DCG en
No ratings yet
Netys Rt3 Catalogue Pages 2021 02 DCG en
4 pages
Accounting MCQ Questions
No ratings yet
Accounting MCQ Questions
4 pages
Python Memory Bounded A* Algorithm
No ratings yet
Python Memory Bounded A* Algorithm
8 pages
Automated System Week1
No ratings yet
Automated System Week1
6 pages
Guide For Authors - Neuroscience & Biobehavioral Reviews - ISSN 0149-7634 - ScienceDirect - Com by Elsevier
No ratings yet
Guide For Authors - Neuroscience & Biobehavioral Reviews - ISSN 0149-7634 - ScienceDirect - Com by Elsevier
19 pages
Cassandra Tutorial For Beginners
No ratings yet
Cassandra Tutorial For Beginners
9 pages
Smart Panel Litemax Sld1968-Enb-G11
No ratings yet
Smart Panel Litemax Sld1968-Enb-G11
3 pages

Unit IV

Uploaded by

Unit IV

Uploaded by

Unit IV: Unsupervised Learning (Clustering)

Why is Clustering Used?

• It's an unsupervised learning task.

• Divides unlabeled data into different groups (clusters).

• Goal: Put similar data points together.

• Helps to understand patterns in data.

Customer behavior analysis1.

4.2 K-Means Clustering Algorithm

What is K-Means Clustering?

How it Works (Simple Steps):

3. Assign Points: Each data point is assigned to the closest centroid.

• An unsupervised learning algorithm.

• Groups unlabeled data into

• Each cluster has a

centroid (its center point)5.

4.3 Adaptive Hierarchical Clustering (HCA)

What is Hierarchical Clustering?

Difference from K-Means:

Two Main Approaches:

o Starts with each data point as its own tiny cluster.

o Then, it repeatedly merges the closest pairs of clusters together.

o This continues until

all data points are merged into one big cluster8888.

o Starts with all data points in one big cluster.

o Then, it repeatedly splits the largest clusters into smaller ones.

• The tree-like structure is called a

• No need to pre-determine the number of clusters12.

What are Gaussian Mixture Models (GMMs)?

How They Work:

Why are GMMs Needed?

• A type of ML algorithm used for

• Assumes data points are generated from a

mixture of Gaussian distributions (bell curves)15.

• Can find clusters that are

not clearly defined17.

estimate the probability of a new point belonging to each cluster18.

4.5 Optimization Using Evolutionary Techniques

What are Evolutionary Optimization Techniques?

How They Work (General Idea):

1. Population: Start with a group of random possible solutions.

2. Fitness: Evaluate how "good" each solution is.

• Inspired by biological evolution (survival of the fittest).

• Used for optimization tasks in ML.

• Works by evolving a "population" of solutions over generations.

• Examples: Genetic Algorithms (a common type).

4.6 Number of Clusters

How to Choose "K" (Number of Clusters)?

Finding the Best Number:

• Deciding the number of clusters (K) is crucial.

• Can be chosen by trial and error.

• Sometimes defined by the problem itself.

4.7 Advanced Discussion on Clustering (Linkage Methods)

Measuring Distance Between Clusters:

Common Linkage Methods:

• Linkage Methods define how distance between clusters is measured.

• Crucial for hierarchical clustering25.

• Different methods lead to different clustering results.

4.8 Expectation Maximization (EM) Algorithm

What is the EM Algorithm?

Why Do We Need It?

How it Works (Two Steps that Repeat):

• An algorithm used to find hidden variables or missing information in data.

train models like Gaussian Mixture Models (GMMs)272727.

• It's an iterative two-step process:

o E-step: Guess the missing information.

o M-step: Update the model based on those guesses.

You might also like