0% found this document useful (0 votes)

10 views

Clustering

Uploaded by

01aasthathakur

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views

Clustering

Uploaded by

01aasthathakur

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

BASICS of CLUSTER ANALYSIS

Introduction to Cluster Analysis

• Introduction
• Clustering Requirements
• Data Representation
• Partitioning Methods
• K-Means Clustering
• K-Medoids Clustering
• Constrained K-Means clustering
• PAM and CLARA
Introduction to Cluster Analysis
• The process of grouping a set of physical or
abstract objects into classes of similar objects
is called clustering

• A cluster is a collection of data objects that

are similar to one another within the same
cluster and are dissimilar to the objects in
other clusters
Formal Definition
• Cluster analysis
Statistical method for grouping a set of data objects into clusters
A good clustering method produces high quality clusters with
high intraclass similarity and low interclass similarity

• Cluster: Collection of data objects

Intra-class similarity: Objects are similar to objects in same
cluster
Inter-class dissimilarity: Objects are dissimilar to objects in
other clusters

• Clustering is unsupervised classification

Supervised vs. Unsupervised Learning
• Unsupervised learning - clustering
– The class labels of training data are unknown
– Given a set of measurements, observations, etc. establish the
existence of clusters in the data
• Supervised learning - classification
– Supervision: The training data (observations, measurements,
etc.) are accompanied by labels indicating the class of the
observations
– New data is classified based on the training set

• Clustering is also called data segmentation in some

applications because clustering partitions large data
sets into groups according to their similarity
Clustering vs. Classification
• Clustering - learning by observations
– Unsupervised
– Input
• Clustering algorithm
• Similarity measure
• Number of clusters
– No specific information for each set of data

• Classification - learning by examples

– Supervised
– Consists of class labeled training data examples
– Build a classifier that assigns data objects to one of
the classes
Clustering vs. Classification
• Class Label Attribute : loan_decision
• Learning of Classifier is “supervised” à it is told to which
class each training tuple (sample) belongs

Data Mining
Concept
and
Techniques
(Chapter 6, Page
287).
Clustering vs. Classification

• Clustering
– class label of training tuple not known
– number or set of classes to be learned may not be
known in advance
– e.g. if we did not have loan_decision data available
we use clustering and NOT classification to determine
“groups of like tuples”
– These “groups of like tuples” may eventually
correspond to risk groups within loan application data
Typical Requirements Of Clustering
• Minimal requirements for domain knowledge
to determine input parameters

• Many clustering algorithms require users to input

certain parameters in cluster analysis
(such as the number of desired clusters)

• The clustering results can be quite sensitive

to input parameters.
• Parameters are often difficult to determine,
especially for data sets containing high-dimensional
objects
Typical Requirements Of Clustering
• Scalability
Many clustering algorithms work well on small data sets
Large database may contain millions of objects
Clustering on a sample of a given large data set may lead
to biased results
Highly scalable clustering algorithms are needed

• Ability to deal with different types of attributes

Many algorithms are designed to cluster numerical data

Applications may require clustering other types of data:

binary, categorical (nominal), and ordinal data, or mixtures of
these data types
Typical Requirements Of Clustering

• Ability to deal with noisy data

Some clustering algorithms are sensitive to noisy
data and may lead to clusters of poor quality
• Incremental clustering and insensitivity to the
order of input records
• Constraint-based clustering
• Interpretability and usability
Typical Requirements Of Clustering
• Discovery of clusters with arbitrary shape

• Many clustering algorithms determine clusters

based on Euclidean or Manhattan distance measures

• Algorithms based on such distance measures

tend to find spherical clusters with similar size
and density
• A cluster could be of any shape

• It is important to develop algorithms that can

detect clusters of arbitrary shape
Examples of Clustering Applications

• Marketing:
• Help marketers discover distinct groups in
their customer bases, and then use this
knowledge to develop targeted marketing
programs
• Insurance:
• Identifying groups of insurance policy holders
with a high average claim cost
Examples of Clustering Applications
• City-planning:
• Identifying groups of houses according to their
house type, value, and geographical location

• Earth-quake studies:
• Observe earth quake epicenters clustered
along continent faults

• Fraud detection:
• Detection of credit card fraud and the monitoring
• of criminal activities in electronic commerce
The K-Means Clustering Method

• Given k, the k-means algorithm is implemented in four

steps:
1. Partition objects into k nonempty subsets
2. Compute seed points as the centroids of the clusters of
the current partition (the centroid is the center, i.e., mean
point, of the cluster)
3. Assign each object to the cluster with the nearest seed
point
4. Go back to step 2.
STOP when no more new assignment
The K-Means Clustering Method

Example ( Book page 31)

10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
Assign 3 Update 3

the
3

each
2 2
2

1
objects
1

0
cluster 1

0
0
0 1 2 3 4 5 6 7 8 9 10 to most
0 1 2 3 4 5 6 7 8 9 10 means 0 1 2 3 4 5 6 7 8 9 10

similar
center reassign reassign
10 10

K=2 9 9

8 8

Arbitrarily choose K 7 7

object as initial
6 6

5 5

cluster center 4 Update 4

2
the 3

1 cluster 1

0
0 1 2 3 4 5 6 7 8 9 10
means 0
0 1 2 3 4 5 6 7 8 9 10
The k-Means Algorithm

The basic step of k-means clustering is simple:

¡ Iterate until stable, i.e. there is no change in the

clusters of objects

¡ Determine the centroid coordinate

¡ Determine the distance of each object to the

centroids

¡ Group the object based on minimum distance

Comments on the K-Means Method

• Strength:
• Relatively efficient: O(tkn), where n is # objects, k is #
clusters, and t is # iterations
Normally, k, t << n
• Comparing: PAM: O(k(n-k)2 )
• CLARA: O(ks2 + k(n-k))
• Comment: Often terminates at a local optimum
• The global optimum may be found using techniques
such as: deterministic annealing and genetic algorithms
Comments on the K-Means Method

• Weakness
– Applicable only when mean is defined,
– then what about categorical data?
– Need to specify k, the number of clusters, in
advance
– Unable to handle noisy data and outliers
– Not suitable to discover clusters with non-convex
shapes
Hierarchical Clustering
Hierarchical Clustering: Creating a hierarchical decomposition of the set of
objects using similarity matrix as clustering criteria

Two Main Algorithms: 1. Agglomerative method 2. Divisive method

Similarity Matrix: Linkage methods

- Distance of
- MIN Centroid
- MAX
- Group Average
Rokach, Lior, and Oded Maimon. "Clustering methods." Data mining and knowledge discovery handbook. Springer US, 2005. 321-352.
Agglomerative Algorithm
Main Steps:
1. Let each data point be a cluster
2. Initialize and compute the similarity matrix
Repeat
3. Merge the two closest clusters
4. Update the similarity matrix
Until only a single cluster remains
5. Draw the dendrogram of the sequences of merges
6. Cut the dendrogram with a certain level to form a certain clustering

Zhang, et al. "Graph degree linkage: Agglomerative clustering on a directed graph." 12th European Conference on Computer Vision, Florence, Italy,
October 7–13, 2012.
Agglomerative Algorithm Example
Use MIN Linkage Method

p2
p1 p4

p7
p3
p5 p6

Update
Agglomerative Algorithm Example

Visualized as a dendrogram: A tree like diagram that

records the sequences of merges or splits
{P1, P2, P3, P4, P5, P6}, {P7}

{P1, P2, P3}, {P4}, {P5, P6}, {P7}

Different cuttings
generate different
clusters!
P1 P2 P3 P4 P5 P6 P7
Clustering
Hierarchical Methods

1
Hierarchical Clustering

• Agglomerative approach
Initialization:
Each object is a cluster
Iteration:
a ab Merge two clusters which are
b abcde most similar to each other;
c Until all objects are merged
cde into a single cluster
d
de
e

Step 0 Step 1 Step 2 Step 3 Step 4 bottom-up

3
Hierarchical Clustering

• Divisive Approaches Initialization:

All objects stay in one cluster
Iteration:
a ab Select a cluster and split it into
b abcde
two sub clusters
Until each leaf cluster contains
c
cde only one object
d
de
e

Step 4 Step 3 Step 2 Step 1 Step 0 Top-down

4
Dendrogram
• A tree that shows how clusters are merged/split
hierarchically
• Each node on the tree is a cluster; each leaf node is a
singleton cluster

5
Dendrogram
• A clustering of the data objects is obtained by cutting
the dendrogram at the desired level, then each
connected component forms a cluster

6
Agglomerative Clustering Algorithm

• More popular hierarchical clustering technique

• Basic algorithm is straightforward
1. Compute the distance matrix
2. Let each data point be a cluster
3. Repeat
4. Merge the two closest clusters
5. Update the distance matrix
6. Until only a single cluster remains

• Key operation is the computation of the distance between

two clusters
– Different approaches to defining the distance between clusters
distinguish the different algorithms

Mine Ventilation Question Bank..
100% (1)
Mine Ventilation Question Bank..
37 pages
Collins WX-RDR TWR-850
100% (1)
Collins WX-RDR TWR-850
75 pages
Ppt-Materi Kuliah: LITERATUR: Kotler - Keller
No ratings yet
Ppt-Materi Kuliah: LITERATUR: Kotler - Keller
30 pages
E-Note_28966_Content_Document_20241211091351PM
No ratings yet
E-Note_28966_Content_Document_20241211091351PM
69 pages
DW & DM Unit 4 Notes
No ratings yet
DW & DM Unit 4 Notes
40 pages
DWDM Unit-5
No ratings yet
DWDM Unit-5
52 pages
MODULE-V
No ratings yet
MODULE-V
16 pages
Introduction to Cluster Analysis.
No ratings yet
Introduction to Cluster Analysis.
53 pages
Lecture Notes For Chapter 8: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 8: by Tan, Steinbach, Kumar
93 pages
22AIP3101A Session 9
No ratings yet
22AIP3101A Session 9
38 pages
Cluster
100% (1)
Cluster
72 pages
07Clustering
No ratings yet
07Clustering
34 pages
Clustering
No ratings yet
Clustering
34 pages
Clustering
No ratings yet
Clustering
39 pages
Assignment 4
No ratings yet
Assignment 4
40 pages
Clustering-Part1
No ratings yet
Clustering-Part1
79 pages
Module 5
No ratings yet
Module 5
91 pages
Final ML Unit3 May24
No ratings yet
Final ML Unit3 May24
154 pages
ARTIFICIAL INTELLIGENCE LEC 5
No ratings yet
ARTIFICIAL INTELLIGENCE LEC 5
20 pages
DM UNIT-5 NOTES
No ratings yet
DM UNIT-5 NOTES
16 pages
Untitled document
No ratings yet
Untitled document
32 pages
Clustering
No ratings yet
Clustering
44 pages
Fds Unit03
No ratings yet
Fds Unit03
11 pages
fuzzy meaning
No ratings yet
fuzzy meaning
6 pages
Cluster_analysis
No ratings yet
Cluster_analysis
22 pages
AIMLB PGP 2024 Session 12
No ratings yet
AIMLB PGP 2024 Session 12
46 pages
8. Clustering
No ratings yet
8. Clustering
38 pages
Clustering Agglo Devisive DBSCAN
No ratings yet
Clustering Agglo Devisive DBSCAN
78 pages
Data Mining - UNIT-IV
No ratings yet
Data Mining - UNIT-IV
24 pages
Clustering
No ratings yet
Clustering
84 pages
UNIT-4
No ratings yet
UNIT-4
106 pages
Unit 4
No ratings yet
Unit 4
74 pages
UNIT III - ML
No ratings yet
UNIT III - ML
13 pages
DATA MINING UNIT-4
No ratings yet
DATA MINING UNIT-4
38 pages
Unit-4 new
No ratings yet
Unit-4 new
36 pages
Clustering new
No ratings yet
Clustering new
6 pages
Unit 3 unsupervised learning algorith
No ratings yet
Unit 3 unsupervised learning algorith
15 pages
DM UNIT-4 Part2
No ratings yet
DM UNIT-4 Part2
18 pages
U20cs604 Machine Learning Unit III
No ratings yet
U20cs604 Machine Learning Unit III
23 pages
Cluster Analysis
No ratings yet
Cluster Analysis
15 pages
Unit 4 Descriptive Modeling
No ratings yet
Unit 4 Descriptive Modeling
18 pages
Prasanna Hebbar @govt First Grade College Honnavar
No ratings yet
Prasanna Hebbar @govt First Grade College Honnavar
11 pages
Data Mining Unit 3 Cluster Analysis: Types of Clusters
No ratings yet
Data Mining Unit 3 Cluster Analysis: Types of Clusters
11 pages
Clustering
No ratings yet
Clustering
5 pages
A06-A Survey of Clustering Techniques
No ratings yet
A06-A Survey of Clustering Techniques
5 pages
DM Unit 5
No ratings yet
DM Unit 5
15 pages
DMW UNIT 5
No ratings yet
DMW UNIT 5
10 pages
ML Unit 4 Notes - NJ
No ratings yet
ML Unit 4 Notes - NJ
15 pages
Sathyabama Institute of Science and Technology SIT1301-Data Mining and Warehousing
No ratings yet
Sathyabama Institute of Science and Technology SIT1301-Data Mining and Warehousing
22 pages
Unit 4 Clustering
No ratings yet
Unit 4 Clustering
18 pages
UNIT5
No ratings yet
UNIT5
60 pages
Clustering
No ratings yet
Clustering
57 pages
Unit 5
No ratings yet
Unit 5
5 pages
Machine Learning Notes-1 (Clustering-1)
No ratings yet
Machine Learning Notes-1 (Clustering-1)
25 pages
Chapter-8 (Cluster Analysis Basic Concepts and Algorithms)
No ratings yet
Chapter-8 (Cluster Analysis Basic Concepts and Algorithms)
73 pages
7.introduction To Clustering
No ratings yet
7.introduction To Clustering
11 pages
dwdm FINAL6
No ratings yet
dwdm FINAL6
28 pages
Clustering
No ratings yet
Clustering
104 pages
Clustering
No ratings yet
Clustering
8 pages
Unsupervised-Learning-Part-1 (1)
No ratings yet
Unsupervised-Learning-Part-1 (1)
9 pages
Classify Clustering
No ratings yet
Classify Clustering
31 pages
Data Mining Models: Techniques and Applications
From Everand
Data Mining Models: Techniques and Applications
Ravi Deshpande
No ratings yet
Data Structures and Algorithms with Python
From Everand
Data Structures and Algorithms with Python
Aadinath Pothuvaal
No ratings yet
Dse7310 20 Mkii Data Sheet Us
No ratings yet
Dse7310 20 Mkii Data Sheet Us
2 pages
Weebly Lesson Plans
No ratings yet
Weebly Lesson Plans
18 pages
Compression of Spring PDF
No ratings yet
Compression of Spring PDF
10 pages
Colors: Color The Thing That Is - Color The Thing That Is
No ratings yet
Colors: Color The Thing That Is - Color The Thing That Is
19 pages
Donor Deferral Flashcards _ Quizlet
No ratings yet
Donor Deferral Flashcards _ Quizlet
4 pages
A. Match The Characters To Their Definitions. "WHO AM I?": Percy Jackson &olympians The Lightning Thief
No ratings yet
A. Match The Characters To Their Definitions. "WHO AM I?": Percy Jackson &olympians The Lightning Thief
3 pages
VCI - Tablet - Rust X
No ratings yet
VCI - Tablet - Rust X
4 pages
International Society For Soil Mechanics and Geotechnical Engineering
No ratings yet
International Society For Soil Mechanics and Geotechnical Engineering
35 pages
Savannah by Night COMPLETE PDF
100% (3)
Savannah by Night COMPLETE PDF
56 pages
Lab Test Report (ABS Parts)
No ratings yet
Lab Test Report (ABS Parts)
1 page
How To Build A Lean-To Greenhouse For Under $100
No ratings yet
How To Build A Lean-To Greenhouse For Under $100
19 pages
Time Sharing Operating System
No ratings yet
Time Sharing Operating System
8 pages
Scandinavian Style
No ratings yet
Scandinavian Style
9 pages
Ii Year-I - Sem STLD
No ratings yet
Ii Year-I - Sem STLD
2 pages
Ultramarine Specialty Chemicals Amendment
No ratings yet
Ultramarine Specialty Chemicals Amendment
4 pages
Nursing Seminar 1 SAS Session 17
No ratings yet
Nursing Seminar 1 SAS Session 17
8 pages
Kalasarpa-Dosha Explained
No ratings yet
Kalasarpa-Dosha Explained
10 pages
Musculoskeletal System: Health Assessment
No ratings yet
Musculoskeletal System: Health Assessment
25 pages
15.01.2025[1670]-Circular_Reg._1st Terminal Examination of MBBS Batch-2024 (Phase I)
No ratings yet
15.01.2025[1670]-Circular_Reg._1st Terminal Examination of MBBS Batch-2024 (Phase I)
2 pages
Department of Laboratory Services - Laboratory
No ratings yet
Department of Laboratory Services - Laboratory
1 page
Mechanical and Aerospace Engineering JAN 2022
No ratings yet
Mechanical and Aerospace Engineering JAN 2022
11 pages
Rumen Impaction in A 3 / - Year Old Balami Ewe: Case Report and Literature Review
No ratings yet
Rumen Impaction in A 3 / - Year Old Balami Ewe: Case Report and Literature Review
4 pages
Hearst Tower PDF
No ratings yet
Hearst Tower PDF
4 pages
Gold Exp Exam A2KeyfS Audioscripts
No ratings yet
Gold Exp Exam A2KeyfS Audioscripts
4 pages
History of Ethiopia & The Horn Unit 2
No ratings yet
History of Ethiopia & The Horn Unit 2
30 pages
DR Lal Pathlabs: LPL - Lpl-Rohini (National Reference Lab) Sector - 18, Block - E Rohini Delhi 110085
No ratings yet
DR Lal Pathlabs: LPL - Lpl-Rohini (National Reference Lab) Sector - 18, Block - E Rohini Delhi 110085
4 pages
Judgement Day Islam
No ratings yet
Judgement Day Islam
8 pages

Clustering

Uploaded by

Clustering

Uploaded by

BASICS of CLUSTER ANALYSIS

Introduction to Cluster Analysis

• A cluster is a collection of data objects that

• Cluster: Collection of data objects

• Clustering is unsupervised classification

• Clustering is also called data segmentation in some

• Classification - learning by examples

• Many clustering algorithms require users to input

• The clustering results can be quite sensitive

• Ability to deal with different types of attributes

Applications may require clustering other types of data:

• Ability to deal with noisy data

• Many clustering algorithms determine clusters

• Algorithms based on such distance measures

• It is important to develop algorithms that can

• Given k, the k-means algorithm is implemented in four

Example ( Book page 31)

cluster center 4 Update 4

The basic step of k-means clustering is simple:

¡ Iterate until stable, i.e. there is no change in the

¡ Determine the centroid coordinate

¡ Determine the distance of each object to the

¡ Group the object based on minimum distance

Two Main Algorithms: 1. Agglomerative method 2. Divisive method

Similarity Matrix: Linkage methods

Visualized as a dendrogram: A tree like diagram that

{P1, P2, P3}, {P4}, {P5, P6}, {P7}

Step 0 Step 1 Step 2 Step 3 Step 4 bottom-up

• Divisive Approaches Initialization:

Step 4 Step 3 Step 2 Step 1 Step 0 Top-down

• More popular hierarchical clustering technique

• Key operation is the computation of the distance between

You might also like