0% found this document useful (0 votes)

12 views

Lecture 16

Uploaded by

sundarkonduru0

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views

Lecture 16

Uploaded by

sundarkonduru0

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

Data Clustering -- Introduction

Ref: Han and Kamber book.

1
Chapter 10. Cluster Analysis: Basic
Concepts and Methods

 Cluster Analysis: Basic Concepts

 Partitioning Methods

 Hierarchical Methods

 Evaluation of Clustering

 Summary

2
What is Cluster Analysis?
 Cluster: A collection of data objects
 similar (or related) to one another within the same group

 dissimilar (or unrelated) to the objects in other groups

 Cluster analysis
 Finding similarities between data according to the

characteristics found in the data and grouping similar

data objects into clusters
 Unsupervised learning: no predefined classes
 Typical applications
 As a stand-alone tool to get insight into data distribution

 As a preprocessing step for other algorithms

3
Clustering: Application Examples
 Biology: taxonomy of living things: kingdom, phylum, class, order,
family, genus and species
 Information retrieval: document clustering
 Land use: Identification of areas of similar land use in an earth
observation database
 Marketing: Help marketers discover distinct groups in their customer
bases, and then use this knowledge to develop targeted marketing
programs
 City-planning: Identifying groups of houses according to their house
type, value, and geographical location
 Climate: understanding earth climate, find patterns of atmospheric
and ocean

4
Considerations for Cluster Analysis
 Partitioning criteria
 Single level vs. hierarchical partitioning (often, multi-level
hierarchical partitioning is desirable)
 Separation of clusters
 Hard (e.g., one customer belongs to only one region) vs. Soft
(e.g., one document may belong to more than one class)
 Similarity measure
 Distance-based (e.g., Euclidian) vs. connectivity-based (e.g.,
density)
 Clustering space
 Full space (often when low dimensional) vs. subspaces (often in
high-dimensional clustering)

5
Clustering methods:

6
Clustering methods:

7
Major Clustering Approaches

 Partitioning approach:
 Construct various partitions and then evaluate them by some

criterion
 Typical methods: k-means, k-medoids, CLARANS

 Hierarchical approach:
 Create a hierarchical decomposition of the set of data (or objects)

using some criterion

 Typical methods: Diana, Agnes, BIRCH, CAMELEON

 Density-based approach:
 Based on connectivity and density functions

 Typical methods: DBSACN, OPTICS, DenClue

 Grid-based approach:
 based on a multiple-level granularity structure

 Typical methods: STING, WaveCluster, CLIQUE

8
Chapter 10. Cluster Analysis: Basic
Concepts and Methods

 Cluster Analysis: Basic Concepts

 Partitioning Methods

 Hierarchical Methods

 Evaluation of Clustering

 Summary

9
Partitioning Algorithms: Basic Concept

 Partitioning method: Partitioning a database D of n objects into a set

of k clusters, such that the sum of squared distances is minimized
(where ci is the centroid or medoid of cluster Ci)
 Global optimal: exhaustively enumerate all partitions
 Heuristic methods: k-means and k-medoids algorithms

10
The K-Means Clustering Method

 Given k, the k-means algorithm is implemented in

four steps:
 Depends on centroids (mean)
 Start with a k block partition
 Refine the partition based on the criterion
 Stop when no more correction is possible.

11
What Is the Problem of the K-Means Method?

 The k-means algorithm is sensitive to outliers !

 Since an object with an extremely large value may substantially

distort the distribution of the data

 K-Medoids: Instead of taking the mean value of the object in a

cluster as a reference point, medoids can be used, which is the most
centrally located object in a cluster

 Some times centroids doesnot make sense. It need not be part of the
data. Eg: Let the data consist of binary vectors, centroids are not
binary vectors.

12
But,

 Finding medoid is costly.

 Centroid can be found in linear time.

 Medoid can take quadratic time.

13
But,

 Finding medoid is costly.

 Centroid can be found in linear time.

 Medoid can take quadratic time.

 Partitioning around medoids (PAM) is a costly

method.
 We have approximations.

14
But,

 Finding medoid is costly.

 Centroid can be found in linear time.

 Medoid can take quadratic time.

 Partitioning around medoids (PAM) is a costly

method.
 We have approximations.

 Finding centroid itself might be difficult in some

situations (kernel K-Means).

15
Chapter 10. Cluster Analysis: Basic
Concepts and Methods
 Cluster Analysis: Basic Concepts
 Partitioning Methods
 Hierarchical Methods
 Evaluation of Clustering
 Summary

16
Hierarchical Clustering
 Use distance matrix as clustering criteria. This method
does not require the number of clusters k as an input,
but needs a termination condition.
Step 0 Step 1 Step 2 Step 3 Step 4 agglomerative
(AGNES)
a
ab
b
abcde
c
cde
d
de
e
divisive
(DIANA)
Step 4 Step 3 Step 2 Step 1 Step 0

17
Hierarchical Agglomerative Clustering:
Linkage Methods
 The single linkage method is based on minimum
distance, or the nearest neighbor rule.

 The complete linkage method is based on the

maximum distance or the furthest neighbor
approach.

 The average linkage method the distance

between two clusters is defined as the average of
the distances between all pairs of objects
Linkage Methods of Clustering
Single Linkage
Minimum Distance

Cluster 1 Cluster 2
Complete Linkage
Maximum
Distance

Cluster 1 Cluster 2
Average Linkage

Average Distance
Cluster 1 Cluster 2
Dendrogram
Chapter 10. Cluster Analysis: Basic
Concepts and Methods

 Cluster Analysis: Basic Concepts

 Partitioning Methods

 Hierarchical Methods

 Evaluation of Clustering

 Summary

21
Determine the Number of Clusters
 Empirical method
 # of clusters: k ≈√n/2 for a dataset of n points, e.g., n = 200, k = 10

 Other methods:
 Elbow method

 Density based methods

22
Measuring Clustering Quality
 3 kinds of measures: External, internal and relative

23
Measuring Clustering Quality
 3 kinds of measures: External, internal and relative
 External: supervised, employ criteria not inherent to the dataset
 Compare a clustering against prior or expert-specified
knowledge using certain clustering quality measure

24
Measuring Clustering Quality
 3 kinds of measures: External, internal and relative
 External: supervised, employ criteria not inherent to the dataset
 Compare a clustering against prior or expert-specified
knowledge using certain clustering quality measure
 Internal: unsupervised, criteria derived from data itself
 Evaluate the goodness of a clustering by considering how
well the clusters are separated, and how compact the
clusters are, e.g., Silhouette coefficient

25
Measuring Clustering Quality
 3 kinds of measures: External, internal and relative
 External: supervised, employ criteria not inherent to the dataset
 Compare a clustering against prior or expert-specified
knowledge using certain clustering quality measure
 Internal: unsupervised, criteria derived from data itself
 Evaluate the goodness of a clustering by considering how
well the clusters are separated, and how compact the
clusters are, e.g., Silhouette coefficient
 Relative: directly compare different clusterings, usually those
obtained via different parameter settings for the same algorithm

26
Chapter 10. Cluster Analysis: Basic
Concepts and Methods

 Cluster Analysis: Basic Concepts

 Partitioning Methods

 Hierarchical Methods

 Evaluation of Clustering

 Summary

27
Summary
 Cluster analysis groups objects based on their similarity and has
wide applications
 Clustering algorithms can be categorized into partitioning methods,
hierarchical methods, density-based methods, grid-based methods,
and model-based methods
 K-means and K-medoids algorithms are popular partitioning-based
clustering algorithms
 Birch and Chameleon are interesting hierarchical clustering algorithms,
and there are also probabilistic hierarchical clustering algorithms
 DBSCAN, OPTICS, and DENCLU are interesting density-based
algorithms
 STING and CLIQUE are grid-based methods, where CLIQUE is also a
subspace clustering algorithm
 Quality of clustering results can be evaluated in various ways

28
Important point

 Are we solving a solvable problem?

 An impossibility theorem on clustering (Kleinberg)
 Possibility arguments (Ackerman)

10ClusBasic (1)
No ratings yet
10ClusBasic (1)
31 pages
10clustering - Han and Kamber
No ratings yet
10clustering - Han and Kamber
93 pages
10ClusBasic
No ratings yet
10ClusBasic
66 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
50 pages
Concepts and Techniques: - Chapter 10
No ratings yet
Concepts and Techniques: - Chapter 10
97 pages
Clustering Algorithms
No ratings yet
Clustering Algorithms
93 pages
Cluster Analysis
No ratings yet
Cluster Analysis
76 pages
unit iv[1]
No ratings yet
unit iv[1]
96 pages
10ClusBasic Editted v1
No ratings yet
10ClusBasic Editted v1
41 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
101 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
101 pages
Clustering K Means Agnes
No ratings yet
Clustering K Means Agnes
36 pages
APznzaaxpWzYylHJmwXGn2puBz7GP1usZYf9XTi7oqfrrKnFV9DMMfVzPCu6yO0UOnr_XFt1gJv4TE1ITR6850n9k65DydQUgoRlylNdn2acWAu6KNonoO8z7QULN6BlLxY_B-JhKko0tJ3K77woLz26oTaAv1YNcIuMcOSqInmgeCUzpUxjKC9VqnT_lhE7vDyWp_LQQjGTRnamgIC6ya3nlwi7mjjE9EUIiO2sUhjkD6RV
No ratings yet
APznzaaxpWzYylHJmwXGn2puBz7GP1usZYf9XTi7oqfrrKnFV9DMMfVzPCu6yO0UOnr_XFt1gJv4TE1ITR6850n9k65DydQUgoRlylNdn2acWAu6KNonoO8z7QULN6BlLxY_B-JhKko0tJ3K77woLz26oTaAv1YNcIuMcOSqInmgeCUzpUxjKC9VqnT_lhE7vDyWp_LQQjGTRnamgIC6ya3nlwi7mjjE9EUIiO2sUhjkD6RV
38 pages
10ClusBasic
No ratings yet
10ClusBasic
95 pages
Clustering
No ratings yet
Clustering
24 pages
Unit - 5 Cluster Analysis
No ratings yet
Unit - 5 Cluster Analysis
83 pages
05 Clustering
No ratings yet
05 Clustering
96 pages
Slide-08-Chapter10-Cluster Analysis Basic Concept I
No ratings yet
Slide-08-Chapter10-Cluster Analysis Basic Concept I
40 pages
Data Mining Clustering
No ratings yet
Data Mining Clustering
76 pages
Unit 5 DM
No ratings yet
Unit 5 DM
47 pages
Lecture 8 - Clustering
No ratings yet
Lecture 8 - Clustering
23 pages
Clustering
No ratings yet
Clustering
32 pages
Unit5 Clustering
No ratings yet
Unit5 Clustering
74 pages
Clustering For Big Data Analytics
No ratings yet
Clustering For Big Data Analytics
28 pages
Cluster-Analysis
No ratings yet
Cluster-Analysis
89 pages
Introduction to Cluster Analysis.
No ratings yet
Introduction to Cluster Analysis.
53 pages
Cluster Analysis
No ratings yet
Cluster Analysis
21 pages
Partitioning Methods & Hierachical Methods
No ratings yet
Partitioning Methods & Hierachical Methods
22 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
42 pages
DWMModule 4 (1) (1) (1)
No ratings yet
DWMModule 4 (1) (1) (1)
31 pages
Cluster
No ratings yet
Cluster
20 pages
UNIT-5 PPT
No ratings yet
UNIT-5 PPT
85 pages
DM_C6
No ratings yet
DM_C6
37 pages
Clustering
No ratings yet
Clustering
104 pages
11-1 Clustering part 1.pdf
No ratings yet
11-1 Clustering part 1.pdf
18 pages
Cluster Analysis: G Sreenivas
No ratings yet
Cluster Analysis: G Sreenivas
29 pages
Lecture 6
No ratings yet
Lecture 6
14 pages
Clustering
No ratings yet
Clustering
84 pages
2002 Spring CS525 Lecture 2
No ratings yet
2002 Spring CS525 Lecture 2
37 pages
Clustering in AI
No ratings yet
Clustering in AI
16 pages
Data Mining: I Gede Mahendra Darmawiguna
No ratings yet
Data Mining: I Gede Mahendra Darmawiguna
25 pages
IT3080 Lecture04 2023
No ratings yet
IT3080 Lecture04 2023
56 pages
DWDM Unit 3
No ratings yet
DWDM Unit 3
21 pages
8 - Clustering
No ratings yet
8 - Clustering
85 pages
Clustering
No ratings yet
Clustering
45 pages
Unit 4
No ratings yet
Unit 4
4 pages
Grouping
No ratings yet
Grouping
98 pages
Dmaclat4 Merged
No ratings yet
Dmaclat4 Merged
46 pages
UG BSF Clustering
No ratings yet
UG BSF Clustering
119 pages
Unit-4
No ratings yet
Unit-4
76 pages
clustering
No ratings yet
clustering
16 pages
clustering
No ratings yet
clustering
6 pages
Session 7 Clustering
No ratings yet
Session 7 Clustering
93 pages
5 Algoritma Klastering
No ratings yet
5 Algoritma Klastering
85 pages
Session 3: Clustering Techniques - Partitioning & Hierarchical Methods
No ratings yet
Session 3: Clustering Techniques - Partitioning & Hierarchical Methods
27 pages
Custer Analysis: Prepared by Navin Ninama
No ratings yet
Custer Analysis: Prepared by Navin Ninama
20 pages
Lecture 3.2.3 3.2.4
No ratings yet
Lecture 3.2.3 3.2.4
28 pages
Cluster Analysis - Approach 1
No ratings yet
Cluster Analysis - Approach 1
28 pages
Cluster Analysis Clustering
No ratings yet
Cluster Analysis Clustering
6 pages
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
R1Plus - User Manual - ENGv2 - 3
No ratings yet
R1Plus - User Manual - ENGv2 - 3
128 pages
Wilo Helix First 50hz Indonesia
No ratings yet
Wilo Helix First 50hz Indonesia
4 pages
Financial Management II - Chapter 18
No ratings yet
Financial Management II - Chapter 18
29 pages
DUTY OF DIRECTORS-iPleaders
No ratings yet
DUTY OF DIRECTORS-iPleaders
3 pages
Seminar On Ajax
0% (1)
Seminar On Ajax
16 pages
Lecture#7 - Chap#2 (Syntax Directed Translator (Part-III) )
No ratings yet
Lecture#7 - Chap#2 (Syntax Directed Translator (Part-III) )
26 pages
Rpad 4 2 Deploy Us en
No ratings yet
Rpad 4 2 Deploy Us en
104 pages
Lopsa Feb 2016
No ratings yet
Lopsa Feb 2016
63 pages
Human Resource Management Managing Employees for Competitive Advantage 2e Wei Zhi All Chapters Instant Download
100% (1)
Human Resource Management Managing Employees for Competitive Advantage 2e Wei Zhi All Chapters Instant Download
34 pages
Module 8 Dimensioning
No ratings yet
Module 8 Dimensioning
57 pages
Intel (R) ME Firmware Integrated Clock Control (ICC) Tools User Guide
No ratings yet
Intel (R) ME Firmware Integrated Clock Control (ICC) Tools User Guide
24 pages
Analyzing New Ways To Adapt The Triple-A Supply Chain
No ratings yet
Analyzing New Ways To Adapt The Triple-A Supply Chain
12 pages
Technical Data: Ceiling Mounted Corner Cassette Fxkq-Mave
No ratings yet
Technical Data: Ceiling Mounted Corner Cassette Fxkq-Mave
21 pages
Who Trs 996 Annex 05
No ratings yet
Who Trs 996 Annex 05
30 pages
Instant ebooks textbook Course in Phonetics 7th Edition by Peter Ladefoged A download all chapters
No ratings yet
Instant ebooks textbook Course in Phonetics 7th Edition by Peter Ladefoged A download all chapters
14 pages
Lubuntu - Lightweight, Fast, Easier PDF
No ratings yet
Lubuntu - Lightweight, Fast, Easier PDF
4 pages
RCL-08 Inewatt Airfield Lighting Solutions
No ratings yet
RCL-08 Inewatt Airfield Lighting Solutions
2 pages
Read Me
No ratings yet
Read Me
15 pages
Illustrative Guide to Cataract Surgery A Step By Step Approach to Refining Surgical Skills 1st Edition Amar Agarwal pdf download
100% (7)
Illustrative Guide to Cataract Surgery A Step By Step Approach to Refining Surgical Skills 1st Edition Amar Agarwal pdf download
57 pages
Difference Between Urban and Rural
No ratings yet
Difference Between Urban and Rural
3 pages
Case Analysis: Bain & Company's IT Practice: Problem Statement
No ratings yet
Case Analysis: Bain & Company's IT Practice: Problem Statement
1 page
Crossmax Elite 29 2017 REAR
No ratings yet
Crossmax Elite 29 2017 REAR
2 pages
Google Feud: Play Now
No ratings yet
Google Feud: Play Now
1 page
Final Date Sheet: Board of Technical Education
No ratings yet
Final Date Sheet: Board of Technical Education
5 pages
Onshore - SOP Subi
No ratings yet
Onshore - SOP Subi
4 pages
EXP-3
No ratings yet
EXP-3
3 pages
Planning and Design Hospitals Other Facilities 2
No ratings yet
Planning and Design Hospitals Other Facilities 2
6 pages
The Social Construct
No ratings yet
The Social Construct
4 pages
November 12, 2009 Intended Appellant's Brief For Leave To Appeal Hearing, Court of Appeal, New Brunswick, File Number, 145-09-CA
No ratings yet
November 12, 2009 Intended Appellant's Brief For Leave To Appeal Hearing, Court of Appeal, New Brunswick, File Number, 145-09-CA
22 pages
Punong Barangay Tasks and Responsibilities 2018 PDF
100% (1)
Punong Barangay Tasks and Responsibilities 2018 PDF
56 pages