Lecture 16
Lecture 16
1
Chapter 10. Cluster Analysis: Basic
Concepts and Methods
Partitioning Methods
Hierarchical Methods
Evaluation of Clustering
Summary
2
What is Cluster Analysis?
Cluster: A collection of data objects
similar (or related) to one another within the same group
Cluster analysis
Finding similarities between data according to the
3
Clustering: Application Examples
Biology: taxonomy of living things: kingdom, phylum, class, order,
family, genus and species
Information retrieval: document clustering
Land use: Identification of areas of similar land use in an earth
observation database
Marketing: Help marketers discover distinct groups in their customer
bases, and then use this knowledge to develop targeted marketing
programs
City-planning: Identifying groups of houses according to their house
type, value, and geographical location
Climate: understanding earth climate, find patterns of atmospheric
and ocean
4
Considerations for Cluster Analysis
Partitioning criteria
Single level vs. hierarchical partitioning (often, multi-level
hierarchical partitioning is desirable)
Separation of clusters
Hard (e.g., one customer belongs to only one region) vs. Soft
(e.g., one document may belong to more than one class)
Similarity measure
Distance-based (e.g., Euclidian) vs. connectivity-based (e.g.,
density)
Clustering space
Full space (often when low dimensional) vs. subspaces (often in
high-dimensional clustering)
5
Clustering methods:
6
Clustering methods:
7
Major Clustering Approaches
Partitioning approach:
Construct various partitions and then evaluate them by some
criterion
Typical methods: k-means, k-medoids, CLARANS
Hierarchical approach:
Create a hierarchical decomposition of the set of data (or objects)
Density-based approach:
Based on connectivity and density functions
Grid-based approach:
based on a multiple-level granularity structure
8
Chapter 10. Cluster Analysis: Basic
Concepts and Methods
Partitioning Methods
Hierarchical Methods
Evaluation of Clustering
Summary
9
Partitioning Algorithms: Basic Concept
10
The K-Means Clustering Method
11
What Is the Problem of the K-Means Method?
Some times centroids doesnot make sense. It need not be part of the
data. Eg: Let the data consist of binary vectors, centroids are not
binary vectors.
12
But,
13
But,
14
But,
15
Chapter 10. Cluster Analysis: Basic
Concepts and Methods
Cluster Analysis: Basic Concepts
Partitioning Methods
Hierarchical Methods
Evaluation of Clustering
Summary
16
Hierarchical Clustering
Use distance matrix as clustering criteria. This method
does not require the number of clusters k as an input,
but needs a termination condition.
Step 0 Step 1 Step 2 Step 3 Step 4 agglomerative
(AGNES)
a
ab
b
abcde
c
cde
d
de
e
divisive
(DIANA)
Step 4 Step 3 Step 2 Step 1 Step 0
17
Hierarchical Agglomerative Clustering:
Linkage Methods
The single linkage method is based on minimum
distance, or the nearest neighbor rule.
Cluster 1 Cluster 2
Complete Linkage
Maximum
Distance
Cluster 1 Cluster 2
Average Linkage
Average Distance
Cluster 1 Cluster 2
Dendrogram
Chapter 10. Cluster Analysis: Basic
Concepts and Methods
Partitioning Methods
Hierarchical Methods
Evaluation of Clustering
Summary
21
Determine the Number of Clusters
Empirical method
# of clusters: k ≈√n/2 for a dataset of n points, e.g., n = 200, k = 10
Other methods:
Elbow method
22
Measuring Clustering Quality
3 kinds of measures: External, internal and relative
23
Measuring Clustering Quality
3 kinds of measures: External, internal and relative
External: supervised, employ criteria not inherent to the dataset
Compare a clustering against prior or expert-specified
knowledge using certain clustering quality measure
24
Measuring Clustering Quality
3 kinds of measures: External, internal and relative
External: supervised, employ criteria not inherent to the dataset
Compare a clustering against prior or expert-specified
knowledge using certain clustering quality measure
Internal: unsupervised, criteria derived from data itself
Evaluate the goodness of a clustering by considering how
well the clusters are separated, and how compact the
clusters are, e.g., Silhouette coefficient
25
Measuring Clustering Quality
3 kinds of measures: External, internal and relative
External: supervised, employ criteria not inherent to the dataset
Compare a clustering against prior or expert-specified
knowledge using certain clustering quality measure
Internal: unsupervised, criteria derived from data itself
Evaluate the goodness of a clustering by considering how
well the clusters are separated, and how compact the
clusters are, e.g., Silhouette coefficient
Relative: directly compare different clusterings, usually those
obtained via different parameter settings for the same algorithm
26
Chapter 10. Cluster Analysis: Basic
Concepts and Methods
Partitioning Methods
Hierarchical Methods
Evaluation of Clustering
Summary
27
Summary
Cluster analysis groups objects based on their similarity and has
wide applications
Clustering algorithms can be categorized into partitioning methods,
hierarchical methods, density-based methods, grid-based methods,
and model-based methods
K-means and K-medoids algorithms are popular partitioning-based
clustering algorithms
Birch and Chameleon are interesting hierarchical clustering algorithms,
and there are also probabilistic hierarchical clustering algorithms
DBSCAN, OPTICS, and DENCLU are interesting density-based
algorithms
STING and CLIQUE are grid-based methods, where CLIQUE is also a
subspace clustering algorithm
Quality of clustering results can be evaluated in various ways
28
Important point
29