Chap8-Cluster Analysis
Chap8-Cluster Analysis
9 9 9
8 8 8
7
Arbitrar 7
Assign 7
6
each
6 6
5
y 5 5
4 choose 4
remainin 4
3 K object 3 g object 3
2
as 2 to 2
1
initial 1
nearest 1
0
medoid medoids
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
s
K=2
Randomly select a
non-medoid
Select initial K medoids randomly object,Oramdom
10 10
Repeat 9
Compute
9
Swapping 8 8
Oramdom
Swap medoid m with oi if it
5 5
4
swapping 4
If quality is 3 3
1 1
a
ab
b
abcde
c
cde
d
de
e
divisive
(DIANA)
Step 4 Step 3 Step 2 Step 1 Step 0
Dendrogram: How Clusters are
Merged
• Dendrogram: Decompose a set of data objects into a tree of clusters
by multi-level nested partitioning
• A clustering of the data objects is obtained by cutting the dendrogram
at the desired level, then each connected component forms a cluster
10
10 10
9
9 9
8
8 8
7 7 7
6 6 6
5 5 5
4 4 4
3 3
3
2 2
2
1 1
1
0 0
0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
Single Link vs. Complete Link in
Hierarchical Clustering X
X
• Average link: The average distance between an element in one cluster and
an element in the other (i.e., all pairs in two clusters)
• Expensive to compute X
X
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
3 3
3
2 2
2
1 1
1
0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
Divisive Clustering Is a Top-down
Approach
• The process starts at the root with all the points as one cluster
• It recursively splits the higher level clusters to build the dendrogram
• Can be considered as a global approach
• More efficient when compared with agglomerative clustering
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
3 3
3
2 2
2
1 1
1
0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
More on Algorithm Design for
Divisive Clustering
• Choosing which cluster to split
• Check the sums of squared errors of the clusters and choose the one with the
largest value
• Splitting criterion: Determining how to split
• One may use Ward’s criterion to chase for greater reduction in the difference
in the SSE criterion as a result of a split
• For categorical data, Gini-index can be used
• Handling the noise
• Use a threshold to determine the termination criterion (do not generate
clusters that are too small because they contain mainly noises)
Extensions to Hierarchical Clustering
• Weakness of the agglomerative & divisive hierarchical clustering
methods
• No revisit: cannot undo any merge/split decisions made before
• Scalability bottleneck: Each merge/split needs to examine many possible options
• Time complexity: at least O(n2), where n is the number of total objects
• Several other hierarchical clustering algorithms
• BIRCH (1996): Use CF-tree and incrementally adjust the quality of sub-clusters
• CURE (1998): Represent a cluster using a set of well-scattered representative
points
• CHAMELEON (1999): Use graph partitioning methods on the K-nearest neighbor
graph of the data
BIRCH: A Multi-Phase Hierarchical
Clustering Method
• BIRCH (Balanced Iterative Reducing and Clustering Using Hierarchies)
• Developed by Zhang, Ramakrishnan & Livny (SIGMOD’96)
• Impact many new clustering methods and applications (received 2006 SIGMOD Test
of Time award)
• Major innovation
• Integrating hierarchical clustering (initial micro-clustering phase) and other
clustering methods (at the later macro-clustering phase)
• Multi-phase hierarchical clustering
• Phase1 (initial micro-clustering): Scan DB to build an initial CF tree, a multi-level
compression of the data to preserve the inherent clustering structure of the data
• Phase 2 (later macro-clustering): Use an arbitrary clustering algorithm (e.g., iterative
partitioning) to cluster flexibly the leaf nodes of the CF-tree
Clustering Feature Vector
• Consider a cluster of multi-dimensional data objects/points
• The clustering feature (CF) of the cluster is a 3-D vector summarizing
info about clusters of objects
• Register the 0-th, 1st, and 2nd moments of a cluster
• Clustering Feature (CF): CF = <n, LS, SS> CF1 = <5, (16,30), 244>
8
(3,4)
7 (2,6)
(4,5)
6
3
(4,7)
2
1
(3,8)
n = 5; LS = ((3+2+4+4+3), (4+6+5+7+8)) = (16, 30); 0
0 1 2 3 4 5 6 7 8 9 10
10
9 (3,4)
8
(2,6)
n = 5; LS = ((3+2+4+4+3), (4+6+5+7+8)) = (16, 30);
7
(4,5)
6
3
(4,7)
2
1
(3,8)
0
0 1 2 3 4 5 6 7 8 9 10
Essential Measures of Cluster:
Centroid, Radius and Diameter
• Centroid:
• The “middle” of a cluster X
√
• n: number of points in a cluster 𝑛 𝑛
∑ ∑ ‖ 𝒙 𝑖 − 𝒙 𝑗‖
2
√
2
• is the i-th point in the cluster 𝑖=1 𝑗 =1 2 𝑛⋅ 𝑆𝑆 −2‖ 𝑳𝑺‖
𝐷= =
𝑛 (𝑛 − 1 ) 𝑛 (𝑛 − 1 )
• Radius: R
• Average distance from member objects to the centroid
• The square root of average distance from any point of the cluster to its centroid
• Diameter: D
• Average pairwise distance within a cluster
• The square root of average mean squared distance between all pairs of points in
the cluster
Example
• Continuing the previous example:
CF Tree: A Height-Balanced Tree
Storing Clustering Features for
Hierarchical Clustering
• Incremental insertion of new points (similar to B+-tree)
• For each point in the input
• Find its closest leaf entry
• Add point to leaf entry and update CF
• If entry diameter > max_diameter
• Split leaf, and possibly parents
• A CF tree has two parameters
• Branching factor: Maximum number of children
• Maximum diameter of sub-clusters stored at the leaf nodes
• A CF tree: A height-balanced tree that stores the clustering features (CFs)
• The non-leaf nodes store sums of the CFs of their children
CF Tree: A Height-Balanced Tree
Storing Clustering Features for
Hierarchical
Root
Clustering
B=7 CF1 CF2 CF3 CF6
L=6 child1 child2 child3 child6
Non-leaf node
CF11 CF12 CF13 CF15
child11 child12 child child15
13
• Density-reachable: p2
q
• A point p is density-reachable from a point q w.r.t. Eps, MinPts
if there is a chain of points , , such that is directly density-
reachable from
p q
• Density-connected:
• A point p is density-connected to a point q w.r.t. Eps, MinPts if o
there is a point o such that both p and q are density-reachable
from o w.r.t. Eps and MinPts
DBSCAN: The Algorithm Outlier
Outlier/noise:
Border not in a cluster
• Algorithm Core point: dense
• Arbitrarily select a point p Core neighborhood
• Retrieve all points density-reachable Border point: in cluster but
• from p w.r.t. Eps and MinPts neighborhood is not dense
• If p is a core point, a cluster is formed
• If p is a border point, no points are directly density-reachable from p, and DBSCAN visits
the next point of the database
• Continue the process until all of the points have been processed
• Computational complexity
• If a spatial index is used, the computational complexity of DBSCAN is , where n
is the number of database objects
• Otherwise, the complexity is O(n2)
DBSCAN Is Sensitive to the Setting
of Parameters
Ack. Figures from G. Karypis, E.-H. Han, and V. Kumar, COMPUTER, 32(8), 1999
Outline
• Cluster analysis
• Partitioning methods
• Hierarchical methods
• Density-based and grid-based methods
• Evaluation of clustering
• Assessing clustering tendency
• Determining the number of clusters
• Measuring clustering quality: extrinsic methods
• Intrinsic methods
Evaluation of Clustering: Basic
Concepts
• Evaluation of clustering
• Assess the feasibility of clustering analysis on a data set
• Evaluate the quality of the results generated by a clustering method
• Major issues on clustering assessment and validation
• Clustering tendency: assessing the suitability of clustering: whether the data
has any inherent grouping structure
• Determining the Number of Clusters: determining for a dataset the right
number of clusters that may lead to a good quality clustering
• Clustering quality evaluation: evaluating the quality of the clustering results
Clustering Tendency: Whether the
Data Contains Inherent Grouping
Structure
• Assess the suitability of clustering
• Whether the data has any “inherent grouping structure” — non-random structure that may
lead to meaningful clusters
• Determine clustering tendency or clusterability
• A hard task because there are so many different definitions of clusters
• Different definitions: Partitioning, hierarchical, density-based and graph-based
• Even fixing a type, still hard to define an appropriate null model for a data set
• There are some clusterability assessment methods, such as
• Spatial histogram: Contrast the histogram of the data with that generated from random
samples
• Distance distribution: Compare the pairwise point distance from the data with those from
the randomly generated samples
• Hopkins Statistic: A sparse sampling test for spatial randomness
Testing Clustering Tendency: A
Spatial Histogram Approach
• Spatial Histogram Approach: Contrast the d-dimensional histogram of
the input dataset D with the histogram generated from random
samples
• Dataset D is clusterable if the distributions of two histograms are rather
different
• Examine how well the clustering results match the ground truth in partitioning
the objects in the data set
• Information theory-based methods
• Compare the distribution of the clustering results and that of the ground truth
• Information theory (e.g., entropy) used to quantify the comparison
• Ex. Conditional entropy, normalized mutual information (NMI)
• Pairwise comparison-based methods
• Treat each group in the ground truth as a class, and then check the pairwise
consistency of the objects in the clustering results
• Ex. Four possibilities: TP, FN, FP, TN; Jaccard coefficient
Matching-Based Methods Ground Truth G1 G2 G3
Cluster C1 C2 C3
• Consider 11 objects
• Other methods:
• maximum matching; F-measure
Information Theory-Based Methods (I)
Conditional Entropy Ground Truth G1
Cluster C1 C2
G2 G3
C3
Cluster C1 C2 C3
• Consider 11 objects
Note: conditional entropy cannot detect the issue that C1 splits the objects in G into two
Information Theory-Based Methods (II)
Normalized Mutual Information (NMI)
• Mutual information
• Quantify the amount of shared info between the clustering C and the ground-
truth partitioning G
• Measure the dependency between the observed joint probability of C and G,
and the expected joint probability under the independence assumption
• When C and G are independent, , I(C, G) = 0
• However, there is no upper bound on the mutual information
• Normalized mutual information
• Value range of NMI: [0,1]
• Value close to 1 indicates a good clustering
Pairwise Comparison-Based
Methods: Jaccard Coefficient
• Pairwise comparison: treat each group in the ground truth as a class
• For each pair of objects (oi, oj) in D, if they are assigned to the same cluster/group, the
assignment is regarded as positive; otherwise, negative
• Depending on assignments, we have four possible cases:
Note: Total # of n
N
pairs of points
2
• Jaccard coefficient: Ignoring the true negatives (thus asymmetric)
• Jaccard = TP/(TP + FN + FP) [i.e., denominator ignores TN]
• Jaccard = 1 if perfect clustering
• Many other measures are based on the pairwise comparison statistics:
• Rand statistic
• Fowlkes-Mallows measure
Intrinsic Methods (I): Dunn Index
• Intrinsic methods (i.e., no ground truth) examine how compact clusters are
and how well clusters are separated, based on similarity/distance measure
between objects
• Dunn Index:
• The compactness of clusters: the maximum distance between two points that belong
to the same cluster:
• The degree of separation among different clusters: the minimum distance between
two points that belong to different clusters:
• The Dunn index is simply the ratio: , the larger the ratio, the farther away the clusters
are separated comparing to the compactness of the clusters
• Dunn index uses the extreme distances to measure the cluster compactness
and inter-cluster separation and it can be affected by outliers
Intrinsic Methods (II): Silhouette
Coefficient
• Suppose D is partitioned into k clusters:
• For each object o in D, we calculate
• : avg distance between o and all other objects in the cluster to which o belongs,
reflects the compactness of the cluster to which o belongs
• : minimum avg distance from o to all clusters to which o does not belong,
captures the degree to which o is separated from other clusters
• Silhouette Coefficient: , value range (-1, 1)
• When the value of o approaches 1, the cluster containing o is compact and o is
far away from other clusters, which is the preferable case
• When the value is negative (i.e., b(o) < a(o)), o is closer to the objects in another
cluster than to the objects in the same cluster as o: a bad situation to be avoided