0% found this document useful (0 votes)
13 views78 pages

Chap8-Cluster Analysis

Chapter 8 provides an overview of cluster analysis, including its definition, applications, and various methodologies such as partitioning, hierarchical, and density-based methods. It discusses the requirements and challenges of effective clustering, including quality, scalability, and interpretability. The chapter also covers specific algorithms like K-Means and K-Medoids, highlighting their strengths, weaknesses, and variations.

Uploaded by

meow50033
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views78 pages

Chap8-Cluster Analysis

Chapter 8 provides an overview of cluster analysis, including its definition, applications, and various methodologies such as partitioning, hierarchical, and density-based methods. It discusses the requirements and challenges of effective clustering, including quality, scalability, and interpretability. The chapter also covers specific algorithms like K-Means and K-Medoids, highlighting their strengths, weaknesses, and variations.

Uploaded by

meow50033
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 78

Chapter 8

Cluster Analysis: Basic


Concepts and Methods
Outline
• Cluster analysis
• What is cluster analysis?
• Requirements for cluster analysis
• Overview of basic clustering methods
• Partitioning methods
• Hierarchical methods
• Density-based and grid-based methods
• Evaluation of clustering
Cluster Analysis: A Quick Overview
• When flying over a city, one can easily identify fields, forests,
commercial areas, and residential areas based on their features,
without anyone’s explicit “training”—This is the power of cluster
analysis
• This chapter and the next systematically study cluster analysis
methods and help answer the following:
• What are the different proximity measures for effective clustering?
• Can we cluster a massive number of data points efficiently?
• Can we find clusters of arbitrary shape? At multiple levels of granularity?
• How can we judge the quality of the clusters discovered by our system?
What Is Cluster Analysis?
• What is a cluster?
• A cluster is a collection of data objects which are
• Similar (or related) to one another within the same group (i.e., cluster)
• Dissimilar (or unrelated) to the objects in other groups (i.e., clusters)
• Cluster analysis (or clustering, data segmentation, …)
• Given a set of data points, partition them into a set of groups (i.e., clusters) which are
as similar as possible
• Cluster analysis is unsupervised learning (i.e., no predefined classes)
• This contrasts with classification (i.e., supervised learning)
• Typical ways to use/apply cluster analysis
• As a stand-alone tool to get insight into data distribution, or
• As a preprocessing (or intermediate) step for other algorithms
Cluster Analysis: Applications
• A key intermediate step for other data mining tasks
• Generating a compact summary of data for classification, pattern discovery, hypothesis
generation and testing, etc.
• Outlier detection: Outliers—those “far away” from any cluster
• Data summarization, compression, and reduction
• Ex. Image processing: Vector quantization
• Collaborative filtering, recommendation systems, or customer segmentation
• Find like-minded users or similar products
• Dynamic trend detection
• Clustering stream data and detecting trends and patterns
• Multimedia data analysis, biological data analysis and social network analysis
• Ex. Clustering images or video/audio clips, gene/protein sequences, etc.
Considerations for Cluster Analysis
• Partitioning criteria
• Single level vs. hierarchical partitioning (often, multi-level hierarchical partitioning is
desirable, e.g., grouping topical terms)
• Separation of clusters
• Exclusive (e.g., one customer belongs to only one region) vs. non-exclusive (e.g., one
document may belong to more than one class)
• Similarity measure
• Distance-based (e.g., Euclidean, road network, vector) vs. connectivity-based (e.g.,
density or contiguity)
• Clustering space
• Full space (often when low dimensional) vs. subspaces (often in high-dimensional
clustering)
Requirements and Challenges
• Quality
• Ability to deal with different types of attributes: Numerical, categorical, text,
multimedia, networks, and mixture of multiple types
• Discovery of clusters with arbitrary shape
• Ability to deal with noisy data
• Scalability
• Clustering all the data instead of only on samples
• High dimensionality
• Incremental or stream clustering and insensitivity to input order
• Constraint-based clustering
• User-given preferences or constraints; domain knowledge; user queries
• Interpretability and usability
Cluster Analysis: A Multi-
Dimensional Categorization
• Technique-Centered
• Distance-based methods
• Density-based and grid-based methods
• Probabilistic and generative models
• Leveraging dimensionality reduction methods
• High-dimensional clustering
• Scalable techniques for cluster analysis
• Data Type-Centered
• Clustering numerical data, categorical data, text data, multimedia data, time-series
data, sequences, stream data, networked data, uncertain data
• Additional Insight-Centered
• Visual insights, semi-supervised, ensemble-based, validation-based
Typical Clustering Methodologies
• Distance-based methods
• Partitioning algorithms: K-Means, K-Medians, K-Medoids
• Hierarchical algorithms: Agglomerative vs. divisive methods
• Density-based and grid-based methods
• Density-based: Data space is explored at a high-level of granularity and then post-processing to
put together dense regions into an arbitrary shape
• Grid-based: Individual regions of the data space are formed into a grid-like structure
• Probabilistic and generative models: Modeling data from a generative process
• Assume a specific form of the generative model (e.g., mixture of Gaussians)
• Model parameters are estimated with the Expectation-Maximization (EM) algorithm (using the
available dataset, for a maximum likelihood fit)
• Then estimate the generative probability of the underlying data points
• High-dimensional clustering
High-dimensional Dlustering
• Subspace clustering: Find clusters on various subspaces
• Bottom-up, top-down, correlation-based methods vs. δ-cluster methods
• Dimensionality reduction: A vertical form (i.e., columns) of clustering
• Columns are clustered; may cluster rows and columns together (co-clustering)
• Probabilistic latent semantic indexing (PLSI) then LDA: Topic modeling of text data
• A cluster (i.e., topic) is associated with a set of words (i.e., dimensions) and a set of
documents (i.e., rows) simultaneously
• Nonnegative matrix factorization (NMF) (as one kind of co-clustering)
• A nonnegative matrix A (e.g., word frequencies in documents) can be approximately
factorized two non-negative low rank matrices U and V
• Spectral clustering: Use the spectrum of the similarity matrix of the data to
perform dimensionality reduction for clustering in fewer dimensions
Clustering Different Types of Data (I)
• Numerical data
• Most earliest clustering algorithms were designed for numerical data
• Categorical data (including binary data)
• Discrete data, no natural order (e.g., sex, race, zip-code, and market-basket)
• Text data: Popular in social media, Web, and social networks
• Features: High-dimensional, sparse, value corresponding to word frequencies
• Methods: Combination of k-means and agglomerative; topic modeling; co-clustering
• Multimedia data: Image, audio, video (e.g., on Flickr, YouTube)
• Multi-modal (often combined with text data)
• Contextual: Containing both behavioral and contextual attributes
• Images: Position of a pixel represents its context, value represents its behavior
• Video and music data: Temporal ordering of records represents its meaning
Clustering Different Types of Data
(II)
• Time-series data: Sensor data, stock markets, temporal tracking, forecasting, etc.
• Data are temporally dependent
• Time: contextual attribute; data value: behavioral attribute
• Correlation-based online analysis (e.g., online clustering of stock to find stock tickers)
• Shape-based offline analysis (e.g., cluster ECG based on overall shapes)
• Sequence data: Weblogs, biological sequences, system command sequences
• Contextual attribute: Placement (rather than time)
• Similarity functions: Hamming distance, edit distance, longest common subsequence
• Sequence clustering: Suffix tree; generative model (e.g., Hidden Markov Model)
• Stream data:
• Real-time, evolution and concept drift, single pass algorithm
• Create efficient intermediate representation, e.g., micro-clustering
Clustering Different Types of Data
(III)
• Graphs and homogeneous networks
• Every kind of data can be represented as a graph with similarity values as edges
• Methods: Generative models; combinatorial algorithms (graph cuts); spectral methods; non-
negative matrix factorization methods
• Heterogeneous networks
• A network consists of multiple typed nodes and edges (e.g., bibliographical data)
• Clustering different typed nodes/links together (e.g., NetClus)
• Uncertain data: Noise, approximate values, multiple possible values
• Incorporation of probabilistic information will improve the quality of clustering
• Big data: Model systems may store and process very big data (e.g., weblogs)
• Ex. Google’s MapReduce framework
• Use Map function to distribute the computation across different machines
• Use Reduce function to aggregate results obtained from the Map step
User Insights and Interactions in
Clustering
• Visual insights: One picture is worth a thousand words
• Human eyes: High-speed processor linking with a rich knowledge-base
• A human can provide intuitive insights; HD-eye: visualizing HD clusters
• Semi-supervised insights: Passing user’s insights or intention to system
• User-seeding: A user provides a number of labeled examples, approximately
representing categories of interest
• Multi-view and ensemble-based insights
• Multi-view clustering: Multiple clusterings represent different perspectives
• Multiple clustering results can be ensembled to provide a more robust solution
• Validation-based insights: Evaluation of the quality of clusters generated
• May use case studies, specific measures, or pre-existing labels
Outline
• Cluster analysis
• Partitioning methods
• K-means: a centroid-based techniques
• Variations of k-means
• Hierarchical methods
• Density-based and grid-based methods
• Evaluation of clustering
Partitioning Algorithms: Basic
Concepts
• Partitioning method: Discovering the groupings in the data by optimizing a
specific objective function and iteratively improving the quality of partitions
• K-partitioning method: Partitioning a dataset D of n objects into a set of K clusters
so that an objective function is optimized (e.g., the sum of squared distances is
minimized, where is the centroid or medoid of cluster )
• A typical objective function: Sum of Squared Errors (SSE)

• Problem definition: Given K, find a partition of K clusters that optimizes the


chosen partitioning criterion
• Global optimal: Needs to exhaustively enumerate all partitions
• Heuristic methods (i.e., greedy algorithms): K-Means, K-Medians, K-Medoids, etc.
The K-Means Clustering Method
• K-Means (MacQueen’67, Lloyd’57/’82)
• Each cluster is represented by the center of the cluster
• Given K, the number of clusters, the K-Means clustering algorithm is
outlined as follows
• Select K points as initial centroids
• Repeat
• Form K clusters by assigning each point to its closest centroid
• Re-compute the centroids (i.e., mean point) of each cluster
• Until convergence criterion is satisfied
• Different kinds of measures can be used
• Manhattan distance (L1 norm), Euclidean distance (L2 norm), Cosine similarity
Example: K-Means Clustering
Assign points
to clusters
Recompute
cluster centers

The original data


points & randomly Execution of the K-Means Clustering Algorithm Redo point
select K = 2 centroids assignment

Select K points as initial centroids


Repeat
• Form K clusters by assigning each point to its closest centroid
• Re-compute the centroids (i.e., mean point) of each cluster
Until convergence criterion is satisfied
Discussion on the K-Means Method
• Efficiency: O(tKn) where n: # of objects, K: # of clusters, and t: # of iterations
• Normally, K, t << n; thus, an efficient method
• K-means clustering often terminates at a local optimal
• Initialization can be important to find high-quality clusters
• Need to specify K, the number of clusters, in advance
• There are ways to automatically determine the “best” K
• In practice, one often runs a range of values and selected the “best” K value
• Sensitive to noisy data and outliers
• Variations: Using K-medians, K-medoids, etc.
• K-means is applicable only to objects in a continuous n-dimensional space
• Using the K-modes for categorical data
• Not suitable to discover clusters with non-convex shapes
• Using density-based clustering, kernel K-means, etc.
Variations of K-Means
• Choosing better initial centroid estimates
• K-means++, Intelligent K-Means, Genetic K-Means
• Choosing different representative prototypes for the clusters
• K-Medoids, K-Medians, K-Modes
• Applying feature transformation techniques
• Weighted K-Means, Kernel K-Means
Initialization of K-Means
• Different initializations may generate rather different clustering results (some
could be far from optimal)
• Original proposal (MacQueen’67): Select K seeds randomly
• Need to run the algorithm multiple times using different seeds
• There are many methods proposed for better initialization of k seeds

• K-Means++ (Arthur & Vassilvitskii’07):


• The first centroid is selected at random
• The next centroid selected is the one that is farthest from the currently
selected (selection is based on a weighted probability score)
• The selection continues until K centroids are obtained
Example: Poor Initialization May
Lead to Poor Clustering
Assign Recompute
points to cluster
clusters centers

Another random selection of k


centroids for the same data points

 Rerun of the K-Means using another random K seeds

 This run of K-Means generates a poor quality clustering


Handling Outliers: From K-Means to
K-Medoids
• The K-Means algorithm is sensitive to outliers
• An object with an extremely large value may substantially distort the distribution of the data
• K-Medoids: Instead of taking the mean value of the object in a cluster as a reference
point, medoids can be used, which is the most centrally located object in a cluster
• The K-Medoids clustering algorithm:
• Select K points as the initial representative objects (i.e., as initial K medoids)
• Repeat
• Assigning each point to the cluster with the closest medoid
• Randomly select a non-representative object
• Compute the total cost S of swapping the medoid m with
• If S < 0, then swap m with to form the new set of medoids
• Until convergence criterion is satisfied
PAM: A Typical K-Medoids Algorithm
10 10 10

9 9 9

8 8 8

7
Arbitrar 7
Assign 7

6
each
6 6

5
y 5 5

4 choose 4
remainin 4

3 K object 3 g object 3

2
as 2 to 2

1
initial 1
nearest 1

0
medoid medoids
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

s
K=2
Randomly select a
non-medoid
Select initial K medoids randomly object,Oramdom
10 10

Repeat 9
Compute
9

Swapping 8 8

Object re-assignment 7 total cost 7


O and 6
of 6

Oramdom
Swap medoid m with oi if it
5 5

4
swapping 4

If quality is 3 3

improves the clustering quality improved 2 2

1 1

Until convergence criterion is satisfied 0


0 1 2 3 4 5 6 7 8 9 10
0
0 1 2 3 4 5 6 7 8 9 10
Discussion on K-Medoids Clustering
• K-Medoids Clustering: Find representative objects (medoids) in clusters
• PAM (Partitioning Around Medoids: Kaufmann & Rousseeuw 1987)
• Starts from an initial set of medoids, and
• Iteratively replaces one of the medoids by one of the non-medoids if it improves the
total sum of the squared errors (SSE) of the resulting clustering
• PAM works effectively for small data sets but does not scale well for large data sets
(due to the computational complexity)
• Computational complexity: PAM: O(K(n − K)2) (quite expensive!)
• Efficiency improvements on PAM
• CLARA (Kaufmann & Rousseeuw, 1990):
• PAM on samples; O(Ks2 + K(n − K)), s is the sample size
• CLARANS (Ng & Han, 1994): Randomized re-sampling, ensuring efficiency + quality
K-Medians: Handling Outliers by
Computing Medians
• Medians are less sensitive to outliers than means
• Think of the median salary vs. mean salary of a large firm when adding a few top executives!
• K-Medians: Instead of taking the mean value of the object in a cluster as a
reference point, medians are used (L1-norm as the distance measure)
• The criterion function for the K-Medians algorithm

• The K-Medians clustering algorithm:


• Select K points as the initial representative objects (i.e., as initial K medians)
• Repeat
• Assign every point to its nearest median
• Re-compute the median using the median of each individual feature
• Until convergence criterion is satisfied
K-Modes: Clustering Categorical
Data
• K-Means cannot handle non-numerical (categorical) data
• Mapping categorical value to 1/0 cannot generate quality clusters for high-dimensional data
• K-Modes: An extension to K-Means by replacing means of clusters with modes
• Dissimilarity measure between object X and the center of a cluster Z
• when ; and 0 otherwise
• This dissimilarity measure (distance function) is frequency-based
• Algorithm is still based on iterative object cluster assignment and centroid update
• A fuzzy K-Modes method is proposed to calculate a fuzzy cluster membership value
for each object to each cluster
• A mixture of categorical and numerical data: Using a K-Prototype method
Outline
• Cluster analysis
• Partitioning methods
• Hierarchical methods
• Basic concepts of hierarchical clustering
• Agglomerative hierarchical clustering
• Divisive hierarchical clustering
• BIRCH: scalable hierarchical clustering using clustering feature trees
• Probabilistic hierarchical clustering
• Density-based and grid-based methods
• Evaluation of clustering
Hierarchical Clustering: Basic
Concepts
• Hierarchical clustering
• Generate a clustering hierarchy (drawn as a dendrogram)
• Not required to specify K, the number of clusters
• More deterministic
• No iterative refinement
• Two categories of algorithms
• Agglomerative: Start with singleton clusters, continuously merge two clusters
at a time to build a bottom-up hierarchy of clusters
• Divisive: Start with a huge macro-cluster, split it continuously into two groups,
generating a top-down hierarchy of clusters
Agglomerative vs. Divisive
Clustering
Step 0 Step 1 Step 2 Step 3 Step 4 agglomerative
(AGNES)

a
ab
b
abcde
c
cde
d
de
e
divisive
(DIANA)
Step 4 Step 3 Step 2 Step 1 Step 0
Dendrogram: How Clusters are
Merged
• Dendrogram: Decompose a set of data objects into a tree of clusters
by multi-level nested partitioning
• A clustering of the data objects is obtained by cutting the dendrogram
at the desired level, then each connected component forms a cluster

Hierarchical clustering generates a


dendrogram (a hierarchy of clusters)
Agglomerative Clustering Algorithm
• AGNES (AGglomerative NESting) (Kaufmann and Rousseeuw, 1990)
• Use the single-link method and the dissimilarity matrix
• Continuously merge nodes that have the least dissimilarity
• Eventually all nodes belong to the same cluster
• Agglomerative clustering varies on different similarity measures among clusters
• Single link (nearest neighbor)
• Complete link (diameter)
• Average link (group average)
• Centroid link (centroid similarity)
Agglomerative Clustering Algorithm

10
10 10
9
9 9

8
8 8

7 7 7

6 6 6

5 5 5

4 4 4

3 3
3

2 2
2
1 1
1
0 0
0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
Single Link vs. Complete Link in
Hierarchical Clustering X
X

• Single link (nearest neighbor)


• The similarity between two clusters is the similarity between their most similar (nearest
neighbor) members
• Local similarity-based: Emphasizing more on close regions, ignoring the overall
structure of the cluster
• Capable of clustering non-elliptical shaped group of objects
• Sensitive to noise and outliers
• Complete link (diameter)
• The similarity between two clusters is the similarity between their most dissimilar
members
• Merge two clusters to form one with the smallest diameter X
X

• Nonlocal in behavior, obtaining compact shaped clusters


• Sensitive to outliers
Agglomerative Clustering: Average
vs. Centroid Links X X

• Agglomerative clustering with average link Ca : N a Cb : N b

• Average link: The average distance between an element in one cluster and
an element in the other (i.e., all pairs in two clusters)
• Expensive to compute X
X

• Agglomerative clustering with centroid link


• Centroid link: The distance between the centroids of two clusters
• Group Averaged Agglomerative Clustering (GAAC)
• Let two clusters and be merged into
• The new centroid is , where and are the cardinality and centroid of cluster , respectively
• The similarity measure for GAAC is the average of their distances
Agglomerative Clustering with
Ward’s Criterion
• Suppose two disjoint clusters and are merged, and is the mean of
the new cluster
• Ward’s criterion:
Divisive Clustering
• DIANA (Divisive Analysis) (Kaufmann and Rousseeuw,1990)
• Implemented in some statistical analysis packages, e.g., Splus
• Inverse order of AGNES: Eventually each node forms a cluster on its
own
10 10
10

9 9
9
8 8
8

7 7
7
6 6
6

5 5
5
4 4
4

3 3
3
2 2
2

1 1
1
0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
Divisive Clustering Is a Top-down
Approach
• The process starts at the root with all the points as one cluster
• It recursively splits the higher level clusters to build the dendrogram
• Can be considered as a global approach
• More efficient when compared with agglomerative clustering

10 10
10

9 9
9
8 8
8

7 7
7
6 6
6

5 5
5

4 4
4

3 3
3
2 2
2

1 1
1
0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
More on Algorithm Design for
Divisive Clustering
• Choosing which cluster to split
• Check the sums of squared errors of the clusters and choose the one with the
largest value
• Splitting criterion: Determining how to split
• One may use Ward’s criterion to chase for greater reduction in the difference
in the SSE criterion as a result of a split
• For categorical data, Gini-index can be used
• Handling the noise
• Use a threshold to determine the termination criterion (do not generate
clusters that are too small because they contain mainly noises)
Extensions to Hierarchical Clustering
• Weakness of the agglomerative & divisive hierarchical clustering
methods
• No revisit: cannot undo any merge/split decisions made before
• Scalability bottleneck: Each merge/split needs to examine many possible options
• Time complexity: at least O(n2), where n is the number of total objects
• Several other hierarchical clustering algorithms
• BIRCH (1996): Use CF-tree and incrementally adjust the quality of sub-clusters
• CURE (1998): Represent a cluster using a set of well-scattered representative
points
• CHAMELEON (1999): Use graph partitioning methods on the K-nearest neighbor
graph of the data
BIRCH: A Multi-Phase Hierarchical
Clustering Method
• BIRCH (Balanced Iterative Reducing and Clustering Using Hierarchies)
• Developed by Zhang, Ramakrishnan & Livny (SIGMOD’96)
• Impact many new clustering methods and applications (received 2006 SIGMOD Test
of Time award)
• Major innovation
• Integrating hierarchical clustering (initial micro-clustering phase) and other
clustering methods (at the later macro-clustering phase)
• Multi-phase hierarchical clustering
• Phase1 (initial micro-clustering): Scan DB to build an initial CF tree, a multi-level
compression of the data to preserve the inherent clustering structure of the data
• Phase 2 (later macro-clustering): Use an arbitrary clustering algorithm (e.g., iterative
partitioning) to cluster flexibly the leaf nodes of the CF-tree
Clustering Feature Vector
• Consider a cluster of multi-dimensional data objects/points
• The clustering feature (CF) of the cluster is a 3-D vector summarizing
info about clusters of objects
• Register the 0-th, 1st, and 2nd moments of a cluster
• Clustering Feature (CF): CF = <n, LS, SS> CF1 = <5, (16,30), 244>

• n: Number of data points 10

• LS: linear sum of n points: : 9

8
(3,4)
7 (2,6)
(4,5)
6

• SS: square sum of n points: 4

3
(4,7)
2

1
(3,8)
n = 5; LS = ((3+2+4+4+3), (4+6+5+7+8)) = (16, 30); 0
0 1 2 3 4 5 6 7 8 9 10

SS=(32+22+42+42+32) + (42+62+52+72+82)= 54+190=244


Clustering Feature: a Summary of
the Statistics for the Given Cluster
• Registers crucial measurements for computing cluster and utilizes
storage efficiently
• Clustering features are additive: Merging clusters C1 and C2—linear
summation of CFs

CF1 = <5, (16,30), 244>

10

9 (3,4)
8

(2,6)
n = 5; LS = ((3+2+4+4+3), (4+6+5+7+8)) = (16, 30);
7

(4,5)
6

SS=(32+22+42+42+32) + (42+62+52+72+82)= 54+190=244 5

3
(4,7)
2

1
(3,8)
0
0 1 2 3 4 5 6 7 8 9 10
Essential Measures of Cluster:
Centroid, Radius and Diameter
• Centroid:
• The “middle” of a cluster X


• n: number of points in a cluster 𝑛 𝑛

∑ ∑ ‖ 𝒙 𝑖 − 𝒙 𝑗‖
2


2
• is the i-th point in the cluster 𝑖=1 𝑗 =1 2 𝑛⋅ 𝑆𝑆 −2‖ 𝑳𝑺‖
𝐷= =
𝑛 (𝑛 − 1 ) 𝑛 (𝑛 − 1 )
• Radius: R
• Average distance from member objects to the centroid
• The square root of average distance from any point of the cluster to its centroid
• Diameter: D
• Average pairwise distance within a cluster
• The square root of average mean squared distance between all pairs of points in
the cluster
Example
• Continuing the previous example:
CF Tree: A Height-Balanced Tree
Storing Clustering Features for
Hierarchical Clustering
• Incremental insertion of new points (similar to B+-tree)
• For each point in the input
• Find its closest leaf entry
• Add point to leaf entry and update CF
• If entry diameter > max_diameter
• Split leaf, and possibly parents
• A CF tree has two parameters
• Branching factor: Maximum number of children
• Maximum diameter of sub-clusters stored at the leaf nodes
• A CF tree: A height-balanced tree that stores the clustering features (CFs)
• The non-leaf nodes store sums of the CFs of their children
CF Tree: A Height-Balanced Tree
Storing Clustering Features for
Hierarchical
Root
Clustering
B=7 CF1 CF2 CF3 CF6
L=6 child1 child2 child3 child6

Non-leaf node
CF11 CF12 CF13 CF15
child11 child12 child child15
13

Leaf node Leaf node


prev CFx1 CFx2 CFx6 next pre CFy1 CFy2 CFy5 next
v
BIRCH: A Scalable and Flexible
Clustering Method
• An integration of agglomerative clustering with other (flexible)
clustering methods
• Low-level micro-clustering
• Exploring CF-feature and BIRCH tree structure
• Preserving the inherent clustering structure of the data
• Higher-level macro-clustering
• Provide sufficient flexibility for integration with other clustering methods
BIRCH: Pros and Cons
• Strength: Good quality of clustering; linear scalability in large/stream
databases; effective for incremental and dynamic clustering of
incoming objects
• Weaknesses
• Due to the fixed size of leaf nodes, clusters so formed may not be very natural
• Clusters tend to be spherical given the radius and diameter measures

Images like this may give BIRCH a hard time


Example
Outline
• Cluster analysis
• Partitioning methods
• Hierarchical methods
• Density-based and grid-based methods
• DBSCAN: density-based clustering based on connected regions with high
density
• DENCLUE: clustering based on density distribution functions
• Grid-based methods
• Evaluation of clustering
Density-Based Clustering Methods
• Clustering based on density (a local cluster criterion), such as density-
connected points
• Major features:
• Discover clusters of arbitrary shape
• Handle noise
• One scan (only examine the local region to justify density)
• Need density parameters as termination condition
• Several interesting studies:
• DBSCAN: Ester, et al. (KDD’96)
• OPTICS: Ankerst, et al (SIGMOD’99)
• DENCLUE: Hinneburg & D. Keim (KDD’98)
• CLIQUE: Agrawal, et al. (SIGMOD’98) (also, grid-based)
DBSCAN: A Density-Based Spatial
Clustering Algorithm
p
• DBSCAN (M. Ester, H.-P. Kriegel, J. Sander, and X. Xu,
MinPts = 5
KDD’96) q Eps = 1 cm
• Discovers clusters of arbitrary shape: Density-Based Spatial
Clustering of Applications with Noise
• A density-based notion of cluster
• A cluster is defined as a maximal set of density-connected points Outlier
• Two parameters: Outlier/noise:
• Eps (ε): Maximum radius of the neighborhood Border not in a cluster
• MinPts: Minimum number of points in the
Core point: dense
• Eps-neighborhood of a point
Core neighborhood
• The Eps(ε)-neighborhood of a point q: Border point: in cluster but
• NEps(q): {p belongs to D | dist(p, q) ≤ Eps} neighborhood is not dense
DBSCAN: Density-Reachable and
Density-Connected
p
• Directly density-reachable: MinPts = 5
• A point p is directly density-reachable from a point q w.r.t. Eps q Eps = 1 cm
(ε), MinPts if
• p belongs to NEps(q)
• core point condition: |NEps (q)| ≥ MinPts p

• Density-reachable: p2
q
• A point p is density-reachable from a point q w.r.t. Eps, MinPts
if there is a chain of points , , such that is directly density-
reachable from
p q
• Density-connected:
• A point p is density-connected to a point q w.r.t. Eps, MinPts if o
there is a point o such that both p and q are density-reachable
from o w.r.t. Eps and MinPts
DBSCAN: The Algorithm Outlier
Outlier/noise:
Border not in a cluster
• Algorithm Core point: dense
• Arbitrarily select a point p Core neighborhood
• Retrieve all points density-reachable Border point: in cluster but
• from p w.r.t. Eps and MinPts neighborhood is not dense
• If p is a core point, a cluster is formed
• If p is a border point, no points are directly density-reachable from p, and DBSCAN visits
the next point of the database
• Continue the process until all of the points have been processed
• Computational complexity
• If a spatial index is used, the computational complexity of DBSCAN is , where n
is the number of database objects
• Otherwise, the complexity is O(n2)
DBSCAN Is Sensitive to the Setting
of Parameters

Ack. Figures from G. Karypis, E.-H. Han, and V. Kumar, COMPUTER, 32(8), 1999
Outline
• Cluster analysis
• Partitioning methods
• Hierarchical methods
• Density-based and grid-based methods
• Evaluation of clustering
• Assessing clustering tendency
• Determining the number of clusters
• Measuring clustering quality: extrinsic methods
• Intrinsic methods
Evaluation of Clustering: Basic
Concepts
• Evaluation of clustering
• Assess the feasibility of clustering analysis on a data set
• Evaluate the quality of the results generated by a clustering method
• Major issues on clustering assessment and validation
• Clustering tendency: assessing the suitability of clustering: whether the data
has any inherent grouping structure
• Determining the Number of Clusters: determining for a dataset the right
number of clusters that may lead to a good quality clustering
• Clustering quality evaluation: evaluating the quality of the clustering results
Clustering Tendency: Whether the
Data Contains Inherent Grouping
Structure
• Assess the suitability of clustering
• Whether the data has any “inherent grouping structure” — non-random structure that may
lead to meaningful clusters
• Determine clustering tendency or clusterability
• A hard task because there are so many different definitions of clusters
• Different definitions: Partitioning, hierarchical, density-based and graph-based
• Even fixing a type, still hard to define an appropriate null model for a data set
• There are some clusterability assessment methods, such as
• Spatial histogram: Contrast the histogram of the data with that generated from random
samples
• Distance distribution: Compare the pairwise point distance from the data with those from
the randomly generated samples
• Hopkins Statistic: A sparse sampling test for spatial randomness
Testing Clustering Tendency: A
Spatial Histogram Approach
• Spatial Histogram Approach: Contrast the d-dimensional histogram of
the input dataset D with the histogram generated from random
samples
• Dataset D is clusterable if the distributions of two histograms are rather
different

(a) Input dataset (b) Data generated from random samples


Testing Clustering Tendency: A
Spatial Histogram Approach
• Method outline
• Divide each dimension into equi-width bins, count how many points lie in
each cell, and obtain the empirical joint probability mass function (EPMF)
• Do the same for the randomly sampled data
• Compute how much they differ using the Kullback-Leibler (KL)
divergence value
Determining the Number of Clusters
• The appropriate number of clusters controls the proper granularity of
cluster analysis
• Finding a good balance between compressibility and accuracy in cluster
analysis
• Two undesirable extremes
• The whole data set is one cluster: No value of clustering
• Treating each point as a cluster: No data summarization
Determining the Number of Clusters
• The right number of clusters often depends on the distribution's
shape and scale in the data set, as well as the clustering resolution
required by the user
• Methods for determining the number of clusters
• An empirical method
• # of clusters: for a dataset of n points (e.g., n = 200, k = 10)
• Each cluster is expected to have about points
Finding the Number of Clusters: the
Elbow Method
• Use the turning point in the curve of the sum of within cluster
variance with respect to the # of clusters
• Increasing the # of clusters can help reduce the sum of within-cluster variance
of each cluster
• But splitting a cohesive cluster gives only a small reduction
Finding K, the Number of Clusters: A
Cross Validation Method
• Divide a given data set into m parts, and use m – 1 parts to obtain a
clustering model
• Use the remaining part to test the quality of the clustering
• For example, for each point in the test set, find the closest centroid, and use
the sum of squared distance between all points in the test set and their
closest centroids to measure how well the model fits the test set
• For any k > 0, repeat it m times, compare the overall quality measure
w.r.t. different k’s, and find # of clusters that fits the data the best
Measuring Clustering Quality
• Clustering Evaluation: Evaluating how good the clustering results are
• No commonly recognized best suitable measure in practice
• Extrinsic vs. intrinsic methods: depending on whether ground truth is
used
• Ground truth: the ideal clustering built by using human experts
• Extrinsic: Supervised, employ criteria not inherent to the dataset
• Compare a clustering against prior or expert-specified knowledge (i.e., the
ground truth) using certain clustering quality measure
• Intrinsic: Unsupervised, criteria derived from data itself
• Evaluate the goodness of a clustering by considering how well the clusters are
separated and how compact the clusters are (e.g., silhouette coefficient)
General Criteria for Measuring Clustering
Quality with Extrinsic Methods
• Given the ground truth Cg, Q(C, Cg) is the quality measure for
a clustering C
• Q(C, Cg) is good if it satisfies the following four essential
criteria
Ground truth partitioning G1 G2
• Cluster homogeneity: the purer, the better Cluster C1 Cluster C2
• Cluster completeness: assign objects belonging to the same
category in the ground truth to the same cluster
• Rag bag better than alien: putting a heterogeneous object into a
pure cluster should be penalized more than putting it into a rag bag
(i.e., “miscellaneous” or “other” category)
• Small cluster preservation: splitting a small category into pieces is
more harmful than splitting a large category into pieces
Commonly Used Extrinsic Methods
Ground truth partitioning G1 G2

• Matching-based methods Cluster C1 Cluster C2

• Examine how well the clustering results match the ground truth in partitioning
the objects in the data set
• Information theory-based methods
• Compare the distribution of the clustering results and that of the ground truth
• Information theory (e.g., entropy) used to quantify the comparison
• Ex. Conditional entropy, normalized mutual information (NMI)
• Pairwise comparison-based methods
• Treat each group in the ground truth as a class, and then check the pairwise
consistency of the objects in the clustering results
• Ex. Four possibilities: TP, FN, FP, TN; Jaccard coefficient
Matching-Based Methods Ground Truth G1 G2 G3
Cluster C1 C2 C3

• The matching based methods compare clusters in the clustering


results and the groups in the ground truth
• Suppose a clustering method partitions D = {o1, …, on} into m clusters
C = {C1, …, Cm}. The ground truth G partitions D into l groups G = {G1,
…, Gl}
• Purity: the extent that cluster contains points only from one (ground
truth) partition
• Purity for cluster : , where matching maximizes
• Total purity of clustering C:
Matching-Based Methods: Example Ground Truth G1 G2 G3
Cluster C1 C2 C3

• Consider 11 objects

Purity for clustering C1 = 1/11 (4 + 2 + 4 + 1) =


11/11 = 1;
Purity for clustering C2 = 1/11 (2 + 3 + 1) = 6/11

• Other methods:
• maximum matching; F-measure
Information Theory-Based Methods (I)
Conditional Entropy Ground Truth G1
Cluster C1 C2
G2 G3

C3

• A clustering can be regarded as a compressed representation of a


given set of objects
• The better the clustering results approach the ground-truth, the less
amount of information is needed
• This idea leads to the use of conditional entropy
Information Theory-Based Methods
(I)
Conditional Entropy
• Entropy of clustering C:
• Entropy of ground truth G:
• Conditional entropy of G given cluster : Ground Truth G1 G2 G3

Cluster C1 C2 C3

• Conditional entropy of G given clustering C:


Example Ground Truth G1 G2 G3
Cluster C1 C2 C3

• Consider 11 objects

Purity for clustering C1 = 1/11 (4 + 2 + 4 + 1) =


11/11 = 1;
Purity for clustering C2 = 1/11 (2 + 3 + 1) = 6/11

Note: conditional entropy cannot detect the issue that C1 splits the objects in G into two
Information Theory-Based Methods (II)
Normalized Mutual Information (NMI)
• Mutual information
• Quantify the amount of shared info between the clustering C and the ground-
truth partitioning G
• Measure the dependency between the observed joint probability of C and G,
and the expected joint probability under the independence assumption
• When C and G are independent, , I(C, G) = 0
• However, there is no upper bound on the mutual information
• Normalized mutual information
• Value range of NMI: [0,1]
• Value close to 1 indicates a good clustering
Pairwise Comparison-Based
Methods: Jaccard Coefficient
• Pairwise comparison: treat each group in the ground truth as a class
• For each pair of objects (oi, oj) in D, if they are assigned to the same cluster/group, the
assignment is regarded as positive; otherwise, negative
• Depending on assignments, we have four possible cases:

Note: Total # of  n
N  
pairs of points
 2
• Jaccard coefficient: Ignoring the true negatives (thus asymmetric)
• Jaccard = TP/(TP + FN + FP) [i.e., denominator ignores TN]
• Jaccard = 1 if perfect clustering
• Many other measures are based on the pairwise comparison statistics:
• Rand statistic
• Fowlkes-Mallows measure
Intrinsic Methods (I): Dunn Index
• Intrinsic methods (i.e., no ground truth) examine how compact clusters are
and how well clusters are separated, based on similarity/distance measure
between objects
• Dunn Index:
• The compactness of clusters: the maximum distance between two points that belong
to the same cluster:
• The degree of separation among different clusters: the minimum distance between
two points that belong to different clusters:
• The Dunn index is simply the ratio: , the larger the ratio, the farther away the clusters
are separated comparing to the compactness of the clusters
• Dunn index uses the extreme distances to measure the cluster compactness
and inter-cluster separation and it can be affected by outliers
Intrinsic Methods (II): Silhouette
Coefficient
• Suppose D is partitioned into k clusters:
• For each object o in D, we calculate
• : avg distance between o and all other objects in the cluster to which o belongs,
reflects the compactness of the cluster to which o belongs
• : minimum avg distance from o to all clusters to which o does not belong,
captures the degree to which o is separated from other clusters
• Silhouette Coefficient: , value range (-1, 1)
• When the value of o approaches 1, the cluster containing o is compact and o is
far away from other clusters, which is the preferable case
• When the value is negative (i.e., b(o) < a(o)), o is closer to the objects in another
cluster than to the objects in the same cluster as o: a bad situation to be avoided

You might also like