Cluster Analysis
Cluster Analysis
— Chapter 10 —
1
Chapter 10. Cluster Analysis: Basic Concepts and
Methods
■ Cluster Analysis: Basic Concepts
■ Partitioning Methods
■ Hierarchical Methods
■ Density-Based Methods
■ Grid-Based Methods
■ Evaluation of Clustering
■ Summary
2
What is Cluster Analysis?
■ Cluster: A collection of data objects
■ similar (or related) to one another within the same group
3
Clustering for Data Understanding and
Applications
■ Biology: taxonomy of living things: kingdom, phylum, class, order,
family, genus and species
■ Information retrieval: document clustering
■ Land use: Identification of areas of similar land use in an earth
observation database
■ Marketing: Help marketers discover distinct groups in their customer
bases, and then use this knowledge to develop targeted marketing
programs
■ City-planning: Identifying groups of houses according to their house
type, value, and geographical location
■ Earth-quake studies: Observed earth quake epicenters should be
clustered along continent faults
■ Climate: understanding earth climate, find patterns of atmospheric
and ocean
■ Economic Science: market resarch
4
Clustering as a Preprocessing Tool (Utility)
■ Summarization:
■ Preprocessing for regression, PCA, classification, and
association analysis
■ Compression:
■ Image processing: vector quantization
■ Finding K-nearest Neighbors
■ Localizing search to one or a small number of clusters
■ Outlier detection
■ Outliers are often viewed as those “far away” from any
cluster
5
Quality: What Is Good Clustering?
6
Measure the Quality of Clustering
■ Dissimilarity/Similarity metric
■ Similarity is expressed in terms of a distance function,
typically metric: d(i, j)
■ The definitions of distance functions are usually rather
different for interval-scaled, boolean, categorical,
ordinal ratio, and vector variables
■ Weights should be associated with different variables
based on applications and data semantics
■ Quality of clustering:
■ There is usually a separate “quality” function that
measures the “goodness” of a cluster.
■ It is hard to define “similar enough” or “good enough”
■ The answer is typically highly subjective
7
Considerations for Cluster Analysis
■ Partitioning criteria
■ Single level vs. hierarchical partitioning (often, multi-level
hierarchical partitioning is desirable)
■ Separation of clusters
■ Exclusive (e.g., one customer belongs to only one region) vs.
non-exclusive (e.g., one document may belong to more than one
class)
■ Similarity measure
■ Distance-based (e.g., Euclidian, road network, vector) vs.
connectivity-based (e.g., density or contiguity)
■ Clustering space
■ Full space (often when low dimensional) vs. subspaces (often in
high-dimensional clustering)
8
Requirements and Challenges
■ Scalability
■ Clustering all the data instead of only on samples
■ Ability to deal with different types of attributes
■ Numerical, binary, categorical, ordinal, linked, and mixture of
these
■ Constraint-based clustering
■ User may give inputs on constraints
■ Use domain knowledge to determine input parameters
■ Interpretability and usability
■ Others
■ Discovery of clusters with arbitrary shape
■ Ability to deal with noisy data
■ Incremental clustering and insensitivity to input order
■ High dimensionality
9
Major Clustering Approaches (I)
■ Partitioning approach:
■ Construct various partitions and then evaluate them by some
criterion, e.g., minimizing the sum of square errors
■ Typical methods: k-means, k-medoids, CLARANS
■ Hierarchical approach:
■ Create a hierarchical decomposition of the set of data (or objects)
using some criterion
■ Typical methods: Diana, Agnes, BIRCH, CAMELEON
■ Density-based approach:
■ Based on connectivity and density functions
■ Typical methods: DBSACN, OPTICS, DenClue
■ Grid-based approach:
■ based on a multiple-level granularity structure
■ Typical methods: STING, WaveCluster, CLIQUE
10
Major Clustering Approaches (II)
■ Model-based:
■ A model is hypothesized for each of the clusters and tries to find
the best fit of that model to each other
■ Typical methods: EM, SOM, COBWEB
■ Frequent pattern-based:
■ Based on the analysis of frequent patterns
■ Typical methods: p-Cluster
■ User-guided or constraint-based:
■ Clustering by considering user-specified or application-specific
constraints
■ Typical methods: COD (obstacles), constrained clustering
■ Link-based clustering:
■ Objects are often linked together in various ways
■ Massive links can be used to cluster objects: SimRank, LinkClus
11
Chapter 10. Cluster Analysis: Basic Concepts and
Methods
■ Cluster Analysis: Basic Concepts
■ Partitioning Methods
■ Hierarchical Methods
■ Density-Based Methods
■ Grid-Based Methods
■ Evaluation of Clustering
■ Summary
12
Partitioning Algorithms: Basic Concept
14
The K-Means Clustering Method
15
An Example of K-Means Clustering
K=2
■ https://youtu.be/KzJORp8bgqs
18
Variations of the K-Means Method
■ Most of the variants of the k-means which differ in
19
What Is the Problem of the K-Means Method?
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
20
PAM: A Typical K-Medoids Algorithm
Total Cost = 20
1
0
9
6
Arbitrary Assign
5
choose k each
4 object as remainin
3 initial g object
2
medoids to
1
0
nearest
0 1 2 3 4 5 6 7 8 9 1
0
medoids
Do loop 0
8
9
Compute
0
8
9
and Oramdom 6
swapping 6
change If quality is
5
4
5
improved. 3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 1 0 1 2 3 4 5 6 7 8 9 1
0 0
21
The K-Medoid Clustering Method
■ PAM works effectively for small data sets, but does not scale
well for large data sets (due to the computational complexity)
22
The K-Medoid Clustering Method
23
The K-Medoid Clustering Method
24
The K-Medoid Clustering Method
■ Otherwise choose some other object as Orandom and repeat the process
until there is no change.
25
Example Problem Solving :
The K-Medoid Clustering Method
■ https://youtu.be/ChBxx4aR-bY
26
Chapter 10. Cluster Analysis: Basic Concepts and
Methods
■ Cluster Analysis: Basic Concepts
■ Partitioning Methods
■ Hierarchical Methods
■ Density-Based Methods
■ Grid-Based Methods
■ Evaluation of Clustering
■ Summary
27
Hierarchical Clustering
■ Use distance matrix as clustering criteria. This method
does not require the number of clusters k as an input, but
needs a termination condition
Step 0 Step 1 Step 2 Step 3 Step 4
agglomerative
(AGNES)
a
ab
b abcde
c
cde
d
de
e
divisive
Step 4 Step 3 Step 2 Step 1 Step 0 (DIANA)
28
AGNES (Agglomerative Nesting)
■ Introduced in Kaufmann and Rousseeuw (1990)
■ Implemented in statistical packages, e.g., Splus
■ Use the single-link method and the dissimilarity matrix
■ Merge nodes that have the least dissimilarity
■ Go on in a non-descending fashion
■ Eventually all nodes belong to the same cluster
29
Dendrogram: Shows How Clusters are Merged
30
DIANA (Divisive Analysis)
31
Distance between Clusters X X
32
Extensions to Hierarchical Clustering
■ Major weakness of agglomerative clustering methods
■ Handle noise
■ One scan
■ Need density parameters as termination condition
grid-based)
34
Density-Based Clustering: Basic Concepts
■ Two parameters:
■ Eps: Maximum radius of the neighbourhood
■ MinPts: Minimum number of points in an
Eps-neighbourhood of that point
■ NEps(p): {q belongs to D | dist(p,q) ≤ Eps}
■ Directly density-reachable: A point p is directly
density-reachable from a point q w.r.t. Eps, MinPts if
■ p belongs to NEps(q)
p MinPts = 5
■ core point condition:
Eps = 1 cm
|NEps (q)| ≥ MinPts q
35
Density-Reachable and Density-Connected
■ Density-reachable:
■ A point p is density-reachable from p
a point q w.r.t. Eps, MinPts if there
p1
is a chain of points p1, …, pn, p1 = q
q, pn = p such that pi+1 is directly
density-reachable from pi
■ Density-connected
■ A point p is density-connected to a p q
point q w.r.t. Eps, MinPts if there is
a point o such that both, p and q o
are density-reachable from o w.r.t.
Eps and MinPts
36
Grid-Based Clustering Method
37
Chapter 10. Cluster Analysis: Basic Concepts and
Methods
■ Cluster Analysis: Basic Concepts
■ Partitioning Methods
■ Hierarchical Methods
■ Density-Based Methods
■ Grid-Based Methods
■ Evaluation of Clustering
■ Summary
38
Assessing Clustering Tendency
■ Assess if non-random structure exists in the data by measuring the
probability that the data is generated by a uniform data distribution
■ Test spatial randomness by statistic test: Hopkins Static
■ Given a dataset D regarded as a sample of a random variable o,
determine how far away o is from being uniformly distributed in the
data space
■ Sample n points, p1, …, pn, uniformly from D. For each pi, find its
nearest neighbor in D: xi = min{dist (pi, v)} where v in D
■ Sample n points, q1, …, qn, uniformly from D. For each qi, find its
nearest neighbor in D – {qi}: yi = min{dist (qi, v)} where v in D and
v ≠ qi
■ Calculate the Hopkins Statistic:
use the sum of squared distance between all points in the test set
and the closest centroids to measure how well the model fits the
test set
■ For any k > 0, repeat it m times, compare the overall quality measure
w.r.t. different k’s, and find # of clusters that fits the data the best
40
Measuring Clustering Quality
41
Measuring Clustering Quality: Extrinsic Methods
■ Partitioning Methods
■ Hierarchical Methods
■ Density-Based Methods
■ Grid-Based Methods
■ Evaluation of Clustering
■ Summary
43
Summary
■ Cluster analysis groups objects based on their similarity and has
wide applications
■ Measure of similarity can be computed for various types of data
■ Clustering algorithms can be categorized into partitioning methods,
hierarchical methods, density-based methods, grid-based methods,
and model-based methods
■ K-means and K-medoids algorithms are popular partitioning-based
clustering algorithms
■ Birch and Chameleon are interesting hierarchical clustering
algorithms, and there are also probabilistic hierarchical clustering
algorithms
■ DBSCAN, OPTICS, and DENCLU are interesting density-based
algorithms
■ STING and CLIQUE are grid-based methods, where CLIQUE is also
a subspace clustering algorithm
■ Quality of clustering results can be evaluated in various ways
44
CS512-Spring 2011: An Introduction
■ Coverage
■ Cluster Analysis: Chapter 11
■ Outlier Detection: Chapter 12
■ Mining Sequence Data: BK2: Chapter 8
■ Mining Graphs Data: BK2: Chapter 9
■ Social and Information Network Analysis
■ BK2: Chapter 9
■ Partial coverage: Mark Newman: “Networks: An Introduction”, Oxford U., 2010
■ Scattered coverage: Easley and Kleinberg, “Networks, Crowds, and Markets:
Reasoning About a Highly Connected World”, Cambridge U., 2010
■ Recent research papers
■ Mining Data Streams: BK2: Chapter 8
■ Requirements
■ One research project
■ One class presentation (15 minutes)
■ Two homeworks (no programming assignment)
■ Two midterm exams (no final exam)
45
References (1)
■ R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace
clustering of high dimensional data for data mining applications. SIGMOD'98
■ M. R. Anderberg. Cluster Analysis for Applications. Academic Press, 1973.
■ M. Ankerst, M. Breunig, H.-P. Kriegel, and J. Sander. Optics: Ordering points
to identify the clustering structure, SIGMOD’99.
■ Beil F., Ester M., Xu X.: "Frequent Term-Based Text Clustering", KDD'02
■ M. M. Breunig, H.-P. Kriegel, R. Ng, J. Sander. LOF: Identifying Density-Based
Local Outliers. SIGMOD 2000.
■ M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for
discovering clusters in large spatial databases. KDD'96.
■ M. Ester, H.-P. Kriegel, and X. Xu. Knowledge discovery in large spatial
databases: Focusing techniques for efficient class identification. SSD'95.
■ D. Fisher. Knowledge acquisition via incremental conceptual clustering.
Machine Learning, 2:139-172, 1987.
■ D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An
approach based on dynamic systems. VLDB’98.
■ V. Ganti, J. Gehrke, R. Ramakrishan. CACTUS Clustering Categorical Data
Using Summaries. KDD'99.
46
References (2)
■ D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An
approach based on dynamic systems. In Proc. VLDB’98.
■ S. Guha, R. Rastogi, and K. Shim. Cure: An efficient clustering algorithm for
large databases. SIGMOD'98.
■ S. Guha, R. Rastogi, and K. Shim. ROCK: A robust clustering algorithm for
categorical attributes. In ICDE'99, pp. 512-521, Sydney, Australia, March
1999.
■ A. Hinneburg, D.l A. Keim: An Efficient Approach to Clustering in Large
Multimedia Databases with Noise. KDD’98.
■ A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Printice Hall,
1988.
■ G. Karypis, E.-H. Han, and V. Kumar. CHAMELEON: A Hierarchical Clustering
Algorithm Using Dynamic Modeling. COMPUTER, 32(8): 68-75, 1999.
■ L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to
Cluster Analysis. John Wiley & Sons, 1990.
■ E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large
datasets. VLDB’98.
47
References (3)
■ G. J. McLachlan and K.E. Bkasford. Mixture Models: Inference and Applications to
Clustering. John Wiley and Sons, 1988.
■ R. Ng and J. Han. Efficient and effective clustering method for spatial data mining.
VLDB'94.
■ L. Parsons, E. Haque and H. Liu, Subspace Clustering for High Dimensional Data: A
Review, SIGKDD Explorations, 6(1), June 2004
■ E. Schikuta. Grid clustering: An efficient hierarchical clustering method for very large
data sets. Proc. 1996 Int. Conf. on Pattern Recognition
■ G. Sheikholeslami, S. Chatterjee, and A. Zhang. WaveCluster: A multi-resolution
clustering approach for very large spatial databases. VLDB’98.
■ A. K. H. Tung, J. Han, L. V. S. Lakshmanan, and R. T. Ng. Constraint-Based Clustering
in Large Databases, ICDT'01.
■ A. K. H. Tung, J. Hou, and J. Han. Spatial Clustering in the Presence of Obstacles,
ICDE'01
■ H. Wang, W. Wang, J. Yang, and P.S. Yu. Clustering by pattern similarity in large data
sets, SIGMOD’02
■ W. Wang, Yang, R. Muntz, STING: A Statistical Information grid Approach to Spatial
Data Mining, VLDB’97
■ T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH : An efficient data clustering method
for very large databases. SIGMOD'96
■ X. Yin, J. Han, and P. S. Yu, “LinkClus: Efficient Clustering via Heterogeneous
Semantic Links”, VLDB'06
48
Slides unused in class
49
A Typical K-Medoids Algorithm (PAM)
Total Cost = 20
1
0
9
6
Arbitrary Assign
5
choose k each
4 object as remainin
3 initial g object
2
medoids to
1
0
nearest
0 1 2 3 4 5 6 7 8 9 1
0
medoids
Do loop 0
8
9
Compute
0
8
9
and Oramdom 6
swapping 6
change If quality is
5
4
5
improved. 3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 1 0 1 2 3 4 5 6 7 8 9 1
0 0
50
PAM (Partitioning Around Medoids) (1987)
51
PAM Clustering: Finding the Best Cluster Center
52
What Is the Problem with PAM?
53
CLARA (Clustering Large Applications) (1990)
55
ROCK: Clustering Categorical Data
■ Major ideas
■ Use links to measure similarity/proximity
■ Not distance-based
■ Experiments
■ Congressional voting, mushroom data
56
Similarity Measure in ROCK
■ Traditional measures for categorical data may not work well, e.g.,
Jaccard coefficient
■ Example: Two groups (clusters) of transactions
■ C1. <a, b, c, d, e>: {a, b, c}, {a, b, d}, {a, b, e}, {a, c, d}, {a, c, e},
{a, d, e}, {b, c, d}, {b, c, e}, {b, d, e}, {c, d, e}
■ C2. <a, b, f, g>: {a, b, f}, {a, b, g}, {a, f, g}, {b, f, g}
■ Jaccard co-efficient may lead to wrong clustering result
■ C1: 0.2 ({a, b, c}, {b, d, e}} to 0.5 ({a, b, c}, {a, b, d})
■ C1 & C2: could be as high as 0.5 ({a, b, c}, {a, b, f})
■ Jaccard co-efficient-based similarity function:
57
Link Measure in ROCK
■ Clusters
■ C1:<a, b, c, d, e>: {a, b, c}, {a, b, d}, {a, b, e}, {a, c, d}, {a, c, e}, {a, d, e},
{b, c, d}, {b, c, e}, {b, d, e}, {c, d, e}
■ C2: <a, b, f, g>: {a, b, f}, {a, b, g}, {a, f, g}, {b, f, g}
■ Neighbors
■ Two transactions are neighbors if sim(T1,T2) > threshold
■ Let T1 = {a, b, c}, T2 = {c, d, e}, T3 = {a, b, f}
■ T1 connected to: {a,b,d}, {a,b,e}, {a,c,d}, {a,c,e}, {b,c,d}, {b,c,e},
{a,b,f}, {a,b,g}
■ T2 connected to: {a,c,d}, {a,c,e}, {a,d,e}, {b,c,e}, {b,d,e}, {b,c,d}
■ T3 connected to: {a,b,c}, {a,b,d}, {a,b,e}, {a,b,g}, {a,f,g}, {b,f,g}
■ Link Similarity
■ Link similarity between two transactions is the # of common neighbors
■ link(T1, T2) = 4, since they have 4 common neighbors
■ {a, c, d}, {a, c, e}, {b, c, d}, {b, c, e}
■ link(T1, T3) = 3, since they have 3 common neighbors
■ {a, b, d}, {a, b, e}, {a, b, g}
58
Rock Algorithm
■ Method
■ Compute similarity matrix
59
Aggregation-Based Similarity Computation
0.2
4 5 ST2
0.9 1.0 0.8 0.9 1.0
1 1 1 1 1
0 1 2 3 4
a b ST1
For each node nk ∈ {n10, n11, n12} and nl ∈ {n13, n14}, their
path-based similarity simp(nk, nl) = s(nk, n4)·s(n4, n5)·s(n5, nl).
60
Computing Similarity with Aggregation
Average
similarity a:(0.9,3 b:(0.95,2)
) 0.2
and total weight 4 5
a b
All
ca
DV
TV me
D
ra
Words
64
Observation 2: Distribution of Similarity
Canon A40
digital camera
Digital
Sony V3 digital Cameras
Consumer Apparels
camera
electronics
TVs
66
Similarity Defined by SimTree
Similarity between two
sibling nodes n1 and n2
n n n
0.2
1 2 3
n7 n8 n9
■ Path-based node similarity
■ simp(n7,n8) = s(n7, n4) x s(n4, n5) x s(n5, n8)
■ Similarity between two nodes is the average similarity
between objects linked with them in other SimTrees
■ Adjust/ ratio for x = Average similarity between x and all other nodes
Average similarity between x’s parent and all other nodes
67
LinkClus: Efficient Clustering via Heterogeneous
Semantic Links
Method
■ Initialize a SimTree for objects of each type
similar to
For details: X. Yin, J. Han, and P. S. Yu, “LinkClus: Efficient
Clustering via Heterogeneous Semantic Links”, VLDB'06
68
Initialization of SimTrees
■ Initializing a SimTree
■ Repeatedly find groups of tightly related nodes, which
69
Finding Tight Groups by Freq. Pattern Mining
■ Finding tight groups Frequent pattern mining
Reduced to
Transactions
n1 1 {n1}
The tightness of a g1 2 {n1, n2}
group of nodes is the 3 {n2}
n2 {n1, n2}
4
support of a frequent {n1, n2}
5
pattern n3 6 {n2, n3, n4}
g2 7 {n4}
n4 8 {n3, n4}
9 {n3, n4}
■ Procedure of initializing a tree
■ Start from leaf nodes (level-0)
n 0.9 n n
0.8 4 5 6
n7 n7 n8 n9
similar to, under the constraint that each parent node can
have at most c children
71
Complexity
Time Space
Updating similarities O(M(logN)2) O(M+N)
Adjusting tree structures O(N) O(N)
72
Experiment: Email Dataset
■ F. Nielsen. Email dataset. Approach Accuracy time (s)
www.imm.dtu.dk/~rem/data/Email-1431.zip
■ 370 emails on conferences, 272 on jobs, LinkClus 0.8026 1579.6
and 789 spam emails SimRank 0.7965 39160
■ Accuracy: measured by manually labeled ReCom 0.5711 74.6
data
■ Accuracy of clustering: % of pairs of objects F-SimRank 0.3688 479.7
in the same cluster that share common label CLARANS 0.4768 8.55
■ Approaches compared:
■ SimRank (Jeh & Widom, KDD 2002): Computing pair-wise similarities
■ SimRank with FingerPrints (F-SimRank): Fogaras & R´acz, WWW 2005
■ pre-computes a large sample of random paths from each object and uses
samples of two objects to estimate SimRank similarity
■ ReCom (Wang et al. SIGIR 2003)
■ Iteratively clustering objects using cluster labels of linked objects
73
WaveCluster: Clustering by Wavelet Analysis (1998)
74
The WaveCluster Algorithm
■ How to apply wavelet transform to find clusters
■ Summarizes the data by imposing a multidimensional grid
structure onto data space
■ These multidimensional spatial data objects are represented in a
n-dimensional feature space
■ Apply wavelet transform on feature space to find the dense
regions in the feature space
■ Apply wavelet transform multiple times which result in clusters at
different scales from fine to coarse
■ Major features:
■ Complexity O(N)
■ Detect arbitrary shaped clusters at different scales
■ Not sensitive to noise, not sensitive to input order
■ Only applicable to low dimensional data
75
Quantization
& Transformation
■ Quantize data into m-D grid structure,
then wavelet transform
a) scale 1: high resolution
b) scale 2: medium resolution
c) scale 3: low resolution
76