Cluster Analysis Methods Guide
Cluster Analysis Methods Guide
Pattern Recognition
Spatial Data Analysis
create thematic maps in GIS by clustering
feature spaces
detect spatial clusters and explain them in
0
Dissimilarity matrix d(2,1) 0
d(3,1) d ( 3,2) 0
(one mode)
: : :
d ( n,1) d ( n,2) ... ... 0
Interval-scaled variables:
Binary variables:
Nominal, ordinal, and ratio variables:
Variables of mixed types:
Standardize data
Calculate the mean absolute deviation:
s f 1n (| x1 f m f | | x2 f m f | ... | xnf m f |)
Calculate the standardized measurement (z-
xif m f
score) zif s
f
If q = 2, d is Euclidean distance:
d (i, j) (| x x |2 | x x |2 ... | x x |2 )
i1 j1 i2 j2 ip jp
Properties
d(i,j) 0
d(i,i) = 0
d(i,j) = d(j,i)
d(i,j) d(i,k) + d(k,j)
Also one can use weighted distance, parametric
Pearson product moment correlation, or other
disimilarity measures.
d (i, j) p p m
M
and treat zif as interval-scaled f 1
assignment.
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
categorical data?
Need to specify k, the number of clusters, in advance
shapes
December 16, 2024
Variations of the K-Means Method
A few variants of the k-means which differ in
Selection of the initial k means
Dissimilarity calculations
Strategies to calculate cluster means
Handling categorical data: k-modes (Huang’98)
Replacing means of clusters with modes
Using new dissimilarity measures to deal with
categorical objects
Using a frequency-based method to update modes
of clusters
A mixture of categorical and numerical data: k-
prototype method
December 16, 2024
The K-Medoids Clustering Method
Find representative objects, called medoids, in
clusters
PAM (Partitioning Around Medoids, 1987)
starts from an initial set of medoids and
iteratively replaces one of the medoids by one of
the non-medoids if it improves the total distance
of the resulting clustering
PAM works effectively for small data sets, but
does not scale well for large data sets
CLARA (Kaufmann & Rousseeuw, 1990)
CLARANS (Ng & Han, 1994): Randomized sampling
Focusing + spatial data structure (Ester et al., 1995)
December 16, 2024
PAM (Partitioning Around Medoids)
(1987)
9 9
j
8
t 8
t
7 7
5
j 6
4
i h 4
h
3
2
3
2
i
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
10
10
9
9
8
h 8
j
7
7
6
6
5
5 i
i h j
t
4
4
3
3
2
2
1
t
1
0
0
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
C jih
December 16, 2024 = d(j, t) - d(j, i) Cjih = d(j, h) - d(j, t)
CLARA (Clustering Large Applications)
(1990)
CLARA (Kaufmann and Rousseeuw in 1990)
Built in statistical analysis packages, such as S+
It draws multiple samples of the data set, applies PAM on
each sample, and gives the best clustering as the output
Strength: deals with larger data sets than PAM
Weakness:
Efficiency depends on the sample size
A good clustering based on samples will not
necessarily represent a good clustering of the whole
data set if the sample is biased
9
8
Eventually all nodes belong to the same cluster 9
8
9
7 7 7
6 6 6
5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
3 3
3
2 2
2
1 1
1
0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
dynamic modeling
December 16, 2024
BIRCH (1996)
Birch: Balanced Iterative Reducing and Clustering using
Hierarchies, by Zhang, Ramakrishnan, Livny
(SIGMOD’96)
Incrementally construct a CF (Clustering Feature) tree, a
hierarchical data structure for multiphase clustering
Phase 1: scan DB to build an initial in-memory CF tree
(a multi-level compression of the data that tries to
preserve the inherent clustering structure of the data)
Phase 2: use an arbitrary clustering algorithm to
cluster the leaf nodes of the CF-tree
Scales linearly: finds a good clustering with a single scan
and improves the quality with a few additional scans
Weakness: handles only numeric data, and sensitive to
the order of the data record.
December 16, 2024
Clustering Feature Vector
9
(3,4)
(2,6)
8
(4,5)
5
1
(4,7)
(3,8)
0
0 1 2 3 4 5 6 7 8 9 10
Non-leaf node
CF1 CF2 CF3 CF5
child1 child2 child3 child5
Density-connected
p q
A point p is density-connected to
a point q wrt. Eps, MinPts if there
is a point o such that both, p and o
q are density-reachable from o
wrt. Eps and MinPts.
December 16, 2024
DBSCAN: Density Based Spatial
Clustering of Applications with
Noise
Border
Eps = 1cm
Core MinPts = 5