2002 Spring CS525 Lecture 2
2002 Spring CS525 Lecture 2
0
d(2,1)
• Dissimilarity matrix 0
d(3,1) d ( 3,2) 0
– (one mode)
: : :
d ( n,1) d ( n,2) ... ... 0
Measure the Quality of
Clustering
• Dissimilarity/Similarity metric: Similarity is expressed in
terms of a distance function, which is typically metric:
d(i, j)
• There is a separate “quality” function that measures the
“goodness” of a cluster.
• The definitions of distance functions are usually very
different for interval-scaled, boolean, categorical, ordinal
and ratio variables.
• Weights should be associated with different variables
based on applications and data semantics.
• It is hard to define “similar enough” or “good enough”
– the answer is typically highly subjective.
Major Clustering
Approaches
• Partitioning algorithms: Construct various partitions and
then evaluate them by some criterion
• Hierarchy algorithms: Create a hierarchical decomposition
of the set of data (or objects) using some criterion
• Density-based: based on connectivity and density functions
• Grid-based: based on a multiple-level granularity structure
• Model-based: A model is hypothesized for each of the
clusters and the idea is to find the best fit of that model to
each other
Partitioning Algorithms: Basic
Concept
• Partitioning method: Construct a partition of a
database D of n objects into a set of k clusters
• Given a k, find a partition of k clusters that
optimizes the chosen partitioning criterion
– Global optimal: exhaustively enumerate all partitions
– Heuristic methods: k-means and k-medoids algorithms
– k-means (MacQueen’67): Each cluster is represented by
the center of the cluster
– k-medoids or PAM (Partition around medoids) (Kaufman
& Rousseeuw’87): Each cluster is represented by one of
the objects in the cluster
The K-Means Clustering Method
2 each
2 the 2
1
objects
1
0
cluster 1
0
0
0 1 2 3 4 5 6 7 8 9 10 to
0 1 2 3 4 5 6 7 8 9 10 means 0 1 2 3 4 5 6 7 8 9 10
most
similar reassign reassign
center 10 10
K=2 9 9
8 8
Arbitrarily choose 7 7
6 6
K object as initial 5 5
2
the 3
1 cluster 1
0
0 1 2 3 4 5 6 7 8 9 10
means 0
0 1 2 3 4 5 6 7 8 9 10
Comments on the K-Means
Method
• Strength: Relatively efficient: O(tkn), where n is #
objects, k is # clusters, and t is # iterations. Normally, k,
t << n.
• Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))
• Weakness
– Applicable only when mean is defined, then what about
categorical data?
– Need to specify k, the number of clusters, in advance
– Unable to handle noisy data and outliers
– Not suitable to discover clusters with non-convex shapes
Variations of the K-Means Method
– Dissimilarity calculations
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
The K-Medoids Clustering Method
• Find representative objects, called medoids, in clusters
• PAM (Partitioning Around Medoids, 1987)
– starts from an initial set of medoids and iteratively replaces one of the
medoids by one of the non-medoids if it improves the total distance of
the resulting clustering
– PAM works effectively for small data sets, but does not scale well for
large data sets
9 9 9
8 8 8
7 7 7
6
Arbitrar 6
Assign 6
5
y 5 each 5
4 choose 4 remaini 4
3
k object 3
ng 3
2
as 2
object 2
1 1
initial to
1
0 0 0
0 1 2 3 4 5 6 7 8 9 10
medoid 0 1 2 3 4 5 6 7 8 9 10
nearest 0 1 2 3 4 5 6 7 8 9 10
s medoid
K=2 s Randomly select a
Total Cost = 26 nonmedoid
object,Oramdom
10 10
Do loop 9
8
Compute
9
8
Swapping 7 total cost 7
Until no O and 6
of 6
Oramdom
change
5 5
4
swapping 4
If quality is 3
2
3
improved. 1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
PAM (Partitioning Around Medoids) (1987)
• PAM (Kaufman and Rousseeuw, 1987), built in Splus
• Use real object to represent the cluster
– Select k representative objects arbitrarily
– For each pair of non-selected object h and selected object i,
calculate the total swapping cost TCih
– For each pair of i and h,
• If TCih < 0, i is replaced by h
• Then assign each non-selected object to the most similar
representative object
– repeat steps 2-3 until there is no change
PAM Clustering: Total swapping cost
TCih=jCjih
10 10
9 9
j
8
t 8
t
7 7
5
j 6
4
i h 4
h
3
2
3
2
i
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
10
10
9
9
8
h 8
j
7
7
6
6
5
5 i
i h j
t
4
4
3
3
2
2
1
t
1
0
0
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
Threshold of
1 2 34 5
A B C D E
MST Example
A B
A B C D E
A 0 1 2 2 3
B 1 0 2 4 3 E C
C 2 2 0 1 5
D 2 4 1 0 3
E 3 3 5 3 0 D
Agglomerative Algorithm
Single Link
• View all items with links (distances) between them.
• Finds maximal connected components in this graph.
• Two clusters are merged if there is at least one edge
which connects them.
• Uses threshold distances at each level.
• Could be agglomerative or divisive.
MST Single Link Algorithm
Single Link Clustering
AGNES (Agglomerative
Nesting)
• Introduced in Kaufmann and Rousseeuw (1990)
• Implemented in statistical analysis packages, e.g., Splus
• Use the Single-Link method and the dissimilarity matrix.
• Merge nodes that have the least dissimilarity
• Go on in a non-descending fashion
• Eventually all nodes belong to the same cluster
10 10 10
9 9 9
8 8 8
7 7 7
6 6 6
5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
DIANA (Divisive Analysis)
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
3 3
3
2 2
2
1 1
1
0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
Readings
• CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic
Modeling. George Karypis, Eui-Hong Han, Vipin Kumar, IEEE Computer
32(8): 68-75, 1999 (http://glaros.dtc.umn.edu/gkhome/node/152)
• BIRCH: A New Data Clustering Algorithm and Its Applications. Data Mining
and Knowledge Discovery Volume 1 , Issue 2 (1997)