Clustering Part 2
Clustering Part 2
10 10 10
9 9 9
8 8 8
7 7 7
6 6 6
5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Dendrogram: Shows How the Clusters are Merged
Decompose data objects into a several levels of nested partitioning (tree of
clusters), called a dendrogram.
A clustering of the data objects is obtained by cutting the dendrogram at the
desired level, then each connected component forms a cluster.
DIANA (DIVISIVE ANALYSIS)
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
3 3
3
2 2
2
1 1
1
0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
RECENT HIERARCHICAL CLUSTERING METHODS
10
9
(3,4)
(2,6)
8
4
(4,5)
3
2 (4,7)
(3,8)
1
0
0 1 2 3 4 5 6 7 8 9 10
PROPERTIES OF CLUSTERING FEATURE
• CF entry is more compact
• Stores significantly less than all of the data points in
the sub-cluster
• Additivity theorem allows us to merge sub-clusters
incrementally & consistently
10
CF-TREE IN BIRCH
• Clustering feature:
• summary of the statistics for a given subcluster
• registers crucial measurements for computing cluster and utilizes storage
efficiently
• A CF tree is a height-balanced tree that stores the clustering features for a
hierarchical clustering
• A nonleaf node in a tree has descendants or “children”
• The nonleaf nodes store sums of the CFs of their children
• A CF tree has two parameters
• Branching factor: specify the maximum number of children.
• Threshold: max diameter of sub-clusters stored at the leaf nodes
THE CF TREE STRUCTURE
Root
Non-leaf node
CF1 CF2 CF3 CF5
child1 child2 child3 child5
13
CF-TREE
o
Each non-leaf node has
at most B entries
o
Each leaf node has at
most L CF entries, each
of which satisfies
threshold T
14
CF-TREE INSERTION
15
EXAMPLE OF BIRCH
New subcluster
sc8 sc3
sc1 sc4 sc5 sc6 sc7
16
INSERTION OPERATION IN BIRCH
If the branching factor of a leaf node can not exceed 3, then LN1 is
split.
sc8 sc3
sc1 sc4 sc5 sc6 sc7
sc2
LN1’ LN2 LN3
LN1” Root
LN1’
LN1”LN2 LN3
17
If the branching factor of a non-leaf node can not
exceed 3, then the root is split and the height of
the CF Tree increases by one.
sc8 sc3
sc1 sc4 sc5 sc6 sc7
sc2
LN1’ LN2 LN3
LN1” Root
NLN1 NLN2
LN1’
LN1” LN2 LN3
19
EXPERIMENTAL RESULTS
• Input parameters:
• Memory (M): 5% of data set
• Disk space (R): 20% of M
• Distance equation: D2
• Quality equation: weighted average diameter
(D)
• Initial threshold (T): 0.0
• Page size (P): 1024 bytes
20
EXPERIMENTAL RESULTS
KMEANS clustering
DS Time D # Scan DS Time D # Scan
1 43.9 2.09 289 1o 33.8 1.97 197
2 13.2 4.43 51 2o 12.7 4.20 29
3 32.9 3.66 187 3o 36.0 4.35 241
BIRCH clustering
DS Time D # Scan DS Time D # Scan
1 11.5 1.87 2 1o 13.6 1.87 2
2 10.7 1.99 2 2o 12.1 1.99 2
3 11.4 3.95 2 3o 12.2 3.99 2
21
CONCLUSIONS
• A CF tree is a height-balanced tree that stores the
clustering features for a hierarchical clustering.
• Given a limited amount of main memory, BIRCH
can minimize the time required for I/O.
• BIRCH is a scalable clustering algorithm with
respect to the number of objects, and good quality
of clustering of the data.
22
CLUSTERING CATEGORICAL DATA: THE ROCK ALGORITHM
• Traditional measures for categorical data may not work well, e.g.,
Jaccard coefficient
• Example: Two groups (clusters) of transactions
• C1. <a, b, c, d, e>: {a, b, c}, {a, b, d}, {a, b, e}, {a, c, d}, {a, c, e}, {a, d, e},
{b, c, d}, {b, c, e}, {b, d, e}, {c, d, e}
• C2. <a, b, f, g>: {a, b, f}, {a, b, g}, {a, f, g}, {b, f, g}
• T1 T2
Jaccard co-efficient-based similarity function: Sim( T1 , T2 )
• Ex. Let T = {a, b, c}, T = {c, d, e} T1 T2
1 2
{c} 1
Sim(T 1, T 2) 0.2
{a, b, c, d , e} 5
• Jaccard co-efficient alone may not lead to the req. results
• C1 & C2: could be as high as 0.5 ({a, b, c}, {a, b, f})
LINK MEASURE IN ROCK
Construct
Sparse Graph Partition the Graph
Data Set
Merge Partition
Final Clusters
CHAMELEON (CLUSTERING COMPLEX OBJECTS)