0% found this document useful (0 votes)
13 views28 pages

Clustering Part 2

The document discusses hierarchical clustering methods, specifically agglomerative and divisive approaches, and introduces BIRCH, a scalable clustering algorithm that efficiently handles large datasets by using a CF-tree structure. It also covers the ROCK algorithm for clustering categorical data using link-based similarity measures, and CHAMELEON, which merges clusters based on dynamic modeling of interconnectivity and proximity. Overall, these methods aim to improve clustering efficiency and quality in various data contexts.

Uploaded by

ATHA SHAMAS
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views28 pages

Clustering Part 2

The document discusses hierarchical clustering methods, specifically agglomerative and divisive approaches, and introduces BIRCH, a scalable clustering algorithm that efficiently handles large datasets by using a CF-tree structure. It also covers the ROCK algorithm for clustering categorical data using link-based similarity measures, and CHAMELEON, which merges clusters based on dynamic modeling of interconnectivity and proximity. Overall, these methods aim to improve clustering efficiency and quality in various data contexts.

Uploaded by

ATHA SHAMAS
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 28

DATA MINING: CLUSTER ANALYSIS

Find all things that are similar to some extent!


HIERARCHICAL APPROACH
HIERARCHICAL CLUSTERING

Hierarchical clustering is a method of cluster analysis which seeks to build a


hierarchy of clusters. This method does not require the number of clusters k
as an input, but needs a termination condition. Strategies for hierarchical
clustering generally fall into two types:
Agglomerative: This is a "bottom up" approach: each observation starts in its
own cluster, and pairs of clusters are merged as one moves up the hierarchy.
Divisive: This is a "top down" approach: all observations start in one cluster,
and splits are performed recursively as one moves down the hierarchy.
Step 0 Step 1 Step 2 Step 3 Step 4
agglomerative
(AGNES)
a ab
b abcde
c
cde
d de
e
divisive
Step 4 Step 3 Step 2 Step 1 Step 0 (DIANA)
AGNES (AGGLOMERATIVE NESTING)

• Implemented in statistical analysis packages, e.g., Splus


• Use the Single-Link method and the dissimilarity matrix.
• Merge nodes that have the least dissimilarity
• Go on in a non-descending fashion
• Eventually all nodes belong to the same cluster

10 10 10

9 9 9

8 8 8

7 7 7

6 6 6

5 5 5

4 4 4

3 3 3

2 2 2

1 1 1

0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Dendrogram: Shows How the Clusters are Merged
Decompose data objects into a several levels of nested partitioning (tree of
clusters), called a dendrogram.
A clustering of the data objects is obtained by cutting the dendrogram at the
desired level, then each connected component forms a cluster.
DIANA (DIVISIVE ANALYSIS)

• Implemented in statistical analysis packages, e.g., Splus


• Inverse order of AGNES
• Eventually each node forms a cluster on its own

10 10
10

9 9
9
8 8
8

7 7
7
6 6
6

5 5
5
4 4
4

3 3
3
2 2
2

1 1
1
0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
RECENT HIERARCHICAL CLUSTERING METHODS

• Major weakness of agglomerative clustering methods


• do not scale well: time complexity of at least O(n2), where n is the
number of total objects
• can never undo what was done previously
• Integration of hierarchical with distance-based clustering
• BIRCH (1996): uses CF-tree and incrementally adjusts the quality
of sub-clusters
• ROCK (1999): clustering categorical data by neighbor and link
analysis
• CHAMELEON (1999): hierarchical clustering using dynamic
modeling
BIRCH (1996)

• Birch: Balanced Iterative Reducing and Clustering using Hierarchies


• Incrementally construct a CF (Clustering Feature) tree, a hierarchical
data structure for multiphase clustering
• Phase 1: scan DB to build an initial in-memory CF tree (a multi-level
compression of the data that tries to preserve the inherent clustering
structure of the data)
• Phase 2: use an arbitrary clustering algorithm to cluster the leaf nodes of
the CF-tree
• Scales linearly: finds a good clustering with a single scan and improves
the quality with a few additional scans
• Weakness: handles only numeric data, and sensitive to the order of the
data record.
Clustering Feature Vector in BIRCH

Clustering Feature: CF = (N, LS, SS)


N: Number of data points
LS (Linear Sum): Ni=1Xi
SS (Squared Sum): Ni=1=Xi2 CF = (5, (16,54), (30,190)

10

9
(3,4)
(2,6)
8

4
(4,5)
3

2 (4,7)
(3,8)
1

0
0 1 2 3 4 5 6 7 8 9 10
PROPERTIES OF CLUSTERING FEATURE
• CF entry is more compact
• Stores significantly less than all of the data points in
the sub-cluster
• Additivity theorem allows us to merge sub-clusters
incrementally & consistently

10
CF-TREE IN BIRCH
• Clustering feature:
• summary of the statistics for a given subcluster
• registers crucial measurements for computing cluster and utilizes storage
efficiently
• A CF tree is a height-balanced tree that stores the clustering features for a
hierarchical clustering
• A nonleaf node in a tree has descendants or “children”
• The nonleaf nodes store sums of the CFs of their children
• A CF tree has two parameters
• Branching factor: specify the maximum number of children.
• Threshold: max diameter of sub-clusters stored at the leaf nodes
THE CF TREE STRUCTURE
Root

B=6 CF1 CF2 CF3 CF6


L=6 child1 child2 child3 child6

Non-leaf node
CF1 CF2 CF3 CF5
child1 child2 child3 child5

Leaf node Leaf node


prev CF1 CF2 CF6 next prev CF1 CF2 CF4 next
BIRCH IN ACTION

• Designed for very large data sets


• Time and memory are limited
• Incremental and dynamic clustering of incoming objects
• Only one scan of data is necessary
• Does not need the whole data set in advance
• Two key phases:
• Scans the database to build an in-memory tree
• Applies clustering algorithm to cluster the leaf nodes

13
CF-TREE

o
Each non-leaf node has
at most B entries
o
Each leaf node has at
most L CF entries, each
of which satisfies
threshold T

14
CF-TREE INSERTION

• Recurse down from root, find the appropriate leaf


• Follow the "closest"-CF path, w.r.t. D0 / … / D4
• Modify the leaf
• If the closest-CF leaf cannot absorb, make a new CF
entry. If there is no room for new leaf, split the parent
node
• Traverse back
• Updating CFs on the path or splitting nodes

15
EXAMPLE OF BIRCH

New subcluster

sc8 sc3
sc1 sc4 sc5 sc6 sc7

sc2 LN2 LN3


Root
LN1 LN1 LN2 LN3

sc8 sc1 sc3sc4sc5 sc6 sc7


sc2

16
INSERTION OPERATION IN BIRCH

If the branching factor of a leaf node can not exceed 3, then LN1 is
split.

sc8 sc3
sc1 sc4 sc5 sc6 sc7

sc2
LN1’ LN2 LN3

LN1” Root
LN1’
LN1”LN2 LN3

sc8 sc1 sc3sc4sc5 sc6 sc7


sc2

17
If the branching factor of a non-leaf node can not
exceed 3, then the root is split and the height of
the CF Tree increases by one.

sc8 sc3
sc1 sc4 sc5 sc6 sc7

sc2
LN1’ LN2 LN3

LN1” Root
NLN1 NLN2
LN1’
LN1” LN2 LN3

sc8 sc1 sc2 sc3sc4sc5 sc6 sc7


18
CF-TREE REBUILDING

• If we run out of space, increase threshold T


• By increasing the threshold, CFs absorb more
data
• Rebuilding "pushes" CFs over
• The larger T allows different CFs to group
together
• Reducibility theorem
• Increasing T will result in a CF-tree smaller than
the original

19
EXPERIMENTAL RESULTS

• Input parameters:
• Memory (M): 5% of data set
• Disk space (R): 20% of M
• Distance equation: D2
• Quality equation: weighted average diameter
(D)
• Initial threshold (T): 0.0
• Page size (P): 1024 bytes

20
EXPERIMENTAL RESULTS

KMEANS clustering
DS Time D # Scan DS Time D # Scan
1 43.9 2.09 289 1o 33.8 1.97 197
2 13.2 4.43 51 2o 12.7 4.20 29
3 32.9 3.66 187 3o 36.0 4.35 241

BIRCH clustering
DS Time D # Scan DS Time D # Scan
1 11.5 1.87 2 1o 13.6 1.87 2
2 10.7 1.99 2 2o 12.1 1.99 2
3 11.4 3.95 2 3o 12.2 3.99 2

21
CONCLUSIONS
• A CF tree is a height-balanced tree that stores the
clustering features for a hierarchical clustering.
• Given a limited amount of main memory, BIRCH
can minimize the time required for I/O.
• BIRCH is a scalable clustering algorithm with
respect to the number of objects, and good quality
of clustering of the data.

22
CLUSTERING CATEGORICAL DATA: THE ROCK ALGORITHM

• ROCK: RObust Clustering using linKs


• S. Guha, R. Rastogi & K. Shim, ICDE’99
• Major ideas
• Use links to measure similarity/proximity (number of common
neighbors between 2 objects)
• Not distance-based
• Computational complexity: O( n 2  nm m  n 2 log n)
m a
• Algorithm: sampling-based clustering
• Draw random sample
• Cluster with links
• Label data in disk
• Experiments
• Congressional voting, mushroom data
SIMILARITY MEASURE IN ROCK

• Traditional measures for categorical data may not work well, e.g.,
Jaccard coefficient
• Example: Two groups (clusters) of transactions
• C1. <a, b, c, d, e>: {a, b, c}, {a, b, d}, {a, b, e}, {a, c, d}, {a, c, e}, {a, d, e},
{b, c, d}, {b, c, e}, {b, d, e}, {c, d, e}
• C2. <a, b, f, g>: {a, b, f}, {a, b, g}, {a, f, g}, {b, f, g}
• T1  T2
Jaccard co-efficient-based similarity function: Sim( T1 , T2 ) 
• Ex. Let T = {a, b, c}, T = {c, d, e} T1  T2
1 2

{c} 1
Sim(T 1, T 2)   0.2
{a, b, c, d , e} 5
• Jaccard co-efficient alone may not lead to the req. results
• C1 & C2: could be as high as 0.5 ({a, b, c}, {a, b, f})
LINK MEASURE IN ROCK

• Links: # of common neighbors i.e. sim>®


• C1 <a, b, c, d, e>: {a, b, c}, {a, b, d}, {a, b, e}, {a, c, d}, {a, c, e}, {a, d, e}, {b, c, d},
{b, c, e}, {b, d, e}, {c, d, e}
• C2 <a, b, f, g>: {a, b, f}, {a, b, g}, {a, f, g}, {b, f, g}

• Let T1 = {a, b, c}, T2 = {c, d, e}, T3 = {a, b, f}


• link(T1, T2) = 4, since they have 4 common neighbors when
threshold=0.5
• {a, c, d}, {a, c, e}, {b, c, d}, {b, c, e}
• link(T1, T3) = 3, since they have 3 common neighbors
• {a, b, d}, {a, b, e}, {a, b, g}

• Thus link is a better measure than Jaccard coefficient


CHAMELEON: HIERARCHICAL CLUSTERING USING
DYNAMIC
• MODELING
CHAMELEON: by G. Karypis, E.H.(1999)
Han, and V. Kumar’99
• Measures the similarity based on a dynamic model
• Two clusters are merged only if the interconnectivity and closeness (proximity)
between two clusters are high relative to the internal interconnectivity of the
clusters and closeness of items within the clusters
• Rock ignores information about the closeness of two clusters
• A two-phase algorithm
1. K-NN approach is used to construct a sparse graph (an edge exists if k most
similar)
• Concept of neighborhood: Dense regions have narrow neighborhoods, sparse regions
have wide neighborhoods
• Edges are weighted to reflect similarity: Dense region has heavy edges, vice-versa
2. Use a graph partitioning algorithm: cluster objects into a large number of relatively
small sub-clusters
3. Use an agglomerative hierarchical clustering algorithm: find the genuine clusters
by repeatedly combining these sub-clusters
OVERALL FRAMEWORK OF CHAMELEON

Construct
Sparse Graph Partition the Graph

Data Set

Merge Partition

Final Clusters
CHAMELEON (CLUSTERING COMPLEX OBJECTS)

You might also like