0% found this document useful (0 votes)

13 views28 pages

Clustering Part 2

The document discusses hierarchical clustering methods, specifically agglomerative and divisive approaches, and introduces BIRCH, a scalable clustering algorithm that efficiently handles large datasets by using a CF-tree structure. It also covers the ROCK algorithm for clustering categorical data using link-based similarity measures, and CHAMELEON, which merges clusters based on dynamic modeling of interconnectivity and proximity. Overall, these methods aim to improve clustering efficiency and quality in various data contexts.

Uploaded by

ATHA SHAMAS

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views28 pages

Clustering Part 2

Uploaded by

ATHA SHAMAS

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 28

DATA MINING: CLUSTER ANALYSIS

Find all things that are similar to some extent!

HIERARCHICAL APPROACH
HIERARCHICAL CLUSTERING

Hierarchical clustering is a method of cluster analysis which seeks to build a

hierarchy of clusters. This method does not require the number of clusters k
as an input, but needs a termination condition. Strategies for hierarchical
clustering generally fall into two types:
Agglomerative: This is a "bottom up" approach: each observation starts in its
own cluster, and pairs of clusters are merged as one moves up the hierarchy.
Divisive: This is a "top down" approach: all observations start in one cluster,
and splits are performed recursively as one moves down the hierarchy.
Step 0 Step 1 Step 2 Step 3 Step 4
agglomerative
(AGNES)
a ab
b abcde
c
cde
d de
e
divisive
Step 4 Step 3 Step 2 Step 1 Step 0 (DIANA)
AGNES (AGGLOMERATIVE NESTING)

• Implemented in statistical analysis packages, e.g., Splus

• Use the Single-Link method and the dissimilarity matrix.
• Merge nodes that have the least dissimilarity
• Go on in a non-descending fashion
• Eventually all nodes belong to the same cluster

10 10 10

9 9 9

8 8 8

7 7 7

6 6 6

5 5 5

4 4 4

3 3 3

2 2 2

1 1 1

0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Dendrogram: Shows How the Clusters are Merged
Decompose data objects into a several levels of nested partitioning (tree of
clusters), called a dendrogram.
A clustering of the data objects is obtained by cutting the dendrogram at the
desired level, then each connected component forms a cluster.
DIANA (DIVISIVE ANALYSIS)

• Implemented in statistical analysis packages, e.g., Splus

• Inverse order of AGNES
• Eventually each node forms a cluster on its own

10 10
10

9 9
9
8 8
8

7 7
7
6 6
6

5 5
5
4 4
4

3 3
3
2 2
2

1 1
1
0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
RECENT HIERARCHICAL CLUSTERING METHODS

• Major weakness of agglomerative clustering methods

• do not scale well: time complexity of at least O(n2), where n is the
number of total objects
• can never undo what was done previously
• Integration of hierarchical with distance-based clustering
• BIRCH (1996): uses CF-tree and incrementally adjusts the quality
of sub-clusters
• ROCK (1999): clustering categorical data by neighbor and link
analysis
• CHAMELEON (1999): hierarchical clustering using dynamic
modeling
BIRCH (1996)

• Birch: Balanced Iterative Reducing and Clustering using Hierarchies

• Incrementally construct a CF (Clustering Feature) tree, a hierarchical
data structure for multiphase clustering
• Phase 1: scan DB to build an initial in-memory CF tree (a multi-level
compression of the data that tries to preserve the inherent clustering
structure of the data)
• Phase 2: use an arbitrary clustering algorithm to cluster the leaf nodes of
the CF-tree
• Scales linearly: finds a good clustering with a single scan and improves
the quality with a few additional scans
• Weakness: handles only numeric data, and sensitive to the order of the
data record.
Clustering Feature Vector in BIRCH

Clustering Feature: CF = (N, LS, SS)

N: Number of data points
LS (Linear Sum): Ni=1Xi
SS (Squared Sum): Ni=1=Xi2 CF = (5, (16,54), (30,190)

9
(3,4)
(2,6)
8

4
(4,5)
3

2 (4,7)
(3,8)
1

0
0 1 2 3 4 5 6 7 8 9 10
PROPERTIES OF CLUSTERING FEATURE
• CF entry is more compact
• Stores significantly less than all of the data points in
the sub-cluster
• Additivity theorem allows us to merge sub-clusters
incrementally & consistently

10
CF-TREE IN BIRCH
• Clustering feature:
• summary of the statistics for a given subcluster
• registers crucial measurements for computing cluster and utilizes storage
efficiently
• A CF tree is a height-balanced tree that stores the clustering features for a
hierarchical clustering
• A nonleaf node in a tree has descendants or “children”
• The nonleaf nodes store sums of the CFs of their children
• A CF tree has two parameters
• Branching factor: specify the maximum number of children.
• Threshold: max diameter of sub-clusters stored at the leaf nodes
THE CF TREE STRUCTURE
Root

B=6 CF1 CF2 CF3 CF6

L=6 child1 child2 child3 child6

Non-leaf node
CF1 CF2 CF3 CF5
child1 child2 child3 child5

Leaf node Leaf node

prev CF1 CF2 CF6 next prev CF1 CF2 CF4 next
BIRCH IN ACTION

• Designed for very large data sets

• Time and memory are limited
• Incremental and dynamic clustering of incoming objects
• Only one scan of data is necessary
• Does not need the whole data set in advance
• Two key phases:
• Scans the database to build an in-memory tree
• Applies clustering algorithm to cluster the leaf nodes

13
CF-TREE

o
Each non-leaf node has
at most B entries
o
Each leaf node has at
most L CF entries, each
of which satisfies
threshold T

14
CF-TREE INSERTION

• Recurse down from root, find the appropriate leaf

• Follow the "closest"-CF path, w.r.t. D0 / … / D4
• Modify the leaf
• If the closest-CF leaf cannot absorb, make a new CF
entry. If there is no room for new leaf, split the parent
node
• Traverse back
• Updating CFs on the path or splitting nodes

15
EXAMPLE OF BIRCH

New subcluster

sc8 sc3
sc1 sc4 sc5 sc6 sc7

sc2 LN2 LN3

Root
LN1 LN1 LN2 LN3

sc8 sc1 sc3sc4sc5 sc6 sc7

sc2

16
INSERTION OPERATION IN BIRCH

If the branching factor of a leaf node can not exceed 3, then LN1 is
split.

sc8 sc3
sc1 sc4 sc5 sc6 sc7

sc2
LN1’ LN2 LN3

LN1” Root
LN1’
LN1”LN2 LN3

sc8 sc1 sc3sc4sc5 sc6 sc7

sc2

17
If the branching factor of a non-leaf node can not
exceed 3, then the root is split and the height of
the CF Tree increases by one.

sc8 sc3
sc1 sc4 sc5 sc6 sc7

sc2
LN1’ LN2 LN3

LN1” Root
NLN1 NLN2
LN1’
LN1” LN2 LN3

sc8 sc1 sc2 sc3sc4sc5 sc6 sc7

18
CF-TREE REBUILDING

• If we run out of space, increase threshold T

• By increasing the threshold, CFs absorb more
data
• Rebuilding "pushes" CFs over
• The larger T allows different CFs to group
together
• Reducibility theorem
• Increasing T will result in a CF-tree smaller than
the original

19
EXPERIMENTAL RESULTS

• Input parameters:
• Memory (M): 5% of data set
• Disk space (R): 20% of M
• Distance equation: D2
• Quality equation: weighted average diameter
(D)
• Initial threshold (T): 0.0
• Page size (P): 1024 bytes

20
EXPERIMENTAL RESULTS

KMEANS clustering
DS Time D # Scan DS Time D # Scan
1 43.9 2.09 289 1o 33.8 1.97 197
2 13.2 4.43 51 2o 12.7 4.20 29
3 32.9 3.66 187 3o 36.0 4.35 241

BIRCH clustering
DS Time D # Scan DS Time D # Scan
1 11.5 1.87 2 1o 13.6 1.87 2
2 10.7 1.99 2 2o 12.1 1.99 2
3 11.4 3.95 2 3o 12.2 3.99 2

21
CONCLUSIONS
• A CF tree is a height-balanced tree that stores the
clustering features for a hierarchical clustering.
• Given a limited amount of main memory, BIRCH
can minimize the time required for I/O.
• BIRCH is a scalable clustering algorithm with
respect to the number of objects, and good quality
of clustering of the data.

22
CLUSTERING CATEGORICAL DATA: THE ROCK ALGORITHM

• Traditional measures for categorical data may not work well, e.g.,
Jaccard coefficient
• Example: Two groups (clusters) of transactions
• C1. <a, b, c, d, e>: {a, b, c}, {a, b, d}, {a, b, e}, {a, c, d}, {a, c, e}, {a, d, e},
{b, c, d}, {b, c, e}, {b, d, e}, {c, d, e}
• C2. <a, b, f, g>: {a, b, f}, {a, b, g}, {a, f, g}, {b, f, g}
• T1  T2
Jaccard co-efficient-based similarity function: Sim( T1 , T2 ) 
• Ex. Let T = {a, b, c}, T = {c, d, e} T1  T2
1 2

{c} 1
Sim(T 1, T 2)   0.2
{a, b, c, d , e} 5
• Jaccard co-efficient alone may not lead to the req. results
• C1 & C2: could be as high as 0.5 ({a, b, c}, {a, b, f})
LINK MEASURE IN ROCK

• Links: # of common neighbors i.e. sim>®

• C1 <a, b, c, d, e>: {a, b, c}, {a, b, d}, {a, b, e}, {a, c, d}, {a, c, e}, {a, d, e}, {b, c, d},
{b, c, e}, {b, d, e}, {c, d, e}
• C2 <a, b, f, g>: {a, b, f}, {a, b, g}, {a, f, g}, {b, f, g}

• Let T1 = {a, b, c}, T2 = {c, d, e}, T3 = {a, b, f}

• link(T1, T2) = 4, since they have 4 common neighbors when
threshold=0.5
• {a, c, d}, {a, c, e}, {b, c, d}, {b, c, e}
• link(T1, T3) = 3, since they have 3 common neighbors
• {a, b, d}, {a, b, e}, {a, b, g}

• Thus link is a better measure than Jaccard coefficient

CHAMELEON: HIERARCHICAL CLUSTERING USING
DYNAMIC
• MODELING
CHAMELEON: by G. Karypis, E.H.(1999)
Han, and V. Kumar’99
• Measures the similarity based on a dynamic model
• Two clusters are merged only if the interconnectivity and closeness (proximity)
between two clusters are high relative to the internal interconnectivity of the
clusters and closeness of items within the clusters
• Rock ignores information about the closeness of two clusters
• A two-phase algorithm
1. K-NN approach is used to construct a sparse graph (an edge exists if k most
similar)
• Concept of neighborhood: Dense regions have narrow neighborhoods, sparse regions
have wide neighborhoods
• Edges are weighted to reflect similarity: Dense region has heavy edges, vice-versa
2. Use a graph partitioning algorithm: cluster objects into a large number of relatively
small sub-clusters
3. Use an agglomerative hierarchical clustering algorithm: find the genuine clusters
by repeatedly combining these sub-clusters
OVERALL FRAMEWORK OF CHAMELEON

Construct
Sparse Graph Partition the Graph

Data Set

Merge Partition

Final Clusters
CHAMELEON (CLUSTERING COMPLEX OBJECTS)

IG Growth Ebook PDF
No ratings yet
IG Growth Ebook PDF
22 pages
Graph Coloring
No ratings yet
Graph Coloring
27 pages
Birch 09
No ratings yet
Birch 09
31 pages
Hierarchical ClusteringAlgorithm
No ratings yet
Hierarchical ClusteringAlgorithm
32 pages
Balanced Iterative Reducing and Clustering Using Hierarchies
No ratings yet
Balanced Iterative Reducing and Clustering Using Hierarchies
33 pages
Lecture 13
No ratings yet
Lecture 13
45 pages
Heirarchical clustering
No ratings yet
Heirarchical clustering
22 pages
4.4 Hierarchical Clustering Methods
No ratings yet
4.4 Hierarchical Clustering Methods
39 pages
Unit 4 - Data Warehousing and Mining
No ratings yet
Unit 4 - Data Warehousing and Mining
51 pages
Data Mining Unit 5
No ratings yet
Data Mining Unit 5
30 pages
ML Module Iv
No ratings yet
ML Module Iv
27 pages
Lesson 3.6 - Supervised Learning Neural Networks
No ratings yet
Lesson 3.6 - Supervised Learning Neural Networks
35 pages
List of Figures Chapter 1: State of The Art
No ratings yet
List of Figures Chapter 1: State of The Art
25 pages
List of Figures Chapter 1: State of The Art
No ratings yet
List of Figures Chapter 1: State of The Art
25 pages
Lecture 18
No ratings yet
Lecture 18
27 pages
Balanced Iterative Reducing and Clustering Using Hierarchies
No ratings yet
Balanced Iterative Reducing and Clustering Using Hierarchies
28 pages
Week-10
No ratings yet
Week-10
84 pages
DWDM Unit 4
No ratings yet
DWDM Unit 4
24 pages
Module-5-Cluster Analysis-Part1
No ratings yet
Module-5-Cluster Analysis-Part1
24 pages
Cluster Analysis - Approach 1
No ratings yet
Cluster Analysis - Approach 1
28 pages
Chp10 Cluster Analysis Basic Concepts and Methods
No ratings yet
Chp10 Cluster Analysis Basic Concepts and Methods
24 pages
ML_U2_BIRCH_61845fd2-aa4b-4335-afa5-37d9f3b4d63a (1)
No ratings yet
ML_U2_BIRCH_61845fd2-aa4b-4335-afa5-37d9f3b4d63a (1)
20 pages
BIRCH: A New Data Clustering Algorithm and Its Applications
No ratings yet
BIRCH: A New Data Clustering Algorithm and Its Applications
42 pages
DM Clustering UNIT4
No ratings yet
DM Clustering UNIT4
36 pages
DOC-20231118-WA0008new Unit 5
No ratings yet
DOC-20231118-WA0008new Unit 5
15 pages
Presentation On Clustering Algorithms
No ratings yet
Presentation On Clustering Algorithms
43 pages
Birch Clustering
No ratings yet
Birch Clustering
11 pages
Session 3: Clustering Techniques - Partitioning & Hierarchical Methods
No ratings yet
Session 3: Clustering Techniques - Partitioning & Hierarchical Methods
27 pages
Clustering
No ratings yet
Clustering
7 pages
Data mining and machine learning
No ratings yet
Data mining and machine learning
48 pages
Clustering
No ratings yet
Clustering
45 pages
6 - Clustering and Applications and Trends in Datamining
No ratings yet
6 - Clustering and Applications and Trends in Datamining
66 pages
13_BIRCH
No ratings yet
13_BIRCH
8 pages
Clustering
No ratings yet
Clustering
28 pages
5812d46b-1c39-4a89-ae4b-eec09f93ba4b
No ratings yet
5812d46b-1c39-4a89-ae4b-eec09f93ba4b
66 pages
Clustering Hierarchical PDF
No ratings yet
Clustering Hierarchical PDF
31 pages
Gautam A. Kudale
No ratings yet
Gautam A. Kudale
6 pages
Spooo
No ratings yet
Spooo
9 pages
UnSupervisedLearning
No ratings yet
UnSupervisedLearning
22 pages
Hierarchical Clustering
No ratings yet
Hierarchical Clustering
11 pages
Artificial Intelligence Report
No ratings yet
Artificial Intelligence Report
23 pages
Electronics 11 02735 v2
No ratings yet
Electronics 11 02735 v2
19 pages
4.unsupervised Learning Model-Clustering
No ratings yet
4.unsupervised Learning Model-Clustering
45 pages
Clustering
No ratings yet
Clustering
110 pages
Hierarchical Clustering
No ratings yet
Hierarchical Clustering
26 pages
Hierarchical Clustering PDF
No ratings yet
Hierarchical Clustering PDF
5 pages
Hierarchical Clustering
No ratings yet
Hierarchical Clustering
10 pages
Clustering
No ratings yet
Clustering
19 pages
Introduction to Cluster Analysis.
No ratings yet
Introduction to Cluster Analysis.
53 pages
unit5_CSM_ML
No ratings yet
unit5_CSM_ML
32 pages
Automatic Clustering Algorithms
No ratings yet
Automatic Clustering Algorithms
3 pages
Data Mining
No ratings yet
Data Mining
4 pages
Clustering
No ratings yet
Clustering
75 pages
Clustering: EE-671 Prof L. Behera, IITK
No ratings yet
Clustering: EE-671 Prof L. Behera, IITK
33 pages
Clustering
No ratings yet
Clustering
75 pages
Unit-4 new
No ratings yet
Unit-4 new
36 pages
Hierarchical-Clustering-in-Machine-Learning
No ratings yet
Hierarchical-Clustering-in-Machine-Learning
10 pages
Birch
No ratings yet
Birch
6 pages
Agnes
No ratings yet
Agnes
25 pages
An Efficient Enhanced K-Means Clustering Algorithm
No ratings yet
An Efficient Enhanced K-Means Clustering Algorithm
8 pages
Sudoku New: Workouts to sharpen your mind
From Everand
Sudoku New: Workouts to sharpen your mind
Sahil Gupta
No ratings yet
Fill Your Glass With Gold-When It's Half-Full or Even Completely Shattered
From Everand
Fill Your Glass With Gold-When It's Half-Full or Even Completely Shattered
Hillary Saffran
No ratings yet
Application Layer Virtualization Lec5
No ratings yet
Application Layer Virtualization Lec5
20 pages
Virtual System Lec#5
No ratings yet
Virtual System Lec#5
22 pages
Partitioned Resources Slides
No ratings yet
Partitioned Resources Slides
11 pages
IT-Infrastructure Lec#8 - Operating System
No ratings yet
IT-Infrastructure Lec#8 - Operating System
15 pages
Service Delivery Process Detailed
No ratings yet
Service Delivery Process Detailed
12 pages
End User Devices
No ratings yet
End User Devices
10 pages
IT-Infrastructure Lec#7 - Virtualization
No ratings yet
IT-Infrastructure Lec#7 - Virtualization
8 pages
IT Infrastructure Management
No ratings yet
IT Infrastructure Management
18 pages
C++ Sample Code Class 12
No ratings yet
C++ Sample Code Class 12
95 pages
Homework No: 3: ATM Networks (CSE-884)
No ratings yet
Homework No: 3: ATM Networks (CSE-884)
8 pages
Adafruit Ft232h Breakout
No ratings yet
Adafruit Ft232h Breakout
43 pages
ES
No ratings yet
ES
50 pages
Comprog 3 Looping in C#
No ratings yet
Comprog 3 Looping in C#
3 pages
Csc
No ratings yet
Csc
20 pages
Digital Certificate
No ratings yet
Digital Certificate
8 pages
P04 - 4. Area of Rectangle, Square, Circle and Triangle
No ratings yet
P04 - 4. Area of Rectangle, Square, Circle and Triangle
3 pages
Application Program For pCO pCO pCO3 pCOC
No ratings yet
Application Program For pCO pCO pCO3 pCOC
56 pages
RSCIT I-Learn - Internal Assessment -12 (माइक्रोसॉफ्ट एक्सेल)
No ratings yet
RSCIT I-Learn - Internal Assessment -12 (माइक्रोसॉफ्ट एक्सेल)
5 pages
f28335 - I2c Module
No ratings yet
f28335 - I2c Module
7 pages
SS Macro
No ratings yet
SS Macro
87 pages
Cheat Sheet
No ratings yet
Cheat Sheet
2 pages
Tutorial 3 Solution
No ratings yet
Tutorial 3 Solution
7 pages
Active Directory With ASP - Net MVC (.NET) - CodeProject
No ratings yet
Active Directory With ASP - Net MVC (.NET) - CodeProject
6 pages
SAP APO T-Codes
No ratings yet
SAP APO T-Codes
67 pages
SADC Harmoniosed Cyber Security Legal Framework - COMESA WOrkshop
No ratings yet
SADC Harmoniosed Cyber Security Legal Framework - COMESA WOrkshop
11 pages
DFD Diagrams of RRS: Context Level (0 Level) DFD: Administr Ator Authentication
No ratings yet
DFD Diagrams of RRS: Context Level (0 Level) DFD: Administr Ator Authentication
5 pages
GENISYSapp SOLUTIONS Order Management Cloud Essentials
No ratings yet
GENISYSapp SOLUTIONS Order Management Cloud Essentials
37 pages
Hexadecimal To Decimal Converter
No ratings yet
Hexadecimal To Decimal Converter
3 pages
PCL Barcode Manual
No ratings yet
PCL Barcode Manual
59 pages
Trimble - Trimble Data Transfer Utility
No ratings yet
Trimble - Trimble Data Transfer Utility
1 page
Eas Multipars Model
No ratings yet
Eas Multipars Model
66 pages
Lecture 4 Nonlinear First-Order PDEs PDF
No ratings yet
Lecture 4 Nonlinear First-Order PDEs PDF
5 pages
Green Big Ideas Math Tutorial Videos
No ratings yet
Green Big Ideas Math Tutorial Videos
8 pages
Aim: To Implement Huffman Coding Using MATLAB Experimental Requirements: PC Loaded With MATLAB Software Theory
No ratings yet
Aim: To Implement Huffman Coding Using MATLAB Experimental Requirements: PC Loaded With MATLAB Software Theory
5 pages
4.3 Keyboard Map: RAPT User Manual RAPT User Manual
No ratings yet
4.3 Keyboard Map: RAPT User Manual RAPT User Manual
8 pages
A Cellular Automata Simulation of Atomic Layer Etching Strotmann Et Al 1
No ratings yet
A Cellular Automata Simulation of Atomic Layer Etching Strotmann Et Al 1
15 pages

Clustering Part 2

Uploaded by

Clustering Part 2

Uploaded by

DATA MINING: CLUSTER ANALYSIS

Find all things that are similar to some extent!

Hierarchical clustering is a method of cluster analysis which seeks to build a

• Implemented in statistical analysis packages, e.g., Splus

• Implemented in statistical analysis packages, e.g., Splus

• Major weakness of agglomerative clustering methods

• Birch: Balanced Iterative Reducing and Clustering using Hierarchies

Clustering Feature: CF = (N, LS, SS)

B=6 CF1 CF2 CF3 CF6

Leaf node Leaf node

• Designed for very large data sets

• Recurse down from root, find the appropriate leaf

sc2 LN2 LN3

sc8 sc1 sc3sc4sc5 sc6 sc7

sc8 sc1 sc3sc4sc5 sc6 sc7

sc8 sc1 sc2 sc3sc4sc5 sc6 sc7

• If we run out of space, increase threshold T

• ROCK: RObust Clustering using linKs

• Links: # of common neighbors i.e. sim>®

• Let T1 = {a, b, c}, T2 = {c, d, e}, T3 = {a, b, f}

• Thus link is a better measure than Jaccard coefficient

You might also like