0% found this document useful (0 votes)
8 views51 pages

Cluster Analysis Methods Guide

Uploaded by

nanip9120
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views51 pages

Cluster Analysis Methods Guide

Uploaded by

nanip9120
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 51

Cluster Analysis

 What is Cluster Analysis?


 Types of Data in Cluster Analysis
 A Categorization of Major Clustering Methods
 Partitioning Methods
 Hierarchical Methods
 Density-Based Methods
 Summary

December 16, 2024


General Applications of Clustering

 Pattern Recognition
 Spatial Data Analysis
 create thematic maps in GIS by clustering

feature spaces
 detect spatial clusters and explain them in

spatial data mining


 Image Processing
 Economic Science (especially market research)
 WWW
 Document classification

 Cluster Weblog data to discover groups of

similar access patterns

December 16, 2024


Examples of Clustering
Applications
 Marketing: Help marketers discover distinct groups in
their customer bases, and then use this knowledge to
develop targeted marketing programs
 Land use: Identification of areas of similar land use in
an earth observation database
 Insurance: Identifying groups of motor insurance
policy holders with a high average claim cost
 City-planning: Identifying groups of houses according
to their house type, value, and geographical location
 Earth-quake studies: Observed earth quake epicenters
should be clustered along continent faults

December 16, 2024


What Is Good Clustering?

 A good clustering method will produce high quality


clusters with

high intra-class similarity

low inter-class similarity
 The quality of a clustering result depends on both
the similarity measure used by the method and its
implementation.
 The quality of a clustering method is also measured
by its ability to discover some or all of the hidden
patterns.
December 16, 2024
Requirements of Clustering in Data
Mining
 Scalability
 Ability to deal with different types of attributes
 Discovery of clusters with arbitrary shape
 Minimal requirements for domain knowledge to
determine input parameters
 Able to deal with noise and outliers
 Insensitive to order of input records
 High dimensionality
 Incorporation of user-specified constraints
 Interpretability and usability
December 16, 2024
Chapter 8. Cluster Analysis

 What is Cluster Analysis?


 Types of Data in Cluster Analysis
 A Categorization of Major Clustering Methods
 Partitioning Methods
 Hierarchical Methods
 Density-Based Methods
 Grid-Based Methods
 Model-Based Clustering Methods
 Outlier Analysis
 Summary
December 16, 2024
Data Structures

 x11 ... x1f ... x1p 


 Data matrix  
 ... ... ... ... ... 
 (two modes) x ... xif ... xip 
 i1 
 ... ... ... ... ... 
x ... xnf ... xnp 
 n1 

 0 
 
 Dissimilarity matrix d(2,1) 0
 
 d(3,1) d ( 3,2) 0 
 (one mode)
 
 : : : 
 d ( n,1) d ( n,2) ... ... 0

December 16, 2024


Measure the Quality of
Clustering
 Dissimilarity/Similarity metric: Similarity is expressed
in terms of a distance function, which is typically
metric: d(i, j)
 There is a separate “quality” function that measures
the “goodness” of a cluster.
 The definitions of distance functions are usually very
different for interval-scaled, boolean, categorical,
ordinal and ratio variables.
 Weights should be associated with different variables
based on applications and data semantics.
 It is hard to define “similar enough” or “good enough”
 the answer is typically highly subjective.

December 16, 2024


Type of data in clustering analysis

 Interval-scaled variables:
 Binary variables:
 Nominal, ordinal, and ratio variables:
 Variables of mixed types:

December 16, 2024


Interval-valued variables

 Standardize data

Calculate the mean absolute deviation:
s f 1n (| x1 f  m f |  | x2 f  m f | ... | xnf  m f |)

where m f  1n (x1 f  x2 f  ...  xnf )


.


Calculate the standardized measurement (z-
xif  m f
score) zif  s
f

 Using mean absolute deviation is more robust than


using standard deviation
December 16, 2024
Similarity and Dissimilarity
Between Objects

 Distances are normally used to measure the


similarity or dissimilarity between two data
objects
 Some popular
d (i, j) q (| ones
x  x |include:
q
 | x  x Minkowski
|q ... | x  x |distance:
q
)
i1 j1 i2 j2 ip jp

where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp)


are two p-dimensional data objects, and q is a
positive integer
 , j) | x  x |  |distance
If q = 1, d isd (iManhattan x  x | ... | x  x |
i1 j1 i2 j 2 ip jp

December 16, 2024


Similarity and Dissimilarity
Between Objects (Cont.)

 If q = 2, d is Euclidean distance:
d (i, j)  (| x  x |2  | x  x |2 ... | x  x |2 )
i1 j1 i2 j2 ip jp
 Properties

d(i,j)  0

d(i,i) = 0

d(i,j) = d(j,i)

d(i,j)  d(i,k) + d(k,j)
 Also one can use weighted distance, parametric
Pearson product moment correlation, or other
disimilarity measures.

December 16, 2024


Binary Variables
 A contingency table for binary data
Object j
1 0 sum
1 a b a b
Object i 0 c d c d
sum a  c b  d p

 Simple matching coefficient (invariant, if the


d (i, j) 
binary variable is symmetric): b c
a b  c  d
 Jaccard coefficient (noninvariant if the binary
d (i, j)  b  c
variable is asymmetric):
a b  c
December 16, 2024
Dissimilarity between Binary
Variables
 Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
 gender is a symmetric attribute
 the remaining attributes are asymmetric binary
 let the values Y and P be set to 1, and the value N be
setdto 0 , mary )  0  1 0.33
( jack
2  0 1
11
d ( jack , jim )  0.67
111
1 2
d ( jim , mary )  0.75
11 2
December 16, 2024
Nominal Variables

 A generalization of the binary variable in that it can


take more than 2 states, e.g., red, yellow, blue, green
 Method 1: Simple matching

m: # of matches, p: total # of variables

d (i, j)  p p m

 Method 2: use a large number of binary variables



creating a new binary variable for each of the M
nominal states

December 16, 2024


Ordinal Variables
 An ordinal variable can be discrete or continuous
 order is important, e.g., rank
 Can be treated like interval-scaled
 replacing x by their rank rif {1,..., M f }
if

 map the range of each variable onto [0, 1] by


replacing i-th object in the f-th variable by
rif  1
zif 
Mf 1

 compute the dissimilarity using methods for


interval-scaled variables

December 16, 2024


Ratio-Scaled
Variables
 Ratio-scaled variable: a positive measurement on a
nonlinear scale, approximately at exponential scale,
such as AeBt or Ae-Bt
 Methods:

treat them like interval-scaled variables — not a
good choice! (why?)

apply logarithmic transformation
yif = log(xif)

treat them as continuous ordinal data treat their
rank as interval-scaled.
December 16, 2024
Variables of Mixed
Types
 A database may contain all the six types of variables

symmetric binary, asymmetric binary, nominal,
ordinal, interval and ratio.
 One may use a weighted formula to combine their
effects.
 f 1 ij d ij
p (f) (f)
d (i, j ) 
 pf 1 ij( f )

f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 o.w.

f is interval-based: use the normalized distance

f is ordinal or ratio-scaled
 compute ranks r and
if r 1
zif 
if

 M
and treat zif as interval-scaled f  1

December 16, 2024


Cluster Analysis

 What is Cluster Analysis?


 Types of Data in Cluster Analysis
 A Categorization of Major Clustering Methods
 Partitioning Methods
 Hierarchical Methods
 Density-Based Methods
 Summary

December 16, 2024


Major Clustering Approaches

 Partitioning algorithms: Construct various partitions and


then evaluate them by some criterion
 Hierarchy algorithms: Create a hierarchical decomposition
of the set of data (or objects) using some criterion
 Density-based: based on connectivity and density
functions
 Grid-based: based on a multiple-level granularity structure
 Model-based: A model is hypothesized for each of the
clusters and the idea is to find the best fit of that model to
each other

December 16, 2024


Cluster Analysis

 What is Cluster Analysis?


 Types of Data in Cluster Analysis
 A Categorization of Major Clustering Methods
 Partitioning Methods
 Hierarchical Methods
 Density-Based Methods
 Summary

December 16, 2024


Partitioning Algorithms: Basic
Concept
 Partitioning method: Construct a partition of a
database D of n objects into a set of k clusters
 Given a k, find a partition of k clusters that optimizes
the chosen partitioning criterion
 Global optimal: exhaustively enumerate all partitions
 Heuristic methods: k-means and k-medoids
algorithms
 k-means (MacQueen’67): Each cluster is represented
by the center of the cluster
 k-medoids or PAM (Partition around medoids)
(Kaufman & Rousseeuw’87): Each cluster is
represented by one of the objects in the cluster
December 16, 2024
The K-Means Clustering Method

 Given k, the k-means algorithm is implemented


in 4 steps:
 Partition objects into k nonempty subsets

 Compute seed points as the centroids of the

clusters of the current partition. The


centroid is the center (mean point) of the
cluster.
 Assign each object to the cluster with the

nearest seed point.


 Go back to Step 2, stop when no more new

assignment.

December 16, 2024


The K-Means Clustering Method
 Example
10 10

9 9

8 8

7 7

6 6

5 5

4 4

3 3

2 2

1 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

10 10

9 9

8 8

7 7

6 6

5 5

4 4

3 3

2 2

1 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

December 16, 2024


Comments on the K-Means Method
 Strength
 Relatively efficient: O(tkn), where n is # objects, k is #

clusters, and t is # iterations. Normally, k, t << n.


 Often terminates at a local optimum. The global

optimum may be found using techniques such as:


deterministic annealing and genetic algorithms
 Weakness
 Applicable only when mean is defined, then what about

categorical data?
 Need to specify k, the number of clusters, in advance

 Unable to handle noisy data and outliers

 Not suitable to discover clusters with non-convex

shapes
December 16, 2024
Variations of the K-Means Method
 A few variants of the k-means which differ in

Selection of the initial k means

Dissimilarity calculations

Strategies to calculate cluster means
 Handling categorical data: k-modes (Huang’98)

Replacing means of clusters with modes

Using new dissimilarity measures to deal with
categorical objects
 Using a frequency-based method to update modes

of clusters

A mixture of categorical and numerical data: k-
prototype method
December 16, 2024
The K-Medoids Clustering Method
 Find representative objects, called medoids, in
clusters
 PAM (Partitioning Around Medoids, 1987)

starts from an initial set of medoids and
iteratively replaces one of the medoids by one of
the non-medoids if it improves the total distance
of the resulting clustering

PAM works effectively for small data sets, but
does not scale well for large data sets
 CLARA (Kaufmann & Rousseeuw, 1990)
 CLARANS (Ng & Han, 1994): Randomized sampling
 Focusing + spatial data structure (Ester et al., 1995)
December 16, 2024
PAM (Partitioning Around Medoids)
(1987)

 PAM (Kaufman and Rousseeuw, 1987), built in Splus


 Use real object to represent the cluster
 Select k representative objects arbitrarily
 For each pair of non-selected object h and selected
object i, calculate the total swapping cost TCih
 For each pair of i and h,
 If TCih < 0, i is replaced by h

Then assign each non-selected object to the
most similar representative object
 repeat steps 2-3 until there is no change
December 16, 2024
PAM Clustering: Total swapping cost
TCih=jCjih
10 10

9 9
j
8
t 8
t
7 7

5
j 6

4
i h 4
h
3

2
3

2
i
1 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

Cjih = d(j, h) - d(j, i) Cjih = 0

10
10

9
9

8
h 8

j
7
7
6
6

5
5 i
i h j
t
4
4

3
3

2
2

1
t
1
0
0
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10

C jih
December 16, 2024 = d(j, t) - d(j, i) Cjih = d(j, h) - d(j, t)
CLARA (Clustering Large Applications)
(1990)
 CLARA (Kaufmann and Rousseeuw in 1990)

Built in statistical analysis packages, such as S+
 It draws multiple samples of the data set, applies PAM on
each sample, and gives the best clustering as the output
 Strength: deals with larger data sets than PAM
 Weakness:

Efficiency depends on the sample size

A good clustering based on samples will not
necessarily represent a good clustering of the whole
data set if the sample is biased

December 16, 2024


CLARANS (“Randomized” CLARA)
(1994)
 CLARANS (A Clustering Algorithm based on Randomized
Search) (Ng and Han’94)
 CLARANS draws sample of neighbors dynamically
 The clustering process can be presented as searching a
graph where every node is a potential solution, that is, a
set of k medoids
 If the local optimum is found, CLARANS starts with new
randomly selected node in search for a new local optimum
 It is more efficient and scalable than both PAM and CLARA
 Focusing techniques and spatial access structures may
further improve its performance (Ester et al.’95)

December 16, 2024


Cluster Analysis

 What is Cluster Analysis?


 Types of Data in Cluster Analysis
 A Categorization of Major Clustering Methods
 Partitioning Methods
 Hierarchical Methods
 Density-Based Methods
 Summary

December 16, 2024


Hierarchical Clustering
 Use distance matrix as clustering criteria. This
method does not require the number of clusters k
as an input, but needs a termination condition
Step 0 Step 1 Step 2 Step 3 Step 4
agglomerative
(AGNES)
a ab
b abcde
c
cde
d
de
e
divisive
Step 4 Step 3 Step 2 Step 1 Step 0 (DIANA)
December 16, 2024
AGNES (Agglomerative Nesting)
 Introduced in Kaufmann and Rousseeuw (1990)
 Implemented in statistical analysis packages, e.g.,
Splus
 Use the Single-Link method and the dissimilarity
matrix.
 Merge nodes that have the least dissimilarity
 Go on in a non-descending fashion
10 10 10

 9

8
Eventually all nodes belong to the same cluster 9

8
9

7 7 7

6 6 6

5 5 5

4 4 4

3 3 3

2 2 2

1 1 1

0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

December 16, 2024


A Dendrogram Shows How the
Clusters are Merged Hierarchically

Decompose data objects into a several levels of nested


partitioning (tree of clusters), called a dendrogram.

A clustering of the data objects is obtained by cutting the


dendrogram at the desired level, then each connected
component forms a cluster.

December 16, 2024


DIANA (Divisive Analysis)

 Introduced in Kaufmann and Rousseeuw (1990)


 Implemented in statistical analysis packages, e.g.,
Splus
 Inverse order of AGNES
 Eventually each node forms a cluster on its own
10 10
10

9 9
9
8 8
8

7 7
7
6 6
6

5 5
5
4 4
4

3 3
3
2 2
2

1 1
1
0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10

December 16, 2024


More on Hierarchical Clustering
Methods
 Major weakness of agglomerative clustering methods
 do not scale well: time complexity of at least O(n2),

where n is the number of total objects


 can never undo what was done previously

 Integration of hierarchical with distance-based


clustering
 BIRCH (1996): uses CF-tree and incrementally

adjusts the quality of sub-clusters


 CURE (1998): selects well-scattered points from the

cluster and then shrinks them towards the center of


the cluster by a specified fraction
 CHAMELEON (1999): hierarchical clustering using

dynamic modeling
December 16, 2024
BIRCH (1996)
 Birch: Balanced Iterative Reducing and Clustering using
Hierarchies, by Zhang, Ramakrishnan, Livny
(SIGMOD’96)
 Incrementally construct a CF (Clustering Feature) tree, a
hierarchical data structure for multiphase clustering
 Phase 1: scan DB to build an initial in-memory CF tree
(a multi-level compression of the data that tries to
preserve the inherent clustering structure of the data)

Phase 2: use an arbitrary clustering algorithm to
cluster the leaf nodes of the CF-tree
 Scales linearly: finds a good clustering with a single scan
and improves the quality with a few additional scans
 Weakness: handles only numeric data, and sensitive to
the order of the data record.
December 16, 2024
Clustering Feature Vector

Clustering Feature: CF = (N, LS, SS)


N: Number of data points
LS: Ni=1=Xi
SS: Ni=1=Xi2 CF = (5, (16,30),(54,190))
10

9
(3,4)
(2,6)
8

(4,5)
5

1
(4,7)
(3,8)
0
0 1 2 3 4 5 6 7 8 9 10

December 16, 2024


CF Tree Root

B=7 CF1 CF2 CF3 CF6

L=6 child1 child2 child3 child6

Non-leaf node
CF1 CF2 CF3 CF5
child1 child2 child3 child5

Leaf node Leaf node


prev CF1 CF2 CF6 next prev CF1 CF2 CF4 next

December 16, 2024


Drawbacks of Distance-Based
Method

 Drawbacks of square-error based clustering method



Consider only one point as representative of a
cluster

Good only for convex shaped, similar size and
density, and if k can be reasonably estimated
December 16, 2024
Cluster Analysis

 What is Cluster Analysis?


 Types of Data in Cluster Analysis
 A Categorization of Major Clustering Methods
 Partitioning Methods
 Hierarchical Methods
 Density-Based Methods
 Summary

December 16, 2024


Density-Based Clustering
Methods
 Clustering based on density (local cluster
criterion), such as density-connected points

Major features:

Discover clusters of arbitrary shape

Handle noise

One scan

Need density parameters as termination
condition
 Several interesting studies:

DBSCAN: Ester, et al. (KDD’96)

OPTICS: Ankerst, et al (SIGMOD’99).

DENCLUE: Hinneburg & D. Keim (KDD’98)

CLIQUE: Agrawal, et al. (SIGMOD’98)
December 16, 2024
Density-Based Clustering:
Background
 Two parameters:
 Eps: Maximum radius of the neighbourhood
 MinPts: Minimum number of points in an Eps-
neighbourhood of that point
 NEps(p): {q belongs to D | dist(p,q) <= Eps}
 Directly density-reachable: A point p is directly
density-reachable from a point q wrt. Eps, MinPts
if

1) p belongs to NEps(q) p MinPts = 5
q
 2) core point condition: Eps = 1 cm
|NEps (q)| >= MinPts
December 16, 2024
Density-Based Clustering:
Background (II)
 Density-reachable:
 A point p is density-reachable p
from a point q wrt. Eps, MinPts if p1
there is a chain of points p1, …, q
pn, p1 = q, pn = p such that pi+1 is
directly density-reachable from pi

 Density-connected
p q
 A point p is density-connected to
a point q wrt. Eps, MinPts if there
is a point o such that both, p and o
q are density-reachable from o
wrt. Eps and MinPts.
December 16, 2024
DBSCAN: Density Based Spatial
Clustering of Applications with
Noise

 Relies on a density-based notion of cluster: A


cluster is defined as a maximal set of density-
connected points
 Discovers clusters of arbitrary shape in spatial
databases with noise
Outlier

Border
Eps = 1cm
Core MinPts = 5

December 16, 2024


DBSCAN: The Algorithm

 Arbitrary select a point p


 Retrieve all points density-reachable from p wrt
Eps and MinPts.
 If p is a core point, a cluster is formed.
 If p is a border point, no points are density-
reachable from p and DBSCAN visits the next
point of the database.
 Continue the process until all of the points have
been processed.
December 16, 2024
Summary

 Cluster analysis groups objects based on their


similarity and has wide applications
 Measure of similarity can be computed for various
types of data
 Clustering algorithms can be categorized into
partitioning methods, hierarchical methods, density-
based methods, grid-based methods, and model-
based methods
 Outlier detection and analysis are very useful for fraud
detection, etc. and can be performed by statistical,
distance-based or deviation-based approaches
 There are still lots of research issues on cluster
analysis, such as constraint-based clustering
December 16, 2024
References (1)
 R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace
clustering of high dimensional data for data mining applications. SIGMOD'98
 M. R. Anderberg. Cluster Analysis for Applications. Academic Press, 1973.
 M. Ankerst, M. Breunig, H.-P. Kriegel, and J. Sander. Optics: Ordering points to
identify the clustering structure, SIGMOD’99.
 P. Arabie, L. J. Hubert, and G. De Soete. Clustering and Classification. World
Scietific, 1996
 M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for
discovering clusters in large spatial databases. KDD'96.
 M. Ester, H.-P. Kriegel, and X. Xu. Knowledge discovery in large spatial
databases: Focusing techniques for efficient class identification. SSD'95.
 D. Fisher. Knowledge acquisition via incremental conceptual clustering.
Machine Learning, 2:139-172, 1987.
 D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An
approach based on dynamic systems. In Proc. VLDB’98.
 S. Guha, R. Rastogi, and K. Shim. Cure: An efficient clustering algorithm for
large databases. SIGMOD'98.
 A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Printice Hall, 1988.
December 16, 2024
References (2)
 L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to
Cluster Analysis. John Wiley & Sons, 1990.
 E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large
datasets. VLDB’98.
 G. J. McLachlan and K.E. Bkasford. Mixture Models: Inference and Applications to
Clustering. John Wiley and Sons, 1988.
 P. Michaud. Clustering techniques. Future Generation Computer systems, 13,
1997.
 R. Ng and J. Han. Efficient and effective clustering method for spatial data
mining. VLDB'94.
 E. Schikuta. Grid clustering: An efficient hierarchical clustering method for very
large data sets. Proc. 1996 Int. Conf. on Pattern Recognition, 101-105.
 G. Sheikholeslami, S. Chatterjee, and A. Zhang. WaveCluster: A multi-resolution
clustering approach for very large spatial databases. VLDB’98.
 W. Wang, Yang, R. Muntz, STING: A Statistical Information grid Approach to
Spatial Data Mining, VLDB’97.
 T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH : an efficient data clustering
method for very large databases. SIGMOD'96.
December 16, 2024
http://www.cs.sfu.ca/~han

Thank you !!!


December 16, 2024

You might also like