0% found this document useful (0 votes)

8 views51 pages

Cluster Analysis Methods Guide

Uploaded by

nanip9120

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views51 pages

Cluster Analysis Methods Guide

Uploaded by

nanip9120

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 51

Cluster Analysis

 What is Cluster Analysis?

 Types of Data in Cluster Analysis
 A Categorization of Major Clustering Methods
 Partitioning Methods
 Hierarchical Methods
 Density-Based Methods
 Summary

December 16, 2024

General Applications of Clustering

 Pattern Recognition
 Spatial Data Analysis
 create thematic maps in GIS by clustering

feature spaces
 detect spatial clusters and explain them in

spatial data mining

 Image Processing
 Economic Science (especially market research)
 WWW
 Document classification

 Cluster Weblog data to discover groups of

similar access patterns

December 16, 2024

Examples of Clustering
Applications
 Marketing: Help marketers discover distinct groups in
their customer bases, and then use this knowledge to
develop targeted marketing programs
 Land use: Identification of areas of similar land use in
an earth observation database
 Insurance: Identifying groups of motor insurance
policy holders with a high average claim cost
 City-planning: Identifying groups of houses according
to their house type, value, and geographical location
 Earth-quake studies: Observed earth quake epicenters
should be clustered along continent faults

December 16, 2024

What Is Good Clustering?

 A good clustering method will produce high quality

clusters with

high intra-class similarity

low inter-class similarity
 The quality of a clustering result depends on both
the similarity measure used by the method and its
implementation.
 The quality of a clustering method is also measured
by its ability to discover some or all of the hidden
patterns.
December 16, 2024
Requirements of Clustering in Data
Mining
 Scalability
 Ability to deal with different types of attributes
 Discovery of clusters with arbitrary shape
 Minimal requirements for domain knowledge to
determine input parameters
 Able to deal with noise and outliers
 Insensitive to order of input records
 High dimensionality
 Incorporation of user-specified constraints
 Interpretability and usability
December 16, 2024
Chapter 8. Cluster Analysis

 What is Cluster Analysis?

 Types of Data in Cluster Analysis
 A Categorization of Major Clustering Methods
 Partitioning Methods
 Hierarchical Methods
 Density-Based Methods
 Grid-Based Methods
 Model-Based Clustering Methods
 Outlier Analysis
 Summary
December 16, 2024
Data Structures

 x11 ... x1f ... x1p 

 Data matrix  
 ... ... ... ... ... 
 (two modes) x ... xif ... xip 
 i1 
 ... ... ... ... ... 
x ... xnf ... xnp 
 n1 

 0 
 
 Dissimilarity matrix d(2,1) 0
 
 d(3,1) d ( 3,2) 0 
 (one mode)
 
 : : : 
 d ( n,1) d ( n,2) ... ... 0

December 16, 2024

Measure the Quality of
Clustering
 Dissimilarity/Similarity metric: Similarity is expressed
in terms of a distance function, which is typically
metric: d(i, j)
 There is a separate “quality” function that measures
the “goodness” of a cluster.
 The definitions of distance functions are usually very
different for interval-scaled, boolean, categorical,
ordinal and ratio variables.
 Weights should be associated with different variables
based on applications and data semantics.
 It is hard to define “similar enough” or “good enough”
 the answer is typically highly subjective.

December 16, 2024

Type of data in clustering analysis

 Interval-scaled variables:
 Binary variables:
 Nominal, ordinal, and ratio variables:
 Variables of mixed types:

December 16, 2024

Interval-valued variables

 Standardize data

Calculate the mean absolute deviation:
s f 1n (| x1 f  m f |  | x2 f  m f | ... | xnf  m f |)

where m f  1n (x1 f  x2 f  ...  xnf )


Calculate the standardized measurement (z-
xif  m f
score) zif  s
f

 Using mean absolute deviation is more robust than

using standard deviation
December 16, 2024
Similarity and Dissimilarity
Between Objects

 Distances are normally used to measure the

where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp)

are two p-dimensional data objects, and q is a
positive integer
 , j) | x  x |  |distance
If q = 1, d isd (iManhattan x  x | ... | x  x |
i1 j1 i2 j 2 ip jp

December 16, 2024

Similarity and Dissimilarity
Between Objects (Cont.)

 If q = 2, d is Euclidean distance:
d (i, j)  (| x  x |2  | x  x |2 ... | x  x |2 )
i1 j1 i2 j2 ip jp
 Properties

d(i,j)  0

d(i,i) = 0

d(i,j) = d(j,i)

d(i,j)  d(i,k) + d(k,j)
 Also one can use weighted distance, parametric
Pearson product moment correlation, or other
disimilarity measures.

December 16, 2024

Binary Variables
 A contingency table for binary data
Object j
1 0 sum
1 a b a b
Object i 0 c d c d
sum a  c b  d p

 Simple matching coefficient (invariant, if the

d (i, j) 
binary variable is symmetric): b c
a b  c  d
 Jaccard coefficient (noninvariant if the binary
d (i, j)  b  c
variable is asymmetric):
a b  c
December 16, 2024
Dissimilarity between Binary
Variables
 Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
 gender is a symmetric attribute
 the remaining attributes are asymmetric binary
 let the values Y and P be set to 1, and the value N be
setdto 0 , mary )  0  1 0.33
( jack
2  0 1
11
d ( jack , jim )  0.67
111
1 2
d ( jim , mary )  0.75
11 2
December 16, 2024
Nominal Variables

 A generalization of the binary variable in that it can

take more than 2 states, e.g., red, yellow, blue, green
 Method 1: Simple matching

m: # of matches, p: total # of variables

d (i, j)  p p m

 Method 2: use a large number of binary variables


creating a new binary variable for each of the M
nominal states

December 16, 2024

Ordinal Variables
 An ordinal variable can be discrete or continuous
 order is important, e.g., rank
 Can be treated like interval-scaled
 replacing x by their rank rif {1,..., M f }
if

 map the range of each variable onto [0, 1] by

replacing i-th object in the f-th variable by
rif  1
zif 
Mf 1

 compute the dissimilarity using methods for

interval-scaled variables

December 16, 2024

Ratio-Scaled
Variables
 Ratio-scaled variable: a positive measurement on a
nonlinear scale, approximately at exponential scale,
such as AeBt or Ae-Bt
 Methods:

treat them like interval-scaled variables — not a
good choice! (why?)

apply logarithmic transformation
yif = log(xif)

treat them as continuous ordinal data treat their
rank as interval-scaled.
December 16, 2024
Variables of Mixed
Types
 A database may contain all the six types of variables

symmetric binary, asymmetric binary, nominal,
ordinal, interval and ratio.
 One may use a weighted formula to combine their
effects.
 f 1 ij d ij
p (f) (f)
d (i, j ) 
 pf 1 ij( f )

f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 o.w.

f is interval-based: use the normalized distance

f is ordinal or ratio-scaled
 compute ranks r and
if r 1
zif 
if

 M
and treat zif as interval-scaled f  1

December 16, 2024

Cluster Analysis

 What is Cluster Analysis?

 Types of Data in Cluster Analysis
 A Categorization of Major Clustering Methods
 Partitioning Methods
 Hierarchical Methods
 Density-Based Methods
 Summary

December 16, 2024

Major Clustering Approaches

 Partitioning algorithms: Construct various partitions and

then evaluate them by some criterion
 Hierarchy algorithms: Create a hierarchical decomposition
of the set of data (or objects) using some criterion
 Density-based: based on connectivity and density
functions
 Grid-based: based on a multiple-level granularity structure
 Model-based: A model is hypothesized for each of the
clusters and the idea is to find the best fit of that model to
each other

December 16, 2024

Cluster Analysis

 What is Cluster Analysis?

 Types of Data in Cluster Analysis
 A Categorization of Major Clustering Methods
 Partitioning Methods
 Hierarchical Methods
 Density-Based Methods
 Summary

December 16, 2024

Partitioning Algorithms: Basic
Concept
 Partitioning method: Construct a partition of a
database D of n objects into a set of k clusters
 Given a k, find a partition of k clusters that optimizes
the chosen partitioning criterion
 Global optimal: exhaustively enumerate all partitions
 Heuristic methods: k-means and k-medoids
algorithms
 k-means (MacQueen’67): Each cluster is represented
by the center of the cluster
 k-medoids or PAM (Partition around medoids)
(Kaufman & Rousseeuw’87): Each cluster is
represented by one of the objects in the cluster
December 16, 2024
The K-Means Clustering Method

 Given k, the k-means algorithm is implemented

in 4 steps:
 Partition objects into k nonempty subsets

 Compute seed points as the centroids of the

clusters of the current partition. The

centroid is the center (mean point) of the
cluster.
 Assign each object to the cluster with the

nearest seed point.

 Go back to Step 2, stop when no more new

assignment.

December 16, 2024

The K-Means Clustering Method
 Example
10 10

9 9

8 8

7 7

6 6

5 5

4 4

3 3

2 2

1 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

10 10

9 9

8 8

7 7

6 6

5 5

4 4

3 3

2 2

1 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

December 16, 2024

Comments on the K-Means Method
 Strength
 Relatively efficient: O(tkn), where n is # objects, k is #

clusters, and t is # iterations. Normally, k, t << n.

 Often terminates at a local optimum. The global

optimum may be found using techniques such as:

deterministic annealing and genetic algorithms
 Weakness
 Applicable only when mean is defined, then what about

categorical data?
 Need to specify k, the number of clusters, in advance

 Unable to handle noisy data and outliers

 Not suitable to discover clusters with non-convex

shapes
December 16, 2024
Variations of the K-Means Method
 A few variants of the k-means which differ in

Selection of the initial k means

Dissimilarity calculations

Strategies to calculate cluster means
 Handling categorical data: k-modes (Huang’98)

Replacing means of clusters with modes

Using new dissimilarity measures to deal with
categorical objects
 Using a frequency-based method to update modes

of clusters

A mixture of categorical and numerical data: k-
prototype method
December 16, 2024
The K-Medoids Clustering Method
 Find representative objects, called medoids, in
clusters
 PAM (Partitioning Around Medoids, 1987)

starts from an initial set of medoids and
iteratively replaces one of the medoids by one of
the non-medoids if it improves the total distance
of the resulting clustering

PAM works effectively for small data sets, but
does not scale well for large data sets
 CLARA (Kaufmann & Rousseeuw, 1990)
 CLARANS (Ng & Han, 1994): Randomized sampling
 Focusing + spatial data structure (Ester et al., 1995)
December 16, 2024
PAM (Partitioning Around Medoids)
(1987)

 PAM (Kaufman and Rousseeuw, 1987), built in Splus

 Use real object to represent the cluster
 Select k representative objects arbitrarily
 For each pair of non-selected object h and selected
object i, calculate the total swapping cost TCih
 For each pair of i and h,
 If TCih < 0, i is replaced by h

Then assign each non-selected object to the
most similar representative object
 repeat steps 2-3 until there is no change
December 16, 2024
PAM Clustering: Total swapping cost
TCih=jCjih
10 10

9 9
j
8
t 8
t
7 7

5
j 6

4
i h 4
h
3

2
3

2
i
1 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

Cjih = d(j, h) - d(j, i) Cjih = 0

10
10

9
9

8
h 8

j
7
7
6
6

5
5 i
i h j
t
4
4

3
3

2
2

1
t
1
0
0
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10

C jih
December 16, 2024 = d(j, t) - d(j, i) Cjih = d(j, h) - d(j, t)
CLARA (Clustering Large Applications)
(1990)
 CLARA (Kaufmann and Rousseeuw in 1990)

Built in statistical analysis packages, such as S+
 It draws multiple samples of the data set, applies PAM on
each sample, and gives the best clustering as the output
 Strength: deals with larger data sets than PAM
 Weakness:

Efficiency depends on the sample size

A good clustering based on samples will not
necessarily represent a good clustering of the whole
data set if the sample is biased

December 16, 2024

CLARANS (“Randomized” CLARA)
(1994)
 CLARANS (A Clustering Algorithm based on Randomized
Search) (Ng and Han’94)
 CLARANS draws sample of neighbors dynamically
 The clustering process can be presented as searching a
graph where every node is a potential solution, that is, a
set of k medoids
 If the local optimum is found, CLARANS starts with new
randomly selected node in search for a new local optimum
 It is more efficient and scalable than both PAM and CLARA
 Focusing techniques and spatial access structures may
further improve its performance (Ester et al.’95)

December 16, 2024

Cluster Analysis

 What is Cluster Analysis?

 Types of Data in Cluster Analysis
 A Categorization of Major Clustering Methods
 Partitioning Methods
 Hierarchical Methods
 Density-Based Methods
 Summary

December 16, 2024

Hierarchical Clustering
 Use distance matrix as clustering criteria. This
method does not require the number of clusters k
as an input, but needs a termination condition
Step 0 Step 1 Step 2 Step 3 Step 4
agglomerative
(AGNES)
a ab
b abcde
c
cde
d
de
e
divisive
Step 4 Step 3 Step 2 Step 1 Step 0 (DIANA)
December 16, 2024
AGNES (Agglomerative Nesting)
 Introduced in Kaufmann and Rousseeuw (1990)
 Implemented in statistical analysis packages, e.g.,
Splus
 Use the Single-Link method and the dissimilarity
matrix.
 Merge nodes that have the least dissimilarity
 Go on in a non-descending fashion
10 10 10

 9

8
Eventually all nodes belong to the same cluster 9

8
9

7 7 7

6 6 6

5 5 5

4 4 4

3 3 3

2 2 2

1 1 1

0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

December 16, 2024

A Dendrogram Shows How the
Clusters are Merged Hierarchically

Decompose data objects into a several levels of nested

partitioning (tree of clusters), called a dendrogram.

A clustering of the data objects is obtained by cutting the

dendrogram at the desired level, then each connected
component forms a cluster.

December 16, 2024

DIANA (Divisive Analysis)

 Introduced in Kaufmann and Rousseeuw (1990)

 Implemented in statistical analysis packages, e.g.,
Splus
 Inverse order of AGNES
 Eventually each node forms a cluster on its own
10 10
10

9 9
9
8 8
8

7 7
7
6 6
6

5 5
5
4 4
4

3 3
3
2 2
2

1 1
1
0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10

December 16, 2024

More on Hierarchical Clustering
Methods
 Major weakness of agglomerative clustering methods
 do not scale well: time complexity of at least O(n2),

where n is the number of total objects

 can never undo what was done previously

 Integration of hierarchical with distance-based

clustering
 BIRCH (1996): uses CF-tree and incrementally

adjusts the quality of sub-clusters

 CURE (1998): selects well-scattered points from the

cluster and then shrinks them towards the center of

the cluster by a specified fraction
 CHAMELEON (1999): hierarchical clustering using

dynamic modeling
December 16, 2024
BIRCH (1996)
 Birch: Balanced Iterative Reducing and Clustering using
Hierarchies, by Zhang, Ramakrishnan, Livny
(SIGMOD’96)
 Incrementally construct a CF (Clustering Feature) tree, a
hierarchical data structure for multiphase clustering
 Phase 1: scan DB to build an initial in-memory CF tree
(a multi-level compression of the data that tries to
preserve the inherent clustering structure of the data)

Phase 2: use an arbitrary clustering algorithm to
cluster the leaf nodes of the CF-tree
 Scales linearly: finds a good clustering with a single scan
and improves the quality with a few additional scans
 Weakness: handles only numeric data, and sensitive to
the order of the data record.
December 16, 2024
Clustering Feature Vector

Clustering Feature: CF = (N, LS, SS)

N: Number of data points
LS: Ni=1=Xi
SS: Ni=1=Xi2 CF = (5, (16,30),(54,190))
10

9
(3,4)
(2,6)
8

(4,5)
5

1
(4,7)
(3,8)
0
0 1 2 3 4 5 6 7 8 9 10

December 16, 2024

CF Tree Root

B=7 CF1 CF2 CF3 CF6

L=6 child1 child2 child3 child6

Non-leaf node
CF1 CF2 CF3 CF5
child1 child2 child3 child5

Leaf node Leaf node

prev CF1 CF2 CF6 next prev CF1 CF2 CF4 next

December 16, 2024

Drawbacks of Distance-Based
Method

 Drawbacks of square-error based clustering method


Consider only one point as representative of a
cluster

Good only for convex shaped, similar size and
density, and if k can be reasonably estimated
December 16, 2024
Cluster Analysis

 What is Cluster Analysis?

 Types of Data in Cluster Analysis
 A Categorization of Major Clustering Methods
 Partitioning Methods
 Hierarchical Methods
 Density-Based Methods
 Summary

December 16, 2024

Density-Based Clustering
Methods
 Clustering based on density (local cluster
criterion), such as density-connected points

Major features:

Discover clusters of arbitrary shape

Handle noise

One scan

Need density parameters as termination
condition
 Several interesting studies:

DBSCAN: Ester, et al. (KDD’96)

OPTICS: Ankerst, et al (SIGMOD’99).

DENCLUE: Hinneburg & D. Keim (KDD’98)

CLIQUE: Agrawal, et al. (SIGMOD’98)
December 16, 2024
Density-Based Clustering:
Background
 Two parameters:
 Eps: Maximum radius of the neighbourhood
 MinPts: Minimum number of points in an Eps-
neighbourhood of that point
 NEps(p): {q belongs to D | dist(p,q) <= Eps}
 Directly density-reachable: A point p is directly
density-reachable from a point q wrt. Eps, MinPts
if

1) p belongs to NEps(q) p MinPts = 5
q
 2) core point condition: Eps = 1 cm
|NEps (q)| >= MinPts
December 16, 2024
Density-Based Clustering:
Background (II)
 Density-reachable:
 A point p is density-reachable p
from a point q wrt. Eps, MinPts if p1
there is a chain of points p1, …, q
pn, p1 = q, pn = p such that pi+1 is
directly density-reachable from pi

 Density-connected
p q
 A point p is density-connected to
a point q wrt. Eps, MinPts if there
is a point o such that both, p and o
q are density-reachable from o
wrt. Eps and MinPts.
December 16, 2024
DBSCAN: Density Based Spatial
Clustering of Applications with
Noise

 Relies on a density-based notion of cluster: A

cluster is defined as a maximal set of density-
connected points
 Discovers clusters of arbitrary shape in spatial
databases with noise
Outlier

Border
Eps = 1cm
Core MinPts = 5

December 16, 2024

DBSCAN: The Algorithm

 Arbitrary select a point p

 Retrieve all points density-reachable from p wrt
Eps and MinPts.
 If p is a core point, a cluster is formed.
 If p is a border point, no points are density-
reachable from p and DBSCAN visits the next
point of the database.
 Continue the process until all of the points have
been processed.
December 16, 2024
Summary

 Cluster analysis groups objects based on their

similarity and has wide applications
 Measure of similarity can be computed for various
types of data
 Clustering algorithms can be categorized into
partitioning methods, hierarchical methods, density-
based methods, grid-based methods, and model-
based methods
 Outlier detection and analysis are very useful for fraud
detection, etc. and can be performed by statistical,
distance-based or deviation-based approaches
 There are still lots of research issues on cluster
analysis, such as constraint-based clustering
December 16, 2024
References (1)
 R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace
clustering of high dimensional data for data mining applications. SIGMOD'98
 M. R. Anderberg. Cluster Analysis for Applications. Academic Press, 1973.
 M. Ankerst, M. Breunig, H.-P. Kriegel, and J. Sander. Optics: Ordering points to
identify the clustering structure, SIGMOD’99.
 P. Arabie, L. J. Hubert, and G. De Soete. Clustering and Classification. World
Scietific, 1996
 M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for
discovering clusters in large spatial databases. KDD'96.
 M. Ester, H.-P. Kriegel, and X. Xu. Knowledge discovery in large spatial
databases: Focusing techniques for efficient class identification. SSD'95.
 D. Fisher. Knowledge acquisition via incremental conceptual clustering.
Machine Learning, 2:139-172, 1987.
 D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An
approach based on dynamic systems. In Proc. VLDB’98.
 S. Guha, R. Rastogi, and K. Shim. Cure: An efficient clustering algorithm for
large databases. SIGMOD'98.
 A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Printice Hall, 1988.
December 16, 2024
References (2)
 L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to
Cluster Analysis. John Wiley & Sons, 1990.
 E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large
datasets. VLDB’98.
 G. J. McLachlan and K.E. Bkasford. Mixture Models: Inference and Applications to
Clustering. John Wiley and Sons, 1988.
 P. Michaud. Clustering techniques. Future Generation Computer systems, 13,
1997.
 R. Ng and J. Han. Efficient and effective clustering method for spatial data
mining. VLDB'94.
 E. Schikuta. Grid clustering: An efficient hierarchical clustering method for very
large data sets. Proc. 1996 Int. Conf. on Pattern Recognition, 101-105.
 G. Sheikholeslami, S. Chatterjee, and A. Zhang. WaveCluster: A multi-resolution
clustering approach for very large spatial databases. VLDB’98.
 W. Wang, Yang, R. Muntz, STING: A Statistical Information grid Approach to
Spatial Data Mining, VLDB’97.
 T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH : an efficient data clustering
method for very large databases. SIGMOD'96.
December 16, 2024
http://www.cs.sfu.ca/~han

Thank you !!!

December 16, 2024

Customer Churn - E-Commerce: Capstone Project Report
100% (1)
Customer Churn - E-Commerce: Capstone Project Report
43 pages
Unit VI Clustering
No ratings yet
Unit VI Clustering
72 pages
Cluster Analysis Techniques
No ratings yet
Cluster Analysis Techniques
98 pages
Clustering
No ratings yet
Clustering
47 pages
DM Clustering
No ratings yet
DM Clustering
51 pages
Cluster Analysis Techniques Guide
No ratings yet
Cluster Analysis Techniques Guide
97 pages
Analysis of Cluteruing
No ratings yet
Analysis of Cluteruing
16 pages
Chapter 7. Cluster Analysis
No ratings yet
Chapter 7. Cluster Analysis
48 pages
Concepts and Techniques: - Chapter 7
No ratings yet
Concepts and Techniques: - Chapter 7
70 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
51 pages
8 Clustering
No ratings yet
8 Clustering
53 pages
Datawarehousing and Data Mining
No ratings yet
Datawarehousing and Data Mining
119 pages
Clustering in Data Mining Guide
No ratings yet
Clustering in Data Mining Guide
39 pages
What Is Cluster Analysis?
No ratings yet
What Is Cluster Analysis?
24 pages
Chapter4 Clustering
No ratings yet
Chapter4 Clustering
77 pages
Unit 2 - Introduction To Cluster Analysis
No ratings yet
Unit 2 - Introduction To Cluster Analysis
53 pages
Clustering for Data Analysts
No ratings yet
Clustering for Data Analysts
69 pages
Lecture24 s12
No ratings yet
Lecture24 s12
24 pages
Clustering
No ratings yet
Clustering
25 pages
Cluster Analysis Essentials
No ratings yet
Cluster Analysis Essentials
24 pages
Lecture 3.2.1 3.2.2
No ratings yet
Lecture 3.2.1 3.2.2
28 pages
Chapter 7. Cluster Analysis
No ratings yet
Chapter 7. Cluster Analysis
120 pages
Data Mining: Clustering
No ratings yet
Data Mining: Clustering
46 pages
Data Mining Unit-4
No ratings yet
Data Mining Unit-4
27 pages
Data Mining Unit-Iv
No ratings yet
Data Mining Unit-Iv
34 pages
8 CLST
No ratings yet
8 CLST
98 pages
BCA Semester VI Data Mining Module 4 (Presentation Kind of N
No ratings yet
BCA Semester VI Data Mining Module 4 (Presentation Kind of N
56 pages
Cluster Analysis Techniques Guide
No ratings yet
Cluster Analysis Techniques Guide
152 pages
Unit 4
No ratings yet
Unit 4
65 pages
Cluster Analysis and Applications
No ratings yet
Cluster Analysis and Applications
37 pages
8 Clustering
No ratings yet
8 Clustering
89 pages
Cluster Analysis Techniques Guide
No ratings yet
Cluster Analysis Techniques Guide
51 pages
DM Chapter 5 (Clustering)
No ratings yet
DM Chapter 5 (Clustering)
40 pages
Clustering Full 1
No ratings yet
Clustering Full 1
98 pages
8 CLST
No ratings yet
8 CLST
100 pages
Cluster Analisys
No ratings yet
Cluster Analisys
100 pages
Data Mining With Clustering: Dr. Mahesh Fernando
No ratings yet
Data Mining With Clustering: Dr. Mahesh Fernando
55 pages
Algorithms
No ratings yet
Algorithms
107 pages
Cluster Analysis in Data Mining
No ratings yet
Cluster Analysis in Data Mining
36 pages
Clustering 1
No ratings yet
Clustering 1
75 pages
Bab 8 Clustering: Data Mining - Arif Djunaidy - FTIF ITS Bab 8 - 1/??
No ratings yet
Bab 8 Clustering: Data Mining - Arif Djunaidy - FTIF ITS Bab 8 - 1/??
119 pages
Graph Partitioning & Clustering Techniques
No ratings yet
Graph Partitioning & Clustering Techniques
14 pages
Cluster
No ratings yet
Cluster
120 pages
Cluster Analysis
No ratings yet
Cluster Analysis
136 pages
Clustering
No ratings yet
Clustering
104 pages
Concepts and Techniques: - Chapter 7
No ratings yet
Concepts and Techniques: - Chapter 7
123 pages
UG BSF Clustering
No ratings yet
UG BSF Clustering
119 pages
8 CLST
No ratings yet
8 CLST
98 pages
What Is Cluster Analysis?: Unsupervised Learning Stand-Alone Tool Preprocessing Step
No ratings yet
What Is Cluster Analysis?: Unsupervised Learning Stand-Alone Tool Preprocessing Step
21 pages
BIS 541 Ch04 20-21 S
No ratings yet
BIS 541 Ch04 20-21 S
82 pages
Unit - 4 DMA
No ratings yet
Unit - 4 DMA
145 pages
K Medoids
No ratings yet
K Medoids
101 pages
Cluster Analysis in Construction
No ratings yet
Cluster Analysis in Construction
23 pages
Module-5 Clustering Algorithms
No ratings yet
Module-5 Clustering Algorithms
44 pages
Kmeans Ex
No ratings yet
Kmeans Ex
98 pages
What Is Cluster Analysis?
No ratings yet
What Is Cluster Analysis?
120 pages
Cluster Analysis: Methods and Applications
No ratings yet
Cluster Analysis: Methods and Applications
14 pages
Cluster Analysis: G Sreenivas
No ratings yet
Cluster Analysis: G Sreenivas
29 pages
Unit Iv
No ratings yet
Unit Iv
14 pages
Final - Model-Machine Learning
No ratings yet
Final - Model-Machine Learning
15 pages
R Clustering for Data Scientists
No ratings yet
R Clustering for Data Scientists
54 pages
ML Ca2
No ratings yet
ML Ca2
3 pages
Data Analytics Quiz Insights
No ratings yet
Data Analytics Quiz Insights
3 pages
Predicting Student Retentionin Random Forest
No ratings yet
Predicting Student Retentionin Random Forest
11 pages
ML Probable Questions 2026 - أسئلة محتملة لامتحان تعلم الآلة 2026 ??
No ratings yet
ML Probable Questions 2026 - أسئلة محتملة لامتحان تعلم الآلة 2026 ??
2 pages
Nirma University: ? !,'' XTLT"
No ratings yet
Nirma University: ? !,'' XTLT"
3 pages
COMP3308/3608 Artificial Intelligence Week 9 Tutorial Exercises Multilayer Neural Networks 2. Deep Learning
No ratings yet
COMP3308/3608 Artificial Intelligence Week 9 Tutorial Exercises Multilayer Neural Networks 2. Deep Learning
2 pages
UNIT5
No ratings yet
UNIT5
60 pages
Examen Deep Learning
100% (1)
Examen Deep Learning
8 pages
Comparative Analysis of Clustering Techniques
No ratings yet
Comparative Analysis of Clustering Techniques
13 pages
Machine Learning: Regression & Trees
No ratings yet
Machine Learning: Regression & Trees
17 pages
Performance Measures - Session 2
No ratings yet
Performance Measures - Session 2
35 pages
Introduction To Boosting: Cynthia Rudin PACM, Princeton University
No ratings yet
Introduction To Boosting: Cynthia Rudin PACM, Princeton University
29 pages
Command STATA Yg Sering Dipakai
No ratings yet
Command STATA Yg Sering Dipakai
5 pages
Data Modeling - Cheatsheet
No ratings yet
Data Modeling - Cheatsheet
9 pages
MO' 2-Topshiriq
No ratings yet
MO' 2-Topshiriq
10 pages
520 2188 1 PB
No ratings yet
520 2188 1 PB
11 pages
DWDM - Unit - VI
No ratings yet
DWDM - Unit - VI
38 pages
Logistic Regression - AI-ML Developer Course
100% (1)
Logistic Regression - AI-ML Developer Course
14 pages
Cheat Sheet For Exam
No ratings yet
Cheat Sheet For Exam
2 pages
ML (AD-502) LAB Experments
No ratings yet
ML (AD-502) LAB Experments
2 pages
Jupyter Notebook Project CART RF ANN
100% (1)
Jupyter Notebook Project CART RF ANN
41 pages
Logistic Regression Quiz
100% (1)
Logistic Regression Quiz
4 pages
RMA Discussion and References
No ratings yet
RMA Discussion and References
2 pages
Module 3 Quiz: Movie Genre Classification
No ratings yet
Module 3 Quiz: Movie Genre Classification
1 page
Project Report - Credit Card Fraud Detection
No ratings yet
Project Report - Credit Card Fraud Detection
11 pages
Assignment 5 - Ai Mpec
No ratings yet
Assignment 5 - Ai Mpec
3 pages
KNN Classifier
No ratings yet
KNN Classifier
5 pages

Cluster Analysis Methods Guide

Uploaded by

Cluster Analysis Methods Guide

Uploaded by

Cluster Analysis

 What is Cluster Analysis?

December 16, 2024

spatial data mining

 Cluster Weblog data to discover groups of

similar access patterns

December 16, 2024

December 16, 2024

 A good clustering method will produce high quality

 What is Cluster Analysis?

 x11 ... x1f ... x1p 

December 16, 2024

December 16, 2024

December 16, 2024

where m f  1n (x1 f  x2 f  ...  xnf )

 Using mean absolute deviation is more robust than

 Distances are normally used to measure the

where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp)

December 16, 2024

December 16, 2024

 Simple matching coefficient (invariant, if the

 A generalization of the binary variable in that it can

 Method 2: use a large number of binary variables

December 16, 2024

 map the range of each variable onto [0, 1] by

 compute the dissimilarity using methods for

December 16, 2024

December 16, 2024

 What is Cluster Analysis?

December 16, 2024

 Partitioning algorithms: Construct various partitions and

December 16, 2024

 What is Cluster Analysis?

December 16, 2024

 Given k, the k-means algorithm is implemented

 Compute seed points as the centroids of the

clusters of the current partition. The

nearest seed point.

December 16, 2024

December 16, 2024

clusters, and t is # iterations. Normally, k, t << n.

optimum may be found using techniques such as:

 Unable to handle noisy data and outliers

 Not suitable to discover clusters with non-convex

 PAM (Kaufman and Rousseeuw, 1987), built in Splus

Cjih = d(j, h) - d(j, i) Cjih = 0

December 16, 2024

December 16, 2024

 What is Cluster Analysis?

December 16, 2024

December 16, 2024

Decompose data objects into a several levels of nested

A clustering of the data objects is obtained by cutting the

December 16, 2024

 Introduced in Kaufmann and Rousseeuw (1990)

December 16, 2024

where n is the number of total objects

 Integration of hierarchical with distance-based

adjusts the quality of sub-clusters

cluster and then shrinks them towards the center of

Clustering Feature: CF = (N, LS, SS)

December 16, 2024

B=7 CF1 CF2 CF3 CF6

L=6 child1 child2 child3 child6

Leaf node Leaf node

December 16, 2024

 Drawbacks of square-error based clustering method

 What is Cluster Analysis?

December 16, 2024

 Relies on a density-based notion of cluster: A

December 16, 2024

 Arbitrary select a point p

 Cluster analysis groups objects based on their

Thank you !!!

You might also like