DMDWUNITV
DMDWUNITV
𝑐𝑖
UNITV
CLUSTER ANALYSIS
Cluster Analysis: Basic Concepts and Algorithms: Overview, What Is Cluster
Analysis?
Different Types of Clustering,
Different Types of Clusters; K-means: The Basic K-means Algorithm, K-means
Additional Issues,
Bisecting K-means, Strengths and Weaknesses;
Agglomerative Hierarchical Clustering: Basic Agglomerative Hierarchical Clustering
Algorithm
DBSCAN: Traditional Density Center-Based Approach,
DBSCANlgorithm,StrengthsandWeaknesses.(Tan&Vipin)
CLUSTERING
What is a Clustering?
• In general a grouping of objects such that the
objects in a group (cluster) are similar (or related)
to one another and different from (or unrelated
to) the objects in other groups
Inter-
Intra- cluster
cluster distances
distances are
are
Applications of Cluster Analysis
Discovered Clusters Industry Group
• Understanding Applied-Matl-DOWN,Bay-Network-Down,3-COM-DOWN,
3
Fannie-Mae-DOWN,Fed-Home-Loan-DOWN,
with similar price fluctuations MBNA-Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN
Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP,
4 Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP,
Schlumberger-UP
Oil-UP
• Summarization
• Reduce the size of large
data sets
Clustering
precipitation in
Australia
Early applications of cluster
analysis
•John Snow, London 1854
Notion of a Cluster can be
Ambiguous
• Hierarchical clustering
• A set of nested clusters organized as a
hierarchical tree
Partitional Clustering
p1
p3 p4
p2
p1 p2 p3 p4
p1
p3 p4
p2
p1 p2 p3 p4
Other types of clustering
• Exclusive (or non-overlapping) versus
non- exclusive (or overlapping)
• In non-exclusive clustering’s, points may
belong to multiple clusters.
• Points that belong to multiple classes, or ‘border’ points
well-separated clusters
Types of Clusters: Center-Based
• Center-based
• A cluster is a set of objects such that an object in a
cluster is closer (more similar) to the “center” of a
cluster, than to the center of any other cluster
• The center of a cluster is often a centroid, the
minimizer of distances from all the points in the
cluster, or a medoid, the most “representative”
point of a cluster
center-based clusters
Types of Clusters: Contiguity-
Based
• Contiguous Cluster (Nearest
neighbor or Transitive)
• A cluster is a set of points such that a point in a
cluster is closer (or more similar) to one or more
other points in the cluster than to any point not in
the cluster.
contiguous clusters
Types of Clusters: Density-Based
• Density-based
• A cluster is a dense region of points, which is separated by
low-density regions, from other regions of high density.
• Used when the clusters are irregular or intertwined,
and when noise and outliers are present.
density-based clusters
Types of Clusters: Conceptual
Clusters
• Shared Property or Conceptual Clusters
• Finds clusters that share some common property or
represent a particular concept.
.
Overlapping Circles
Types of Clusters: Objective
Function
• Clustering as an optimization problem
• Finds clusters that minimize or maximize an objective
function.
• Enumerate all possible ways of dividing the points into
clusters and evaluate the `goodness' of each potential
set of clusters by using the given objective function.
(NP Hard)
• Can have global or local objectives.
• Hierarchical clustering algorithms typically have local objectives
• Partitional algorithms typically have global objectives
• A variation of the global objective function approach is
to fit the data to a parameterized model.
• The parameters for the model are determined from the data, and
they
determine the clustering
• E.g., Mixture models assume that the data is a ‘mixture' of
a number of statistical distributions.
•Hierarchical clustering
•DBSCAN
K-MEANS
K-means Clustering
• Partitional clustering approach
• Each cluster is associated with a
centroid (center point)
• Each point is assigned to the cluster
with the closest centroid
• Number of clusters, K, must be specified
• The objective is to minimize the
sum of distances of the points to
their respective centroid
K-means
Clustering
•Problem: Given a set X of n points in a d-
dimensional space and an integer K group
the points into K clusters C= {C1, C2,
…,Ck} such that
𝑘
𝐶𝑜𝑠𝑡 = ∑𝑑𝑖𝑠𝑡(𝑥, 𝑐)
𝐶 ∑ 𝑥∈
𝑖=1 𝐶𝑖
K-means
Clustering
is minimized, where ci is the centroid of the
points in cluster C
i
K-means
Clustering
• Most common definition is with euclidean
distance, minimizing the Sum of Squares
Error (SSE) function
• Sometimes K-means is defined like that
𝐶 𝑘
that
𝐶𝑜𝑠𝑡 = 𝑖=1
∑
K-means
𝑥∈𝐶𝑖
∑Clustering
2
2.5
1.5
Original
Points
y
1
0.5
3 3
2.5 2.5
2 2
1.5 1.5
y
y
1 1
0.5 0.5
0 0
1.
1.5
5
y
y
0.
0.5
5
0
0
2.5
2.5 2.5
2 2 2
1.5
1.5 1.5
y
y
1 1 1
0.5
0.5 0.5
0 0 0
Iteration 4
3
Iteration 5 Iteration
3 6
3
2.5
2.5 2.5
2 2 2
1.5
1.5 1.5
y
y
y
1 1 1
0.5
0.5 0.5
0 0 0
1.
1.5
5
y
y
0.
0.5
5
0
0
2.5 2.5
2 2
1.5 1.5
y
y
1 1
0.5 0.5
0 0
2 2 2
1.5 1. 1.
5 5
y
y
1 1 1
0.5 0. 0.
5 5
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
Dealing with Initialization
•Do multiple runs and select the clustering
with the smallest error
Hierarchical clustering
Produces nested
clusters Can be
visualized as a
dendrogram
Can be either:
- Agglomerative (bottom
up): Initially, each point is a
cluster Repeatedly combine
the two “nearest” clusters
into one
- Divisive (top down):
Start with one cluster and
recursively split
Advantages of
Hierarchical
Clustering
● Do not have to assume
any particular number of
clusters
– Any desired number of
clusters can be obtained
by cutting the
dendrogram at the proper level
● No random component
(clusters will be the same
from run to run)
● Clusters may
correspond to
meaningful taxonomies
– Especially in biological
sciences (e.g., phylogeny
reconstruction)
Linkages
● Linkage: measure of dissimilarity
between clusters
● Many methods:
– Single linkage
– Complete linkage
– Average linkage
– Centroids
– Ward’s method
Single linkage
(akanearest
neighbor)
● Proximity of two clusters is based on the two
closest points in the different cluster
● Proximity is determined by one pair of points (i.e., one
link)
● Can handle non-elliptical shapes
● Sensitive to noise and outliers
Complete linkage
● Proximity of two clusters is based on the two
most distant points in the different clusters
● Less susceptible to noise and outliers
● May break large clusters
● Biased toward globular clusters
Average linkage
● Similiarity of two
clusters is based on
the increase in squared
error when two clusters
are merged
● Similar to group
average if distance
between points is
distance squared
● Less susceptible to
noise and outliers
● Biased towards
globular clusters
Agglomerative
clustering
exercise
● How do clusters change with different
linkage methods?
∙ Single
5
3 1
1
5
2 1
2 3 6
5
2
4
3 6 4
4
Agglomerative clustering exercise
● How do clusters change with different
linkage methods?
∙ Complete 4 1
2 5
5
1 2
3 6
3
5 1
2 4
3 6
4
Agglomerative clustering exercise
● How do clusters change with different
linkage methods?
∙ Average
5
1
2
1 5
2
3 6
5 3
2 4 1
4
3 6
4
Linkage Comparison
5 4
1 1
3
2 5
5 5
2 1 2
Single Complete
2 3 6 3 6
3
1
4 4
4
5 5 4
1 1
2 2
5 Ward’s Method 5
2 2
3 6 Average 3 6
• 3
4 1 1
4 4
3
DBSCAN ALGORITHM:
DBSCAN stands for density-based spatial clustering of applications with
noise. It is able to find arbitrary shaped clusters and clusters with noise (i.e.
outliers).
Density-based clustering
● Assumes clusters are areas of high density separated
by areas of low density
● Core points are in areas of a certain density (at least n
points in radius r from the core point)
● Border points aren’t core points, but are w/in r of the core
point
● Noise points are all other points
n=7
r
r r
DBSCAN Algorithm