0% found this document useful (0 votes)
12 views

DMDWUNITV

The document provides an overview of cluster analysis, detailing various clustering methods such as K-means, hierarchical clustering, and DBSCAN. It discusses the basic concepts of clustering, types of clusters, and the importance of choosing initial centroids in K-means clustering. Additionally, it highlights applications of cluster analysis in different industries and the complexities involved in clustering algorithms.

Uploaded by

Ratna Kumari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

DMDWUNITV

The document provides an overview of cluster analysis, detailing various clustering methods such as K-means, hierarchical clustering, and DBSCAN. It discusses the basic concepts of clustering, types of clusters, and the importance of choosing initial centroids in K-means clustering. Additionally, it highlights applications of cluster analysis in different industries and the complexities involved in clustering algorithms.

Uploaded by

Ratna Kumari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 72

𝑥−

𝑐𝑖

UNITV
CLUSTER ANALYSIS
Cluster Analysis: Basic Concepts and Algorithms: Overview, What Is Cluster
Analysis?
Different Types of Clustering,
Different Types of Clusters; K-means: The Basic K-means Algorithm, K-means
Additional Issues,
Bisecting K-means, Strengths and Weaknesses;
Agglomerative Hierarchical Clustering: Basic Agglomerative Hierarchical Clustering
Algorithm
DBSCAN: Traditional Density Center-Based Approach,
DBSCANlgorithm,StrengthsandWeaknesses.(Tan&Vipin)
CLUSTERING
What is a Clustering?
• In general a grouping of objects such that the
objects in a group (cluster) are similar (or related)
to one another and different from (or unrelated
to) the objects in other groups

Inter-
Intra- cluster
cluster distances
distances are
are
Applications of Cluster Analysis
Discovered Clusters Industry Group
• Understanding Applied-Matl-DOWN,Bay-Network-Down,3-COM-DOWN,

• Group related documents for


1 Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,
DSC-Comm-DOWN,INTEL-DOWN,LSI-Logic-DOWN,
Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down, Technology1-DOWN
Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOWN,
browsing, group genes and Sun-DOWN
Apple-Comp-DOWN,Autodesk-DOWN,DEC-DOWN,

proteins that have similar 2 ADV-Micro-Device-DOWN,Andrew-Corp-DOWN,


Computer-Assoc-DOWN,Circuit-City-DOWN,
Compaq-DOWN, EMC-Corp-DOWN, Gen-Inst-DOWN, Technology2-DOWN
functionality, or group stocks Motorola-DOWN,Microsoft-DOWN,Scientific-Atl-DOWN

3
Fannie-Mae-DOWN,Fed-Home-Loan-DOWN,
with similar price fluctuations MBNA-Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN
Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP,

4 Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP,
Schlumberger-UP
Oil-UP

• Summarization
• Reduce the size of large
data sets

Clustering
precipitation in
Australia
Early applications of cluster
analysis
•John Snow, London 1854
Notion of a Cluster can be
Ambiguous

How many clusters? Six Clusters

Two Clusters Four Clusters


Types of Clusterings
• A clustering is a set of clusters

• Clusterings are hierarchical and


partitional clusters
• Partitional Clustering
• A division data objects into subsets
(clusters) such that each data object is in
exactly one subset

• Hierarchical clustering
• A set of nested clusters organized as a
hierarchical tree
Partitional Clustering

Original Points A Partitional Clustering


Hierarchical Clustering

p1
p3 p4
p2
p1 p2 p3 p4

Traditional Traditional Dendrogram


Hierarchical
Clustering

p1
p3 p4
p2

p1 p2 p3 p4
Other types of clustering
• Exclusive (or non-overlapping) versus
non- exclusive (or overlapping)
• In non-exclusive clustering’s, points may
belong to multiple clusters.
• Points that belong to multiple classes, or ‘border’ points

• Fuzzy (or soft) versus non-fuzzy (or hard)


• In fuzzy clustering, a point belongs to every cluster
with some weight between 0 and 1
• Weights usually must sum to 1 (often interpreted as probabilities)

• Partial versus complete


• In some cases, we only want to cluster some
of the data
Types of Clusters: Well-Separated
• Well-Separated Clusters:
• A cluster is a set of points such that any point in a
cluster is closer (or more similar) to every other point
in the cluster than to any point not in the cluster.

well-separated clusters
Types of Clusters: Center-Based
• Center-based
• A cluster is a set of objects such that an object in a
cluster is closer (more similar) to the “center” of a
cluster, than to the center of any other cluster
• The center of a cluster is often a centroid, the
minimizer of distances from all the points in the
cluster, or a medoid, the most “representative”
point of a cluster
center-based clusters
Types of Clusters: Contiguity-
Based
• Contiguous Cluster (Nearest
neighbor or Transitive)
• A cluster is a set of points such that a point in a
cluster is closer (or more similar) to one or more
other points in the cluster than to any point not in
the cluster.
contiguous clusters
Types of Clusters: Density-Based
• Density-based
• A cluster is a dense region of points, which is separated by
low-density regions, from other regions of high density.
• Used when the clusters are irregular or intertwined,
and when noise and outliers are present.

density-based clusters
Types of Clusters: Conceptual
Clusters
• Shared Property or Conceptual Clusters
• Finds clusters that share some common property or
represent a particular concept.
.
Overlapping Circles
Types of Clusters: Objective
Function
• Clustering as an optimization problem
• Finds clusters that minimize or maximize an objective
function.
• Enumerate all possible ways of dividing the points into
clusters and evaluate the `goodness' of each potential
set of clusters by using the given objective function.
(NP Hard)
• Can have global or local objectives.
• Hierarchical clustering algorithms typically have local objectives
• Partitional algorithms typically have global objectives
• A variation of the global objective function approach is
to fit the data to a parameterized model.
• The parameters for the model are determined from the data, and
they
determine the clustering
• E.g., Mixture models assume that the data is a ‘mixture' of
a number of statistical distributions.

Typical workflow for cluster


analysis
Clustering Algorithms
•K-means and its variants

•Hierarchical clustering

•DBSCAN
K-MEANS
K-means Clustering
• Partitional clustering approach
• Each cluster is associated with a
centroid (center point)
• Each point is assigned to the cluster
with the closest centroid
• Number of clusters, K, must be specified
• The objective is to minimize the
sum of distances of the points to
their respective centroid
K-means
Clustering
•Problem: Given a set X of n points in a d-
dimensional space and an integer K group
the points into K clusters C= {C1, C2,
…,Ck} such that
𝑘

𝐶𝑜𝑠𝑡 = ∑𝑑𝑖𝑠𝑡(𝑥, 𝑐)
𝐶 ∑ 𝑥∈
𝑖=1 𝐶𝑖
K-means
Clustering
is minimized, where ci is the centroid of the
points in cluster C
i
K-means
Clustering
• Most common definition is with euclidean
distance, minimizing the Sum of Squares
Error (SSE) function
• Sometimes K-means is defined like that

• Problem: Given a set X of n points in a d-


dimensional space and an integer K group the
points into K clusters C= {C1, C2,…,Ck} such

𝐶 𝑘
that

𝐶𝑜𝑠𝑡 = 𝑖=1


K-means
𝑥∈𝐶𝑖
∑Clustering
2

is minimized, where ci is the mean of the points in


cluster Ci Sum of Squares Error
(SSE)
Complexity of the k-means
problem
•NP-hard if the dimensionality of the
data is at least 2 (d>=2)
• Finding the best solution in polynomial time is
infeasible

•For d=1 the problem is solvable in


polynomial time (how?)

•A simple iterative algorithm works quite


well in practice
K-means
Algorithm
•Also known as Lloyd’s algorithm.
•K-means is sometimes synonymous
with this algorithm
K-means Algorithm –
Initialization
• Initial centroids are often chosen
randomly.
• Clusters produced vary from one run to
another.
Two different K-means Clusterings
3

2.5

1.5
Original
Points
y
1

0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2


x

3 3

2.5 2.5

2 2

1.5 1.5
y

y
1 1

0.5 0.5

0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2


-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x
x
Optimal Clustering Sub-optimal Clustering
Importance of Choosing Initial
Centroids
Iteration
Iteration5
1
2
3
4
3 6
2.
2.5
5

1.
1.5
5
y
y

0.
0.5
5

0
0

-2 -2 - -1.5 -1 -1 - -0.5 0 0 0. 0.5 1 1 1. 1.5 2


1.5 0.5 x 2 5 5
x
Importance of Choosing Initial
Centroids
Iteration 1
3
Iteration 2 Iteration
3 3
3

2.5
2.5 2.5

2 2 2

1.5
1.5 1.5
y

y
1 1 1

0.5
0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2


-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x
x x

Iteration 4
3
Iteration 5 Iteration
3 6
3

2.5
2.5 2.5

2 2 2

1.5
1.5 1.5
y
y

y
1 1 1

0.5
0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2


-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x
x x
Importance of Choosing Initial
Centroids
Iteration
Iteration4
1
2
3
3 5
2.
2.5
5

1.
1.5
5
y
y

0.
0.5
5

0
0

-2 -2 - -1.5 -1 -1 - -0.5 0 0 0. 0.5 1 1 1. 1.5 2


1.5 0.5 x 2 5 5
x
Importance of Choosing Initial
Centroids …
Iteration Iteration 2
3 1 3

2.5 2.5

2 2

1.5 1.5
y

y
1 1

0.5 0.5

0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5


2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x

Iteration Iteration Iteration


3 3 3 4 3 5
2.5 2. 2.
5 5

2 2 2

1.5 1. 1.
5 5
y

y
1 1 1

0.5 0. 0.
5 5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
Dealing with Initialization
•Do multiple runs and select the clustering
with the smallest error

•Select original set of points by methods


other than random . E.g., pick the most
distant (from each other) points as cluster
centers (K-means++ algorithm)
K-means Algorithm – Centroids
• The centroid depends on the distance function
• The minimizer for the distance function
• ‘Closeness’ is measured by Euclidean
distance (SSE), cosine similarity,
correlation, etc.
• Centroid:
• The mean of the points in the cluster for SSE, and cosine
similarity
• The median for Manhattan distance.

• Finding the centroid is not always easy


• It can be an NP-hard problem for some distance functions
• E.g., median form multiple dimensions
K-means Algorithm – Convergence
•K-means will converge for common
similarity measures mentioned above.
• Most of the convergence happens in the first few
iterations.
• Often the stopping condition is changed
to ‘Until relatively few points change
clusters’
•Complexity is O( n * K * I * d )
• n = number of points, K = number of
clusters, I = number of iterations, d =
dimensionality
•In general a fast and efficient algorithm
Limitations of K-means
•K-means has problems when clusters
are of different
• Sizes
• Densities
• Non-globular shapes

•K-means has problems when the data


contains outliers.
Limitations of K-means: Differing Sizes

Original Points K-means (3 Clusters)


Limitations of K-means: Differing
Density

Original Points K-means (3 Clusters)


Limitations of K-means: Non-globular
Shapes

Original Points K-means (2 Clusters)


Overcoming K-means Limitations

Original Points K-means Clusters

One solution is to use many clusters.


Find parts of clusters, but need to put together.
Overcoming K-means Limitations

Original Points K-means Clusters


Overcoming K-means Limitations

Original Points K-means Clusters


Variations
•K-medoids: Similar problem definition as in K-
means, but the centroid of the cluster is defined
to be one of the points in the cluster (the
medoid).

•K-centers: Similar problem definition as in K-


means, but the goal now is to minimize the
maximum diameter of the clusters (diameter of
a cluster is maximum distance between any
two points in the cluster).
Bisecting K-means
● Combines K-
means and
hierarchial
clustering
● Clusters are
iteratively split via
regular
K-means with K=2
● Stops when
desired # of
clusters is reached

Hierarchical clustering
Produces nested
clusters Can be
visualized as a
dendrogram
Can be either:
- Agglomerative (bottom
up): Initially, each point is a
cluster Repeatedly combine
the two “nearest” clusters
into one
- Divisive (top down):
Start with one cluster and
recursively split
Advantages of
Hierarchical
Clustering
● Do not have to assume
any particular number of
clusters
– Any desired number of
clusters can be obtained
by cutting the
dendrogram at the proper level
● No random component
(clusters will be the same
from run to run)
● Clusters may
correspond to
meaningful taxonomies
– Especially in biological
sciences (e.g., phylogeny
reconstruction)

Agglomerative Clustering Algorithm

● Most popular hierarchical clustering


technique
● Basic algorithm:
1) Compute the proximity metric
2) Let each data point be a cluster
3) Repeat
4) Merge the two closest clusters
5) Update the proximity metric
6) Until only a single cluster remains
● Key operation is the computation
of the proximity between two
clusters
– Different approaches to defining this
distance distinguish the different
algorithms
Divisive Clustering Algorithm
● Minimum spanning tree (MST)
– Start with one point
– In successive steps, look for closest pair of
points (p,q) such that p is in the tree but q
is not.
– Add q to the tree (add edge between p and q)

Linkages
● Linkage: measure of dissimilarity
between clusters
● Many methods:
– Single linkage
– Complete linkage
– Average linkage
– Centroids
– Ward’s method
Single linkage
(akanearest
neighbor)
● Proximity of two clusters is based on the two
closest points in the different cluster
● Proximity is determined by one pair of points (i.e., one
link)
● Can handle non-elliptical shapes
● Sensitive to noise and outliers
Complete linkage
● Proximity of two clusters is based on the two
most distant points in the different clusters
● Less susceptible to noise and outliers
● May break large clusters
● Biased toward globular clusters
Average linkage

● Proximity of two clusters is the


average of pairwise proximity
between points in the clusters
● Less susceptible to noise and outliers
● Biased towards globular clusters
Ward’s method

● Similiarity of two
clusters is based on
the increase in squared
error when two clusters
are merged
● Similar to group
average if distance
between points is
distance squared
● Less susceptible to
noise and outliers
● Biased towards
globular clusters
Agglomerative
clustering
exercise
● How do clusters change with different
linkage methods?
∙ Single
5
3 1
1
5
2 1
2 3 6
5
2
4
3 6 4

4
Agglomerative clustering exercise
● How do clusters change with different
linkage methods?
∙ Complete 4 1
2 5
5
1 2
3 6
3
5 1
2 4

3 6

4
Agglomerative clustering exercise
● How do clusters change with different
linkage methods?
∙ Average
5
1
2
1 5
2
3 6
5 3
2 4 1
4
3 6

4
Linkage Comparison
5 4
1 1
3
2 5
5 5
2 1 2
Single Complete
2 3 6 3 6
3
1
4 4
4

5 5 4
1 1
2 2
5 Ward’s Method 5
2 2
3 6 Average 3 6
• 3
4 1 1
4 4
3

DBSCAN ALGORITHM:
DBSCAN stands for density-based spatial clustering of applications with
noise. It is able to find arbitrary shaped clusters and clusters with noise (i.e.
outliers).

Density-based clustering
● Assumes clusters are areas of high density separated
by areas of low density
● Core points are in areas of a certain density (at least n
points in radius r from the core point)
● Border points aren’t core points, but are w/in r of the core
point
● Noise points are all other points

n=7
r
r r
DBSCAN Algorithm

● Eliminate noise points


● Perform clustering on remaining points
DBSCAN Advantages &
Limitations
● Advantages:
● Resistant to noise
● Can handle clusters of different shapes and sizes
● Number of clusters is determined by the algorithm

Original data Clustered


Limitations:
● Struggles to identify clusters with varying densities –
clustering is often incomplete at points in low density
regions are ignored
● Density can be difficult/expensive to compute in high-
dimensional datasets.

You might also like