Lecture 8
Lecture 8
Clustering
Dr. Amr El-Wakeel
Lane Department of Computer
Science and Electrical Engineering
Spring 24
Clustering
3
The Problem of Clustering
• Given a set of points, with a notion of distance
between points, group the points into some
number of clusters, so that
– Members of a cluster are close/similar to each other
– Members of different clusters are dissimilar
• Usually:
– Points are in a high-dimensional space
– Similarity is defined using a distance measure
• Euclidean, Cosine, Jaccard, edit distance, …
4
Example: Clusters & Outliers
x
x
xx x
x x
x x x x x
x x x x x
x xx x xx x
x x x x
x x
x x x
x x x x
x x x
x
Outlier Cluster
5
Clustering is a hard problem!
6
Why is it hard?
• Clustering in two dimensions looks easy
• Clustering small amounts of data looks easy
• And in most cases, looks are not deceiving
7
Clustering Problem: Galaxies
• A catalog of 2 billion “sky objects” represents
objects by their radiation in 7 dimensions
(frequency bands)
• Problem: Cluster into similar objects, e.g.,
galaxies, nearby stars, quasars, etc.
• Sloan Digital Sky Survey
8
Clustering Problem: Music CDs
• Intuitively: Music divides into categories, and
customers prefer a few categories
– But what are categories really?
Finding topics:
• Represent a document by a vector
(x1, x2,…, xk), where xi = 1 iff the i th word
(in some order) appears in the document
– It actually doesn’t matter if k is infinite; i.e., we
don’t limit the set of words
12
Overview: Methods of Clustering
• Hierarchical:
– Agglomerative (bottom up):
• Initially, each point is a cluster
• Repeatedly combine the two
“nearest” clusters into one
– Divisive (top down):
• Start with one cluster and recursively split it
• Point assignment:
– Maintain a set of clusters
– Points belong to “nearest” cluster
13
Hierarchical Clustering
• Key operation:
Repeatedly combine
two nearest clusters
Data:
o … data point
x … centroid
Dendrogram 16
And in the Non-Euclidean Case?
What about the Non-Euclidean case?
• The only “locations” we can talk about are the
points themselves
– i.e., there is no “average” of two points
• Approach 1:
– (1) How to represent a cluster of many points?
clustroid = (data)point “closest” to other points
– (2) How do you determine the “nearness” of
clusters? Treat clustroid as if it were centroid, when
computing inter-cluster distances
17
“Closest” Point?
• (1) How to represent a cluster of many points?
clustroid = point “closest” to other points
• Possible meanings of “closest”:
– Smallest maximum distance to other points
– Smallest average distance to other points
– Smallest sum of squares of distances to other points
• For distance metric d clustroid c of cluster C is: c
2
min d ( x , c )
xC
Datapoint Centroid
19
Cohesion
• Approach 3.1: Use the diameter of the
merged cluster = maximum distance between
points in the cluster
• Approach 3.2: Use the average distance
between points in the cluster
• Approach 3.3: Use a density-based approach
– Take the diameter or avg. distance, e.g., and divide
by the number of points in the cluster
20
Implementation
• Naïve implementation of hierarchical
clustering:
– At each step, compute pairwise distances
between all pairs of clusters, then merge
– O(N3)
23
Populating Clusters
• 1) For each point, place it in the cluster whose
current centroid it is nearest
x
x
x
x
x
x x x x x x
x … data point
… centroid Clusters after round 1
25
Example: Assigning Clusters
x
x
x
x
x
x x x x x x
x … data point
… centroid Clusters after round 2
26
Example: Assigning Clusters
x
x
x
x
x
x x x x x x
x … data point
… centroid Clusters at the end
27
Getting the k right
How to select k?
• Try different k, looking at the change in the
average distance to centroid as k increases
• Average falls rapidly until right k, then changes
little
Best value
of k
Average
distance to
centroid k
28
Example: Picking k
Too few; x
many long x
xx x
distances
x x
to centroid. x x x x x
x x x x x
x xx x xx x
x x x x
x x
x x x
x x x x
x x x
x
29
Example: Picking k
x
Just right; x
distances xx x
rather short. x x
x x x x x
x x x x x
x xx x xx x
x x x x
x x
x x x
x x x x
x x x
x
30
Example: Picking k
Too many; x
little improvement x
in average xx x
distance. x x
x x x x x
x x x x x
x xx x xx x
x x x x
x x
x x x
x x x x
x x x
x
31
The BFR Algorithm
Compressed sets.
Their points are in
the CS.
A cluster.
All its points are in the DS. The centroid
37
Summarizing Points: Comments
• 2d + 1 values represent any size cluster
– d = number of dimensions
• Average in each dimension (the centroid)
can be calculated as SUMi / N
– SUMi = ith component of SUM
• Variance of a cluster’s discard set in dimension
i is: (SUMSQi / N) – (SUMi / N)2
– And standard deviation is the square root of that
• Next step: Actual clustering
Note: Dropping the “axis-aligned” clusters assumption would require
storing full covariance matrix to summarize the cluster. So, instead of
SUMSQ being a d-dim vector, it would be a d x d matrix, which is too big! 38
The “Memory-Load” of Points
Processing the “Memory-Load” of points (1):
• 1) Find those points that are “sufficiently
close” to a cluster centroid and add those
points to that cluster and the DS
– These points are so close to the centroid that
they can be summarized and then discarded
• 2) Use any main-memory clustering algorithm
to cluster the remaining points and the old RS
– Clusters go to the CS; outlying points to the RS
Discard set (DS): Close enough to a centroid to be summarized.
Compression set (CS): Summarized, but not assigned to a cluster
39
Retained set (RS): Isolated points
The “Memory-Load” of Points
Processing the “Memory-Load” of points (2):
• 3) DS set: Adjust statistics of the clusters to
account for the new points
– Add Ns, SUMs, SUMSQs
Compressed sets.
Their points are in
the CS.
42
How Close is Close Enough?
• Q1) We need a way to decide whether to put
a new point into a cluster (and discard)
43
Mahalanobis Distance
• Normalized Euclidean distance from centroid
46
Should 2 CS clusters be combined?
47
The CURE Algorithm
h
e e
e
h e
e e h
e e e e
salary h
e
h
h
h h
h h h
age
50
Starting CURE
2 Pass algorithm. Pass 1:
• 0) Pick a random sample of points that fit in
main memory
• 1) Initial clusters:
– Cluster these points hierarchically – group
nearest points/clusters
• 2) Pick representative points:
– For each cluster, pick a sample of points, as
dispersed as possible
– From the sample, pick representatives by moving
them (say) 20% toward the centroid of the cluster
51
Example: Initial Clusters
h h
h
e e
e
h e
e e h
e e e e
h
salary e
h
h
h h
h h h
age
52
Example: Pick Dispersed Points
h h
h
e e
e
h e
e e h
e e e e
h
salary e Pick (say) 4
h
h remote points
h h for each
h h h cluster.
age
53
Example: Pick Dispersed Points
h h
h
e e
e
h e
e e h
e e e e
h
salary e Move points
h
h (say) 20%
h h toward the
h h h centroid.
age
54
Finishing CURE
Pass 2:
• Now, rescan the whole dataset and
visit each point p in the data set
p
• Place it in the “closest cluster”
– Normal definition of “closest”:
Find the closest representative to p and
assign it to representative’s cluster
55
Summary
• Clustering: Given a set of points, with a notion
of distance between points, group the points
into some number of clusters
• Algorithms:
– Agglomerative hierarchical clustering:
• Centroid and clustroid
– k-means:
• Initialization, picking k
– BFR
– CURE
56