cz4041 10 Clustering
cz4041 10 Clustering
Machine Learning
Lesson 10: Clustering
LIU Siyuan
School of Computer Science and Engineering,
NTU, Singapore
[email protected]
Office Hour: Tuesday & Thursday 3-4pm
Office Location: N4-02C-117a
Purpose of Machine Learning
Pattern Discovery
Imagine you are an alien from another planet. You
watch a soccer match. What do you observe?
Two groups of humans. One ball.
Behavior change when the ball goes into the net.
2
Supervised Learning
The examples presented to computers are pairs of
inputs and the corresponding outputs, the goal is to
“learn” a mapping or model from inputs to labels
Labeled
training data
label = 𝑓(input)
3
Unsupervised Learning
The examples presented to computers are a set of
inputs without any outputs, the goal is to “learn”
an intrinsic structure of the examples, e.g., clusters
Unlabeled
training dataof examples, density of the examples
4
What is Cluster Analysis?
Finding groups of data instances such that the data
instances in a group are
similar to one another intra-cluster
different from data instances in other groups
5
An example
of clustering
Grouping faces
for different
persons
9
Types of Clusterings
A clustering is a set of clusters
Important distinction between hierarchical and
partitional sets of clusters
Partitional clustering
Divide data instances into non-overlapping
subsets (clusters) so that each data instance is in
exactly one subset
Hierarchical clustering
A set of nested clusters organized as a
hierarchical tree
10
Partitional Clustering
p1
>
-
pl p > ps
, , p4
p3 p4
p2
p1 p2 p3 p4
p1
p3 p4
p2
p1 p2 p3 p4
13
Clustering Algorithms
𝐾-means and its variants
Hierarchical clustering
14
𝑲-means Clustering
Partitional clustering approach
Number of clusters, 𝐾, must be specified
Each cluster is associated with a centroid (center point)
Each point is assigned to the cluster with the closest centroid
Basic algorithm:
1. Select 𝐾 data instances as the initial centroids
2. Repeat
3. Form 𝐾 clusters by assigning all data instances to the
closest centroid
4. Recompute the centroid of each cluster
5. Until The centroids do not change
15
𝑲-means Clustering – Details
Initial centroids are often chosen randomly
The centroid is (typically) the mean of the data
instances in the cluster
‘Closeness’ is measured by a proximity (V1 =
10
Iteration 65
Iteration 1
2
3
4
33
2.5
2.5
22
1.5
1.5
yy
11
0.5
0.5
00
-2
-2 -1.5
-1.5 -1
-1 -0.5
-0.5 00 0.5
0.5 11 1.5
1.5 22
xx
17
𝑲-means Illustration (cont.)
Iteration 1 Iteration 2 Iteration 3
3 3 3
2 2 2
y
1 1 1
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
2 2 2
y
1 1 1
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
18 x x x
Two different 𝑲-means Clusterings
3
2.5
2
Original Points
1.5
y
1
0.5
3 3
2.5 2.5
2 2
+ +
1.5 1.5
y
y
1 1
0.5 0.5
0 0 &
21
Importance of Initial Centroids
Iteration 4
1
2
3
5
3
2.5
1.5
y
0.5
22
Importance of Initial Centroids (cont.)
Iteration 1 Iteration 2
3 3
2.5 2.5
2 2
1.5 1.5
y
y
1 1
0.5 0.5
0 0
2 2 2
y
1 1 1
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
23 x x x
10 Clusters Example
Iteration 4
1
2
3
8
2
y
-2
-4
-6
0 5 10 15 20
x
24
Starting with two initial centroids in one cluster of each pair of clusters
10 Clusters Example (cont.)
Iteration 1 Iteration 2
8 8
6 6
4 4
2 2
y
y
0 0
-2 -2
-4 -4
-6 -6
0 5 10 15 20 0 5 10 15 20
x x
Iteration 3 Iteration 4
8 8
6 6
4 4
2 2
y
0
y 0
-2 -2
-4 -4
-6 -6
0 5 10 15 20 0 5 10 15 20
x x
25
Starting with two initial centroids in one cluster of each pair of clusters
10 Clusters Example (cont.)
Iteration 4
1
2
3
8
2
y
-2
-4
-6
0 5 10 15 20
x
26 Starting with some pairs of clusters having three initial centroids, while some have only one.
10 Clusters Example (cont.)
Iteration 1 Iteration 2
8 8
6 6
4 4
2 2
y
y
0 0
-2 -2
-4 -4
-6 -6
0 5 10 15 20 0 5 10 15 20
x
Iteration 3 x
Iteration 4
8 8
6 6
4 4
2 2
y
0 y 0
-2 -2
-4 -4
-6 -6
0 5 10 15 20 0 5 10 15 20
x x
Starting with some pairs of clusters having three initial centroids, while some have only one.
Initial Centroids Issue
Multiple runs
Postprocessing
Decompose a cluster with high Cluster SSE,
Merge clusters with low Cluster SSE, which are
close to each other
Bisecting 𝐾-means
28
Pre-processing and Post-processing
Pre-processing
Normalize the data 3
+
Eliminate outliers
Post-processing
Eliminate small clusters that may represent
outliers
Split ‘loose’ clusters, i.e., clusters with relatively
high SSE
Merge clusters that are ‘close’ and that have
relatively low SSE
29
Empty Clusters Issue
Basic 𝐾-means algorithm can yield empty
clusters
Several strategies to choose a replacement
centroid
Choose the point that contributes most to SSE
Choose a point from the cluster with the highest
Cluster SSE
If there are several empty clusters, the above can
be repeated several times
30
Bisecting 𝑲-means
Basic algorithm:
1. Initialize the list of clusters to be a single cluster that
contains all points
2. Repeat
3. Select a cluster with highest SSE from the list of clusters
4. For 𝑖 = 1 to 𝑇 do
5. Bisect the selected cluster using basic 𝐾-means
6. End
7. Add the two clusters from the bisection with lowest
SSE to the list of clusters
8. Until the list of clusters contains 𝐾 clusters
9. (After bisecting) run K-means with 𝐾 clusters found so far
31 Clusters
reorder to bisected
Bisecting 𝑲-means
32
Estimation of 𝑲
SSE can be used to estimate the number of clusters
elbow
10
6 9
8
4
7
2 6
SSE
5
0
4
-2 3
2 8
-4
1
O
-6 0
2 5 10 15 20 25 30
5 10 15 K
33
Limitations of 𝑲-means
𝐾-means has problems when clusters are of
differing
Sizes
Densities
Non-globular shapes
𝐾-means has problems when the data
contains outliers
34
Limitations of 𝑲-means: Differing Sizes
35
Limitations of 𝑲-means: Differing Density
36
Limitations of 𝑲-means: Non-globular Shapes
density based
clustering
37
Solution
39
Solution (cont.)
40
Hierarchical Clustering
Produces a set of nested clusters organized as a
hierarchical tree
Can be visualized as a dendrogram
A tree like diagram that records the sequences of merges
or splits
6 5
0.2
4
3 4
0.15 2
5
2
0.1
1
0.05
3 1
0
1 3 2 5 4 6
44
Starting Situation
Start with clusters of individual instances and a
proximity matrix
P1 P2 P3 P4 P5 . . .
P1
P2
P3
P4
P5
...
Proximity Matrix
...
P1 P2 P3 P4 P9 P10 P11 P12
45
Intermediate Situation
After some merging steps, we have some clusters
C1 C2 C3 C4 C5
C1
C2
C3
C3
C4 C4
C5
C1 Proximity Matrix
C5
C2
...
46 p1 p2 p3 p4 p9 p10 p11 p12
Intermediate Situation…
We want to merge the two closest clusters (C2 and C5) and
update the proximity matrix. C1 C2 C3 C4 C5
C1
C2
C3
C3
C4
C4
C5
Proximity Matrix
C1
C2 C5
...
47 p1 p2 p3 p4 p9 p10 p11 p12
After Merging
The question is “How do we update the proximity matrix?”
C2 ∪ C5
C1 C3 C4
C1 ?
C2 ∪ C5 ? ? ? ?
C3
C4 C3 ?
C4 ?
C1 Proximity Matrix
C2 ∪ C5
48
...
p1 p2 p3 p4 p9 p10 p11 p12
Define Inter-Cluster Proximity
P1 P2 P3 P4 P5 . . .
P1
proximity? P2
P3
P4
P5
...
Proximity Matrix
• MIN or Single Link
• MAX or Complete Link
• Group Average
link
average
49
Inter-Cluster Similarity (I)
P1 P2 P3 P4 P5 . . .
P1
P2
P3
P4
P5
...
MIN or Single Link: Proximity Matrix
Defines cluster proximity as
• the proximity between the closest two data points that are
in different clusters
• or the shortest edge (single link) between two nodes in
50 different subsets (using graph terms)
MIN or Single Link
Similarity of two clusters is based on the two closest
points (most similar) in the different clusters
P1 P2 P3 P4 P5
P1 1.00 0.90 0.10 0.65 0.20
#
Similarity matrix
51
Clustering with MIN
Step 1: Merge the two closest clusters (largest
similarity)
P1 P2 P3 P4 P5
P1 1.00 0.90 0.10 0.65 0.20
P2 0.90 1.00 0.70 0.60 0.50
0 T .
52
Clustering with MIN (cont.)
Step 2: Update proximity matrix based on MIN: proximity of
two clusters is based on the two closest points in different
clusters (largest similarity)
P1∪P2 P3 P4 P5
1.00 0.90 0.10 0.65 0.20
P1∪P2 1.00 0.70 0.65 0.50
0.90 1.00 0.70 0.60 0.50
P3 0.100.700.70 1.00 0.40 0.30
P4 0.650.650.60 0.40 1.00 0.80
P5 0.200.500.50 0.30 0.80 1.00
Similarity matrix 1 2 3 4 5
53
Clustering with MIN (cont.)
Step 1: Merge the two closest clusters (largest
similarity)
P1∪P2 P3 P4 P5
P1∪P2 1.00 0.70 0.65 0.50
P3 0.70 1.00 0.40 0.30
P4 0.65 0.40 1.00 0.80
P5 0.50 0.30 0.80 1.00
Similarity matrix
1 2 3 4 5
54
Clustering with MIN (cont.)
Step 2: Update proximity matrix based on MIN: proximity of
two clusters is based on the two closest points in different
clusters (largest similarity)
P1∪P2 P3 P4∪P5
P1∪P2 1.00 0.70 0.650.650.50
P3 0.70 1.00 0.400.40
0.30
0.65 0.40 1.00 0.80
P4∪P5 0.65 0.40 1.00
0.50 0.30 0.80 1.00
Similarity matrix
1 2 3 4 5
55
Clustering with MIN (cont.)
Step 1: Merge the two closest clusters (largest
similarity)
P1∪P2 P3 P4∪P5
P1∪P2 1.00 0.70 0.65
P3 0.70 1.00 0.40
P4∪P5 0.65 0.40 1.00
Similarity matrix
1 2 3 4 5
56
Clustering with MIN (cont.)
Step 2: Update proximity matrix based on MIN:
proximity of two clusters is based on the two closest
points in different clusters (largest similarity)
P1∪P2∪P3 P4∪P5
1.00 0.70 0.65
P1∪P2∪P3 1.00 0.65
0.70 1.00 0.40
P4∪P5 0.650.650.40 1.00
Similarity matrix
1 2 3 4 5
57
Clustering with MIN (cont.)
Step 1: Merge the two closest clusters (largest
similarity)
P1∪P2∪P3 P4∪P5
P1∪P2∪P3 1.00 0.65
P4∪P5 0.65 1.00
Similarity matrix
58
matrix is a distance matrix? Tutorial
Inter-Cluster Similarity II
P1 P2 P3 P4 P5 . . .
P1
P2
P3
P4
P5
...
MAX or Complete Link: Proximity Matrix
Defines cluster proximity as
• the proximity between the farthest two points that are in
different clusters
• or the longest edge (complete link) between two nodes in
59 different subsets (using graph terms)
MAX or Complete Link
Similarity of two clusters is based on the two least similar
(most distant) points in the different clusters
P1 P2 P3 P4 P5
P1 1.00 0.90 0.10 0.65 0.20
P2 0.90 1.00 0.70 0.60 0.50
P3 0.10 0.70 1.00 0.40 0.30
P4 0.65 0.60 0.40 1.00 0.80
P5 0.20 0.50 0.30 0.80 1.00 1 2 3 4 5
Similarity matrix
Tutorial
60
Inter-Cluster Similarity (III)
P1 P2 P3 P4 P5 . . .
P1
P2
P3
P4
P5
...
Group Average: average link Proximity Matrix
Defines cluster proximity as
• the average pairwise proximities of all pairs of points from
different clusters
• Or average length of edges between nodes in different
61 subsets (using graph terms)
Group Average
Proximity of two clusters is the average of pairwise
proximity between points in the two clusters
∑𝒙 ∈ ,𝒙 ∈ Proximity(𝒙 , 𝒙 )
Proximity 𝐶 , 𝐶 =
𝐶 × 𝐶
Need to use average connectivity for scalability since total
proximity favors large clusters
P1 P2 P3 P4 P5
P1 1.00 0.90 0.10 0.65 0.20
P2 0.90 1.00 0.70 0.60 0.50
P3 0.10 0.70 1.00 0.40 0.30
P4 0.65 0.60 0.40 1.00 0.80
62 P5 0.20 0.50 0.30 0.80 1.00 1 2 3 4 5
Agglomerative Clustering: Limitations
Once a decision is made to combine two clusters, it
cannot be undone
No objective function is directly minimized
Different schemes have problems with one or more
of the following:
Sensitivity to noise and outliers
Difficulty in handling different-sized clusters
63
Divisive Hierarchical Clustering
Basic algorithm:
1. Generate a minimum spanning tree to connect
all data instances as a single cluster
2. Repeat
3. Create a new cluster by breaking the link
corresponding to the largest distance (smallest
similarity)
4. Until only singleton clusters remain
n nodeso o
n-ledge
00
64
Divisive Hierarchical Clustering (cont.)
Minimum Spanning Tree (MST)
Start with a tree that consists of any point
In successive steps, look for the closest pair of points
(𝒙 , 𝒙 ) such that one point (𝒙 ) is in the current tree but
the other (𝒙 ) is not
Add (𝒙 ) to the tree and put an edge between 𝒙 and 𝒙
65
An Example of DHC
66
Thank you!