0% found this document useful (0 votes)
12 views

cz4041 10 Clustering

Uploaded by

seeyin2001
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

cz4041 10 Clustering

Uploaded by

seeyin2001
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 67

CZ4041/SC4000:

Machine Learning
Lesson 10: Clustering

LIU Siyuan
School of Computer Science and Engineering,
NTU, Singapore
[email protected]
Office Hour: Tuesday & Thursday 3-4pm
Office Location: N4-02C-117a
Purpose of Machine Learning
 Pattern Discovery
Imagine you are an alien from another planet. You
watch a soccer match. What do you observe?
Two groups of humans. One ball.
Behavior change when the ball goes into the net.

2
Supervised Learning
 The examples presented to computers are pairs of
inputs and the corresponding outputs, the goal is to
“learn” a mapping or model from inputs to labels
Labeled
training data

Inputs: Face images

Outputs: Female Female Male Male


Female or Male

label = 𝑓(input)

3
Unsupervised Learning
 The examples presented to computers are a set of
inputs without any outputs, the goal is to “learn”
an intrinsic structure of the examples, e.g., clusters
Unlabeled
training dataof examples, density of the examples

Inputs: Face images

Groups of similar faces

4
What is Cluster Analysis?
 Finding groups of data instances such that the data
instances in a group are
 similar to one another intra-cluster
 different from data instances in other groups

Intra-cluster distances Inter-cluster distances


are minimized are maximized

5
An example
of clustering
Grouping faces
for different
persons

Images credit: http://blog.dlib.net/2017/02/high-quality-face-recognition-with-deep.html


Another example
of clustering
Customer segmentation

Based on customers’ profiles:


education background, occupation,
shopping behaviours, etc.
What is not Cluster Analysis?
 Supervised classification
 Have class label information
 Simple segmentation
 Dividing customers into different groups
alphabetically, by last name
Notion of a Cluster can be Ambiguous

How many clusters? Six Clusters

Two Clusters Four Clusters

9
Types of Clusterings
 A clustering is a set of clusters
 Important distinction between hierarchical and
partitional sets of clusters
 Partitional clustering
 Divide data instances into non-overlapping
subsets (clusters) so that each data instance is in
exactly one subset
 Hierarchical clustering
 A set of nested clusters organized as a
hierarchical tree
10
Partitional Clustering

Original instances A Partitional Clustering


11
Hierarchical Clustering
>
-
P1 , p2p > pY

p1
>
-

pl p > ps
, , p4
p3 p4
p2
p1 p2 p3 p4

Traditional Hierarchical Clustering Traditional Dendrogram

p1
p3 p4
p2
p1 p2 p3 p4

Non-traditional Hierarchical Clustering Non-traditional Dendrogram


Other Distinctions Between Clusterings
 Exclusive versus non-exclusive
 In non-exclusive clustering, instances may belong to
multiple clusters
 Fuzzy versus non-fuzzy ⑮
 In fuzzy clustering, a point belongs to every cluster with
some weight between 0 and 1
 Weights must sum to 1
 Partial versus complete
 In partial clustering, only some of the instances are
clustered

13
Clustering Algorithms
 𝐾-means and its variants
 Hierarchical clustering

14
𝑲-means Clustering
 Partitional clustering approach
 Number of clusters, 𝐾, must be specified
 Each cluster is associated with a centroid (center point)
 Each point is assigned to the cluster with the closest centroid
 Basic algorithm:
1. Select 𝐾 data instances as the initial centroids
2. Repeat
3. Form 𝐾 clusters by assigning all data instances to the
closest centroid
4. Recompute the centroid of each cluster
5. Until The centroids do not change
15
𝑲-means Clustering – Details
 Initial centroids are often chosen randomly
 The centroid is (typically) the mean of the data
instances in the cluster
 ‘Closeness’ is measured by a proximity (V1 =
10

 Distance: Euclidean distance, etc


 Similarity: cosine similarity, correlation, etc
 𝐾-means will converge for common similarity
measures mentioned above
 In practice, it converges in the first few iterations
 Often the stopping condition is changed to ‘Until
relatively few instances change clusters’
16
𝑲-means Illustration

Iteration 65
Iteration 1
2
3
4
33

2.5
2.5

22

1.5
1.5
yy

11

0.5
0.5

00

-2
-2 -1.5
-1.5 -1
-1 -0.5
-0.5 00 0.5
0.5 11 1.5
1.5 22
xx
17
𝑲-means Illustration (cont.)
Iteration 1 Iteration 2 Iteration 3
3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5


y

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x

Iteration 4 Iteration 5 Iteration 6


3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5


y

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
18 x x x
Two different 𝑲-means Clusterings
3

2.5

2
Original Points
1.5

y
1

0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2


x

3 3

2.5 2.5

2 2
+ +
1.5 1.5

y
y

1 1

0.5 0.5

0 0 &

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2


x x
19
Optimal Clustering Sub-optimal Clustering
Evaluating 𝑲-means Clusters
 Most common measure is Sum of Squared Error (SSE)
 For each data instance, the “error” is the distance to the
nearest cluster that is represented by a centroid
 To get an overall SSE, we square these errors and sum
them

Total SSE SSE = dist(𝒄 , 𝒙)


𝒙∈
Centroid of the cluster 𝐶

Cluster SSE for cluster 𝐶 = dist(𝒄 , 𝒙)


𝒙∈
20
Evaluating 𝑲-means Clusters
 Given two different runs of 𝐾-means, we can
choose the one with the smallest Total SSE
 One easy way to reduce SSE is to increase 𝐾, the
number of clusters
 However, a good clustering with smaller 𝐾 can
have a lower SSE than a poor clustering with
higher 𝐾

21
Importance of Initial Centroids
Iteration 4
1
2
3
5
3

2.5

1.5
y

0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2


x

22
Importance of Initial Centroids (cont.)
Iteration 1 Iteration 2
3 3

2.5 2.5

2 2

1.5 1.5
y

y
1 1

0.5 0.5

0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2


x x

Iteration 3 Iteration 4 Iteration 5


3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5


y

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

23 x x x
10 Clusters Example
Iteration 4
1
2
3
8

2
y

-2

-4

-6

0 5 10 15 20
x
24
Starting with two initial centroids in one cluster of each pair of clusters
10 Clusters Example (cont.)
Iteration 1 Iteration 2
8 8

6 6

4 4

2 2
y

y
0 0

-2 -2

-4 -4

-6 -6

0 5 10 15 20 0 5 10 15 20
x x
Iteration 3 Iteration 4
8 8

6 6

4 4

2 2
y

0
y 0

-2 -2

-4 -4

-6 -6

0 5 10 15 20 0 5 10 15 20
x x
25
Starting with two initial centroids in one cluster of each pair of clusters
10 Clusters Example (cont.)
Iteration 4
1
2
3
8

2
y

-2

-4

-6

0 5 10 15 20
x
26 Starting with some pairs of clusters having three initial centroids, while some have only one.
10 Clusters Example (cont.)
Iteration 1 Iteration 2
8 8

6 6

4 4

2 2
y

y
0 0

-2 -2

-4 -4

-6 -6

0 5 10 15 20 0 5 10 15 20
x
Iteration 3 x
Iteration 4
8 8

6 6

4 4

2 2
y

0 y 0

-2 -2

-4 -4

-6 -6

0 5 10 15 20 0 5 10 15 20
x x

Starting with some pairs of clusters having three initial centroids, while some have only one.
Initial Centroids Issue
 Multiple runs
 Postprocessing
 Decompose a cluster with high Cluster SSE,
 Merge clusters with low Cluster SSE, which are
close to each other
 Bisecting 𝐾-means

28
Pre-processing and Post-processing
 Pre-processing
 Normalize the data 3
+
 Eliminate outliers
 Post-processing
 Eliminate small clusters that may represent
outliers
 Split ‘loose’ clusters, i.e., clusters with relatively
high SSE
 Merge clusters that are ‘close’ and that have
relatively low SSE
29
Empty Clusters Issue
 Basic 𝐾-means algorithm can yield empty
clusters
 Several strategies to choose a replacement
centroid
 Choose the point that contributes most to SSE
 Choose a point from the cluster with the highest
Cluster SSE
 If there are several empty clusters, the above can
be repeated several times

30
Bisecting 𝑲-means
 Basic algorithm:
1. Initialize the list of clusters to be a single cluster that
contains all points
2. Repeat
3. Select a cluster with highest SSE from the list of clusters
4. For 𝑖 = 1 to 𝑇 do
5. Bisect the selected cluster using basic 𝐾-means
6. End
7. Add the two clusters from the bisection with lowest
SSE to the list of clusters
8. Until the list of clusters contains 𝐾 clusters
9. (After bisecting) run K-means with 𝐾 clusters found so far
31 Clusters
reorder to bisected
Bisecting 𝑲-means

32
Estimation of 𝑲
 SSE can be used to estimate the number of clusters

elbow
10
6 9

8
4
7

2 6

SSE
5
0
4

-2 3

2 8
-4
1
O
-6 0
2 5 10 15 20 25 30
5 10 15 K

33
Limitations of 𝑲-means
 𝐾-means has problems when clusters are of
differing
 Sizes
 Densities
 Non-globular shapes
 𝐾-means has problems when the data
contains outliers

34
Limitations of 𝑲-means: Differing Sizes

Original Points 𝐾-means (3 Clusters)

35
Limitations of 𝑲-means: Differing Density

Original Points 𝐾-means (3 Clusters)

36
Limitations of 𝑲-means: Non-globular Shapes
density based
clustering

Original Points 𝐾-means (2 Clusters)

37
Solution

Original Points 𝐾-means Clusters


One solution is to use many clusters. Find parts of
clusters, but need to put together.
38
Solution (cont.)

Original Points 𝐾-means Clusters

39
Solution (cont.)

Original Points 𝐾-means Clusters

40
Hierarchical Clustering
 Produces a set of nested clusters organized as a
hierarchical tree
 Can be visualized as a dendrogram
 A tree like diagram that records the sequences of merges
or splits

6 5
0.2
4
3 4
0.15 2
5
2
0.1

1
0.05
3 1

0
1 3 2 5 4 6

41 Dendrogram Nested cluster diagram


Strengths of Hierarchical Clustering
 Do not have to assume any particular
number of clusters
 Any desired number of clusters can be obtained
by ‘cutting’ the dendogram at the proper level
 They may correspond to meaningful
taxonomies
 Examples in document organization, biological
sciences
mammal fish bird
42
monkey dog
Hierarchical Clustering (cont.)
 Two main types of hierarchical clustering
 Agglomerative (bottom-up):
 Start with the instances as individual clusters
 At each step, merge the closest pair of clusters until
only one cluster (or 𝐾 clusters) left
 Divisive (top-down):
 Start with one, all-inclusive cluster
 At each step, split a cluster until each cluster contains a
point (or there are 𝐾 clusters)
 Traditional hierarchical algorithms use a proximity
matrix (similarity or distance) to merge or split one
cluster at a time
43
Agglomerative Clustering Algorithm
 Basic algorithm:
1. Compute the proximity matrix distance or similarity
2. Let each data instance be a cluster
smallest distance or
3. Repeat largest similarity
4. Merge the two closest clusters
5. Update the proximity matrix
6. Until only a single cluster remains
 Key operation is to compute the proximity of two clusters
 Different approaches to defining the proximity between
clusters lead to different clustering results

44
Starting Situation
 Start with clusters of individual instances and a
proximity matrix
P1 P2 P3 P4 P5 . . .
P1
P2
P3
P4
P5
...
Proximity Matrix

...
P1 P2 P3 P4 P9 P10 P11 P12
45
Intermediate Situation
 After some merging steps, we have some clusters
C1 C2 C3 C4 C5
C1
C2
C3
C3
C4 C4
C5

C1 Proximity Matrix

C5
C2

...
46 p1 p2 p3 p4 p9 p10 p11 p12
Intermediate Situation…
 We want to merge the two closest clusters (C2 and C5) and
update the proximity matrix. C1 C2 C3 C4 C5
C1
C2
C3
C3
C4
C4
C5
Proximity Matrix
C1

C2 C5

...
47 p1 p2 p3 p4 p9 p10 p11 p12
After Merging
 The question is “How do we update the proximity matrix?”
C2 ∪ C5
C1 C3 C4
C1 ?
C2 ∪ C5 ? ? ? ?
C3
C4 C3 ?
C4 ?

C1 Proximity Matrix

C2 ∪ C5

48
...
p1 p2 p3 p4 p9 p10 p11 p12
Define Inter-Cluster Proximity
P1 P2 P3 P4 P5 . . .
P1
proximity? P2
P3
P4
P5
...
Proximity Matrix
• MIN or Single Link
• MAX or Complete Link
• Group Average
link
average
49
Inter-Cluster Similarity (I)
P1 P2 P3 P4 P5 . . .
P1
P2
P3
P4
P5
...
MIN or Single Link: Proximity Matrix
Defines cluster proximity as
• the proximity between the closest two data points that are
in different clusters
• or the shortest edge (single link) between two nodes in
50 different subsets (using graph terms)
MIN or Single Link
 Similarity of two clusters is based on the two closest
points (most similar) in the different clusters

P1 P2 P3 P4 P5
P1 1.00 0.90 0.10 0.65 0.20
#

P2 0.90 1.00 0.70 0.60 0.50


P3 0.10 0.70 1.00 0.40 0.30
P4 0.65 0.60 0.40 1.00 0.80
P5 0.20 0.50 0.30 0.80 1.00 1 2 3 4 5

Similarity matrix

51
Clustering with MIN
Step 1: Merge the two closest clusters (largest
similarity)

P1 P2 P3 P4 P5
P1 1.00 0.90 0.10 0.65 0.20
P2 0.90 1.00 0.70 0.60 0.50
0 T .

P3 0.10 0.70 1.00 0.40 0.30


0 - 65

P4 0.65 0.60 0.40 1.00 0.80


P5 0.20 0.50 0.30 0.80 1.00
Similarity matrix 1 2 3 4 5

52
Clustering with MIN (cont.)
 Step 2: Update proximity matrix based on MIN: proximity of
two clusters is based on the two closest points in different
clusters (largest similarity)
P1∪P2 P3 P4 P5
1.00 0.90 0.10 0.65 0.20
P1∪P2 1.00 0.70 0.65 0.50
0.90 1.00 0.70 0.60 0.50
P3 0.100.700.70 1.00 0.40 0.30
P4 0.650.650.60 0.40 1.00 0.80
P5 0.200.500.50 0.30 0.80 1.00
Similarity matrix 1 2 3 4 5

53
Clustering with MIN (cont.)
Step 1: Merge the two closest clusters (largest
similarity)

P1∪P2 P3 P4 P5
P1∪P2 1.00 0.70 0.65 0.50
P3 0.70 1.00 0.40 0.30
P4 0.65 0.40 1.00 0.80
P5 0.50 0.30 0.80 1.00
Similarity matrix
1 2 3 4 5

54
Clustering with MIN (cont.)
 Step 2: Update proximity matrix based on MIN: proximity of
two clusters is based on the two closest points in different
clusters (largest similarity)

P1∪P2 P3 P4∪P5
P1∪P2 1.00 0.70 0.650.650.50
P3 0.70 1.00 0.400.40
0.30
0.65 0.40 1.00 0.80
P4∪P5 0.65 0.40 1.00
0.50 0.30 0.80 1.00
Similarity matrix
1 2 3 4 5

55
Clustering with MIN (cont.)
Step 1: Merge the two closest clusters (largest
similarity)

P1∪P2 P3 P4∪P5
P1∪P2 1.00 0.70 0.65
P3 0.70 1.00 0.40
P4∪P5 0.65 0.40 1.00
Similarity matrix
1 2 3 4 5

56
Clustering with MIN (cont.)
 Step 2: Update proximity matrix based on MIN:
proximity of two clusters is based on the two closest
points in different clusters (largest similarity)

P1∪P2∪P3 P4∪P5
1.00 0.70 0.65
P1∪P2∪P3 1.00 0.65
0.70 1.00 0.40
P4∪P5 0.650.650.40 1.00
Similarity matrix
1 2 3 4 5

57
Clustering with MIN (cont.)
Step 1: Merge the two closest clusters (largest
similarity)

P1∪P2∪P3 P4∪P5
P1∪P2∪P3 1.00 0.65
P4∪P5 0.65 1.00
Similarity matrix

What about the proximity 1 2 3 4 5

58
matrix is a distance matrix? Tutorial
Inter-Cluster Similarity II
P1 P2 P3 P4 P5 . . .
P1
P2
P3
P4
P5
...
MAX or Complete Link: Proximity Matrix
Defines cluster proximity as
• the proximity between the farthest two points that are in
different clusters
• or the longest edge (complete link) between two nodes in
59 different subsets (using graph terms)
MAX or Complete Link
 Similarity of two clusters is based on the two least similar
(most distant) points in the different clusters

P1 P2 P3 P4 P5
P1 1.00 0.90 0.10 0.65 0.20
P2 0.90 1.00 0.70 0.60 0.50
P3 0.10 0.70 1.00 0.40 0.30
P4 0.65 0.60 0.40 1.00 0.80
P5 0.20 0.50 0.30 0.80 1.00 1 2 3 4 5
Similarity matrix
Tutorial
60
Inter-Cluster Similarity (III)
P1 P2 P3 P4 P5 . . .
P1
P2
P3
P4
P5
...
Group Average: average link Proximity Matrix
Defines cluster proximity as
• the average pairwise proximities of all pairs of points from
different clusters
• Or average length of edges between nodes in different
61 subsets (using graph terms)
Group Average
 Proximity of two clusters is the average of pairwise
proximity between points in the two clusters
∑𝒙 ∈ ,𝒙 ∈ Proximity(𝒙 , 𝒙 )
Proximity 𝐶 , 𝐶 =
𝐶 × 𝐶
 Need to use average connectivity for scalability since total
proximity favors large clusters

P1 P2 P3 P4 P5
P1 1.00 0.90 0.10 0.65 0.20
P2 0.90 1.00 0.70 0.60 0.50
P3 0.10 0.70 1.00 0.40 0.30
P4 0.65 0.60 0.40 1.00 0.80
62 P5 0.20 0.50 0.30 0.80 1.00 1 2 3 4 5
Agglomerative Clustering: Limitations
 Once a decision is made to combine two clusters, it
cannot be undone
 No objective function is directly minimized
 Different schemes have problems with one or more
of the following:
 Sensitivity to noise and outliers
 Difficulty in handling different-sized clusters

63
Divisive Hierarchical Clustering
 Basic algorithm:
1. Generate a minimum spanning tree to connect
all data instances as a single cluster
2. Repeat
3. Create a new cluster by breaking the link
corresponding to the largest distance (smallest
similarity)
4. Until only singleton clusters remain
n nodeso o

n-ledge
00
64
Divisive Hierarchical Clustering (cont.)
 Minimum Spanning Tree (MST)
 Start with a tree that consists of any point
 In successive steps, look for the closest pair of points
(𝒙 , 𝒙 ) such that one point (𝒙 ) is in the current tree but
the other (𝒙 ) is not
 Add (𝒙 ) to the tree and put an edge between 𝒙 and 𝒙

65
An Example of DHC

The distance matrix between 5 points P5


50 P2
P1 P2 P3 P4 P5 20
P1 0 90 10 65 20 P1
10
P2 90 0 70 60 50 ①
⑪ P3
P3 10 70 0 40 30
P4 65 60 40 ② 0 80
③ 40
P5 20 50 30 80 0 P4

Suppose 𝐾 = 2, and P3 is chosen at the beginning for constructing the MST

66
Thank you!

You might also like