0% found this document useful (0 votes)
72 views58 pages

CSE 319 Pattern Recognition: Clustering

The document summarizes hierarchical clustering techniques. It discusses how hierarchical clustering groups data into clusters organized in a tree structure from large to small groups. It provides examples of agglomerative clustering algorithms like single-linkage and complete-linkage that use different methods to calculate distances between clusters during the merging process. Single-linkage uses the minimum distance between points in clusters, while complete-linkage uses the maximum distance.

Uploaded by

rumasum
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
72 views58 pages

CSE 319 Pattern Recognition: Clustering

The document summarizes hierarchical clustering techniques. It discusses how hierarchical clustering groups data into clusters organized in a tree structure from large to small groups. It provides examples of agglomerative clustering algorithms like single-linkage and complete-linkage that use different methods to calculate distances between clusters during the merging process. Single-linkage uses the minimum distance between points in clusters, while complete-linkage uses the maximum distance.

Uploaded by

rumasum
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 58

CSE 319

Pattern Recognition

Clustering

Md. Abu Sayeed Mondol


Computer Science and Engineering Department
1
Clustering
• Clustering refers to the process of grouping
samples so that the samples are similar within
each group.
• The groups are called clusters

2
Hierarchical Clustering
• Hierarchical Clustering refers to a clustering
process that organizes the data into large
groups, which contains smaller groups and so
on.
• A hierarchical clustering may be represented
by a tree or dendrogram

3
Hierarchical Clustering
Employee
Level 0 - {1}, {2}, {3}, {4}, {5}
General Technical

Marketing Finance
Level 1 - {1,2}, {3}, {4}, {5}
Sales Distribution Hardware Software
Level 2 - {1,2}, {3}, {4,5}
Level 3 - {1,2,3}, {4,5}
1 2 3 4 5
Level 4 - {1,2,3,4,5}

4
Hierarchical Clustering
• Two types of hierarchical clustering
– Agglomerative : build the dendrogram from bottom up
– Divisive : build the dendrogram from top down

5
Agglomerative Clustering
Agglomerative Clustering Algorithm
1. Begin with n clusters, each consisting of one sample
2. Repeat step 3 a total of n-1 times
3. Find the most similar clusters Ci and Cj , merge Ci and
Cj into one cluster.

6
Agglomerative Clustering
Different hierarchical clustering algorithms
- Based on the method to determine similarity of clusters

Like nearest neighbor techniques, the most popular distance


measures in cluster analysis are - Euclidean distance and
city block distance.

7
Example
14

Sample x y 12
5
1 4 4
10
2 8 4
3 15 8 8
3

4 24 4 6
5 24 12
4 1 2 4

0
0 5 10 15 20 25 30

8
Single-Linkage Algorithm
• Also known as the minimum method and the nearest
neighbor method
• The distance between two clusters is the smallest
distance between two points such that one point is in
each cluster.
• If Ci and Cj are the clusters, the distance between them
is defined by
DSL (Ci ,Cj) = min
aϵ C ,bϵ d(a,b)
i j C

d(a,b) – distance between two samples a and b


9
Single-Linkage Algorithm

Sample x y 1 2 3 4 5
1 4 4 1 - 4.0 11.7 20.0 21.5
2 8 4 2 4.0 - 8.1 16.0 17.9
3 15 8 3 11.7 8.1 - 9.8 9.8
4 24 4 4 20.0 16.0 9.8 - 8.0
5 24 12 5 21.5 17.9 9.8 8.0 -

Distance Matrix

10
Single-Linkage Algorithm
Distance Matrix
Sample x y
1 2 3 4 5
1 4 4
2 8 4 1 - 4.0 11.7 20.0 21.5

3 15 8 2 4.0 - 8.1 16.0 17.9

4 24 4 3 11.7 8.1 - 9.8 9.8


4 20.0 16.0 9.8 - 8.0
5 24 12
5 21.5 17.9 9.8 8.0 -

11
Single-Linkage Algorithm
Distance Matrix
Sample x y
1 4 4 {1,2} 3 4 5
2 8 4 {1,2} - 8.1 16.0 17.9
3 15 8 3 8.1 - 9.8 9.8
4 24 4 4 16.0 9.8 - 8.0
5 24 12 5 17.9 9.8 8.0 -

12
Single-Linkage Algorithm
14

Sample x y 12
5
1 4 4
10
2 8 4
3 15 8 8
3

4 24 4 6
5 24 12
4 1 2 4

0
0 5 10 15 20 25 30

13
Single-Linkage Algorithm
Distance Matrix
Sample x y
1 4 4 {1,2} 3 4 5
2 8 4 {1,2} - 8.1 16.0 17.9
3 15 8 3 8.1 - 9.8 9.8
4 24 4 4 16.0 9.8 - 8.0
5 24 12 5 17.9 9.8 8.0 -

14
Single-Linkage Algorithm
Distance Matrix
Sample x y
1 4 4 {1,2} 3 {4,5}
2 8 4 {1,2} - 8.1 16.0
3 15 8 3 8.1 - 9.8
4 24 4 {4,5} 16.0 9.8 -
5 24 12

15
Single-Linkage Algorithm
14

Sample x y 12
5
1 4 4
10
2 8 4
3 15 8 8
3

4 24 4 6
5 24 12
4 1 2 4

0
0 5 10 15 20 25 30

16
Single-Linkage Algorithm
Distance Matrix
Sample x y
1 4 4 {1,2} 3 {4,5}
2 8 4 {1,2} - 8.1 16.0
3 15 8 3 8.1 - 9.8
4 24 4 {4,5} 16.0 9.8 -
5 24 12

17
Single-Linkage Algorithm
14

Sample x y 12
5
1 4 4
10
2 8 4
3 15 8 8
3

4 24 4 6
5 24 12
4 1 2 4

0
0 5 10 15 20 25 30

18
Single-Linkage Algorithm
Distance Matrix
Sample x y
1 4 4 {1,2,3} {4,5}
2 8 4 {1,2,3} - ??
3 15 8 {4,5} ?? -
4 24 4
5 24 12 No need to
calculate.
Why?

Final cluster – {1,2,3,4,5}

19
Single-Linkage Algorithm
14

Sample x y 12
5
1 4 4
10
2 8 4
3 15 8 8
3

4 24 4 6
5 24 12
4 1 2 4

0
0 5 10 15 20 25 30

20
Complete-Linkage Algorithm
• Also known as the maximum method and the farthest
neighbor method
• The distance between two clusters is the largest
distance between two points such that one point is in
each cluster.
• If Ci and Cj are the clusters, the distance between them
is defined by
DCL (Ci ,Cj) = max
aϵ C ,bϵ Cd(a,b)
i j

d(a,b) – distance between two samples a and b


21
Complete-Linkage Algorithm

Sample x y 1 2 3 4 5
1 4 4 1 - 4.0 11.7 20.0 21.5
2 8 4 2 4.0 - 8.1 16.0 17.9
3 15 8 3 11.7 8.1 - 9.8 9.8
4 24 4 4 20.0 16.0 9.8 - 8.0
5 24 12 5 21.5 17.9 9.8 8.0 -

Distance Matrix

22
Complete-Linkage Algorithm
Distance Matrix
Sample x y
1 2 3 4 5
1 4 4
2 8 4 1 - 4.0 11.7 20.0 21.5

3 15 8 2 4.0 - 8.1 16.0 17.9

4 24 4 3 11.7 8.1 - 9.8 9.8


4 20.0 16.0 9.8 - 8.0
5 24 12
5 21.5 17.9 9.8 8.0 -

23
Complete-Linkage Algorithm
Distance Matrix (Complete-Linkage)
Sample x y
1 4 4 {1,2} 3 4 5
2 8 4 {1,2} - 11.7 20.0 21.5
3 15 8 3 11.7 - 9.8 9.8
4 24 4 4 20.0 9.8 - 8.0
5 24 12 5 21.5 9.8 8.0 -

Distance Matrix (Single-Linkage)


{1,2} 3 4 5
{1,2} - 8.1 16.0 17.9
3 8.1 - 9.8 9.8
4 16.0 9.8 - 8.0
5 17.9 9.8 8.0 -
24
Complete-Linkage Algorithm
Distance Matrix (Complete-Linkage)
Sample x y
1 4 4 {1,2} 3 {4,5}
2 8 4 {1,2} - 11.7 21.5
3 15 8 3 11.7 - 9.8
4 24 4 {4,5} 21.5 9.8 -
5 24 12

Distance Matrix (Single-Linkage)

{1,2} 3 {4,5}
{1,2} - 8.1 16.0
3 8.1 - 9.8
{4,5} 16.0 9.8 -

25
Complete-Linkage Algorithm
Distance Matrix (Complete-Linkage)
Sample x y
1 4 4 {1,2} {3,4,5}
2 8 4 {1,2} - ??
3 15 8 {3,4,5} ?? -
4 24 4
5 24 12

Distance Matrix (Single-Linkage)


{1,2,3} {4,5}
{1,2,3} - ??
{4,5} ?? -

26
Complete-Linkage Algorithm
14

Sample x y 12
5
1 4 4
10
2 8 4
3 15 8 8
3

4 24 4 6
5 24 12
4 1 2 4

0
0 5 10 15 20 25 30

27
Single Linkage Algorithm
14

Sample x y 12
5
1 4 4
10
2 8 4
3 15 8 8
3

4 24 4 6
5 24 12
4 1 2 4

0
0 5 10 15 20 25 30

28
Average-Linkage Algorithm
• Also known as the unweighted pair-group method using
arithmetic averages (UPGMA)
• The distance between two clusters is the average distance
between a point in one cluster and a point in the other
cluster.
• If Ci and Cj are the clusters with ni and nj number of members
respectively, the distance between them is defined by
DAL (Ci ,Cj) = ∑ d(a,b)
1
ni nj
aϵ Ci ,bϵ Cj
d(a,b) – distance between two samples a and b

29
Average-Linkage Algorithm

Sample x y 1 2 3 4 5
1 4 4 1 - 4.0 11.7 20.0 21.5
2 8 4 2 4.0 - 8.1 16.0 17.9
3 15 8 3 11.7 8.1 - 9.8 9.8
4 24 4 4 20.0 16.0 9.8 - 8.0
5 24 12 5 21.5 17.9 9.8 8.0 -

Distance Matrix

30
Average-Linkage Algorithm
Distance Matrix
Sample x y
1 2 3 4 5
1 4 4
2 8 4 1 - 4.0 11.7 20.0 21.5

3 15 8 2 4.0 - 8.1 16.0 17.9

4 24 4 3 11.7 8.1 - 9.8 9.8


4 20.0 16.0 9.8 - 8.0
5 24 12
5 21.5 17.9 9.8 8.0 -

31
Average-Linkage Algorithm
Distance Matrix (Average-Linkage)
Sample x y
1 4 4 {1,2} 3 4 5
2 8 4 {1,2} - 9.9 18.0 19.7
3 15 8 3 9.9 - 9.8 9.8
4 24 4 4 18.0 9.8 - 8.0
5 24 12 5 19.7 9.8 8.0 -

Distance Matrix (Single-Linkage)


{1,2} 3 4 5
{1,2} - 8.1 16.0 17.9
3 8.1 - 9.8 9.8
4 16.0 9.8 - 8.0
5 17.9 9.8 8.0 -
32
Average-Linkage Algorithm
Distance Matrix (Average-Linkage)
Sample x y
1 4 4 {1,2} 3 {4,5}
2 8 4 {1,2} - 9.9 18.9
3 15 8 3 9.9 - 9.8
4 24 4 {4,5} 18.9 9.8 -
5 24 12

Distance Matrix (Single-Linkage)

{1,2} 3 {4,5}
{1,2} - 8.1 16.0
3 8.1 - 9.8
{4,5} 16.0 9.8 -

33
Average-Linkage Algorithm
Distance Matrix (Average-Linkage)
Sample x y
1 4 4 {1,2} {3,4,5}
2 8 4 {1,2} - ??
3 15 8 {3,4,5} ?? -
4 24 4
5 24 12

Distance Matrix (Single-Linkage)


{1,2,3} {4,5}
{1,2,3} - ??
{4,5} ?? -

34
Average-Linkage Algorithm
14

Sample x y 12
5
1 4 4
10
2 8 4
3 15 8 8
3

4 24 4 6
5 24 12
4 1 2 4

0
0 5 10 15 20 25 30

35
Partitional Clustering
• Agglomeative clustering – creates a series of
nested clusters
• Partitional Clustering / Divisive Clustering–
divides the data set into two groups and then
each of these groups is divided into two parts
and so on.
• Top down approach
• Number of clusters to be construct is specified

36
Forgy’s Algorithm
• One of the simplest partitional clustering
algorithm
• Input:
k – number of clusters to be constructed
k samples called seed points

Seed points can be chosen randomly or with some knowledge

37
Forgy’s Algorithm
1. Initialize the cluster centroids to the seed
points
2. For each sample, find the cluster centroid
nearest it. Put the sample in the cluster
identified with this nearest cluster centroid.
3. If no samples changed clusters in step 2, stop.
4. Compute the centroids of the resulting
clusters and go to step 2.

38
Example
14

Sample x y 12
5
1 4 4
10
2 8 4
3 15 8 8
3

4 24 4 6
5 24 12
4 1 2 4

0
0 5 10 15 20 25 30

39
Forgy’s Algorithm
• Each sample is denoted by feature vector.
• Let, k=2 i.e. two clusters should be constructed
• So, 2 seed points is required
• What are these points?
• Let, first two samples (4,4) and (8,4) are the
seed points

40
Forgy’s Algorithm
• Compute the centroids of the clusters
(4,4)
Sample Nearest Cluster Centroid
(4,4) (4,4)
(8,4) (8,4)
(15,8) (8,4)
(24,4) (8,4)
(24,12) (8,4)
(8+15+24+24)/4, (4+8+4+12)4
= (17.75, 7)
First iteration of Forgy’s algorithm

41
Forgy’s Algorithm
• Find the nearest cluster centroid for each sample
(4+8)/2, ,(4+4)/2
Sample = (6,4)
Nearest Cluster Centroid
(4,4) (4,4)
(8,4) (4,4)
(15,8) (17.75, 7)
(24,4) (17.75, 7)
(24,12) (17.75, 7) (8+4+12)4
(15+24+24)/4,
= (21 , 8)
Second iteration of Forgy’s algorithm

42
Forgy’s Algorithm
• Find the nearest cluster centroid for each sample
(4+8)/2, ,(4+4)/2
Sample = (6,4)
Nearest Cluster Centroid
(4,4) (6,4)
(8,4) (6,4)
(15,8) (21, 8)
(24,4) (21, 8)
(24,12) (21, 8)
(15+24+24)/4, (8+4+12)4
= (21 , 8)
Third iteration of Forgy’s algorithm

43
Forgy’s Algorithm
• Since no sample will change clusters, the
algorithm is terminated

• So, the resulting clusters are-


– {(4,4), (8,4)}
– {(15,8),(24,4),(24,12)}

44
k-means Algorithm
• Similar to Forgy’s algorithm
• Input to the algorithm is k, the number of
clusters to be constructed
• Differences with Forgy’s algorithm-
– Centroids of clusters are recomputed as soon as a
sample joins a cluster
– Forgy’s algorithm is iterative but k-means algorithm
makes only two passes through the data set

45
k-means Algorithm
1. Begin with k clusters, each consisting of the first k samples. For each of the remaining
n-k samples, find the controid nearest it. Put the sample in the cluster identified with
this nearest centroid. After each sample is assigned, recompute the centroid of the
altered cluster.
2. Go through the data a second time. For each sample, find the centroid nearest it. Put
the sample in the cluster identified with this nearest centroid( Do not recompute any
centroid during this step).

46
Example
14

Sample x y 12
5
1 8 4
10
2 24 4
3 15 8 8
3

4 4 4
6
5 24 12
4 1 2 4

0
0 5 10 15 20 25 30

47
k-means Algorithm
• Each sample is denoted by feature vector.
• Let, k=2 i.e. two clusters should be constructed

48
k-means Algorithm
• Begin with two clusters {(8,4)} and{(24,4)}
• For each of the remaining three samples,
– Find the centroid nearest it
– Put the sample in the cluster
– Recompute the centroid of this cluster

49
k-means Algorithm
• Begin with two clusters {(8,4)} and{(24,4)}
• For each of the remaining three samples,
– Find the centroid nearest it
– Put the sample in the cluster
– Recompute the centroid of this cluster

Sample Nearest Cluster Centroid


(8,4) (8,4)
(24,4) (24,4)
(15,8)
(4,4)
(24,12)

Centroids: (8,4) and (24,4)

50
k-means Algorithm
Nearest of (15,8) is (8,4)

Sample Nearest Cluster Centroid


(8,4) (8,4)
(24,4) (24,4)
(15,8) (8,4)
(4,4)
(24,12)

(8+15)/2, ,(4+8)/2
= (11.5,6)

Centroids: (11.5,6) and (24,4)

51
k-means Algorithm
Nearest of (4,4) is (11.5,6)

Sample Nearest Cluster Centroid


(8,4) (8,4)
(24,4) (24,4)
(15,8) (8,4)
(4,4) (11.5,6)
(24,12)

(8+15+4)/3, (4+8+4)/3
= (9,5.3)

Centroids: (9,5.3) and (24,4)

52
k-means Algorithm
Nearest of (24,12) is (24,4)

Sample Nearest Cluster Centroid


(8,4) (8,4)
(24,4) (24,4)
(15,8) (8,4)
(4,4) (11.5,6)
(24,12) (24,4)

(24+24)/2, ,(4+12)/2
= (24,8)

Centroids: (9,5.3) and (24,8)

53
k-means Algorithm
First step completed
Two centroids: (9, 5.3) and (24,8)

Sample
(8,4)
(24,4)
(15,8)
(4,4)
(24,12)

54
k-means Algorithm
Second Step:
Two centroids: (9, 5.3) and (24,8)

Distance to Centroid Distance to Centroid


Sample
(9, 5.3) (24,8)
(8,4) 1.6 16.5
(24,4) 15.1 4.0
(15,8) 6.6 9.0
(4,4) 6.6 40.4
(24,12) 6.4 4.0

55
k-means Algorithm
Second Step:
Two centroids: (9, 5.3) and (24,8)

Distance to Centroid Distance to Centroid


Sample
(9, 5.3) (24,8)
(8,4) 1.6 16.5
(24,4) 15.1 4.0
(15,8) 6.6 9.0
(4,4) 6.6 40.4
(24,12) 6.4 4.0

So resulting clusters are : {(8,4), (15,8),(4,4)} and {(24,4),(24,12)}

56
k-means Algorithm
Alternative version of k-means algorithm (iterative)
1. Begin with k clusters, each consisting of the first k samples. For each of the remaining n-k samples, find
the controid nearest it. Put the sample in the cluster identified with this nearest centroid. After each
sample is assigned, recompute the centroid of the altered cluster.
2. For each sample, find the centroid nearest it. Put the sample in the cluster identified with this nearest
centroid.
3. If no samples changed clusters, stop.
4. Compute the centroids of the resulting clusters and go to step 2.

57
The End

58

You might also like