0% found this document useful (0 votes)
12 views

Lecture 16

Uploaded by

sundarkonduru0
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Lecture 16

Uploaded by

sundarkonduru0
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Data Clustering -- Introduction

Ref: Han and Kamber book.

1
Chapter 10. Cluster Analysis: Basic
Concepts and Methods

 Cluster Analysis: Basic Concepts

 Partitioning Methods

 Hierarchical Methods

 Evaluation of Clustering

 Summary

2
What is Cluster Analysis?
 Cluster: A collection of data objects
 similar (or related) to one another within the same group

 dissimilar (or unrelated) to the objects in other groups

 Cluster analysis
 Finding similarities between data according to the

characteristics found in the data and grouping similar


data objects into clusters
 Unsupervised learning: no predefined classes
 Typical applications
 As a stand-alone tool to get insight into data distribution

 As a preprocessing step for other algorithms

3
Clustering: Application Examples
 Biology: taxonomy of living things: kingdom, phylum, class, order,
family, genus and species
 Information retrieval: document clustering
 Land use: Identification of areas of similar land use in an earth
observation database
 Marketing: Help marketers discover distinct groups in their customer
bases, and then use this knowledge to develop targeted marketing
programs
 City-planning: Identifying groups of houses according to their house
type, value, and geographical location
 Climate: understanding earth climate, find patterns of atmospheric
and ocean

4
Considerations for Cluster Analysis
 Partitioning criteria
 Single level vs. hierarchical partitioning (often, multi-level
hierarchical partitioning is desirable)
 Separation of clusters
 Hard (e.g., one customer belongs to only one region) vs. Soft
(e.g., one document may belong to more than one class)
 Similarity measure
 Distance-based (e.g., Euclidian) vs. connectivity-based (e.g.,
density)
 Clustering space
 Full space (often when low dimensional) vs. subspaces (often in
high-dimensional clustering)

5
Clustering methods:

6
Clustering methods:

7
Major Clustering Approaches

 Partitioning approach:
 Construct various partitions and then evaluate them by some

criterion
 Typical methods: k-means, k-medoids, CLARANS

 Hierarchical approach:
 Create a hierarchical decomposition of the set of data (or objects)

using some criterion


 Typical methods: Diana, Agnes, BIRCH, CAMELEON

 Density-based approach:
 Based on connectivity and density functions

 Typical methods: DBSACN, OPTICS, DenClue

 Grid-based approach:
 based on a multiple-level granularity structure

 Typical methods: STING, WaveCluster, CLIQUE

8
Chapter 10. Cluster Analysis: Basic
Concepts and Methods

 Cluster Analysis: Basic Concepts

 Partitioning Methods

 Hierarchical Methods

 Evaluation of Clustering

 Summary

9
Partitioning Algorithms: Basic Concept

 Partitioning method: Partitioning a database D of n objects into a set


of k clusters, such that the sum of squared distances is minimized
(where ci is the centroid or medoid of cluster Ci)
 Global optimal: exhaustively enumerate all partitions
 Heuristic methods: k-means and k-medoids algorithms

10
The K-Means Clustering Method

 Given k, the k-means algorithm is implemented in


four steps:
 Depends on centroids (mean)
 Start with a k block partition
 Refine the partition based on the criterion
 Stop when no more correction is possible.

11
What Is the Problem of the K-Means Method?

 The k-means algorithm is sensitive to outliers !

 Since an object with an extremely large value may substantially


distort the distribution of the data

 K-Medoids: Instead of taking the mean value of the object in a


cluster as a reference point, medoids can be used, which is the most
centrally located object in a cluster

 Some times centroids doesnot make sense. It need not be part of the
data. Eg: Let the data consist of binary vectors, centroids are not
binary vectors.

12
But,

 Finding medoid is costly.


 Centroid can be found in linear time.

 Medoid can take quadratic time.

13
But,

 Finding medoid is costly.


 Centroid can be found in linear time.

 Medoid can take quadratic time.

 Partitioning around medoids (PAM) is a costly


method.
 We have approximations.

14
But,

 Finding medoid is costly.


 Centroid can be found in linear time.

 Medoid can take quadratic time.

 Partitioning around medoids (PAM) is a costly


method.
 We have approximations.

 Finding centroid itself might be difficult in some


situations (kernel K-Means).

15
Chapter 10. Cluster Analysis: Basic
Concepts and Methods
 Cluster Analysis: Basic Concepts
 Partitioning Methods
 Hierarchical Methods
 Evaluation of Clustering
 Summary

16
Hierarchical Clustering
 Use distance matrix as clustering criteria. This method
does not require the number of clusters k as an input,
but needs a termination condition.
Step 0 Step 1 Step 2 Step 3 Step 4 agglomerative
(AGNES)
a
ab
b
abcde
c
cde
d
de
e
divisive
(DIANA)
Step 4 Step 3 Step 2 Step 1 Step 0

17
Hierarchical Agglomerative Clustering:
Linkage Methods
 The single linkage method is based on minimum
distance, or the nearest neighbor rule.

 The complete linkage method is based on the


maximum distance or the furthest neighbor
approach.

 The average linkage method the distance


between two clusters is defined as the average of
the distances between all pairs of objects
Linkage Methods of Clustering
Single Linkage
Minimum Distance

Cluster 1 Cluster 2
Complete Linkage
Maximum
Distance

Cluster 1 Cluster 2
Average Linkage

Average Distance
Cluster 1 Cluster 2
Dendrogram
Chapter 10. Cluster Analysis: Basic
Concepts and Methods

 Cluster Analysis: Basic Concepts

 Partitioning Methods

 Hierarchical Methods

 Evaluation of Clustering

 Summary

21
Determine the Number of Clusters
 Empirical method
 # of clusters: k ≈√n/2 for a dataset of n points, e.g., n = 200, k = 10

 Other methods:
 Elbow method

 Density based methods

22
Measuring Clustering Quality
 3 kinds of measures: External, internal and relative

23
Measuring Clustering Quality
 3 kinds of measures: External, internal and relative
 External: supervised, employ criteria not inherent to the dataset
 Compare a clustering against prior or expert-specified
knowledge using certain clustering quality measure

24
Measuring Clustering Quality
 3 kinds of measures: External, internal and relative
 External: supervised, employ criteria not inherent to the dataset
 Compare a clustering against prior or expert-specified
knowledge using certain clustering quality measure
 Internal: unsupervised, criteria derived from data itself
 Evaluate the goodness of a clustering by considering how
well the clusters are separated, and how compact the
clusters are, e.g., Silhouette coefficient

25
Measuring Clustering Quality
 3 kinds of measures: External, internal and relative
 External: supervised, employ criteria not inherent to the dataset
 Compare a clustering against prior or expert-specified
knowledge using certain clustering quality measure
 Internal: unsupervised, criteria derived from data itself
 Evaluate the goodness of a clustering by considering how
well the clusters are separated, and how compact the
clusters are, e.g., Silhouette coefficient
 Relative: directly compare different clusterings, usually those
obtained via different parameter settings for the same algorithm

26
Chapter 10. Cluster Analysis: Basic
Concepts and Methods

 Cluster Analysis: Basic Concepts

 Partitioning Methods

 Hierarchical Methods

 Evaluation of Clustering

 Summary

27
Summary
 Cluster analysis groups objects based on their similarity and has
wide applications
 Clustering algorithms can be categorized into partitioning methods,
hierarchical methods, density-based methods, grid-based methods,
and model-based methods
 K-means and K-medoids algorithms are popular partitioning-based
clustering algorithms
 Birch and Chameleon are interesting hierarchical clustering algorithms,
and there are also probabilistic hierarchical clustering algorithms
 DBSCAN, OPTICS, and DENCLU are interesting density-based
algorithms
 STING and CLIQUE are grid-based methods, where CLIQUE is also a
subspace clustering algorithm
 Quality of clustering results can be evaluated in various ways

28
Important point

 Are we solving a solvable problem?


 An impossibility theorem on clustering (Kleinberg)
 Possibility arguments (Ackerman)

29

You might also like