0% found this document useful (0 votes)
10 views

Clustering

Uploaded by

01aasthathakur
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Clustering

Uploaded by

01aasthathakur
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

BASICS of CLUSTER ANALYSIS

Introduction to Cluster Analysis


• Introduction
• Clustering Requirements
• Data Representation
• Partitioning Methods
• K-Means Clustering
• K-Medoids Clustering
• Constrained K-Means clustering
• PAM and CLARA
Introduction to Cluster Analysis
• The process of grouping a set of physical or
abstract objects into classes of similar objects
is called clustering

• A cluster is a collection of data objects that


are similar to one another within the same
cluster and are dissimilar to the objects in
other clusters
Formal Definition
• Cluster analysis
Statistical method for grouping a set of data objects into clusters
A good clustering method produces high quality clusters with
high intraclass similarity and low interclass similarity

• Cluster: Collection of data objects


Intra-class similarity: Objects are similar to objects in same
cluster
Inter-class dissimilarity: Objects are dissimilar to objects in
other clusters

• Clustering is unsupervised classification


Supervised vs. Unsupervised Learning
• Unsupervised learning - clustering
– The class labels of training data are unknown
– Given a set of measurements, observations, etc. establish the
existence of clusters in the data
• Supervised learning - classification
– Supervision: The training data (observations, measurements,
etc.) are accompanied by labels indicating the class of the
observations
– New data is classified based on the training set

• Clustering is also called data segmentation in some


applications because clustering partitions large data
sets into groups according to their similarity
Clustering vs. Classification
• Clustering - learning by observations
– Unsupervised
– Input
• Clustering algorithm
• Similarity measure
• Number of clusters
– No specific information for each set of data

• Classification - learning by examples


– Supervised
– Consists of class labeled training data examples
– Build a classifier that assigns data objects to one of
the classes
Clustering vs. Classification
• Class Label Attribute : loan_decision
• Learning of Classifier is “supervised” à it is told to which
class each training tuple (sample) belongs

Data Mining
Concept
and
Techniques
(Chapter 6, Page
287).
Clustering vs. Classification

• Clustering
– class label of training tuple not known
– number or set of classes to be learned may not be
known in advance
– e.g. if we did not have loan_decision data available
we use clustering and NOT classification to determine
“groups of like tuples”
– These “groups of like tuples” may eventually
correspond to risk groups within loan application data
Typical Requirements Of Clustering
• Minimal requirements for domain knowledge
to determine input parameters

• Many clustering algorithms require users to input


certain parameters in cluster analysis
(such as the number of desired clusters)

• The clustering results can be quite sensitive


to input parameters.
• Parameters are often difficult to determine,
especially for data sets containing high-dimensional
objects
Typical Requirements Of Clustering
• Scalability
Many clustering algorithms work well on small data sets
Large database may contain millions of objects
Clustering on a sample of a given large data set may lead
to biased results
Highly scalable clustering algorithms are needed

• Ability to deal with different types of attributes


Many algorithms are designed to cluster numerical data

Applications may require clustering other types of data:


binary, categorical (nominal), and ordinal data, or mixtures of
these data types
Typical Requirements Of Clustering

• Ability to deal with noisy data


Some clustering algorithms are sensitive to noisy
data and may lead to clusters of poor quality
• Incremental clustering and insensitivity to the
order of input records
• Constraint-based clustering
• Interpretability and usability
Typical Requirements Of Clustering
• Discovery of clusters with arbitrary shape

• Many clustering algorithms determine clusters


based on Euclidean or Manhattan distance measures

• Algorithms based on such distance measures


tend to find spherical clusters with similar size
and density
• A cluster could be of any shape

• It is important to develop algorithms that can


detect clusters of arbitrary shape
Examples of Clustering Applications

• Marketing:
• Help marketers discover distinct groups in
their customer bases, and then use this
knowledge to develop targeted marketing
programs
• Insurance:
• Identifying groups of insurance policy holders
with a high average claim cost
Examples of Clustering Applications
• City-planning:
• Identifying groups of houses according to their
house type, value, and geographical location

• Earth-quake studies:
• Observe earth quake epicenters clustered
along continent faults

• Fraud detection:
• Detection of credit card fraud and the monitoring
• of criminal activities in electronic commerce
The K-Means Clustering Method

• Given k, the k-means algorithm is implemented in four


steps:
1. Partition objects into k nonempty subsets
2. Compute seed points as the centroids of the clusters of
the current partition (the centroid is the center, i.e., mean
point, of the cluster)
3. Assign each object to the cluster with the nearest seed
point
4. Go back to step 2.
STOP when no more new assignment
The K-Means Clustering Method

Example ( Book page 31)


10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
Assign 3 Update 3

the
3

each
2 2
2

1
objects
1

0
cluster 1

0
0
0 1 2 3 4 5 6 7 8 9 10 to most
0 1 2 3 4 5 6 7 8 9 10 means 0 1 2 3 4 5 6 7 8 9 10

similar
center reassign reassign
10 10

K=2 9 9

8 8

Arbitrarily choose K 7 7

object as initial
6 6

5 5

cluster center 4 Update 4

2
the 3

1 cluster 1

0
0 1 2 3 4 5 6 7 8 9 10
means 0
0 1 2 3 4 5 6 7 8 9 10
The k-Means Algorithm

The basic step of k-means clustering is simple:

¡ Iterate until stable, i.e. there is no change in the


clusters of objects

¡ Determine the centroid coordinate

¡ Determine the distance of each object to the


centroids

¡ Group the object based on minimum distance


Comments on the K-Means Method

• Strength:
• Relatively efficient: O(tkn), where n is # objects, k is #
clusters, and t is # iterations
Normally, k, t << n
• Comparing: PAM: O(k(n-k)2 )
• CLARA: O(ks2 + k(n-k))
• Comment: Often terminates at a local optimum
• The global optimum may be found using techniques
such as: deterministic annealing and genetic algorithms
Comments on the K-Means Method

• Weakness
– Applicable only when mean is defined,
– then what about categorical data?
– Need to specify k, the number of clusters, in
advance
– Unable to handle noisy data and outliers
– Not suitable to discover clusters with non-convex
shapes
Hierarchical Clustering
Hierarchical Clustering: Creating a hierarchical decomposition of the set of
objects using similarity matrix as clustering criteria

Two Main Algorithms: 1. Agglomerative method 2. Divisive method

Similarity Matrix: Linkage methods


- Distance of
- MIN Centroid
- MAX
- Group Average
Rokach, Lior, and Oded Maimon. "Clustering methods." Data mining and knowledge discovery handbook. Springer US, 2005. 321-352.
Agglomerative Algorithm
Main Steps:
1. Let each data point be a cluster
2. Initialize and compute the similarity matrix
Repeat
3. Merge the two closest clusters
4. Update the similarity matrix
Until only a single cluster remains
5. Draw the dendrogram of the sequences of merges
6. Cut the dendrogram with a certain level to form a certain clustering

Zhang, et al. "Graph degree linkage: Agglomerative clustering on a directed graph." 12th European Conference on Computer Vision, Florence, Italy,
October 7–13, 2012.
Agglomerative Algorithm Example
Use MIN Linkage Method

p2
p1 p4

p7
p3
p5 p6

Update
Agglomerative Algorithm Example

Visualized as a dendrogram: A tree like diagram that


records the sequences of merges or splits
{P1, P2, P3, P4, P5, P6}, {P7}

{P1, P2, P3}, {P4}, {P5, P6}, {P7}

Different cuttings
generate different
clusters!
P1 P2 P3 P4 P5 P6 P7
Clustering
Hierarchical Methods

1
Hierarchical Clustering

• Agglomerative approach
Initialization:
Each object is a cluster
Iteration:
a ab Merge two clusters which are
b abcde most similar to each other;
c Until all objects are merged
cde into a single cluster
d
de
e

Step 0 Step 1 Step 2 Step 3 Step 4 bottom-up

3
Hierarchical Clustering

• Divisive Approaches Initialization:


All objects stay in one cluster
Iteration:
a ab Select a cluster and split it into
b abcde
two sub clusters
Until each leaf cluster contains
c
cde only one object
d
de
e

Step 4 Step 3 Step 2 Step 1 Step 0 Top-down

4
Dendrogram
• A tree that shows how clusters are merged/split
hierarchically
• Each node on the tree is a cluster; each leaf node is a
singleton cluster

5
Dendrogram
• A clustering of the data objects is obtained by cutting
the dendrogram at the desired level, then each
connected component forms a cluster

6
Agglomerative Clustering Algorithm

• More popular hierarchical clustering technique


• Basic algorithm is straightforward
1. Compute the distance matrix
2. Let each data point be a cluster
3. Repeat
4. Merge the two closest clusters
5. Update the distance matrix
6. Until only a single cluster remains

• Key operation is the computation of the distance between


two clusters
– Different approaches to defining the distance between clusters
distinguish the different algorithms

You might also like