Clustering
Clustering
Data Mining
Concept
and
Techniques
(Chapter 6, Page
287).
Clustering vs. Classification
• Clustering
– class label of training tuple not known
– number or set of classes to be learned may not be
known in advance
– e.g. if we did not have loan_decision data available
we use clustering and NOT classification to determine
“groups of like tuples”
– These “groups of like tuples” may eventually
correspond to risk groups within loan application data
Typical Requirements Of Clustering
• Minimal requirements for domain knowledge
to determine input parameters
• Marketing:
• Help marketers discover distinct groups in
their customer bases, and then use this
knowledge to develop targeted marketing
programs
• Insurance:
• Identifying groups of insurance policy holders
with a high average claim cost
Examples of Clustering Applications
• City-planning:
• Identifying groups of houses according to their
house type, value, and geographical location
• Earth-quake studies:
• Observe earth quake epicenters clustered
along continent faults
• Fraud detection:
• Detection of credit card fraud and the monitoring
• of criminal activities in electronic commerce
The K-Means Clustering Method
the
3
each
2 2
2
1
objects
1
0
cluster 1
0
0
0 1 2 3 4 5 6 7 8 9 10 to most
0 1 2 3 4 5 6 7 8 9 10 means 0 1 2 3 4 5 6 7 8 9 10
similar
center reassign reassign
10 10
K=2 9 9
8 8
Arbitrarily choose K 7 7
object as initial
6 6
5 5
2
the 3
1 cluster 1
0
0 1 2 3 4 5 6 7 8 9 10
means 0
0 1 2 3 4 5 6 7 8 9 10
The k-Means Algorithm
• Strength:
• Relatively efficient: O(tkn), where n is # objects, k is #
clusters, and t is # iterations
Normally, k, t << n
• Comparing: PAM: O(k(n-k)2 )
• CLARA: O(ks2 + k(n-k))
• Comment: Often terminates at a local optimum
• The global optimum may be found using techniques
such as: deterministic annealing and genetic algorithms
Comments on the K-Means Method
• Weakness
– Applicable only when mean is defined,
– then what about categorical data?
– Need to specify k, the number of clusters, in
advance
– Unable to handle noisy data and outliers
– Not suitable to discover clusters with non-convex
shapes
Hierarchical Clustering
Hierarchical Clustering: Creating a hierarchical decomposition of the set of
objects using similarity matrix as clustering criteria
Zhang, et al. "Graph degree linkage: Agglomerative clustering on a directed graph." 12th European Conference on Computer Vision, Florence, Italy,
October 7–13, 2012.
Agglomerative Algorithm Example
Use MIN Linkage Method
p2
p1 p4
p7
p3
p5 p6
Update
Agglomerative Algorithm Example
Different cuttings
generate different
clusters!
P1 P2 P3 P4 P5 P6 P7
Clustering
Hierarchical Methods
1
Hierarchical Clustering
• Agglomerative approach
Initialization:
Each object is a cluster
Iteration:
a ab Merge two clusters which are
b abcde most similar to each other;
c Until all objects are merged
cde into a single cluster
d
de
e
3
Hierarchical Clustering
4
Dendrogram
• A tree that shows how clusters are merged/split
hierarchically
• Each node on the tree is a cluster; each leaf node is a
singleton cluster
5
Dendrogram
• A clustering of the data objects is obtained by cutting
the dendrogram at the desired level, then each
connected component forms a cluster
6
Agglomerative Clustering Algorithm