IT426
Data Mining Concept & Technique
Raja Ram Dutta.
Assistant Professor
Department of Computer Science & Engineering,
BIT Mesra,Off Campus Deoghar
1
◼ Clustering: the process of grouping a set of
objects into classes of similar objects
◼ Documents within a cluster should be similar.
◼ Documents from different clusters should be
dissimilar.
2
3
What is Cluster Analysis?
◼ Cluster: A collection of data objects
◼ similar (or related) to one another within the same group
◼ dissimilar (or unrelated) to the objects in other groups
◼ Cluster analysis (or clustering, data segmentation, …)
◼ Finding similarities between data according to the
characteristics found in the data and grouping similar
data objects into clusters
◼ Unsupervised learning: no predefined classes (i.e., learning
by observations vs. learning by examples: supervised)
◼ Typical applications
◼ As a stand-alone tool to get insight into data distribution
◼ As a preprocessing step for other algorithms
4
Clustering Applications
◼ Biology: taxonomy of living things: kingdom, phylum, class, order,
family, genus and species
◼ Information retrieval: document clustering
◼ Land use: Identification of areas of similar land use in an earth
observation database
◼ Marketing: Help marketers discover distinct groups in their customer
bases, and then use this knowledge to develop targeted marketing
programs
◼ City-planning: Identifying groups of houses according to their house
type, value, and geographical location
◼ Earth-quake studies: Observed earth quake epicenters should be
clustered along continent faults
◼ Climate: understanding earth climate, find patterns of atmospheric
and ocean
◼ Economic Science: market resarch
5
Quality: What Is Good Clustering?
◼ A good clustering method will produce high quality
clusters
◼ high intra-class similarity: cohesive within clusters
◼ low inter-class similarity: distinctive between clusters
◼ The quality of a clustering method depends on
◼ the similarity measure used by the method
◼ its implementation, and
◼ Its ability to discover some or all of the hidden patterns
6
Measure the Quality of Clustering
◼ Dissimilarity/Similarity metric
◼ Similarity is expressed in terms of a distance function,
typically metric: d(i, j)
◼ The definitions of distance functions are usually rather
different for interval-scaled, boolean, categorical,
ordinal ratio, and vector variables
◼ Weights should be associated with different variables
based on applications and data semantics
◼ Quality of clustering:
◼ There is usually a separate “quality” function that
measures the “goodness” of a cluster.
◼ It is hard to define “similar enough” or “good enough”
◼ The answer is typically highly subjective
7
Considerations for Cluster Analysis
◼ Partitioning criteria
◼ Single level vs. hierarchical partitioning (often, multi-level
hierarchical partitioning is desirable)
◼ Separation of clusters
◼ Exclusive (e.g., one customer belongs to only one region) vs. non-
exclusive (e.g., one document may belong to more than one
class)
◼ Similarity measure
◼ Distance-based (e.g., Euclidian, road network, vector) vs.
connectivity-based (e.g., density or contiguity)
◼ Clustering space
◼ Full space (often when low dimensional) vs. subspaces (often in
high-dimensional clustering)
8
Major Clustering Approaches
◼ Partitioning approach:
◼ Construct various partitions and then evaluate them by some
criterion, e.g., minimizing the sum of square errors
◼ Typical methods: k-means, k-medoids, CLARANS
◼ Hierarchical approach:
◼ Create a hierarchical decomposition of the set of data (or objects)
using some criterion
◼ Typical methods: Diana, Agnes, BIRCH, CAMELEON
◼ Density-based approach:
◼ Based on connectivity and density functions
◼ Typical methods: DBSACN, OPTICS, DenClue
◼ Grid-based approach:
◼ based on a multiple-level granularity structure
◼ Typical methods: STING, WaveCluster, CLIQUE
9
Major Clustering Approaches
◼ Model-based:
◼ A model is hypothesized for each of the clusters and tries to find
the best fit of that model to each other
◼ Typical methods: EM, SOM, COBWEB
◼ Frequent pattern-based:
◼ Based on the analysis of frequent patterns
◼ Typical methods: p-Cluster
◼ User-guided or constraint-based:
◼ Clustering by considering user-specified or application-specific
constraints
◼ Typical methods: COD (obstacles), constrained clustering
◼ Link-based clustering:
◼ Objects are often linked together in various ways
◼ Massive links can be used to cluster objects: SimRank, LinkClus
10
Partitioning Clustering
◼ Partitioning method: Construct a partition of n
documents into a set of K clusters
◼ Given: a set of documents and the number K
◼ Find: a partition of K clusters that optimizes
the chosen partitioning criterion
◼ Globally optimal
◼ Intractable for many objective functions
◼ Effective heuristic methods: K-means and K-
medoids algorithms
11
Partitioning Clustering
12
Partitioning Clustering
13
14
15
Partitioning Algorithms: Basic Concept
◼ Partitioning method: Partitioning a database D of n objects into a set of
k clusters, such that the sum of squared distances is minimized (where
ci is the centroid or medoid of cluster Ci)
E = ik=1 pCi ( p − ci )2
◼ Given k, find a partition of k clusters that optimizes the chosen
partitioning criterion
◼ Global optimal: exhaustively enumerate all partitions
◼ Heuristic methods: k-means and k-medoids algorithms
◼ k-means (MacQueen’67, Lloyd’57/’82): Each cluster is represented
by the center of the cluster
◼ k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the objects
in the cluster
16
The K-Means Clustering Method
◼ Assumes documents are real-valued vectors.
◼ Clusters based on centroids (aka the center of
gravity or mean) of points in a cluster, c:
1
μ(c) =
| c | xc
x
◼ Reassignment of instances to clusters is based
on distance to the current cluster centroids.
◼ (Or one can equivalently phrase it in terms of
similarities)
17
The K-Means Clustering Method
◼ Given k, the k-means algorithm is implemented in four
steps:
◼ Partition objects into k nonempty subsets
◼ Compute seed points as the centroids of the
clusters of the current partitioning (the centroid is
the center, i.e., mean point, of the cluster)
◼ Assign each object to the cluster with the nearest
seed point
◼ Go back to Step 2, stop when the assignment does
not change
18
The K-Means Clustering Method
Select K random docs {s1, s2,… sK} as seeds.
Until clustering converges (or other stopping criterion):
For each doc di:
Assign di to the cluster cj such that dist(xi, sj) is minimal.
(Next, update the seeds to the centroid of each
cluster)
For each cluster cj
sj = (cj)
19
An Example of K-Means Clustering
K=2
Arbitrarily Update the
partition cluster
objects into centroids
k groups
The initial data set Loop if Reassign objects
needed
◼ Partition objects into k nonempty
subsets
◼ Repeat
◼ Compute centroid (i.e., mean Update the
cluster
point) for each partition
centroids
◼ Assign each object to the
cluster of its nearest centroid
◼ Until no change
20
Comments on the K-Means Method
◼ Strength: Efficient: O(tkn), where n is # objects, k is # clusters, and t is
# iterations. Normally, k, t << n.
◼ Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))
◼ Comment: Often terminates at a local optimal.
◼ Weakness
◼ Applicable only to objects in a continuous n-dimensional space
◼ Using the k-modes method for categorical data
◼ In comparison, k-medoids can be applied to a wide range of
data
◼ Need to specify k, the number of clusters, in advance (there are
ways to automatically determine the best k (see Hastie et al., 2009)
◼ Sensitive to noisy data and outliers
◼ Not suitable to discover clusters with non-convex shapes
21
K-Means
◼ Let us take the number of clusters as 2 and the following input set is
given to us: Input set={1,2,3, 5,10,12,22,32,16,18}
◼ Step 1:We randomly assign means: m1=3,m2=5
◼ Step 2:K1={1,2,3}, K2={5,10,12,22,32,16,18}, m1=2,m2=16.43
Now redefine cluster as per the closest mean:
◼ Step 3:K1={1,2,3,5},K2={10,12,22,32,16,18}
◼ Calculate the mean once again: • m1=2.75,m2=18.33
◼ Step 4:K1={1,2,3,5},K2={10,12,22,32,16,18}, m1=2.75,m2=18.33
Stop as the clusters with these means are the same.
22
Hierarchical Clustering
23
Hierarchical Clustering
◼ Use distance matrix as clustering criteria. This method
does not require the number of clusters k as an input, but
needs a termination condition
Step 0 Step 1 Step 2 Step 3 Step 4
agglomerative
(AGNES)
a ab
b abcde
c
cde
d
de
e
divisive
Step 4 Step 3 Step 2 Step 1 Step 0 (DIANA)
24
AGNES (Agglomerative Nesting)
◼ Introduced in Kaufmann and Rousseeuw (1990)
◼ Implemented in statistical packages, e.g., Splus
◼ Use the single-link method and the dissimilarity matrix
◼ Merge nodes that have the least dissimilarity
◼ Go on in a non-descending fashion
◼ Eventually all nodes belong to the same cluster
10 10 10
9 9 9
8 8 8
7 7 7
6 6 6
5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
25
Dendrogram: Shows How Clusters are Merged
Decompose data objects into a several levels of nested
partitioning (tree of clusters), called a dendrogram
A clustering of the data objects is obtained by cutting
the dendrogram at the desired level, then each
connected component forms a cluster
26
DIANA (Divisive Analysis)
◼ Introduced in Kaufmann and Rousseeuw (1990)
◼ Implemented in statistical analysis packages, e.g., Splus
◼ Inverse order of AGNES
◼ Eventually each node forms a cluster on its own
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
3 3
3
2 2
2
1 1
1
0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
27
Density-Based Clustering Methods
◼ Clustering based on density (local cluster criterion), such
as density-connected points
◼ Major features:
◼ Discover clusters of arbitrary shape
◼ Handle noise
◼ One scan
◼ Need density parameters as termination condition
◼ Several interesting studies:
◼ DBSCAN:
◼ OPTICS:
◼ DENCLUE:
◼ CLIQUE:
28
Density-Based Clustering: Basic Concepts
◼ Two parameters:
◼ Eps: Maximum radius of the neighbourhood
◼ MinPts: Minimum number of points in an Eps-
neighbourhood of that point
◼ NEps(p): {q belongs to D | dist(p,q) ≤ Eps}
◼ Directly density-reachable: A point p is directly density-
reachable from a point q w.r.t. Eps, MinPts if
◼ p belongs to NEps(q)
◼ core point condition: p MinPts = 5
|NEps (q)| ≥ MinPts Eps = 1 cm
q
29
Density-Reachable and Density-Connected
◼ Density-reachable:
◼ A point p is density-reachable from p
a point q w.r.t. Eps, MinPts if there p1
is a chain of points p1, …, pn, p1 = q
q, pn = p such that pi+1 is directly
density-reachable from pi
◼ Density-connected
◼ A point p is density-connected to a p q
point q w.r.t. Eps, MinPts if there is
a point o such that both, p and q o
are density-reachable from o w.r.t.
Eps and MinPts
30
DBSCAN: Density-Based Spatial Clustering of
Applications with Noise
◼ Relies on a density-based notion of cluster: A cluster is
defined as a maximal set of density-connected points
◼ Discovers clusters of arbitrary shape in spatial databases
with noise
Outlier
Border
Eps = 1cm
Core MinPts = 5
31
DBSCAN: The Algorithm
◼ Arbitrary select a point p
◼ Retrieve all points density-reachable from p w.r.t. Eps
and MinPts
◼ If p is a core point, a cluster is formed
◼ If p is a border point, no points are density-reachable
from p and DBSCAN visits the next point of the database
◼ Continue the process until all of the points have been
processed
32