6 - Clustering and Applications and Trends in Datamining
6 - Clustering and Applications and Trends in Datamining
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized
Cluster can be Ambiguous
• Hierarchical clustering
• A set of nested clusters organized as a hierarchical tree
Partitional Clustering
p1
p3 p4
p2
p1 p2 p3 p4
Traditional Hierarchical Clustering Traditional Dendrogram
p1
p3 p4
p2
p1 p2 p3 p4
Non-traditional Hierarchical Clustering Non-traditional Dendrogram
Types of Clusters
• Well-separated clusters
• Center-based clusters
• Contiguous clusters
• Density-based clusters
• Property or Conceptual
• Well-Separated Clusters:
• A cluster is a set of points such that any point in a cluster is closer (or more similar) to
every other point in the cluster than to any point not in the cluster.
3 well-separated clusters
Types of Clusters: Center-Based
• Center-based
• A cluster is a set of objects such that an object in a cluster is closer (more similar)
to the “center” of a cluster than to the center of any other cluster
• The center of a cluster is often a centroid, the average of all the points in the
cluster, or a medoid, the most “representative” point of a cluster
4 center-based clusters
Types of Clusters: Contiguity-Based
8 contiguous clusters
Types of Clusters: Density-Based
• Density-based
• A cluster is a dense region of points, which is separated by low-density regions, from other
regions of high density.
• Used when the clusters are irregular or intertwined, and when noise and outliers are
present.
6 density-based clusters
Types of Clusters: Conceptual Clusters
• Shared Property or Conceptual Clusters
• Finds clusters that share some common property or represent a particular concept.
.
2 Overlapping Circles
What is a natural grouping of these objects?
What is a natural grouping of these objects?
Clustering is subjective
What is Similarity?
Similarity is hard to define, but…
“We know it when we see it”
The similarity
Two Types of Clustering
• Partitional algorithms: Construct various partitions and then
evaluate them by some criterion
• Hierarchical algorithms: Create a hierarchical decomposition of the
set of objects using some criterion
Hierarchical Partitional
Dendogram: A Useful Tool for Summarizing Similarity
Measurements
Terminal Branch Root
The similarity between two objects in a
Internal Branch
Internal Node
dendrogram is represented as the
Leaf height of the lowest internal node they
share.
There is only one dataset that
can be perfectly clustered
using a hierarchy…
Pio
tr
Pyo
tr
Pe
tros
Pie
t ro
Pe
dro
hierarchical
Pie
rre
Pie
clustering using
ro
Pe
Petedr
A demonstration of
e
Pe r
k
Pe a
ada
r
Mic
hal
Mic is
hae
Mig l
ue
Mic l
Cri k
st o
Ch v
rist ao
oph
Ch er
rist
oph
Ch e
rist
o
Cri ph
sde
Cri an
sto
Cri bal
sto
foro
Kri
sto
ff
Kry er
st o
f
Hierarchal clustering can sometimes show patterns that
are meaningless or spurious
The tight grouping of Australia, Anguilla, St. Helena etc is meaningful; all these
countries are former UK colonies
However the tight grouping of Niger and India is completely spurious; there is no
connection between the two.
Outlier
Hierarchical Clustering
The number of dendrograms Since we cannot test all possible
with n leafs = (2n - trees we will have to heuristic
3)!/[(2(n -2)) (n -2)!] search of all possible trees. We
Number Number of Possible
could do this..
of Leafs Dendrograms
2 1 Bottom-Up (agglomerative):
3 3
4 15 Starting with each item in its own
5 105 cluster, find the best pair to merge
... …
10 34,459,425 into a new cluster. Repeat until all
clusters are fused together.
0 2 4 4
0 3 3
D( , ) = 8 0 1
D( , ) = 1 0
Bottom-Up (agglomerative):
Starting with each item in its own
cluster, find the best pair to merge
into a new cluster. Repeat until all
clusters are fused together.
Consider all
Choose
possible
… the best
merges…
Consider all
Choose
possible
merges… … the best
Consider all
Choose
possible
… the best
merges…
Consider all
Choose
possible
merges… … the best
Consider all
Choose
possible
… the best
merges…
Points to remember −
For a given number of partitions (say k), the partitioning method will create an initial partitioning.
Then it uses the iterative relocation technique to improve the partitioning by moving objects from one
group to other.
Partitional Clustering
• Nonhierarchical, each instance is placed in exactly one
of K non-overlapping clusters.
• Since only one set of clusters is output, the user
normally has to input the desired number of clusters K.
Partition Algorithm 1: k-means
1. Decide on a value for k.
2. Initialize the k cluster centers (randomly, if necessary).
3. Decide the class memberships of the N objects by assigning them
to the nearest cluster center.
4. Re-estimate the k cluster centers, by assuming the memberships
found above are correct.
5. If none of the N objects changed membership in the last iteration,
exit. Otherwise goto 3.
K-means Clustering: Step 1
Algorithm: k-means, Distance Metric: Euclidean Distance
5
4
k1
3
2
k2
k3
0
0 1 2 3 4 5
K-means Clustering: Step 2
Algorithm: k-means, Distance Metric: Euclidean Distance
5
4
k1
3
2
k2
k3
0
0 1 2 3 4 5
K-means Clustering: Step 3
Algorithm: k-means, Distance Metric: Euclidean Distance
5
4
k1
k3
1 k2
0
0 1 2 3 4 5
K-means Clustering: Step 4
Algorithm: k-means, Distance Metric: Euclidean Distance
5
4
k1
k3
1 k2
0
0 1 2 3 4 5
K-means Clustering: Step 5
Algorithm: k-means, Distance Metric: Euclidean Distance
5
expression in condition 2
4
k1
3
k2
1 k3
0
0 1 2 3 4 5
expression in condition 1
Comments on k-Means
• Strengths
• Relatively efficient: O(tkn), where n is # objects, k is # clusters,
and t is # iterations. Normally, k, t << n.
• Often terminates at a local optimum.
• Weakness
• Applicable only when mean is defined, then what about
categorical data?
• Need to specify k, the number of clusters, in advance
• Unable to handle noisy data and outliers
• Not suitable to discover clusters with non-convex shapes
How do we measure similarity?
Peter Piotr
0.23 3 342.7
A generic technique for measuring similarity
To measure the similarity between two objects,
transform one into the other, and measure how
much effort it took. The measure of effort
becomes the distance measure.
Piter
Insertion (o)
Pioter
dro
er
tros
r re
ro
t ro
r
tr
t
Pyo
Pio
Pet
Pie
Pe
Pie
Pie
Deletion (e)
Pe
Piotr
K-means Clustering
• Partitional clustering approach
• Each cluster is associated with a centroid (center point)
• Each point is assigned to the cluster with the closest centroid
• Number of clusters, K, must be specified
• The basic algorithm is very simple
K-means Clustering – Details
2.5
2
Original Points
1.5
y
1
0.5
3 3
2.5 2.5
2 2
1.5 1.5
y
y
1 1
0.5 0.5
0 0
2.5
1.5
y
1
0.5
2.5
1.5
y
1
0.5
• Several strategies
• Choose the point that contributes most to SSE
• Choose a point from the cluster with the highest SSE
• If there are several empty clusters, the above can be repeated several times.
Updating Centers Incrementally
• In the basic K-means algorithm, centroids are updated after all points
are assigned to a centroid
• Post-processing
• Eliminate small clusters that may represent outliers
• Split ‘loose’ clusters, i.e., clusters with relatively high SSE
• Merge clusters that are ‘close’ and that have relatively low SSE
Bisecting K-means
• Bisecting K-means algorithm
• Variant of K-means that can produce a partitional or a hierarchical
clustering
Bisecting K-means Example
Limitations of K-means
• K-means has problems when clusters are of differing
• Sizes
• Densities
• Non-globular shapes
Single linkage 29 2 6 11 9 17 10 13 24 25 26 20 22 30 27 1 3 8 4 12 5 14 23 15 16 18 19 21 28 7
Average linkage