0% found this document useful (0 votes)
26 views66 pages

6 - Clustering and Applications and Trends in Datamining

The document discusses various clustering techniques including k-means, hierarchical clustering, density-based clustering, and expectation maximization. It covers clustering concepts such as partitioning methods, well-separated clusters, center-based clusters, and conceptual clusters. Examples are provided to illustrate different types of clusters.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views66 pages

6 - Clustering and Applications and Trends in Datamining

The document discusses various clustering techniques including k-means, hierarchical clustering, density-based clustering, and expectation maximization. It covers clustering concepts such as partitioning methods, well-separated clusters, center-based clusters, and conceptual clusters. Examples are provided to illustrate different types of clusters.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 66

Clustering and Applications

and Trends in Datamining


lecture:-30 to 35

Dr. Shweta Sharma


Manipal University Jaipur
Topic Covered
• Clustering:
• Basic Concepts,
• Cluster Analysis,
• K-Means,
• Partitioning Methods,
• Hierarchical Clustering,
• Expectation-Maximization,
• Density-based Clustering,
• Web Mining, Text Mining, Spatial Mining.
• Case Study:
• Case Studies on Various Data Mining Techniques with Varying Data Sets.
What is clustering?
• Clustering: the process of grouping a
set of objects into classes of similar
objects
• Documents within a cluster should
be similar.
• Documents from different clusters
should be dissimilar.
• The commonest form of
unsupervised learning
• Unsupervised learning = learning
from raw data, as opposed to
supervised data where a
classification of examples is given
• A common and important task that
finds many applications in IR and other
places
Goal of Clustering
• Given a set of data points, each described by a set of attributes, find
clusters such that:
F1 xx
• Inter-cluster similarity is x x
x xx x
xx
maximized
xxxx
x
x xx x
• Intra-cluster similarity is
minimized
F2

• Requires the definition of a similarity measure


What is Cluster Analysis?
• Finding groups of objects such that the objects in a group will
be similar (or related) to one another and different from (or
unrelated to) the objects in other groups

Inter-cluster
Intra-cluster distances are
distances are maximized
minimized
Cluster can be Ambiguous

How many clusters? Six Clusters

Two Clusters Four Clusters


Types of Clusterings

• A clustering is a set of clusters


• Important distinction between hierarchical and partitional sets of
clusters
• Partitional Clustering
• A division data objects into non-overlapping subsets (clusters) such that each data object
is in exactly one subset

• Hierarchical clustering
• A set of nested clusters organized as a hierarchical tree
Partitional Clustering

Original Points A Partitional Clustering


Hierarchical Clustering

p1
p3 p4
p2
p1 p2 p3 p4
Traditional Hierarchical Clustering Traditional Dendrogram

p1
p3 p4
p2
p1 p2 p3 p4
Non-traditional Hierarchical Clustering Non-traditional Dendrogram
Types of Clusters
• Well-separated clusters

• Center-based clusters

• Contiguous clusters

• Density-based clusters

• Property or Conceptual

• Described by an Objective Function


Types of Clusters: Well-Separated

• Well-Separated Clusters:
• A cluster is a set of points such that any point in a cluster is closer (or more similar) to
every other point in the cluster than to any point not in the cluster.

3 well-separated clusters
Types of Clusters: Center-Based

• Center-based
• A cluster is a set of objects such that an object in a cluster is closer (more similar)
to the “center” of a cluster than to the center of any other cluster
• The center of a cluster is often a centroid, the average of all the points in the
cluster, or a medoid, the most “representative” point of a cluster

4 center-based clusters
Types of Clusters: Contiguity-Based

• Contiguous Cluster (Nearest neighbor or Transitive)


• A cluster is a set of points such that a point in a cluster is closer (or more similar) to
one or more other points in the cluster than to any point, not in the cluster.

8 contiguous clusters
Types of Clusters: Density-Based

• Density-based
• A cluster is a dense region of points, which is separated by low-density regions, from other
regions of high density.
• Used when the clusters are irregular or intertwined, and when noise and outliers are
present.

6 density-based clusters
Types of Clusters: Conceptual Clusters
• Shared Property or Conceptual Clusters
• Finds clusters that share some common property or represent a particular concept.
.

2 Overlapping Circles
What is a natural grouping of these objects?
What is a natural grouping of these objects?

Clustering is subjective
What is Similarity?
Similarity is hard to define, but…
“We know it when we see it”

The similarity
Two Types of Clustering
• Partitional algorithms: Construct various partitions and then
evaluate them by some criterion
• Hierarchical algorithms: Create a hierarchical decomposition of the
set of objects using some criterion

Hierarchical Partitional
Dendogram: A Useful Tool for Summarizing Similarity
Measurements
Terminal Branch Root
The similarity between two objects in a
Internal Branch
Internal Node
dendrogram is represented as the
Leaf height of the lowest internal node they
share.
There is only one dataset that
can be perfectly clustered
using a hierarchy…
Pio
tr
Pyo
tr
Pe
tros
Pie
t ro
Pe
dro
hierarchical

Pie
rre
Pie
clustering using

ro
Pe
Petedr
A demonstration of

string edit distance

e
Pe r
k
Pe a
ada
r
Mic
hal
Mic is
hae
Mig l
ue
Mic l
Cri k
st o
Ch v
rist ao
oph
Ch er
rist
oph
Ch e
rist
o
Cri ph
sde
Cri an
sto
Cri bal
sto
foro
Kri
sto
ff
Kry er
st o
f
Hierarchal clustering can sometimes show patterns that
are meaningless or spurious
The tight grouping of Australia, Anguilla, St. Helena etc is meaningful; all these
countries are former UK colonies

However the tight grouping of Niger and India is completely spurious; there is no
connection between the two.

South Georgia & Serbia &


St. Helena & South Sandwich Montenegro
AUSTRALIA ANGUILLA Islands U.K. (Yugoslavia) FRANCE NIGER INDIA IRELAND BRAZIL
Dependencie
s
We can look at the dendrogram to determine the “correct”
number of clusters.
One potential use of a dendrogram: detecting outliers
The single isolated branch is suggestive of a data
point that is very different to all others

Outlier
Hierarchical Clustering
The number of dendrograms Since we cannot test all possible
with n leafs = (2n - trees we will have to heuristic
3)!/[(2(n -2)) (n -2)!] search of all possible trees. We
Number Number of Possible
could do this..
of Leafs Dendrograms
2 1 Bottom-Up (agglomerative):
3 3
4 15 Starting with each item in its own
5 105 cluster, find the best pair to merge
... …
10 34,459,425 into a new cluster. Repeat until all
clusters are fused together.

Top-Down (divisive): Starting with


all the data in a single cluster,
consider every possible way to
divide the cluster into two. Choose
the best division and recursively
operate on both sides.
We begin with a distance
matrix which contains the
distances between every
pair of objects in our
database.
0 8 8 7 7

0 2 4 4

0 3 3
D( , ) = 8 0 1

D( , ) = 1 0
Bottom-Up (agglomerative):
Starting with each item in its own
cluster, find the best pair to merge
into a new cluster. Repeat until all
clusters are fused together.

Consider all Choose


possible … the best
merges…
Bottom-Up (agglomerative):
Starting with each item in its own
cluster, find the best pair to merge
into a new cluster. Repeat until all
clusters are fused together.

Consider all
Choose
possible
… the best
merges…

Consider all Choose


possible … the best
merges…
Bottom-Up (agglomerative):
Starting with each item in its own
cluster, find the best pair to merge
into a new cluster. Repeat until all
clusters are fused together.

Consider all
Choose
possible
merges… … the best

Consider all
Choose
possible
… the best
merges…

Consider all Choose


possible … the best
merges…
Bottom-Up (agglomerative):
Starting with each item in its own
cluster, find the best pair to merge
into a new cluster. Repeat until all
clusters are fused together.

Consider all
Choose
possible
merges… … the best

Consider all
Choose
possible
… the best
merges…

Consider all Choose


possible … the best
merges…
Hierarchal Clustering Methods Summary

 No need to specify the number of clusters in


advance
 Hierarchal nature maps nicely onto human
intuition for some domains
 They do not scale well: time complexity of at least
O(n2), where n is the number of total objects
 Like any heuristic search algorithms, local optima
are a problem
 Interpretation of results is (very) subjective
Partitional algorithms
Suppose we are given a database of ‘n’ objects and the partitioning method constructs ‘k’ partition of
data. Each partition will represent a cluster and k ≤ n. It means that it will classify the data into k
groups, which satisfy the following requirements −
 Each group contains at least one object.
 Each object must belong to exactly one group.

Points to remember −

 For a given number of partitions (say k), the partitioning method will create an initial partitioning.

 Then it uses the iterative relocation technique to improve the partitioning by moving objects from one
group to other.
Partitional Clustering
• Nonhierarchical, each instance is placed in exactly one
of K non-overlapping clusters.
• Since only one set of clusters is output, the user
normally has to input the desired number of clusters K.
Partition Algorithm 1: k-means
1. Decide on a value for k.
2. Initialize the k cluster centers (randomly, if necessary).
3. Decide the class memberships of the N objects by assigning them
to the nearest cluster center.
4. Re-estimate the k cluster centers, by assuming the memberships
found above are correct.
5. If none of the N objects changed membership in the last iteration,
exit. Otherwise goto 3.
K-means Clustering: Step 1
Algorithm: k-means, Distance Metric: Euclidean Distance
5

4
k1
3

2
k2

k3
0
0 1 2 3 4 5
K-means Clustering: Step 2
Algorithm: k-means, Distance Metric: Euclidean Distance
5

4
k1
3

2
k2

k3
0
0 1 2 3 4 5
K-means Clustering: Step 3
Algorithm: k-means, Distance Metric: Euclidean Distance
5

4
k1

k3
1 k2

0
0 1 2 3 4 5
K-means Clustering: Step 4
Algorithm: k-means, Distance Metric: Euclidean Distance
5

4
k1

k3
1 k2

0
0 1 2 3 4 5
K-means Clustering: Step 5
Algorithm: k-means, Distance Metric: Euclidean Distance
5

expression in condition 2
4
k1
3

k2
1 k3

0
0 1 2 3 4 5

expression in condition 1
Comments on k-Means
• Strengths
• Relatively efficient: O(tkn), where n is # objects, k is # clusters,
and t is # iterations. Normally, k, t << n.
• Often terminates at a local optimum.
• Weakness
• Applicable only when mean is defined, then what about
categorical data?
• Need to specify k, the number of clusters, in advance
• Unable to handle noisy data and outliers
• Not suitable to discover clusters with non-convex shapes
How do we measure similarity?

Peter Piotr

0.23 3 342.7
A generic technique for measuring similarity
To measure the similarity between two objects,
transform one into the other, and measure how
much effort it took. The measure of effort
becomes the distance measure.

The distance between Patty and Selma:


Change dress color, 1 point
Change earring shape, 1 point
Change hair part, 1 point
D(Patty,Selma) = 3
The distance between Marge and Selma:
Change dress color, 1 point
Add earrings, 1 point
Decrease height, 1 point
Take up smoking, 1 point
Lose weight, 1 point
This is called the “edit
D(Marge,Selma) = 5 distance” or the
“transformation
distance”
Edit Distance Example How similar are the names
It is possible to transform any string Q “Peter” and “Piotr”?
Assume the following cost function
into string C, using only Substitution, Substitution 1 Unit
Insertion and Deletion. Insertion 1 Unit
Assume that each of these operators Deletion 1 Unit
has a cost associated with it.
D(Peter,Piotr) is 3
The similarity between two strings can
be defined as the cost of the cheapest
transformation from Q to C.
Note that for now we have ignored the issue of how we can find this cheapest
Peter
transformation Substitution (i for e)

Piter
Insertion (o)

Pioter
dro

er
tros

r re

ro
t ro
r
tr

t
Pyo
Pio

Pet
Pie
Pe
Pie
Pie

Deletion (e)
Pe

Piotr
K-means Clustering
• Partitional clustering approach
• Each cluster is associated with a centroid (center point)
• Each point is assigned to the cluster with the closest centroid
• Number of clusters, K, must be specified
• The basic algorithm is very simple
K-means Clustering – Details

• Initial centroids are often chosen randomly.


• Clusters produced vary from one run to another.
• The centroid is (typically) the mean of the points in the cluster.
• ‘Closeness’ is measured by Euclidean distance, cosine similarity, correlation, etc.
• K-means will converge for common similarity measures mentioned above.
• Most of the convergence happens in the first few iterations.
• Often the stopping condition is changed to ‘Until relatively few points change clusters’
• Complexity is O( n * K * I * d )
• n = number of points, K = number of clusters,
I = number of iterations, d = number of attributes
Two different K-means Clustering's
3

2.5

2
Original Points
1.5

y
1

0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2


x

3 3

2.5 2.5

2 2

1.5 1.5
y

y
1 1

0.5 0.5

0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2


x x

Optimal Clustering Sub-optimal Clustering


Importance of Choosing Initial
Centroids Iteration 6
1
2
3
4
5
3

2.5

1.5

y
1

0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2


x
Evaluating K-means Clusters
• Most common measure is Sum of Squared Error (SSE)
• For each point, the error is the distance to the nearest cluster
• To get SSE, we square these errors and sum them.
K
SSE    dist 2 ( mi , x )
i 1 xCi

• x is a data point in cluster Ci and mi is the representative point for cluster Ci


• can show that mi corresponds to the center (mean) of the cluster
• Given two clusters, we can choose the one with the smallest error
• One easy way to reduce SSE is to increase K, the number of clusters
• A good clustering with smaller K can have a lower SSE than a poor clustering with higher K
Importance of Choosing Initial Centroids

Iteration 5
1
2
3
4
3

2.5

1.5

y
1

0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2


x
Solutions to Initial Centroids Problem
• Multiple runs
• Helps, but probability is not on your side
• Sample and use hierarchical clustering to determine initial centroids
• Select more than k initial centroids and then select among these
initial centroids
• Select most widely separated
• Postprocessing
• Bisecting K-means
• Not as susceptible to initialization issues
Handling Empty Clusters
• Basic K-means algorithm can yield empty clusters

• Several strategies
• Choose the point that contributes most to SSE
• Choose a point from the cluster with the highest SSE
• If there are several empty clusters, the above can be repeated several times.
Updating Centers Incrementally
• In the basic K-means algorithm, centroids are updated after all points
are assigned to a centroid

• An alternative is to update the centroids after each assignment


(incremental approach)
• Each assignment updates zero or two centroids
• More expensive
• Introduces an order dependency
• Never get an empty cluster
• Can use “weights” to change the impact
Pre-processing and Post-processing
• Pre-processing
• Normalize the data
• Eliminate outliers

• Post-processing
• Eliminate small clusters that may represent outliers
• Split ‘loose’ clusters, i.e., clusters with relatively high SSE
• Merge clusters that are ‘close’ and that have relatively low SSE
Bisecting K-means
• Bisecting K-means algorithm
• Variant of K-means that can produce a partitional or a hierarchical
clustering
Bisecting K-means Example
Limitations of K-means
• K-means has problems when clusters are of differing
• Sizes
• Densities
• Non-globular shapes

• K-means has problems when the data contains outliers.


Limitations of K-means: Differing Sizes

Original Points K-means (3 Clusters)


Limitations of K-means: Differing Density

Original Points K-means (3 Clusters)


Limitations of K-means: Non-globular Shapes

Original Points K-means (2 Clusters)


Overcoming K-means Limitations

Original Points K-means Clusters

One solution is to use many clusters.


Find parts of clusters, but need to put together.
Overcoming K-means Limitations

Original Points K-means Clusters


Overcoming K-means Limitations

Original Points K-means Clusters


We know how to measure the distance between two
objects, but defining the distance between an object
and a cluster, or defining the distance between two
clusters is non obvious.
• Single linkage (nearest neighbor): In this method, the
distance between two clusters is determined by the distance of the two closest
objects (nearest neighbors) in the different clusters.
• Complete linkage (furthest neighbor): In this method, the
distances between clusters are determined by the greatest distance between
any two objects in the different clusters (i.e., by the "furthest neighbors").
• Group average linkage: In this method, the distance between two
clusters is calculated as the average distance between all pairs of objects in the
two different clusters.
7

Single linkage 29 2 6 11 9 17 10 13 24 25 26 20 22 30 27 1 3 8 4 12 5 14 23 15 16 18 19 21 28 7

Average linkage

You might also like