DATA MINING AND DATA WAREHOUSING
MODULE-5
CLUSTERING ANALYSIS
5.1 Introduction
5.2 K-Means
5.3 Agglomerative Hierarchical
5.4 Clustering
5.5 DBSCAN
5.6 Cluster Evaluation
5.7Density-Based Clustering
5.8 Graph-Based Clustering
5.9Scalable Clustering Algorithms
5.11 Important Question
5.1 Introduction
Finding groups of objects such that the objects in a group will be similar (or related) to one
another and different from (or unrelated to) the objects in other groups.
The greater the similarity within a group and the greater the difference between groups, the
better or more distinct the clustering.
Cluster analysis divides data into groups (clusters) that are meaningful, useful, or both.
In the context of understanding data, clusters are potential classes and cluster analysis is the
VTUPulse.com
study of techniques for automatically finding classes.
Clustering: Applications
Biology: biologists have applied clustering to analyze the large amounts of genetic information
that are now available.
For example, clustering has been used to find groups of genes that have similar functions.
Information Retrieval:. The World Wide Web consists of billions of Web pages, and the
results of a query to a search engine can return thousands of pages. Clustering can be used to
group these search results into a small number of clusters, each of which captures a particular
aspect of the query.
Climate: Understanding the Earth's climate requires finding patterns in the atmosphere and
ocean. To that end, cluster analysis has been applied to find patterns in the atmospheric pressure
of polar regions and are as of the ocean that have a significant impact on land climate.
Psychology and Medicine: An illness or condition frequently has a number of variations, and
cluster analysis can be used to identify these different subcategories.
Dept. of CSE Page 1
DATA MINING AND DATA WAREHOUSING
Business: Businesses collect large amounts of information on current and potential customers.
Clustering can be used to segment customers into a small number of groups for additional
analysis and marketing activities.
Types of Clustering's
A clustering is a set of clusters
Important distinction between hierarchical and partitional sets of clusters Partitional Clustering
– A division data objects into non-overlapping subsets (clusters) such that each data object is
in exactly one subset
– Hierarchical clustering
A set of nested clusters organized as a hierarchical tree
Types of Clusters
Well-separated clusters
Center-based clusters
Contiguous clusters
VTUPulse.com
Density-based clusters
Property or Conceptual
Well-Separated Clusters:
A cluster is a set of points such that any point in a cluster is closer (or more similar) to every
other point in the cluster than to any point not in the cluster
Dept. of CSE Page 2
DATA MINING AND DATA WAREHOUSING
Center-based (proto type based)
– A cluster is a set of objects such that an object in a cluster is closer (more similar) to the
“center” of a cluster, than to the center of any other cluster
– The center of a cluster is often a centroid, the average of all the points in the cluster, or a
medoid, the most “representative” point of a cluster
VTUPulse.com
Contiguous Cluster (Graph based)
– A cluster is a set of points such that a point in a cluster is closer (or more similar) to one or
more other points in the cluster than to any point not in the cluster.
Dept. of CSE Page 3
DATA MINING AND DATA WAREHOUSING
Density-based
– A cluster is a dense region of points, which is separated by low-density regions, from other
regions of high density.
– Used when the clusters are irregular or intertwined, and when noise and outliers are
present.
VTUPulse.com
Shared Property or Conceptual Clusters
– Finds clusters that share some common property or represent a particular concept.
5.2 K-means Clustering
Partitional clustering approach
Dept. of CSE Page 4
DATA MINING AND DATA WAREHOUSING
Each cluster is associated with a centroid (center point)
Each point is assigned to the cluster with the closest centroid
Number of clusters, K, must be specified
Initial centroids are often chosen randomly.
Clusters produced vary from one run to another.
The centroid is (typically) the mean of the points in the cluster.
„Closeness‟ is measured by Euclidean distance, cosine similarity, correlation, etc.
K-means will converge for common similarity measures mentioned above.
Most of the convergence happens in the first little iteration.
VTUPulse.com
Often the stopping condition is changed to „Until relatively few points change clusters‟
Complexity is O( n * K * I * d )
n = number of points, K = number of clusters, I = number of iterations, d = number of
attributes
Evaluating K-means Clusters
Most common measure is Sum of Squared Error (SSE).
For each point, the error is the distance to the nearest cluster. To get SSE, we square these errors
and sum them.
x is a data point in cluster Ci and mi is the representative point for cluster Ci can show that mi
corresponds to the center (mean) of the cluster.
Given two clusters, we can choose the one with the smallest error. One easy way to reduce SSE
is to increase K, the number of clusters.
Dept. of CSE Page 5
DATA MINING AND DATA WAREHOUSING
A good clustering with smaller K can have a lower SSE than a poor clustering with higher K.
Problems with Selecting Initial Points
If there are K „real‟ clusters then the chance of selecting one centroid from each cluster is
small.
Chance is relatively small when K is large If clusters are the same size, n, then
K-means: Additional Issues Handling Empty Clusters
Basic K-means algorithm can yield empty clusters Several strategies to address this..
Choose the point that contributes most to SSE
Choose a point from the cluster with the highest SSE
If there are several empty clusters, the above can be repeated several times
Updating Centers Incrementally
In the basic K-means algorithm, centroids are updated after all points are assigned to a centroid
An alternative is to update the centroids after each assignment (incremental approach)
–
–
–
VTUPulse.com
Each assignment updates zero or two centroids
More expensive
Introduces an order dependency
– Never get an empty cluster
– Can use “weights” to change the impact
Pre-processing and Post-processing
Pre-processing
Normalize the data
Eliminate outliers Post-processing
Eliminate small clusters that may represent outliers
Dept. of CSE Page 6
DATA MINING AND DATA WAREHOUSING
Split „loose‟ clusters, i.e., clusters with relatively high SSE
Merge clusters that are „close‟ and that have relatively low SSE
Can use these steps during the clustering process
Strengths and Weaknesses of K-means (Limitations)
K-means is simple and can be used for a wide variety of data types.
It is also quite efficient, even though multiple runs are often performed.
K-means is not suitable for all types of data.
K-means has problems when clusters are of differing
Sizes
Densities
Non-globular shapes
K-means has problems when the data contains outliers.
BisectingK-means
The bisecting K-means algorithm is a straightforward extension of the basic K-means algorithm that
is based on a simple idea: to obtain K clusters, split the set of all points into two clusters, select one
VTUPulse.com
of these clusters to split, and so on, until K clusters have been produced.
There are a number of different ways to choose which cluster to split. We can choose the largest
cluster at each step, choose the one with the largest SSE, or use a criterion based on both size
and SSE. Different choices result in different clusters.
Dept. of CSE Page 7
DATA MINING AND DATA WAREHOUSING
VTUPulse.com
Dept. of CSE Page 8
DATA MINING AND DATA WAREHOUSING
VTUPulse.com
5.3 Agglomerative Hierarchical Clustering
More popular hierarchical clustering technique.
Produces a set of nested clusters organized as a hierarchical tree.
Can be visualized as a dendrogram.
A tree like diagram that records the sequences of merges or splits.
Do not have to assume any particular number of clusters
Dept. of CSE Page 9
DATA MINING AND DATA WAREHOUSING
– Any desired number of clusters can be obtained by „cutting‟ the dendogram at the proper
level
Basic algorithm is
How to Define Inter-Cluster Similarity (Proximity of two clusters)
MIN MAX
Group Average
VTUPulse.com
1) Single Link or MIN
For the single link or MIN version of hierarchical clustering, the proximity of two clusters is defined
as the minimum of the distance (maximum of the similarity) between any two points in the two
different clusters.
we shall use sample data that consists of 6 two-dimensional points, which are shown in Figure
8.15. The r and g coordinates of the points and the Euclidean distances between them are shown in
Tables 8.3 and 8.4. respectively.
Dept. of CSE Page 10
DATA MINING AND DATA WAREHOUSING
Figure 8.16 shows the result of applying the single link technique to our example data set of six
VTUPulse.com
points. Figure 8.16(a) shows the nested clusters as a sequence of nested ellipses, where the numbers
associated with the ellipses indicate the order of the clustering. Figure S.16(b) shows the same
information, but as a dendrogram.
The height at which two clusters are merged in the dendrogram reflects the distance of the two
clusters.
For instance, from Table 8.4, we see that the distance between points 3 and 6 is 0.11, and that is the
height at which they are joined into one cluster in the dendrogram. As another example, the distance
between clusters {3,6} and {2,5} is given by
Dept. of CSE Page 11
DATA MINING AND DATA WAREHOUSING
2) Complete Link or MAX or CLIQUE
For the complete link or MAX version of hierarchical clustering, the proximity of two clusters is
defined as the maximum of the distance (minimum of the similarity) between any two points in the
two different clusters. Using graph terminology, if you start with all points as singleton clusters and
add links between points one at a time, shortest links first, then a group of points is not a cluster
until all the points in it are completely linked, i.e., form a clique.
Example 8.5 (Complete Link). Figure 8.17 shows the results of applying MAX to the sample data
VTUPulse.com
set of six points. As with single link, points 3 and 6 are merged first. However, {3,6} is merged with
{4}, instead of {2,5} or {1}.
Dept. of CSE Page 12
DATA MINING AND DATA WAREHOUSING
3) Group Average
For the group average version of hierarchical clustering, the proximity of two clusters is defined as
the average pairwise proximity among all pairs of points in the different clusters.
This is an intermediate approach between the single and complete link approaches. Thus, for group
average, the cluster proximity proximity(Ci,Cj) of clusters C,i and Cj', which are of size mi and mj,
respectively, is expressed by the following equation.
VTUPulse.com
Figure 8.18 shows the results of applying the group average approach to the sample data set of six
points. To illustrate how group average works, we calculate the distance between some clusters.
Dept. of CSE Page 13
DATA MINING AND DATA WAREHOUSING
Key Issues in Agglomerative Hierarchical Clustering (Strengths and Weaknesses)
Once a decision is made to combine two clusters, it cannot be undone
No objective function is directly minimized.
Different schemes have problems with one or more of the following:
- Sensitivity to noise and outliers
–
–
VTUPulse.comDifficulty handling different sized clusters and convex shapes
Breaking large clusters
5.5 The DBSCAN Algorithm
DBSCAN is a density-based algorithm.
Density = number of points within a specified radius (Eps)
A point is a core point if it has more than a specified number of points (MinPts) within Eps
These are points that are at the interior of a cluster
A border point has fewer than MinPts within Eps, but is in the neighborhood of a core point
A noise point is any point that is not a core point or a border point.
Dept. of CSE Page 14
DATA MINING AND DATA WAREHOUSING
Eliminate noise points
Perform clustering on the remaining points
VTUPulse.com
Strengths and weaknesses of DBSCAN
It is relatively Resistant to Noise.
It can Can handle clusters of different shapes and sizes
Does NOT Work Well when the clusters having Varying densities
Does NOT Work Well With High-dimensional data.
Dept. of CSE Page 15
DATA MINING AND DATA WAREHOUSING
VTUPulse.com
5.6 Cluster Evaluation:
Why do we want to evaluate them?
To avoid finding patterns in noise.
To compare clustering algorithms
To compare two sets of clusters.
To compare two clusters.
Different Aspects of Cluster Validation:
Dept. of CSE Page 16
DATA MINING AND DATA WAREHOUSING
1. Determining the clustering tendency of a set of data, i.e., distinguishing whether non-random
structure actually exists in the data.
2. Comparing the results of a cluster analysis to externally known results, e.g., to externally given
class labels.
3. Evaluating how well the results of a cluster analysis fit the data without reference to external
information.
- Use only the data
1. Comparing the results of two different sets of cluster analyses to determine which is better.
2. Determining the „correct‟ number of clusters.
For 2, 3, and 4, we can further distinguish whether we want to evaluate the entire clustering or just
individual clusters.
Supervised Cluster Evaluation Using Cohesion and Separation
Cluster cohesion (compactness, tightness), which determine how closely related the objects in a
cluster are.
cluster separation (isolation), which determine how distinct or well separated a cluster is from other
clusters.
Graph-Based View of Cohesion and Separation:
For graph-based clusters, the cohesion of a cluster can be defined as the sum of the weights of the
links in the proximity graph that connect points within the cluster.
The separation between two clusters can be measured by the sum of the weights of the links from
points in one cluster to points in the other cluster.
Mathematically, cohesion and separation for a graph-based cluster can be expressed using Equations
VTUPulse.com
8.9 and 8.10, respectively.
Prototype-Based View of Cohesion and Separation
Dept. of CSE Page 17
DATA MINING AND DATA WAREHOUSING
For prototype-based clusters, the cohesion of a cluster can be defined as the sum of the proximities
with respect to the prototype (centroid or medoid) of the cluster. Similarly, the separation between
two clusters can be measured by the proximity of the two cluster prototypes
Measuring Cluster Validity Via Correlation
• Two matrices
VTUPulse.com
• Similarity or Distance Matrix
• One row and one column for each data point
• An entry is the similarity or distance of the associated pair of points
• “Incidence” Matrix
• One row and one column for each data point
• An entry is 1 if the associated pair of points belong to the same cluster
• An entry is 0 if the associated pair of points belongs to different clusters
• Compute the correlation between the two matrices
• Since the matrices are symmetric, only the correlation between n(n-1) / 2 entries needs
to be calculated.
• High correlation (positive for similarity, negative for distance) indicates that points that belong
to the same cluster are close to each other.
• Not a good measure for some density or contiguity based clusters.
Supervised Measures of Cluster Validity (External Measures for Clustering Validity)
Two different kinds of approaches.
The first set of techniques use measures from classification, such as entropy, purity, and the F-
measure. These measures evaluate the extent to which a cluster contains objects of a single class.
The second group of methods is related to the similarity measures for binary data, such as the
Jaccard measure . These approaches measure the extent to which two objects that are in the same
class are in the same cluster and vice versa.
Classification-Oriented Measures of Cluster Validity
Dept. of CSE Page 18
DATA MINING AND DATA WAREHOUSING
• Assume that the data is labeled with some class labels
• E.g., documents are classified into topics, people classified according to their income,
politicians classified according to the political party.
• This is called the “ground truth”
• In this case we want the clusters to be homogeneous with respect to classes
• Each cluster should contain elements of mostly one class Each class should ideally be assigned to
a single cluster
Confusion matrix
Measures:
VTUPulse.com
Example:
Dept. of CSE Page 19
DATA MINING AND DATA WAREHOUSING
VTUPulse.com
Similarity-Oriented Measures of Cluster Validity
The measures that we discuss in this section are all based on the premise that any two objects that
are in the same cluster should be in the same class and vice versa.
We can view this approach to cluster validity as involving the comparison of two matrices: (1)The
ideal cluster similarity matrix discussed previously, which has a 1 in the (i,j)th entry if two objects, i.
and j, are in the same cluster and 0, otherwise.
(2) An ideal class similarity matrix defined with respect to class labels, which has a 1 in the (i,j) th
entry if two objects, i and j, belong to the same class, and a 0 otherwise.
Dept. of CSE Page 20
DATA MINING AND DATA WAREHOUSING
VTUPulse.com
In particular, the simple matching coefficient, which is known as the Rand statistic in this context,
and the Jaccard coefficient are two of the most frequently used cluster validity measures.
Dept. of CSE Page 21
DATA MINING AND DATA WAREHOUSING
VTUPulse.com
5.7 Density-Based Clustering
• Grid-Based Clustering
• Subspace Clustering
• CLIQUE
• DENCLUE: A Kernel-Based Scheme for Density-Based
• Clustering
Grid-Based Clustering
The idea is to split the possible values of each attribute into a number of contiguous intervals,
creating a set of grid cells.
Objects can be assigned to grid cells in one pass through the data, and information about each cell,
such as the number of points in the cell, can also be gathered at the same time.
Defining Grid Cells: This is a key step in the process, but also the least well defined, as there are
many ways to split the possible values of each attribute into a number of contiguous intervals.
Dept. of CSE Page 22
DATA MINING AND DATA WAREHOUSING
For continuous attributes, one common approach is to split the values into equal width intervals. If
this approach is applied to each attribute, then the resulting grid cells all have the same olume, and
the density of a cell is conveniently defined as the number of points in the cell.
The Density of Grid Cells: A natural way to define the density of a grid cell (or a more generally
shaped region) is as the number of points divided by the volume of the region. In other words,
density is the number of points per amount of space, regardless of the dimensionality of that space.
Example: Figure 9.10 shows two sets of two dimensional points divided into 49 cells using a 7- by-7
grid. The first set contains 200 points generated from a uniform distribution over a circle centered at
(2, 3) of radius 2, while the second set has 100 points generated from a uniform distribution over a
circle centered at (6, 3) of radius 1. The counts for the grid cells are shown in Table 9.2.
CLIQUE
VTUPulse.com
CLIQUE (Clustering In QUEst) is a grid-based clustering algorithm that methodically finds
subspace clusters. It is impractical to check each subspace for clusters since the number of such
subspaces is exponential in the number of dimensions. Instead, CLIQUE relies on the following
property;
Monotonicity property of density-based clusters If a set of points forms a density-based cluster in k
dimensions (attributes), then the same set of points is also part of a density-based cluster in all
possible subsets of those dimensions.
Dept. of CSE Page 23
DATA MINING AND DATA WAREHOUSING
DENCLUE: A Kernel-Based Scheme for Density-Based Clustering DENCLUE
(DENsity ClUstEring) is a density-based clustering approach that models the overall density of a set
of points as the sum of influence functions associated with each point. The resulting overall density
function will have local peaks, i.e., local density maxima, and these local peaks can be used to
define clusters in a natural way. Specifically, for each data point, a hill climbing procedure finds the
nearest peak associated with that point, and the set of all data points associated with a particular
peak (called a local density attractor) becomes a cluster.
5.8 Graph-Based Clustering
Graph-Based clustering uses the proximity graph
Start with the proximity matrix
Consider each point as a node in a graph
VTUPulse.com
Each edge between two nodes has a weight which is the proximity between the two points
Initially the proximity graph is fully connected
MIN (single-link) and MAX (complete-link) can be viewed as starting with this graph In the
simplest case, clusters are connected components in the graph.
Sparsification
The amount of data that needs to be processed is drastically reduced.
Sparsification can eliminate more than 99% of the entries in a proximity matrix
The amount of time required to cluster the data is drastically reduced
The size of the problems that can be handled is increased.
Clustering may work better
Sparsification techniques keep the connections to the most similar (nearest) neighbors of a point
while breaking the connections to less similar points.
The nearest neighbors of a point tend to belong to the same class as the point itself.
This reduces the impact of noise and outliers and sharpens the distinction between clusters.
Sparsification facilitates the use of graph partitioning algorithms (or algorithms based on graph
partitioning algorithms.
Chameleon and Hypergraph-based Clustering
Dept. of CSE Page 24
DATA MINING AND DATA WAREHOUSING
Limitations of Current Merging Schemes
Existing merging schemes in hierarchical clustering algorithms are static in nature
– MIN or CURE:
merge two clusters based on their closeness (or minimum distance)
– GROUP-AVERAGE:
merge two clusters based on their average connectivity
Chameleon: Clustering Using Dynamic Modeling.
Adapt to the characteristics of the data set to find the natural clusters Use a dynamic model to
measure the similarity between clusters
Main property is the relative closeness and relative inter-connectivity of the cluster
Two clusters are combined if the resulting cluster shares certain properties with the
constituent clusters
The merging scheme preserves self-similarity
Steps
Preprocessing Step: Represent the Data by a Graph
Given a set of points, construct the k-nearest-neighbor (k-NN) graph to capture the
relationship between a point and its k nearest neighbors
Concept of neighborhood is captured dynamically (even if region is sparse)
Phase 1: Use a multilevel graph partitioning algorithm on the graph to find a large number of
clusters of well-connected vertices.
VTUPulse.com
Each cluster should contain mostly points from one “true” cluster, i.e., is a sub-cluster of a
“real” cluster.
Phase 2: Use Hierarchical Agglomerative Clustering to merge sub-clusters
Two clusters are combined if the resulting cluster shares certain properties with the
constituent clusters
Two key properties used to model cluster similarity:
Relative Interconnectivity: Absolute interconnectivity of two clusters normalized by the internal
connectivity of the clusters
Relative Closeness: Absolute closeness of two clusters normalized by the internal closeness of the
clusters
SNN Clustering Algorithm:
1) Compute the similarity matrix
This corresponds to a similarity graph with data points for nodes and edges whose weights are the
similarities between data points
2) Sparsify the similarity matrix by keeping only the k most similar neighbors This corresponds to
only keeping the k strongest links of the similarity graph
3) Construct the shared nearest neighbor graph from the sparsified similarity matrix.
At this point, we could apply a similarity threshold and find the connected components to obtain the
clusters (Jarvis-Patrick algorithm)
4) Find the SNN density of each Point.
Using a user specified parameters, Eps, find the number points that have an SNN imilarity of Eps
or greater to each point. This is the SNN density of the point. 5)Find the core points
Using a user specified parameter, MinPts, find the core points, i.e., all points that have an SNN
density greater than MinPts.
Dept. of CSE Page 25
DATA MINING AND DATA WAREHOUSING
6)Form clusters from the core points.
If two core points are within a radius, Eps, of each other they are place in the same cluster.
7)Discard all noise points.
All non-core points that are not within a radius of Eps of a core point are discarded . 8)Assign all
non-noise, non-core points to clusters.
This can be done by assigning such points to the nearest core point
VTUPulse.com
Limitations of SNN Clustering Complexity of SNN Clustering is high
– O( n * time to find numbers of neighbor within Eps)
– In worst case, this is O(n2)
5.9 Scalable Clustering Algorithms BIRCH
BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) is a highly efficient
clustering technique for data in Euclidean vector spaces, i'e., data for which averages make sense.
BIRCH can efficiently cluster such data with one pass and can improve that clustering with
additional passes. BIRCH can also deal effectively with outliers.
Dept. of CSE Page 26
DATA MINING AND DATA WAREHOUSING
CURE
CURE (Clustering Using REpresentatives) is a clustering algorithm that uses a variety of different
techniques to create an approach that can handle large data sets, outliers, and clusters with non-
VTUPulse.com
spherical shapes and non-uniform sizes. CURE represents a cluster by using multiple representative
points from the cluster. These points will, in theory, capture the geometry and shape of the cluster.
The first representative point is chosen to be the point farthest from
the center of the cluster, while the remaining points are chosen so that they are farthest from all the
previously chosen points. In this way, the representative points are naturally relatively well
distributed.
5.10 Question Bank: Clustering Analysis
Dept. of CSE Page 27
DATA MINING AND DATA WAREHOUSING
1. Explain desired features of cluster analysis.
2. Explain how distance between a pair of points can be computed.
3. Write a short note on density-based methods.
4. Write and explain basic K-Means algorithm.
5. Explain DBSCAN clustering algorithm.
6. What are the limitations of K Means algorithm.
7. Explain cluster analysis methods briefly.
8. Explain agglomerative hierarchical clustering.
9. Explain bisecting K Means algorithm.
10. Distinguish between various types of clustering.
11. What are unsupervised, supervised and relative evaluation measures that are applied to judge
various aspects of cluster validity.
12. Explain different types of defining proximity between clusters.
13. Differentiate between exclusive and overlapping clustering.
14. What are the various issues considered for cluster validation? Explain different evaluation
measures used for cluster validity.
15. Explain unsupervised cluster evaluation using cohesion and separation.
VTUPulse.com
16. Explain unsupervised cluster evaluation using proximity matrix.
17. List and explain classification-oriented measures of cluster validity.
18. Explain similarity – oriented measures of cluster validity.
19. Explain grid-based clustering algorithm.
20. Explain subspace clustering.
21. Write and explain CLIQUE algorithm.
22. Write and explain DENCLUE algorithm.
23. Explain different graph-based clustering
Dept. of CSE Page 28