Cluster Analysis
Basic Concepts and Algorithms
What is Cluster Analysis?
Cluster: a collection of data objects
– Similar to one another within the same cluster
– Dissimilar to the objects in other clusters
Cluster analysis
– Grouping a set of data objects into clusters
Inter-cluster
Intra-cluster distances
distances are are
minimized maximized
Applications of Cluster Analysis
Understanding Discovered Clusters Industry Group
1
Applied-Matl-DOWN,Bay-Network-Down,3-COM-DOWN,
– Information Retrieval Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,
DSC-Comm-DOWN,INTEL-DOWN,LSI-Logic-DOWN,
Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down,
Technology1-DOWN
Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOWN,
Group related documents for Sun-DOWN
2
Apple-Comp-DOWN,Autodesk-DOWN,DEC-DOWN,
ADV-Micro-Device-DOWN,Andrew-Corp-DOWN,
browsing Computer-Assoc-DOWN,Circuit-City-DOWN,
Compaq-DOWN, EMC-Corp-DOWN, Gen-Inst-DOWN,
Technology2-DOWN
Motorola-DOWN,Microsoft-DOWN,Scientific-Atl-DOWN
– Finance
3
Fannie-Mae-DOWN,Fed-Home-Loan-DOWN,
MBNA-Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN
Group stocks with similar price 4
Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP,
Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP, Oil-UP
Schlumberger-UP
fluctuations
– Biology
Group genes and proteins that have
similar functionality
– Marketing
Help marketers discover distinct
groups in their customer bases, and
develop targeted marketing programs
Clustering gene expression data
Applications of Cluster Analysis
Image segmentation
– Goal: Break up the image
into meaningful or
perceptually similar regions
Summarization
– Reduce the size of large
data sets
Clustering precipitation in
Australia
Notion of a Cluster can be Ambiguous
How many clusters? Six Clusters
Two Clusters Four Clusters
Clustering results are crucially dependent on the measure of similarity (or distance) between
“points” to be clustered
Measure the Quality of Clustering
Quality of clustering:
– There is usually a separate “quality” function that
measures the “goodness” of a cluster.
– It is hard to define “similar enough” or “good
enough”
The answer is typically highly subjective
6
Considerations for Cluster Analysis
Partitioning criteria
– Single level vs. hierarchical partitioning (often, multi-level hierarchical
partitioning is desirable)
Separation of clusters
– Exclusive (e.g., one customer belongs to only one region) vs. non-
exclusive (e.g., one document may belong to more than one class)
Similarity measure
– Distance-based (e.g., Euclidian, road network, vector) vs.
connectivity-
based (e.g., density or contiguity)
Clustering space (Partial versus complete)
– Full space (often when low dimensional) vs. subspaces (often in high-
dimensional clustering)
Heterogeneous versus homogeneous
– Cluster of widely different sizes, shapes, and densities
7
Types of Clusters
Well-separated clusters
Center-based clusters
Contiguous clusters
Density-based clusters
Property or Conceptual
Described by an Objective Function
Types of Clusters: Well-Separated
Well-Separated Clusters:
– A cluster is a set of points such that any point in a cluster is
closer (or more similar) to every point in the cluster than to any
point not in the cluster.
3 well-separated clusters
Types of Clusters: Center-Based
Center-based
– A cluster is a set of objects such that an object in a
cluster is closer (more similar) to the “center” of a cluster,
than to the center of any other cluster
– The center of a cluster is often a centroid
4 center-based clusters
Types of Clusters: Contiguity-Based
Contiguous Cluster (Nearest neighbor or Transitive)
– A cluster is a set of points such that a point in a cluster is
closer (or more similar) to one or more other points in the
cluster than to any point not in the cluster.
Types of Clusters: Density-Based
Density-based
– A cluster is a dense region of points, which is separated by
low-density regions, from other regions of high density.
– Used when the clusters are irregular or intertwined, and when
noise and outliers are present.
6 density-based clusters
Types of Clusters: Conceptual Clusters
Shared Property or Conceptual Clusters
– Finds clusters that share some common property or represent
a particular concept.
.
2 Overlapping Circles
Characteristics of the Input Data Are Important
Type of proximity or density measure
– This is a derived measure, but central to clustering
Sparseness
– Dictates type of similarity
– Adds to efficiency
Attribute type
– Dictates type of similarity
Type of Data
– Dictates type of similarity
– Other characteristics, e.g., autocorrelation
Dimensionality
Noise and Outliers
Type of Distribution
Similarity and
Dissimilarity
Type of data in clustering analysis
Interval-scaled variables
Binary variables
Nominal, ordinal, and ratio variables
Variables of mixed types
– Nominal attribute: distinctness
– Ordinal attribute: distinctness & order
– Interval attribute: distinctness, order & addition
– Ratio attribute: all 4 properties
Types of Attributes
There are different types of attributes
– Nominal
Examples: ID numbers, eye color, zip codes
– Ordinal
Examples: rankings (e.g., taste of potato chips on a scale
from 1-10), grades, height in {tall, medium, short}
– Interval
Examples: calendar dates, temperatures in Celsius or
Fahrenheit.
– Ratio
Examples: temperature in Kelvin, length, time, counts
Attribute Description Examples Operations
Type
Nominal The values of a nominal attribute are zip codes, employee ID mode, entropy,
just different names, i.e., nominal numbers, eye color, sex: contingency
attributes provide only enough {male, female} correlation, 2 test
information to distinguish one object
from another. (=, )
Ordinal The values of an ordinal attribute hardness of minerals, median, percentiles,
provide enough information to order {good, better, best}, rank correlation,
objects. (<, >) grades, street numbers run tests, sign tests
Interval For interval attributes, the differences calendar dates, mean, standard
between values are meaningful, i.e., temperature in Celsius deviation, Pearson's
a unit of measurement exists. or Fahrenheit correlation, t and F
(+, - ) tests
Ratio For ratio variables, both differences temperature in Kelvin, geometric mean,
and ratios are meaningful. (*, /) monetary quantities, harmonic mean,
counts, age, mass, percent
length, electrical current variation
Attribute Transformation Comments
Level
Nominal Any permutation of values If all employee ID numbers
were reassigned, would it
make any difference?
Ordinal An order preserving change of An attribute encompassing
values, i.e., the notion of good, better best
new_value = f(old_value) can be represented equally
where f is a monotonic well by the values {1, 2, 3} or
function. by { 0.5, 1, 10}.
Interval new_value =a * old_value + b where Thus, the Fahrenheit and
a and b are constants Celsius temperature scales
differ in terms of where their
zero value is and the size of a
unit (degree).
Ratio new_value = a * old_value Length can be measured in
meters or feet.
Similarity and Dissimilarity
Similarity
– Numerical measure of how alike two data objects are.
– Is higher when objects are more alike.
– Often falls in the range [0,1]
Dissimilarity
– Numerical measure of how different are two data
objects
– Lower when objects are more alike
– Minimum dissimilarity is often 0
– Upper limit varies
Proximity refers to a similarity or dissimilarity
Similarity/Dissimilarity for Simple Attributes
p and q are the attribute values for two data objects.
Distance Measures
Manhattan Distance
Euclidean Distance
n
2
dist ( pk q k)
k 1
Where n is the number of dimensions (attributes) and pk and qk
are, respectively, the kth attributes (components) or data
objects p and q.
Manhattan Distance
3
point x y
2 p1 p1 0 2
p3 p4 p2 2 0
1 p3 3 1
p2 p4 5 1
0
0 1 2 3 4 5 6
Acutal Points in 2D
Distance between p1 and p2 Manhattan Distance is the sum of
(X1,Y1) = (0 , 2) The absolute values of differences
(X2,Y2) = (2 , 0) Of the coordinates.
d=|0–2|+|2–0|
d= 2+2 =4
Distance between p1 and p3
(X1,Y1) = (0 , 2)
(X2,Y2) = (3 , 1)
d=|0–3|+|2–1|
d= 3+1 = 4 L1 p1 P2 p3 p4
Distance between p1 and p4 p1 0 4 4 6
(X1,Y1) = (0 , 2) p2 4 0 2 4
(X4,Y4) = (5 , 1) p3 4 2 0 2
d=|0–5|+|2–1| p4 6 4 2 0
d= 5+1 =6
Euclidean Distance
point x y
3 p1 0 2
p2 2 0
2 p1 p3 3 1
p3 p4 p4 5 1
1
p2 Acutal Points in 2D
0
0 1 2 3 4 5 6
p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
Distance Matrix
Clustering Algorithms
K-means and its variants
Hierarchical clustering
Density-based clustering
Partitional Clustering
Divide data objects into non-overlapping subsets (clusters) such that
each data object is in exactly one subset
Typical methods: k-means, k-medoids, CLARANS
Original Points A Partitional Clustering
Hierarchical Clustering
A set of nested clusters organized as a hierarchical tree
p1
p3 p4
p2
p1 p2 p3
Traditional Hierarchical Clustering p4
Traditional Dendrogram
p1
p3 p4
p2
p1 p2
p3 p4
Non-traditional Hierarchical Clustering
Non-traditional Dendrogram
K-Means : Partitioning approach
An iterative clustering algorithm
Each cluster is associated with a centroid (center point)
Each point is assigned to the cluster with the closest centroid
Number of clusters, K, must be specified
Initialize: Pick K random points as cluster
centers
Repeat:
1.Assign data points to closest
cluster center
2.Change the cluster center to the
average of its assigned points
Until The centroids don’t change
K-Means : Partitioning approach
An iterative clustering algorithm
Each cluster is associated with a centroid (center point)
Each point is assigned to the cluster with the closest centroid
Number of clusters, K, must be specified
Initialize: Pick K random points as cluster
centers
Repeat:
1.Assign data points to closest
cluster center
2.Change the cluster center to the
average of its assigned points
Until The centroids don’t change
K-means clustering Example
K-means clustering Example
Iteration 1
Iterative Step 2:
Iterative Step 1: Change the cluster
Assign data points to center to average of the
closest cluster center assigned points
K-means clustering Example
Iteration 2
Iterative Step 1:
Assign data points to
closest cluster center
Repeat until convergence Iterative Step 2:
Change the cluster
center to average of the
assigned points
K-means clustering Example
Iteration 3
Iterative Step 1:
Assign data points to
closest cluster center
Repeat until convergence Iterative Step 2:
Change the cluster
center to average of the
assigned points
K-means Clustering – Details
Initial centroids are often chosen randomly.
– Clusters produced vary from one run to another.
The centroid is (typically) the mean of the points in the
cluster.
‘Closeness’ is measured by Euclidean distance, cosine
similarity, correlation, etc.
– K-means will converge for common similarity measures mentioned
above.
K-means Clustering – Details
Most of the convergence happens in the first few
iterations.
– Often the stopping condition is changed to ‘Until relatively few
points change clusters’
Complexity is O( n * K * I )
– n = number of points,
– K = number of clusters,
– I = number of iterations
Two different K-means Clusterings
3
2.5
2
Original Points
1.5
y
1
0.5
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x
3 3
2.5 2.5
2 2
1.5 1.5
y
y
1 1
0.5 0.5
0 0
-2 -1.5 -1 -0.5 0 0.5 1 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5
1.5 2
x x
Optimal Clustering Sub-optimal Clustering
Importance of Choosing Initial Centroids
Iteration 1 Iteration 2 Iteration 3
3 3 3
2.5 2.5 2.5
2 2 2
1.5 1.5 1.5
y
y
1 1 1
0.5 0.5 0.5
0 0 0
-2 -1.5 -1 -0.5 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0.5 1 1.5 2
0 0 0
x x x
Iteration 4 Iteration 5 Iteration 6
3 3 3
2.5 2.5 2.5
2 2 2
1.5 1.5 1.5
y
y
1 1 1
0.5 0.5 0.5
0 0 0
-2 -1.5 -1 -0.5 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0.5 1 1.5 2
0 0 0
x x x
Importance of Choosing Initial Centroids
Iteration 6
2
3
4
5
1
3
2.5
1.5
1.5
y
y
0.5
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
-
2 x
-
1
.
5
-
1
Importance of Choosing Initial Centroids …
Iteration 1 Iteration 2
3 3
2.5 2.5
2 2
1.5 1.5
y
y
1 1
0.5 0.5
0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x
Iteration 3 Iteration 4 Iteration 5
3 3 3
2.5 2.5 2.5
2 2 2
1.5 1.5 1.5
y
y
1 1 1
0.5 0.5 0.5
0 0 0
-2 -1.5 -1 -0.5 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0.5 1 1.5 2
0 0 0
x x x
Importance of Choosing Initial Centroids …
Iteration 5
1
2
3
4
3
2.5
1.5
1.5
y
y
0.5
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
-
2 x
-
1
.
5
-
1
Evaluating K-means Clusters
Most common measure is Sum of Squared Error (SSE)
– For each point, the error is the distance to the nearest cluster
– To get SSE, we square these errors and sum them.
K
SSE dist 2 (mi ,
i1 xC x) i
x is a data point in cluster Ci and mi is the representative point for cluster Ci
can show that mi corresponds to the center (mean) of the cluster
– Given two clusters, we can choose the one with the smallest error
– One easy way to reduce SSE is to increase K, the number of clusters
A good clustering with smaller K can have a lower SSE than a poor
clustering with higher K
Solutions to Initial Centroids Problem
Multiple runs
– Helps, but probability is not on your side
Sample and use hierarchical clustering to determine initial
centroids
Select more than k initial centroids and then select among
these initial centroids
– Select most widely separated
Postprocessing
Bisecting K-means
– Not as susceptible to initialization issues
Pre-processing and Post-processing
Pre-processing
– Normalize the data
– Eliminate outliers
Post-processing
– Eliminate small clusters that may represent outliers
– Split ‘loose’ clusters, i.e., clusters with relatively high SSE
– Merge clusters that are ‘close’ and that have relatively low SSE
Limitations of K-means
K-means has problems when clusters are of
different
– Sizes
– Densities
– Non-globular shapes
K-means has problems when the data contains
outliers.
Limitations of K-means: Differing Sizes
Original Points K-means (3 Clusters)
Limitations of K-means: Differing Density
Original Points K-means (3 Clusters)
Limitations of K-means: Non-globular Shapes
Original Points K-means (2 Clusters)
Overcoming K-means Limitations
Original Points K-means Clusters
Overcoming K-means Limitations
Original Points K-means Clusters
Density-Based Clustering Methods
Clustering based on density (local cluster criterion), such as density-
connected points
Major features:
– Discover clusters of arbitrary shape
– Handle noise
– One scan
– Need density parameters as termination condition
Several interesting studies:
– DBSCAN: Ester, et al. (KDD’96)
– OPTICS: Ankerst, et al (SIGMOD’99).
– DENCLUE: Hinneburg & D. Keim (KDD’98)
– CLIQUE: Agrawal, et al. (SIGMOD’98) (more grid-based)
132
DBSCAN: Density Based Spatial Clustering of Applications with Noise
Locates region of high density that are separated by regions of low
density.
In center- based approach, density of a point is the number of points
within specified radius, Eps, of that point
A cluster is defined as a maximal set of density-connected points
In center-based approach, we can classify
point as being
1) in the interior of dense region (core)
2) on the edge of a dense region (border)
Outlier
3) in a sparely occupied region (noise)
Border
Eps = 1cm
MinPts = 5
Cor e
DBSCAN
DBSCAN is a density-based algorithm.
– Density = number of points within a specified radius (Eps)
– A point is a core point if it has more than a specified number
of points (MinPts) within Eps
These are points that are at the interior of a cluster
– A border point has fewer than MinPts within Eps, but is in
the
neighborhood of a core point
– A noise point is any point that is not a core point or a border
point.
DBSCAN: Core, Border, and Noise Points
DBSCAN: The Algorithm
– Label all points as core, border or noise.
– Eliminate noise points.
– Put an edge between all core points that are within
Eps of each other.
– Make each group of connected core points into a
separate cluster.
– Assign each border points to one of the clusters of its
associated core points.
138
DBSCAN Algorithm
Eliminate noise points
Perform clustering on the remaining points
DBSCAN Algorithm
Time Complexity
– O(N x time to find points in Eps-neighbourhood)
– where N is the no of points
– Worst case O(N2)
– KD-trees, allow efficient retreivel of all points within
given distance of a specified point in O(N logN)
Space Complexity
– O(N)
DBSCAN: Core, Border and Noise Points
Original Points Point types: core,
border and noise
Eps = 10, MinPts = 4
When DBSCAN Works Well
Original Points Clusters
• Resistant to Noise
• Can handle clusters of different shapes and sizes
DBSCAN: Sensitive to Parameters
DBSCAN online Demo:
http://webdocs.cs.ualberta.ca/~yaling/Cluster/Applet/Code/Cluster.ht
ml