Informal goal
Clustering:
Overview and • Given set of objects and measure of
similarity between them, group similar
K-means algorithm objects together
• What mean by “similar”?
• What is good grouping?
K-Means illustrations thanks to • Computation time / quality tradeoff
2006 student Martin Makowiecki 1 2
General types of clustering Applications:
• “Soft” versus “hard” clustering Many
– Hard: partition the objects – biology
• each object in exactly one partition – astronomy
– Soft: assign degree to which object in – computer aided design of circuits
cluster – information organization
• view as probability or score – marketing
– …
• flat versus hierarchical clustering
– hierarchical = clusters within clusters
3 4
Clustering in information Example applications in search
search and analysis
• Query evaluation: cluster pruning (§7.1.6)
• Group information objects - cluster all documents
⇒ discover topics - choose representative for each cluster
? other groupings desirable - evaluate query w.r.t. cluster reps.
• Clustering versus classifying - evaluate query for docs in cluster(s) having
- classifying: have pre-determined classes most similar cluster rep.(s)
with example members • Results presentation: labeled clusters
- clustering: - cluster only query results
- get groups of similar objects - e.g. Yippy.com (metasearch)
- added problem of labeling clusters by topic
- e.g. common terms within cluster of docs. 5 hard / soft? flat / hier? 6
1
Issues Issues continued
• What attributes represent items for clustering
purposes?
• Cluster goals?
• What is measure of similarity between items? – Number of clusters?
• General objects and matrix of pairwise similarities – flat or hierarchical clustering?
• Objects with specific properties that allow other – cohesiveness of clusters?
specifications of measure • How evaluate cluster results?
– Most common:
– relates to measure of closeness between clusters
Objects are d-dimensional vectors
» Euclidean distance • Efficiency of clustering algorithms
» cosine similarity – large data sets => external storage
• Maintain clusters in dynamic setting?
• What is measure of similarity between clusters?
7
• Clustering methods? - MANY! 8
Quality of clustering General types of clustering
methods
• In applications, quality of clustering depends on
how well solves problem at hand
• constructive versus iterative improvement
• Algorithm uses measure of quality that can be
– constructive: decide in what cluster each
optimized, but that may or may not do a good
object belongs and don’t change
job of capturing application needs.
• often faster
• Underlying graph-theoretic problems usually – iterative improvement: start with a clustering
NP-complete and move objects around to see if can
– e.g. graph partitioning improve clustering
• often slower but better
• Usually algorithm not finding optimal clustering
9 10
Vector model:
K-means overview
K- means algorithm
• Choose k points among set to cluster
• Well known, well used – Call them k centroids
• Flat clustering • For each point not selected, assign it to its
closest centroid
• Number of clusters picked ahead of time – All assignment give initial clustering
• Iterative improvement • Until “happy” do:
• Uses notion of centroid – Recompute centroids of clusters
• New centroids may not be points of original set
• Typically uses Euclidean distance – Reassign all points to closest centroid
• Updates clusters
11 12
2
An Example An Example
start: choose centroids and cluster recompute centroids
13 14
An Example An Example
re-cluster around new centroids 2nd recompute centroids and re-cluster
15 16
An Example Details for K-means
3rd (final) recompute and re-cluster • Need definition of centroid
ci = 1/|Ci| ∑x for ith cluster Ci containing objects x
x∈Ci
notion of sum of objects ?
• Need definition of distance to (similarity to)
centroid
• Typically vector model with Euclidean distance
• minimizing sum of squared distances of each
point to its centroid = Residual Sum of Squares
K
RSS = ∑ ∑ dist(ci,x)2
i=1 x∈Ci
17 18
3
K-means performance Time Complexity of K-means
• Can prove RSS decreases with each • Let tdist be the time to calculate the distance
iteration, so converge between two objects
• Can achieve local optimum • Each iteration time complexity:
– No change in centroids O(K*n*tdist)
n = number of objects
• Running time depends on how • Bound number of iterations I giving
demanding stopping criteria O(I*K*n*tdist)
• Works well in practice • for m-dimensional vectors:
– speed O(I*K*n*m)
m large and centroids not sparse
– quality
19 20
Space Complexity of K-means Choosing Initial Centroids
• Store points and centroids • Bad initialization leads to poor results
– vector model: O((n + K)m)
• External algorithm versus internal?
– store k centroids in memory
– run through points each iteration
Optimal
Not Optimal
21 22
Choosing Initial Centroids
K-means weakness
Many people spent much time examining
how to choose seeds
• Random Non-globular clusters
• Fast and easy, but often poor results
• Run random multiple times, take best
– Slower, and still no guarantee of results
• Pre-conditioning
– remove outliers
• Choose seeds algorithmically
– run hierarchical clustering on sample points and
use resulting centroids
– Works well on small samples and for few initial
centroids
23 24
4
K-means weakness K-means weakness
Wrong number of clusters Outliers and empty clusters
25 26
Real cases tend to be harder
• Different attributes of the feature vector
have vastly different sizes
– size of star versus color
• Can weight different features
– how weight greatly affects outcome
• Difficulties can be overcome
27