37 Application of k means clustering
37 Application of k means clustering
Prepared By
What is
Application Benefit
clustered?
search result search more effective information
clustering results Presentation to user
Scatter-Gather (subsets of) alternative user interface:
collection “search without typing”
collection collection effective information
clustering presentation for exploratory
Browsing
cluster-based collection higher efficiency: faster search
retrieval
Clustering for improving recall
• To improve search recall:
– Cluster docs in collection a priori
– When a query matches a doc d, also return other docs in the
cluster containing d
• Hope: if we do this: the query “car” will also return docs
containing “automobile”
– Because the clustering algorithm groups together docs
containing “car” with those containing “automobile”.
– Both types of documents contain words like “parts”,
“dealer”, “mercedes”, “road trip”.
Data set with clear cluster structure
Cluster
structure
Desiderata for clustering
• General goal: put related docs in the same cluster, put unrelated
docs in different clusters.
• We’ll see different ways of formalizing this.
• The number of clusters should be appropriate for the data set we
are clustering.
• Initially, we will assume the number of clusters K is given.
• Later: Semiautomatic methods for determining K
• Secondary goals in clustering
• Avoid very small and very large clusters
• Define clusters that are easy to explain to the user
• Many others . . .
Flat algorithms
• Flat algorithms compute a partition of N documents
into a set of K clusters.
• Given: a set of documents and the number K
• Find: a partition into K clusters that optimizes the
chosen partitioning criterion
• Global optimization: exhaustively enumerate
artitions, pick optimal one
• Not tractable
• Effective heuristic method: K-means algorithm
K-means
Exercise:
(i) Guess what the optimal clustering into two clusters is in this
case;
(ii) compute the centroids of the clusters
Example: Random selection of initial
centroids
Example: Assign points to closest
center
Example: Recompute cluster centroids
Example: Assign points to closest
centroid
Example: Assignment
Example: Recompute cluster centroids
…
Example: Recompute cluster
centroids
Example: Centroids and
assignments after convergence
K-means is guaranteed to converge:
Proof
• RSS = sum of all squared distances between document vector
and closest centroid
• RSS decreases during each reassignment step.
– because each vector is moved to a closer centroid
• RSS decreases during each recomputation step.
• There is only a finite number of clusterings.
• Thus: We must reach a fixed point.
• Assumption: Ties are broken consistently.
• Finite set & monotonically decreasing → convergence
Recomputation decreases average
distance
• Internal criteria
– Example of an internal criterion: RSS in K-means
• But an internal criterion often does not evaluate the actual
utility of a clustering in the application.
• Alternative: External criteria
– Evaluate with respect to a human-defined
classification
External criteria for clustering quality
• Based on a gold standard data set, e.g., the Reuters
collection we also used for the evaluation of classification
• Goal: Clustering should reproduce the classes in the gold
standard
• (But we only want to reproduce how documents are
divided into groups, not the class labels.)
• First measure for how well we were able to reproduce the
classes: purity
External criterion: Purity