0% found this document useful (0 votes)
4 views38 pages

37 Application of k means clustering

The document discusses text clustering, particularly focusing on K-means clustering as a method for grouping similar documents. It outlines the differences between classification and clustering, the applications of clustering in information retrieval, and the importance of the cluster hypothesis. Additionally, it explains the K-means algorithm, its convergence properties, initialization methods, and evaluation criteria for clustering quality.

Uploaded by

Notes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views38 pages

37 Application of k means clustering

The document discusses text clustering, particularly focusing on K-means clustering as a method for grouping similar documents. It outlines the differences between classification and clustering, the applications of clustering in information retrieval, and the importance of the cluster hypothesis. Additionally, it explains the K-means algorithm, its convergence properties, initialization methods, and evaluation criteria for clustering quality.

Uploaded by

Notes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 38

Information Retrieval

Topic- Text Clustering


(Application of K-means clustering)
Lecture-37

Prepared By

Dr. Rasmita Rautray & Dr. Rasmita Dash


Associate Professor
Dept. of CSE
Text Clustering
Content
• Clustering
• Classification Vs Clustering
• Applications of clustering in information
retrieval
• K-mean Clustering
Clustering: Definition

• (Document) clustering is the process of grouping a set of


documents into clusters of similar documents.
• Documents within a cluster should be similar.
• Documents from different clusters should be dissimilar.
• Clustering is the most common form of unsupervised learning.
• Unsupervised = there are no labeled or annotated data.
Classification vs. Clustering
• Classification: supervised learning
• Clustering: unsupervised learning
• Classification: Classes are human-defined and part of the input
to the learning algorithm.
• Clustering: Clusters are inferred from the data without human
input.
• However, there are many ways of influencing the outcome of
clustering: number of clusters, similarity measure,
representation of documents, . . .
The cluster hypothesis

• Cluster hypothesis. Documents in the same cluster behave


similarly with respect to relevance to information needs.

• All applications of clustering in IR are based (directly or


indirectly) on the cluster hypothesis.
Applications of clustering in IR

What is
Application Benefit
clustered?
search result search more effective information
clustering results Presentation to user
Scatter-Gather (subsets of) alternative user interface:
collection “search without typing”
collection collection effective information
clustering presentation for exploratory
Browsing
cluster-based collection higher efficiency: faster search
retrieval
Clustering for improving recall
• To improve search recall:
– Cluster docs in collection a priori
– When a query matches a doc d, also return other docs in the
cluster containing d
• Hope: if we do this: the query “car” will also return docs
containing “automobile”
– Because the clustering algorithm groups together docs
containing “car” with those containing “automobile”.
– Both types of documents contain words like “parts”,
“dealer”, “mercedes”, “road trip”.
Data set with clear cluster structure

Cluster
structure
Desiderata for clustering
• General goal: put related docs in the same cluster, put unrelated
docs in different clusters.
• We’ll see different ways of formalizing this.
• The number of clusters should be appropriate for the data set we
are clustering.
• Initially, we will assume the number of clusters K is given.
• Later: Semiautomatic methods for determining K
• Secondary goals in clustering
• Avoid very small and very large clusters
• Define clusters that are easy to explain to the user
• Many others . . .
Flat algorithms
• Flat algorithms compute a partition of N documents
into a set of K clusters.
• Given: a set of documents and the number K
• Find: a partition into K clusters that optimizes the
chosen partitioning criterion
• Global optimization: exhaustively enumerate
artitions, pick optimal one
• Not tractable
• Effective heuristic method: K-means algorithm
K-means

• The best known clustering algorithm


• Simple, works well in many cases
• Use as default / baseline for clustering documents
Document representations in clustering
• Vector space model
• As in vector space classification, we measure relatedness
• between vectors by Euclidean distance . . .
• . . .which is almost equivalent to cosine similarity.
• Almost: centroids are not length-normalized.
K-means: Basic idea
• Each cluster in K-means is defined by a centroid.
• Objective/partitioning criterion: minimize the average
squared difference from the centroid
• Recall definition of centroid:
 1 
 ( ) 
w
x

x

where we use ω to denote a cluster.


• We try to find the minimum average squared difference by
iterating two steps:
• reassignment: assign each vector to its closest centroid
• recomputation: recompute each centroid as the average of
the vectors that were assigned to it in reassignment
Example: Set of points to be clustered

Exercise:
(i) Guess what the optimal clustering into two clusters is in this
case;
(ii) compute the centroids of the clusters
Example: Random selection of initial
centroids
Example: Assign points to closest
center
Example: Recompute cluster centroids
Example: Assign points to closest
centroid
Example: Assignment
Example: Recompute cluster centroids

Example: Recompute cluster
centroids
Example: Centroids and
assignments after convergence
K-means is guaranteed to converge:
Proof
• RSS = sum of all squared distances between document vector
and closest centroid
• RSS decreases during each reassignment step.
– because each vector is moved to a closer centroid
• RSS decreases during each recomputation step.
• There is only a finite number of clusterings.
• Thus: We must reach a fixed point.
• Assumption: Ties are broken consistently.
• Finite set & monotonically decreasing → convergence
Recomputation decreases average
distance

The last line is the componentwise definition of the centroid! We


minimize RSSk when the old centroid is replaced with the new
centroid. RSS, the sum of the RSSk , must then also decrease
during recomputation.
K-means is guaranteed to converge

• But we don’t know how long convergence will take!


• If we don’t care about a few docs switching back and
forth, then convergence is usually fast (< 10-20
iterations).
• However, complete convergence can take many more
iterations.
Optimality of K-means
• Convergence  optimality
• Convergence does not mean that we converge to the
optimal clustering!
• This is the great weakness of K-means.
• If we start with a bad set of seeds, the resulting
clustering can be horrible.
Initialization of K-means
• Random seed selection is just one of many ways K-means
can be initialized.
• Random seed selection is not very robust: It’s easy to get
a suboptimal clustering.
• Better ways of computing initial centroids:
– Select seeds not randomly, but using some heuristic
(e.g., filter out outliers or find a set of seeds that has
“good coverage” of the document space)
– Use hierarchical clustering to find good seeds
– Select i (e.g., i = 10) different random sets of seeds, do
a K-means clustering for each, select the clustering
with lowest RSS
Evaluation
What is a good clustering?

• Internal criteria
– Example of an internal criterion: RSS in K-means
• But an internal criterion often does not evaluate the actual
utility of a clustering in the application.
• Alternative: External criteria
– Evaluate with respect to a human-defined
classification
External criteria for clustering quality
• Based on a gold standard data set, e.g., the Reuters
collection we also used for the evaluation of classification
• Goal: Clustering should reproduce the classes in the gold
standard
• (But we only want to reproduce how documents are
divided into groups, not the class labels.)
• First measure for how well we were able to reproduce the
classes: purity
External criterion: Purity

•  = {ω1, ω2, . . . , ωK} is the set of clusters and


• C = {c1, c2, . . . , cJ} is the set of classes.
• For each cluster ωk : find class cj with most members
nkj in ωk
• Sum all nkj and divide by total number of points
Another external criterion: Rand index
Rand Index: Example
Rand measure for the o/⋄/x example
Two other external evaluation measures
• Two other measures
• Normalized mutual information (NMI)
– How much information does the clustering contain
about the classification?
– Singleton clusters (number of clusters = number of
docs) have maximum MI
– Therefore: normalize by entropy of clusters and
classes
• F measure
– Like Rand, but “precision” and “recall” can be
weighted
Thank You

You might also like