Cluster Search Result
Cluster Search Result
www.elsevier.com/locate/datak
Abstract
We develop a new algorithm for clustering search results. Dierently from many other clustering systems that have been
recently proposed as a post-processing step for Web search engines, our system is not based on phrase analysis inside snippets,
but instead uses latent semantic indexing on the whole document content. A main contribution of the paper is a novel strategy
called dynamic SVD clustering to discover the optimal number of singular values to be used for clustering purposes. Moreover, the algorithm is such that the SVD computation step has in practice good performance, which makes it feasible to perform clustering when term vectors are available. We show that the algorithm has very good classication performance, and
that it can be eectively used to cluster results of a search engine to make them easier to browse by users. The algorithm has
being integrated into the Noodles search engine, a tool for searching and clustering Web and desktop documents.
2006 Elsevier B.V. All rights reserved.
Keywords: Web; Search engines; Clustering; Latent semantic indexing; Document classication
0169-023X/$ - see front matter 2006 Elsevier B.V. All rights reserved.
doi:10.1016/j.datak.2006.10.006
505
[13], the transition from the current syntactic Web to the next generation Web appears to be a slow process
and these proposals can hardly be considered as a ready-made solution to todays search problems.
At the moment, the most promising improvement over traditional search techniques are the so-called clustering engines. The idea of clustering search results is not new, and has been investigated quite deeply in information retrieval, based on the so called cluster hypothesis [34] according to which clustering may be benecial
to users of an information retrieval system since it is likely that results that are relevant to the user are close to
each other in the document space, and therefore tend to fall into relatively few clusters.
Several commercial clustering engines have recently emerged on the market. Well known examples are Vivisimo [35] and Grokker [12]. Although the underlying clustering techniques are not fully disclosed to the public,
based on the available documentation it is possible to say that these systems share a number of common features with research systems introduced in literature, mainly the Grouper system [38,39] and Lingo/Carrot
Search [23,25]. We summarize these features in the following.
First, these tools are usually not search engines by themselves. On the contrary, when a user poses a query,
the clustering engine uses one or more traditional search engines to gather a number of results; then, it does a
form of post-processing on these results in order to cluster them into meaningful groups. The cluster tree is
then presented to the user so that s/he can browse it in order to explore the result set. Fig. 1 shows the clusters
produced by Vivisimo for the query term power. It can be seen that such a technique may be helpful to
users, since they can quickly grasp the dierent meanings and articulations of the search terms, and more easily select a subset of relevant clusters.
Being based on a post-processing step, all of these clustering engines work by analyzing snippets, i.e., short
document abstracts returned by the search engine, usually containing words around query term occurrences.
The reason for this is performance: each snippet contains from 0 to 40 words, and therefore can be analyzed
very quickly, so that users do not experience excessive delays due to the clustering step. However, snippets are
often hardly representative of the whole document content, and this may in some cases seriously worsen the
quality of the clusters.
1.1. Contributions of the paper
This research aims at reconsidering some of the techniques for clustering search results. More specically,
we want to investigate the trade-o between performance and quality of the clustering when choosing snippets
506
versus whole documents. This is particularly relevant if we consider that in some emerging contexts, like for
example, desktop search engines, (a) it is reasonable to assume that the document-term vectors are available to
the clustering engine; (b) snippets may not be available at all, based on the dierent nature of documents.
The main goal of the paper is to evaluate the quality of the clustering in terms of its ability to correctly
classify documents, i.e., to dynamically build a bunch of clusters that correctly reect the dierent categories
in the document collection returned by the search engine. Similarly to [7], we believe that this ability may assist
less-skilled users in browsing the document set and nding relevant results.
The main contributions of the paper can be summarized as follows:
we develop a new document clustering algorithm called dynamic SVD clustering (DSC), based on latent
semantic indexing [9]; the novelty of the algorithm is twofold: (a) rst, it is based on an incremental computation of singular values, and does not require to compute the whole SVD of the original matrix; (b) second, it uses an original strategy to select k, i.e., the number of singular values used to represent the
concepts in the document space; dierently from other proposals in the literature, like for example [28
or 23], our strategy does not assume a xed value of k, neither a xed approximation threshold;
based on experimental results, we show that the algorithm has very good classication power; in many cases
it is able to cluster pre-classied documents collections with 100% accuracy; it is worth noting that the quality of the classication severely degrades when snippets are used in place of the whole document content,
thus providing further evidence that snippets are often too poor and not suciently informative; by comparing our results to those of other proposals in the literature, we show that our algorithm has comparatively better classication performance;
nally, we show that the complexity of the chosen SVD computation strategy is such that the algorithm has
in practice good performance, and lends to a very natural clustering strategy based on the minimum spanning tree of the projected document space.
To the best of our knowledge, this is the rst paper to propose a dynamic strategy to discover the optimal
number of singular values to be used in a classication task. This strategy represents the main contribution of
this paper.
The paper is organized as follows. Section 2 introduces a number of preliminary denitions and discusses
some of the techniques that will be used in the rest of the paper. Section 3 is devoted to the description of the
clustering algorithm and Section 4 discusses implementation and experimental results. Related works are in
Section 5.
2. Preliminaries
This section introduces a number of techniques that will be used in the rest of the paper.
2.1. Document indexing
As it is common in information retrieval, we represent documents as vectors in a multidimensional term
space [34]. Documents are rst preprocessed to remove stop words. Terms may be stemmed. There are several
weighting schemes [34,8] that can be used to construct document vectors. Generally speaking, the weight wij of
term ti in document Dj, i.e., the ith coordinate of vector j, is given by the product of three dierent factors:
wij Lij Gi N j
where Lij is the local weight of term i in document j, Gi is the global weight of term i in the document collection,
and Nj is the normalization factor for document j.
We have experimented several weighting schemes, as follows. Let us call fij the frequency of term i in document j, Fi the global frequency of term i in the whole document collection, ni the number of documents in
which term i appears, N the total number of documents, and m the size of a document vector vj.
The following table summarizes the three forms of local weight that have been considered in the paper (see
Table 1).
507
Table 1
Local weights
Id
Formula
Description
FREQ
Lij = fij
1 logfij if f ij > 0
Lij
0
if f ij 0
p
1 fij 0:5 if f ij > 0
Lij
0
if f ij 0
LOGA
SQRT
Table 2
Global weights
Id
Formula
Description
NONE
IGFF
Gi = 1
Gi Fnii
IDFB
Gi log
N
ni
Table 3
Normalization weights
Id
Formula
Description
NONE
COSN
Nj = 1
1
N j pP
m
2
i0
Gi Lij
With respect to global weights, we have considered the following (see Table 2).
Finally, with respect to normalization factors, we have considered (see Table 3).
Note that, when using the cosine normalization factor, all term vectors have length 1.
In order to compare distances and similarities between vectors, we use cosine similarity, that is, we compute
the similarity s(vi,vj) between vectors vi and vj as the cosine of their angle h, as follows:
vi vj
svi ; vj cosh
kvi kkvj k
here vi vj is the dot product of the two vectors, and kvikkvjk is the product of their norms. Since all vectors
have length 1, we have that s(vi,vj) = cos(h) = vi vj.
2.2. Latent semantic indexing
Latent semantic indexing (LSI) [9] is a document projection technique based on singular value decomposition
(SVD). Suppose we are given d documents and t terms; let us represent each document as a term vector of t
coordinates. We call A the t d matrix whose columns are the term vectors; let r be the rank of A. SVD
decomposes matrix A into the product of three new matrices, as follows:
r
X
A U RV T
ri ui vTi
i0
where R is a diagonal matrix of size r r made of the singular values of A in decreasing order: R = diag(r1,r2, . . . rr),r1 P r2 P . . . P rr > 0, U and V are of size t r and d r respectively; vectors ui are called
the left singular vectors and vTi are the right singular vectors of A.
A singular value and its left and right singular vectors are also called a singular triplet. In order to obtain an
approximation of A let us x k 6 r and call Rk = diag(r1,r2, ,rk) , i.e., the k k head minor of R . Similarly,
let us call Uk the restriction of U to the rst k left singular vectors; similarly for V Tk ; we dene:
508
Ak U k Rk V Tk
It is possible to prove [9] that Ak is the best k-rank approximation of the original matrix A. Informally
speaking, by appropriately choosing a value for k we are selecting the largest singular values of the original
matrix, and ignoring the smallest ones, therefore somehow preserving the main features of the original vector space while ltering out some noise. It can be seen that the smaller is k with respect to r, the less the
new space resembles the original one. In traditional information retrieval applications, in which LSI is used
as a means to improve retrieval performance, it is crucial that the essential topological properties of the
vector space are preserved; as a consequence, the value of k is usually quite high (empirical studies [28]
show that for a typical information retrieval application a value between 150 and 300 is usually the best
choice).
It is known [4] that the computation of singular triplets of a matrix A may be reduced to the computation of
eigenvalues and eigenvectors of matrix AAT or ATA . Therefore, in order to discuss the complexity of SVD, we
shall refer to the complexity of the eigenvalue problem. In particular, there are several algorithms for computing eigenvalues and eigenvectors of a matrix. One example of this are the implicitly restarted Arnoldi/Lanczos
methods [32].1 This family of methods is particularly relevant with respect to this work, since it can be implemented in an incremental way, that is, to obtain the rst k eigenvalues one by one. It has also very interesting
computational complexity; more specically, it has been shown that it can be used to compute the rst k singular triplets in time 2k2n + O(k3) using storage 2kn + O(k3) [6].
3. Clustering algorithm
In this section we introduce the clustering algorithm used by Noodles. We shall rst give some insight on
the main ideas behind the algorithm, and then elaborate on the technical details.
3.1. Intuition
Our algorithm is heavily based on Latent Semantic Indexing. LSI has a natural application in clustering
and classication applications. It is often said that, by means of SVD, LSI does perform a transformation
of the original vector space in which each coordinate is a term into a new vector space, in which each coordinate is some relevant concept in the document collection, i.e., some topic that is common to several documents. Note that the coordinates of the original documents in this space are given by matrix VkRk . We shall
call this d k space the projected document space.
In this respect, a possible approach to clustering would be the following: (a) compute SVD over the original
matrix, for some value k, to obtain a representation of the original documents in the new concept space,
VkRk, in which each coordinate represents some topic in the original document collection, and therefore
some cluster of documents; (b) run a clustering algorithm in this space to cluster documents with respect to
their topics. This method has been used for clustering purposes for example in [23].
A critical step, in this approach, is the selection of k. A natural intuition suggests that, assuming
the document collection contains x hidden clusters, the natural value of k to be used for SVD is exactly
x. This is a consequence of the fact that one of the property of SVD is that of producing in VkRk an
optimal alignment of the original documents along the k axes [26]. However, such a value is unknown
and must be discovered by the clustering engine. There are a few interesting observations with respect
to this point:
rst, such a value of k can be signicantly lower that the number of documents, d, and the number of terms,
t, since it is unlikely that there are more than a dozen relevant clusters among the results of a search; this
means that the projected document space does not preserve much of the features of the original space; this
is, however, not a concern, since in our case we are using this space only as a means to discover clusters, and
not for retrieval purposes;
1
One of these methods is also used in the well known Matlab toolkit to compute eigenvalues/eigenvectors.
509
it is therefore apparent that xing the value of k, for example assuming k in the order of 150300, as it is
often done in the literature, would not give good classication performance; the fact that lower values of k
are to be preferred in clustering applications is conrmed for example in [28]; nevertheless, in the latter
work xed values of k = 20 and k = 50 are compared, whereas the optimal value should ideally be discovered dynamically;
in light of these considerations, we also discard the other typical approach used to establish the value of k,
i.e., that of xing a threshold for the approximation that Ak gives of the original space A, as it is done, for
example, in [23].
In fact, the strategy used to dynamically select the optimal value of k is one of the main originality of our
algorithm.
The main intuition behind the selection of k is that assuming the original collection contains x clusters,
and that each cluster informally corresponds to some clearly dened topic then the points in the projected
document space should naturally fall into x clusters that are reasonably compact and well separated.
To see this, consider for example Fig. 2, which refers to one of our test document collection, corresponding
to search results for the query term life. We have identied four natural clusters in the collection, namely:
(a) documents about biological life; (b) documents about life cycles of a system or a living entity; (c) documents about social life and related events; (d) documents about the game of life. The gure shows the minimum spanning tree of document distances in space V4R4. Edges are drawn dierently based on their length;
more specically, the longest ones are drawn as a double line, the shortest ones as a dashed line. It can be seen
that, on the one side SVD has clearly brought documents belonging to the same cluster close to each other,
and on the other side that an accurate clustering can be obtained simply by removing the k 1 longest edges,
that is, by applying a variant of a spanning-tree based clustering algorithm [37,16].
Based on this observation, our algorithm can be informally sketched as follows:
given the original term-document space A we incrementally compute SVD, starting with a low number of
singular values, and, for each value of k , generate the projected space VkRk;
we nd the minimum spanning tree of points in the projected space; assuming we have found the optimal
value for k, we stop the algorithm and obtain our clusters simply by removing the k 1 longest edges from
the minimum spanning tree, and considering each connected component as a cluster;
to discover such optimal value of k, we dene a quality function for the clusters obtained from the minimum spanning tree; intuitively, this function rates higher those trees in which the k 1 longest edges are
signicantly longer than the average (details are given below); based on this, we stop as soon as we nd a
local maximum of the quality function; this is a hint that we have found a natural distribution of documents in the projected space, that would be worsened if we choose higher values of k, i.e., more clusters.
510
A plot of the quality function for the life example discussed above is reported in Fig. 3. It can be
seen that the quality function has a maximum for k = 4, that is exactly the number of clusters we are
seeking in this example. It is also worth noting that such a strategy nicely solves a typical problem of
clustering algorithms, i.e., that of dynamically choosing the right number of clusters for a given dataset
[16].
The following section describes the clustering algorithm in detail.
3.2. Description of the dynamic SVD clustering algorithm
Let us rst formalize the quality function for minimum spanning trees. We x a value of k, and assume we
have computed the projected space VkRk by means of SVD. Let us call MSTk the minimum spanning tree of
document distances in the projected space. We call the k 1 longest edges in the tree e1, e2, . . . ek1,
respectively.
We nd the average avgk and the standard deviation rk of the distance distribution in the projected
space. Based on this, we assign to each edge e of length l(e) in the spanning tree a cost c(e) as
follows:
ce le avgk rk
Then, we dene the quality Q(MSTk) of MSTk as follows:
QMSTk k
k1
X
cei
i1
Intuitively, the quality function is higher when edges that are candidate to be removed have lengths
that are signicantly above the average. Term k is necessary as a multiplicative factor since we have
observed that, by increasing the value of k, the distribution of distances in the space changes signicantly.
More specically, for higher values of k we have both higher average distance and standard deviation
i.e., the overall space is more disperse and therefore edges tend to bring a smaller contribution to the
quality function.
This said, we can formalize the DSC algorithm as follows. The input to the algorithm is a document collection D = {D0, D1, . . . Dd}. These may either be whole documents, or document summaries, such as snippets. We assume that documents have been previously indexed, i.e., for each document Di in D we have
build a term vector after performing stop-word removal. Note that we are not xing the weighting scheme
used to construct the vectors. In Section 4 we compare the impact of the dierent weighting schemes on the
clustering.
511
512
derivation of the minimum spanning tree are likely to have the highest impact in terms of running time. In
practice, however, SVD computation is by far the bottleneck in terms of eciency.
4. Implementation and experiments
The clustering algorithm has been implemented in the Noodles desktop search engine [30], a snapshot of
which is shown in Fig. 4. The system is written in Java, using Apache Lucene2 as an indexing engine, and
the Spring Rich Client Platform3 as a desktop application framework. It has been conceived as a general-purpose search tool, that can run both Web and desktop searches. In order to perform desktop searches, it incorporates a crawler module that runs as a daemon and incrementally indexes portions of the local disks that
have been specied by the user.
It also incorporates a testbed for the clustering engine, which we have used to conduct a number of experiments. Overall, we have run several hundreds of experiments, combining dierent datasets, dierent summarization schemes, dierent weighting schemes, dierent criteria for the selection of k in SVD, and dierent
clustering algorithms. The most interesting experimental evidences are described in this section.
4.1. Description of the datasets
We concentrated on two dierent categories of datasets. To construct datasets of the rst category, we ran
queries on Google and selected a number of the top-ranked search results. Those results were manually classied into a number of clusters. Then, the algorithm was run on the document collection to compare the suggested clusters with those identied manually. Although this kind of experiments closely mimic the situation in
which a user runs a query on a search engine and then inspects the results browsing the clusters produced by
the clustering engine, the manual classication step necessary to assess the quality of the clustering tends to
be labor-intensive and quite error-prone. As a consequence, document collections of this rst category tend to
be quite small.
2
3
http://lucene.apache.org.
http://www.springframework.org.
513
In order to test the classication power of the algorithm on larger document collections, we implemented a
data fetching utility for the Open Directory Project (DMOZ) [10]. This utility was used in order to sample
categories inside DMOZ, and then run the clustering algorithm on the pre-categorized samples. In this case,
the clusters produced by the algorithm were compared with the original DMOZ categories.
Overall, we selected 12 dierent datasets, ranging from 16 to about 100 documents; three datasets are of the
rst category, and nine of the second category. The datasets are described in Table 4. For each dataset, we
have worked both on the full document and on a snippet.
4.2. Quality measures
In order to assess the clusters produced by the algorithm, we have computed several quality measures.
Table 4
List of datasets
Dataset
Description
docs
clust
Clusters
Amazon
19
16
20
69
DMOZsamples4A
Sampling of DMOZ
categories
52
DMOZsamples4B
Sampling of DMOZ
categories
97
DMOZsamples4C
Sampling of DMOZ
categories
33
DMOZsamples5A
Sampling of DMOZ
categories
40
DMOZsamples5B
Sampling of DMOZ
categories
53
DMOZsamples6
Sampling of DMOZ
categories
63
DMOZsamples7
Sampling of DMOZ
categories
90
DMOZsamples8
Sampling of DMOZ
categories
73
Life
Power
DMOZsamples3
514
As a primary measure of quality we use the F-measure [34], i.e., the harmonic means of precision and recall.
More specically, given a cluster of documents C , to evaluate his quality with respect to an ideal cluster Ci we
rst compute precision and recall as usual:
PrC; C i
jC \ C i j
jCj
RecC; C i
jC \ C i j
jC i j
Then, we dene:
F C; C i
2PrC; C i RecC; C i
PrC; C i RecC; C i
Given a collection of clusters, {C1, C2, . . . Ck}, to evaluate its F-measure with respect to a collection of ideal
clusters fC i1 ; C i2 ; . . . C ih g we do as follows: (a) we nd for each ideal cluster C in a distinct cluster Cm that best
approximates it in the collection to evaluate, and evaluate F C m ; C in ; (b) then, we take as an F-measure of
the collection the average value of the F-measures over all ideal clusters.
To gain an in-depth understanding of the classication power of our technique, for each dataset we have
computed two dierent F-measures. First, we ran the clustering algorithm and stopped at the value of k chosen
by the algorithm; this experiment was used to compute a rst F-measure for the clustering algorithm. However, since we were also interested in evaluating the classication power of our SVD-based technique independently of the stop criterion, we also ran a second experiment, in which we computed clusterings for all values
of k, from 2 to n, and took the maximum F-measure obtained by the algorithm. In the following, we shall refer
to this second measure as F-measure Max.
Other works in the literature have used dierent quality metrics to assess their experimental results. To
compare our results to these, we have also computed several other quality measures. More specically,
Grouper [38] introduced a customized quality measure, called Grouper Quality Function, which in essence
looks at all pair of documents in a single cluster and counts the number of true positive pairs (the two documents were also in the same ideal cluster), and false positive pairs (they were not in the ideal cluster); then, it
combines these two gures to calculate an overall quality function. Lingo [23] uses an alternative measure,
cluster contamination [22], which intuitively measures how much a cluster produced by the system mixes documents that should ideally belong to dierent classes; of course, 1 cluster contamination can be considered as
a measure of the purity of the clustering.
Beside F-measure and F-measure Max, we also computed the Grouper Quality Function and cluster contamination for our experiments.
4.3. Indexing scheme
We are now ready to discuss a number of experimental evidences. As a rst step, we dealt with the problem of selecting the correct weighting scheme. We ran several preliminary experiments, both on full documents and on snippets, with and without SVD, comparing six weighting schemes obtained by the
combination of dierent local and global weights. In all cases we used cosine normalization, since we soon
noticed that results strongly degrade in absence of normalization. Fig. 5 shows average F-measures for the
six weighting schemes.
It can be seen that the best results were obtained using the augmented logarithmic local weight and no global weight. Therefore, we adopted this weighting scheme for the rest of our experiments. This phenomenon
seems to be related to the adoption of LSI. In essence, global weights are often used in information retrieval
as a means to reduce noise in the document space; to give an example, one typical consequence of the adoption
of a inverse document frequency global weight is that of assigning weight 0 to all terms that appear in every
document. However, similar eects are also obtained by applying SVD. It seems that SVD has indeed superior
noise-reduction power in clustering applications, so that global weights are no longer necessary. To conrm
this hypothesis, we performed several other experiments on document summaries. Being Web pages, most of
our documents typically contain some header e.g., banner and navigation links and footer material. We
ran the clustering algorithm after removing such sections, which in most cases are rather uninformative; in
essence, we extracted what, in information extraction terms, is called a rich region from the pages. Also in this
515
case, no signicant improvement in F-measures was observed (around 2%), thus conrming the intuition that
SVD does a very good job in ltering out this noisy portions of the documents.
A second preliminary evaluation was related to the adoption of stemming after stop-word removal in the
document indexing phase. We compared F-measures obtained in the 12 experiments after stemming to those
without stemming. As it can be seen in Fig. 6, there is no signicant improvement in quality related to the
adoption of stemming. However, we noticed that stemming on average reduced the number of terms by
23%, therefore making matrices smaller and computation more ecient. As a consequence, we decided to consistently perform stemming.
Fig. 6 also shows a comparison of F-measure and maximum F-measure obtained by the algorithm on full
documents and on document snippets. It can be easily seen that performance degrades of more than 40% when
analyzing snippets. This seems to conrm one of our initial assumptions, i.e. that snippets are often too short
to represent a meaningful summary of the document, and that the analysis of the whole document content is
very important in order to improve the quality of the classication. It is worth noting that this is somehow in
contrast with other works, for example [38], according to which there is no signicant degradation when going
from documents to snippets. Such a big discrepancy might be justied by the completely dierent nature of our
approach with respect to theirs.
516
Table 5
Quality measures
Collection
# of pre-classied
clusters
# of generated
clusters
F-measure
(%)
F-measure
Max (%)
Grouper
quality (%)
1-Cluster
contamination (%)
Life
Amazon
Power
DMOZsamples3
DMOZsamples4A
DMOZsamples4B
DMOZsamples4C
DMOZsamples5A
DMOZsamples5B
DMOZsamples6
DMOZsamples7
DMOZsamples8
Average
4
3
4
3
4
4
4
5
5
6
7
8
4
3+1
4
4+1
4
4+1
5
5+1
5
6+1
9
7+1
100
90.67
89
91.7
100
98.25
96
97.6
100
77.5
92.4
77
92.5
100
100
91
100
100
100
96
98.6
100
99
92.4
89.37
97.2
100
100
53.4
100
100
100
100
100
100
72.5
95.1
62.5
90.3
100
100
80.8
100
100
100
100
100
100
92
96.3
84.2
96.1
517
the net eect of such an organization of the user interface is that, in the user perception, the clustering is
almost immediate.
Fig. 9 reports computing times for the 12 experiments. More specically, for each experiment, we recorded
the number of seconds spent in the two most expensive steps, namely computing SVD and generating minimum spanning trees. The remaining steps have negligible running times. As it can be seen, the computation of
minimum spanning trees is not a real issue, since in all cases took less than 2 s (including the generation of
distance matrices in the projected spaces).
On the contrary, SVD computations are signicantly more expensive, thus conrming that this is the
most delicate task in the algorithm. On average, given the low values of k, our incremental SVD compu-
518
tation strategy performed rather well. However, in three of our experiments with the largest documents, we
had computing times above 10 s. While these times may still be considered acceptable, we believe that performance was negatively inuenced by the implementation of SVD used in the system. After a comparative
analysis, we selected the COLT4 matrix manipulation package, one of the open source matrix packages
available in Java, and customized the SVD algorithm to t our needs. However, we noticed that COLT performance tends to severely degrade on large-size matrices. To conrm this impression, we compared its performance to that of Matlab and JScience,5 the newest math package for the Java language. We noticed that
Matlab implementation of the Arnoldi/Lanczos method is orders of magnitude faster than that of COLT.
We suspect that this might be a consequence of a poor implementation of the sparse-matrices routines in
COLT. To conrm this impressions, we ran a number of comparative tests using JScience; also in this case,
JScience which is based on a high performance library called Javolution6 showed signicantly better performance both in terms of computing time and heap usage with respect to COLT and other similar packages. Unfortunately, JScience currently does not oer support for SVD, although this might be added to
future version. Re-implementing Arnoldi/Lanczos methods in JScience was beyond the purpose of this
paper. Nevertheless, we believe that JScience might signicantly improve computing times once support
for SVD is implemented.
5. Related works
As we have discussed in Section 1, there are several commercial search engines that incorporate some form
of clustering. Besides Vivisimo [35] and Grokker [12], other examples are Ask.com [2], iBoogie [15], Kartoo
[17], and WiseNut [36].
In fact, the idea of clustering search results as a means to improve retrieval performance has been investigated quite deeply in Information Retrieval. A seminal work in this respect is the Scatter/Gather project
[14,31]. ScatterGather provides a simple graphical user interface to do clustering on a traditional information
retrieval system. After the user has posed her/his query, s/he can decide to scatter the results into a xed
number of clusters; then, s/he can gather the most promising clusters, possibly to scatter them again in
order to further rene the search. One limitation in ScatterGather is the fact that the system is not able to
4
5
6
http://dsd.lbl.gov/hoschek/colt.
http://www.jscience.org.
http://javolution.org.
519
infer the optimal number of clusters for a given query, and requires that this is specied by the user in advance.
This may in some cases have an adverse eect on the quality of the clustering.
Other traditional works on clustering along the same lines include [7,1, and 28]. In [7] the authors develop a
classication algorithm for Web documents, based on Yahoo! categories. The classication algorithm learns
representative terms for a number of pre-classied documents, and then tries to assign new documents to the
correct category. The reported recall values are in the range 6090%. More recently, the issue of classifying
Web document into a hierarchy of topics has been studied in [18].
The Paraphrase system [1] introduces latent semantic indexing as a means to cluster search results. The
paper focuses on the relationship among clusters produced by LSI and their labels, obtained by taking the
most representative words for each cluster. The latter technique is similar to the one we use for labeling our
clusters. Besides this, the authors follow a rather typical approach with respect to the selection of the number k of singular values. More specically, k is xed and equals 200; in fact, values in the range 100200
were considered as optimal in retrieval applications, competitive or even superior to term-based similarity
search.
In [28] the authors further explore the use of projection techniques for document clustering. Their results
show that LSI performs better than document truncation. Also in this case, the number k of singular values is
xed. The authors compare the quality of the clustering with k = 20, k = 50 and k = 150. An interesting point
is that, although k = 150 was considered a more typical value, the paper concludes that in clustering applications k = 20 gives better performance and clustering quality. This conclusion is coherent with our idea that the
optimal value of k for clustering documents is equal to the number of classes (or ideal clusters), and therefore
usually much lower than the number of documents.
The advent of Web search engines brought to a shift of perspective about the problem of clustering. The
requirements for eective clustering of Web documents are summarized in [19], with emphasis on performance. In fact, most of the recent clustering systems for the Web share a number of common features, namely:
they work in conjunction with one or more traditional search engines in order to run the queries submitted
by users and gather top results,
clustering is based on a form of post-processing of document snippets returned by the back-end search
engine,
the clustering algorithm is based on some form of common phrase extraction.
Grouper [38,39] is a snippet-based clustering engine based on the HuskySearch meta search engine. The
main feature of Grouper is the introduction of a phrase-analysis algorithm called STC (Sux Tree Clustering). In essence, the algorithm builds a sux tree of phrases in snippets; each representative phrase becomes
a candidate cluster; candidates with large overlap are merged together. The main contribution of Grouper
stands in the complexity of the clustering algorithm, which allows for very fast processing of large result
sets. However, due to inherent limitations of snippets, quality results are not always excellent. A key observation of the paper is that there is no signicant degradation in the quality of clusters when going from full
document to snippets. This experimental evidence for the STC algorithm is largely in contrast with our
experimental results on DSC.
Grouper has inspired a number of other proposals along the same lines. Two examples are SHOC [41]
and Lingo/Carrot Search [23,25]. Both these works extend the STC algorithm with the use of SVD in order
to lter some noise in the snippets and improve the quality of the produced clusters. SHOCs clustering is
based on two steps: during the rst step phrase analysis is used to generate a snippet-topic matrix in which
a topic is either a term or a phrase; then, as a second step, SVD is performed on the matrix in order to
identify the most relevant topics. A key dierence with our work is that the stop criterion for SVD is based
on a xed approximation threshold. No experimental results are reported in [41] to asses the quality of the
clustering algorithm.
Lingo/Carrot Search is similar in spirit, but it uses a dierent strategy. A primary concern is to produce
meaningful descriptions for the clusters. To do this, rst SVD is performed on a snippet-term matrix to identify a number of relevant topics. Also in this case, the selection of k is based on a xed approximation threshold specied by the user. Then, phrase analysis is done to identify, for each of the selected topics, a phrase that
520
represents a good description. Finally, documents are assigned to clusters based on the contained phrases.
Experimental results in terms of cluster contamination have been reported in [22].
Other works along the same lines are [40,11]. In all these cases variants of phrase-analysis algorithms are
developed in order to improve the quality of the original STC algorithm. As an example, in [40] the problem of
improving cluster labels is studied. A machine learning approach is used: given a corpus of training data, the
system is able to identify the most salient phrases and use them as cluster titles. The algorithm is therefore
supervised. It is worth noting that this work was developed by Microsoft Research Asia, and has been used
to provide an experimental clustering service based on MSN contents [33].
SnakeT [11] introduces advanced cluster labeling with variable-length sentences, based on the use of gapped sentences, i.e., sentences made of terms that may not appear contiguously in the original snippet. The
main focus of the system in on personalization: the authors show how to plug the clustering engine into a
search engine to obtain a form of personalization.
A dierent approach has been recently undertaken in the Infocious System [20]. Infocious is a fulledged search engine based on natural language processing techniques. Linguistic processing is built into
the search engine in order to provide a deeper form of processing of documents and search terms, and a
much richer search interface to users. The authors compare the classication power of a Naive Bayes
Classier to that of a classier enhanced with NLP techniques, and show that this may reduce error rate
of about 7%.
6. Conclusions
The paper has introduced a new algorithm for clustering the results of a search engine. With respect to snippet-based clustering algorithms, we have shown that, by considering the whole document content and employing appropriate SVD-based compression techniques, it is possible to achieve very good classication results,
signicantly better than those obtained by analyzing document snippets only.
We believe that these results represent promising directions to improve the quality of clustering in all context in which it is reasonable to assume that document vectors are available to the clustering system. One of
these cases is that of desktop search engines. This impression is conrmed by our experiences with a prototypical implementation of the algorithm in the Noodles desktop search engine, which showed how the proposed
approach represents a good compromise between quality and eciency.
Another possible context of application is that in which the clustering algorithm is integrated into the
search engine, and not run as a postprocessing step. In fact, Google has experienced for a while with forms
of clustering of their search results [21], and has recently introduced a Rene Your Query feature7 that
essentially allows to select one of a few topics to narrow a search. Similar experiments are also being conducted by Microsoft [33]. We believe that this might draw more attention in the future around Web document
clustering.
Acknowledgments
The authors thank Martin Funk, Donatella Occorsio, and Maria Grazia Russo for the insightful discussions during the early stages of this research. Thanks also go to Donatello Santoro for his excellent work
in the implementation of the desktop search engine.
References
[1] P. Anick, S. Vaithyanathan, Exploiting clustering and phrases for context-based information retrieval, in: ACM SIGIR, 1997, pp.
314323.
[2] Ask.com Search Engine. http://search.ask.com.
[3] T. Berners-Lee, O. Lassila, J. Hendler, The semantic web, Scientic American 284 (5) (2001) 3443.
http://www.google.com/help/features.html#rene.
521
[4] M.W. Berry, Large-scale sparse singular value computations, The International Journal of Supercomputer Applications 6 (1) (1992)
1349.
[5] S. Brin, L. Page, The anatomy of a large-scale hypertextual web search engine, Computer Networks and ISDN Systems 30 (17)
(1998) 107117.
[6] D. Calvetti, L. Reichel, D.C. Sorensen, An implicitly restarted lanczos method for large symmetric eigenvalue problems, Electronic
Transactions on Numerical Analysis 2 (1994) 121.
[7] C. Chekuri, P. Raghavan, Web search using automatic classication, in: Proceedings of the World Wide Web Conference,
1997.
[8] E. Chisholm, T.G. Kolda, New term weighting formulas for the vector space method in information retrieval, Technical Report No.
ORNL/TM-13756, Computer Science and Mathematics Division Oak Ridge National Laboratory, 1999.
[9] S. Deerwester, S.T. Dumais, G.W. Furnas, T.K. Landauer, R. Harshman, Indexing by latent semantic analysis, Journal of the
American Society for Information Sciences 41 (6) (1990) 391407.
[10] Open Directory Project (DMOZ). http://www.dmoz.org.
[11] P. Ferragina, A. Gulli, A personalized search engine based on web snippet hierarchical clustering, in: Proceedings of the World Wide
Web Conference, 2005.
[12] Grokker Search Engine. http://www.grokker.com.
[13] R. Guha, R. Mc Cool, E. Miller, Semantic search, in: Proceedings of the World Wide Web Conference, 2003.
[14] M.A. Hearst, J.O. Pedersen, Re-examining the cluster hypothesis: Scatter/gather on retrieval results, in: Proceedings of the ACM
SIGIR Conference, 1996.
[15] iBoogie Search Engine. http://www.iboogie.com.
[16] A.K. Jain, M.N. Murty, P.J. Flynn, Data clustering: a review, ACM Computing Surveys 31 (3) (1999) 265323.
[17] Kartoo Search Engine. http://www.kartoo.com.
[18] K. Kummamuru, R. Lotlikar, S. Roy, K. Singal, R. Krishnapuram, A hierarchical monothetic document clustering algorithm for
summarization and browsing search results, in: Proceedings of the World Wide Web Conference, 2004, pp. 658 665.
[19] Y.S. Maarek, R. Fagin, I.Z. Ben-Shaul, D. Pelleg, Ephemeral document clustering for web applications, IBM Research Report RJ
10186, IBM, 2000.
[20] A. Ntoulas, G. Chao, J. Cho, The infocious web search engine: Improving web searching through linguistic analysis, in: Proceedings
of the World Wide Web Conference, 2005, pp. 840849.
[21] Web 2.0 Exclusive Demonstration of Clustering from Google. http://www.searchenginelowdown.com/2004/10/web-20-exclusivedemonstration-of.html.
[22] S. Osinski, Dimensionality reduction techniques for search result clustering, Masters thesis, Department of Computer Science
University of Sheeld, 2004.
[23] S. Osinski, J. Stefanowski, D. Weiss, Lingo: Search results clustering algorithm based on singular value decomposition, in:
Proceedings of the International Conference on Intelligent Information Systems (IIPWM), 2004.
[24] S. Osinski, D. Weiss, Conceptual clustering using lingo algorithm: evaluation on open directory project data, in: Proceedings of the
International Conference on Intelligent Information Systems (IIPWM), 2004, pp. 369377.
[25] S. Osinski, D. Weiss, A concept-driven algorithm for clustering search results, IEEE Intelligent Systems 20 (3) (2005) 4854.
[26] C. Papadimitriou, P. Raghavan, H. Tamaki, S. Vempala, Latent semantic indexing: a probabilistic analysis, Journal of Computer and
System Sciences 61 (2) (2000) 217235.
[27] R.C. Prim, Shortest connection networks and some generalisations, Bell Systems Technical Journal 36 (1957) 13891401.
[28] H. Schutze, C. Silverstein, Projections for ecient document clustering, in: Proceedings of the ACM SIGIR Conference on Research
and Development in Information Retrieval (SIGIR), 1997.
[29] Google Desktop Search. http://desktop.google.com.
[30] Noodles Project Web Site. http://www.db.unibas.it/projects/noodles.
[31] The Scatter-Gather Project Web Site. http://www.sims.berkeley.edu/~hearst/sg-overview.html.
[32] D.C. Sorensen, Implicitly restarted arnoldi/lanczos methods for large scale eigenvalue calculations, Technical Report TR-96-40,
Department of Computational and Applied Mathematics Rice Universiry, 1996.
[33] SRC Search Engine. http://rwsm.directtaps.net/.
[34] C.J. van Rijsbergen, in: Information Retrieval, second ed., Butterworth, London, 1979.
[35] Vivisimo Search Engine. http://www.vivisimo.com.
[36] WiseNut Search Engine. http://www.wisenut.com.
[37] C.T. Zahn, Graph-theoretical methods for detecting and describing gestalt clusters, IEEE Transactions on Computers C-20 (1971)
6886.
[38] O. Zamir, O. Etzioni, Web document clustering: A feasibility demonstration, in: Proceedings of the ACM SIGIR Conference on
Research and Development in Information Retrieval (SIGIR), 1998.
[39] O. Zamir, O. Etzioni, Grouper: a dynamic clustering interface for web search results, Computer Networks 31 (1116) (1999) 1361
1374.
[40] H.J. Zeng, Q.C. He, Z. Chen, W.Y. Ma, J. Ma, Learning to cluster web search results, in: Proceedings of the ACM SIGIR
Conference, 2004.
[41] D. Zhang, Y. Dong, Semantic, hierarchical, online clustering of web search results, in: Proceedings of the Asia-Pacic Web
Conference, 2004.
522
Salvatore Raunich is a research assistant at Universita della Basilicata. He graduated in Computer Science at
Universita della Basilicata in 2003, and received a master in Computer Science in 2005. His research interests
include Web clustering and data integration.
Alessandro Pappalardo holds a master in Computer Science from Universita della Basilicata. He graduated in
Computer Science at Universita della Basilicata in 2004. He received his Master in 2006. His research interests
include Web clustering and data integration.