0% found this document useful (0 votes)

51 views

Cluster Search Result

The document presents a new algorithm for clustering search results called dynamic SVD clustering (DSC). Unlike other clustering systems that analyze snippets, DSC uses latent semantic indexing on full document content to project documents into a concept space. It then uses an incremental SVD computation and novel strategy to dynamically determine the optimal number of singular values k to use, improving on fixed k approaches. Experimental results show DSC has very good classification accuracy, outperforming other methods, and remains feasible for clustering large document sets.

Uploaded by

Bama Raja Segaran

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

51 views

Cluster Search Result

Uploaded by

Bama Raja Segaran

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

Data & Knowledge Engineering 62 (2007) 504522

www.elsevier.com/locate/datak

A new algorithm for clustering search results

Giansalvatore Mecca *, Salvatore Raunich, Alessandro Pappalardo
Dipartimento di Matematica e Informatica, Universita` della Basilicata, viale dellAteneo Lucano, 10, 85100 Potenza, Italy
Received 5 June 2006; received in revised form 15 September 2006; accepted 16 October 2006
Available online 13 November 2006

Abstract
We develop a new algorithm for clustering search results. Dierently from many other clustering systems that have been
recently proposed as a post-processing step for Web search engines, our system is not based on phrase analysis inside snippets,
but instead uses latent semantic indexing on the whole document content. A main contribution of the paper is a novel strategy
called dynamic SVD clustering to discover the optimal number of singular values to be used for clustering purposes. Moreover, the algorithm is such that the SVD computation step has in practice good performance, which makes it feasible to perform clustering when term vectors are available. We show that the algorithm has very good classication performance, and
that it can be eectively used to cluster results of a search engine to make them easier to browse by users. The algorithm has
being integrated into the Noodles search engine, a tool for searching and clustering Web and desktop documents.
2006 Elsevier B.V. All rights reserved.
Keywords: Web; Search engines; Clustering; Latent semantic indexing; Document classication

1. Introduction and motivations

Web search engines like Google [5] can nowadays be considered as a cornerstone service for any Internet
user. The keyword-based, boolean search style used by these engines has rapidly permeated user habits, to
such an extent that it is now extending to other classes of applications, for example desktop search [29].
A key factor in the success of Web search engines is their ability to rapidly nd good quality results to queries that are based on rather specic terms, like Java Server Faces or Juventus Football Club. On the other
side, however, Google-like search services usually fall short when asked to answer much broader queries
those that are often posed by less-experienced users like, for example, to nd documents about the term
power or the term amazon. The poor quality of results in these cases is mainly due to two dierent factors:
(a) polysemy and/or synonymity in search terms (b) excessively high number of results returned to the user. As
a consequence, less skilled users are often frustrated in their research eorts.
The so-called Semantic Web [3] promises to solve most of these problems by adding semantics to Web
resources. Although there have been some proposals in the literature towards a semantic Web search engine
*

Corresponding author. Tel.: +39 0971 205855.

E-mail address: [email protected] (G. Mecca).

0169-023X/$ - see front matter 2006 Elsevier B.V. All rights reserved.
doi:10.1016/j.datak.2006.10.006

G. Mecca et al. / Data & Knowledge Engineering 62 (2007) 504522

505

[13], the transition from the current syntactic Web to the next generation Web appears to be a slow process
and these proposals can hardly be considered as a ready-made solution to todays search problems.
At the moment, the most promising improvement over traditional search techniques are the so-called clustering engines. The idea of clustering search results is not new, and has been investigated quite deeply in information retrieval, based on the so called cluster hypothesis [34] according to which clustering may be benecial
to users of an information retrieval system since it is likely that results that are relevant to the user are close to
each other in the document space, and therefore tend to fall into relatively few clusters.
Several commercial clustering engines have recently emerged on the market. Well known examples are Vivisimo [35] and Grokker [12]. Although the underlying clustering techniques are not fully disclosed to the public,
based on the available documentation it is possible to say that these systems share a number of common features with research systems introduced in literature, mainly the Grouper system [38,39] and Lingo/Carrot
Search [23,25]. We summarize these features in the following.
First, these tools are usually not search engines by themselves. On the contrary, when a user poses a query,
the clustering engine uses one or more traditional search engines to gather a number of results; then, it does a
form of post-processing on these results in order to cluster them into meaningful groups. The cluster tree is
then presented to the user so that s/he can browse it in order to explore the result set. Fig. 1 shows the clusters
produced by Vivisimo for the query term power. It can be seen that such a technique may be helpful to
users, since they can quickly grasp the dierent meanings and articulations of the search terms, and more easily select a subset of relevant clusters.
Being based on a post-processing step, all of these clustering engines work by analyzing snippets, i.e., short
document abstracts returned by the search engine, usually containing words around query term occurrences.
The reason for this is performance: each snippet contains from 0 to 40 words, and therefore can be analyzed
very quickly, so that users do not experience excessive delays due to the clustering step. However, snippets are
often hardly representative of the whole document content, and this may in some cases seriously worsen the
quality of the clusters.
1.1. Contributions of the paper
This research aims at reconsidering some of the techniques for clustering search results. More specically,
we want to investigate the trade-o between performance and quality of the clustering when choosing snippets

Fig. 1. Search results clustered by Vivisimo.

506

G. Mecca et al. / Data & Knowledge Engineering 62 (2007) 504522

versus whole documents. This is particularly relevant if we consider that in some emerging contexts, like for
example, desktop search engines, (a) it is reasonable to assume that the document-term vectors are available to
the clustering engine; (b) snippets may not be available at all, based on the dierent nature of documents.
The main goal of the paper is to evaluate the quality of the clustering in terms of its ability to correctly
classify documents, i.e., to dynamically build a bunch of clusters that correctly reect the dierent categories
in the document collection returned by the search engine. Similarly to [7], we believe that this ability may assist
less-skilled users in browsing the document set and nding relevant results.
The main contributions of the paper can be summarized as follows:
we develop a new document clustering algorithm called dynamic SVD clustering (DSC), based on latent
semantic indexing [9]; the novelty of the algorithm is twofold: (a) rst, it is based on an incremental computation of singular values, and does not require to compute the whole SVD of the original matrix; (b) second, it uses an original strategy to select k, i.e., the number of singular values used to represent the
concepts in the document space; dierently from other proposals in the literature, like for example [28
or 23], our strategy does not assume a xed value of k, neither a xed approximation threshold;
based on experimental results, we show that the algorithm has very good classication power; in many cases
it is able to cluster pre-classied documents collections with 100% accuracy; it is worth noting that the quality of the classication severely degrades when snippets are used in place of the whole document content,
thus providing further evidence that snippets are often too poor and not suciently informative; by comparing our results to those of other proposals in the literature, we show that our algorithm has comparatively better classication performance;
nally, we show that the complexity of the chosen SVD computation strategy is such that the algorithm has
in practice good performance, and lends to a very natural clustering strategy based on the minimum spanning tree of the projected document space.
To the best of our knowledge, this is the rst paper to propose a dynamic strategy to discover the optimal
number of singular values to be used in a classication task. This strategy represents the main contribution of
this paper.
The paper is organized as follows. Section 2 introduces a number of preliminary denitions and discusses
some of the techniques that will be used in the rest of the paper. Section 3 is devoted to the description of the
clustering algorithm and Section 4 discusses implementation and experimental results. Related works are in
Section 5.
2. Preliminaries
This section introduces a number of techniques that will be used in the rest of the paper.
2.1. Document indexing
As it is common in information retrieval, we represent documents as vectors in a multidimensional term
space [34]. Documents are rst preprocessed to remove stop words. Terms may be stemmed. There are several
weighting schemes [34,8] that can be used to construct document vectors. Generally speaking, the weight wij of
term ti in document Dj, i.e., the ith coordinate of vector j, is given by the product of three dierent factors:
wij Lij Gi N j
where Lij is the local weight of term i in document j, Gi is the global weight of term i in the document collection,
and Nj is the normalization factor for document j.
We have experimented several weighting schemes, as follows. Let us call fij the frequency of term i in document j, Fi the global frequency of term i in the whole document collection, ni the number of documents in
which term i appears, N the total number of documents, and m the size of a document vector vj.
The following table summarizes the three forms of local weight that have been considered in the paper (see
Table 1).

G. Mecca et al. / Data & Knowledge Engineering 62 (2007) 504522

507

Table 1
Local weights
Id

Formula

Description

FREQ

Lij = fij

1 logfij if f ij > 0
Lij
0
if f ij 0
p

1 fij 0:5 if f ij > 0
Lij
0
if f ij 0

Frequency of term i in document j

LOGA
SQRT

Augmented logarithmic local weight

Augmented square root local weight

Table 2
Global weights
Id

Formula

Description

NONE
IGFF

Gi = 1
Gi Fnii

IDFB

Gi log

All terms have the same global weight

Inverted global frequency (a term has higher global weight if its total number of occurrences exceeds the
number of documents in which it appears)
Inverse document frequency (a term has global weight 0 if it appears in every document)

N
ni

Table 3
Normalization weights
Id

Formula

Description

NONE
COSN

Nj = 1
1

N j pP
m
2

All documents have the same normalization weight

Cosine normalization factor

Gi Lij

With respect to global weights, we have considered the following (see Table 2).
Finally, with respect to normalization factors, we have considered (see Table 3).
Note that, when using the cosine normalization factor, all term vectors have length 1.
In order to compare distances and similarities between vectors, we use cosine similarity, that is, we compute
the similarity s(vi,vj) between vectors vi and vj as the cosine of their angle h, as follows:
vi vj
svi ; vj cosh
kvi kkvj k
here vi vj is the dot product of the two vectors, and kvikkvjk is the product of their norms. Since all vectors
have length 1, we have that s(vi,vj) = cos(h) = vi vj.
2.2. Latent semantic indexing
Latent semantic indexing (LSI) [9] is a document projection technique based on singular value decomposition
(SVD). Suppose we are given d documents and t terms; let us represent each document as a term vector of t
coordinates. We call A the t d matrix whose columns are the term vectors; let r be the rank of A. SVD
decomposes matrix A into the product of three new matrices, as follows:
r
X
A U RV T
ri ui vTi
i0

where R is a diagonal matrix of size r r made of the singular values of A in decreasing order: R = diag(r1,r2, . . . rr),r1 P r2 P . . . P rr > 0, U and V are of size t r and d r respectively; vectors ui are called
the left singular vectors and vTi are the right singular vectors of A.
A singular value and its left and right singular vectors are also called a singular triplet. In order to obtain an
approximation of A let us x k 6 r and call Rk = diag(r1,r2, ,rk) , i.e., the k k head minor of R . Similarly,
let us call Uk the restriction of U to the rst k left singular vectors; similarly for V Tk ; we dene:

508

G. Mecca et al. / Data & Knowledge Engineering 62 (2007) 504522

Ak U k Rk V Tk
It is possible to prove [9] that Ak is the best k-rank approximation of the original matrix A. Informally
speaking, by appropriately choosing a value for k we are selecting the largest singular values of the original
matrix, and ignoring the smallest ones, therefore somehow preserving the main features of the original vector space while ltering out some noise. It can be seen that the smaller is k with respect to r, the less the
new space resembles the original one. In traditional information retrieval applications, in which LSI is used
as a means to improve retrieval performance, it is crucial that the essential topological properties of the
vector space are preserved; as a consequence, the value of k is usually quite high (empirical studies [28]
show that for a typical information retrieval application a value between 150 and 300 is usually the best
choice).
It is known [4] that the computation of singular triplets of a matrix A may be reduced to the computation of
eigenvalues and eigenvectors of matrix AAT or ATA . Therefore, in order to discuss the complexity of SVD, we
shall refer to the complexity of the eigenvalue problem. In particular, there are several algorithms for computing eigenvalues and eigenvectors of a matrix. One example of this are the implicitly restarted Arnoldi/Lanczos
methods [32].1 This family of methods is particularly relevant with respect to this work, since it can be implemented in an incremental way, that is, to obtain the rst k eigenvalues one by one. It has also very interesting
computational complexity; more specically, it has been shown that it can be used to compute the rst k singular triplets in time 2k2n + O(k3) using storage 2kn + O(k3) [6].
3. Clustering algorithm
In this section we introduce the clustering algorithm used by Noodles. We shall rst give some insight on
the main ideas behind the algorithm, and then elaborate on the technical details.
3.1. Intuition
Our algorithm is heavily based on Latent Semantic Indexing. LSI has a natural application in clustering
and classication applications. It is often said that, by means of SVD, LSI does perform a transformation
of the original vector space in which each coordinate is a term into a new vector space, in which each coordinate is some relevant concept in the document collection, i.e., some topic that is common to several documents. Note that the coordinates of the original documents in this space are given by matrix VkRk . We shall
call this d k space the projected document space.
In this respect, a possible approach to clustering would be the following: (a) compute SVD over the original
matrix, for some value k, to obtain a representation of the original documents in the new concept space,
VkRk, in which each coordinate represents some topic in the original document collection, and therefore
some cluster of documents; (b) run a clustering algorithm in this space to cluster documents with respect to
their topics. This method has been used for clustering purposes for example in [23].
A critical step, in this approach, is the selection of k. A natural intuition suggests that, assuming
the document collection contains x hidden clusters, the natural value of k to be used for SVD is exactly
x. This is a consequence of the fact that one of the property of SVD is that of producing in VkRk an
optimal alignment of the original documents along the k axes [26]. However, such a value is unknown
and must be discovered by the clustering engine. There are a few interesting observations with respect
to this point:
rst, such a value of k can be signicantly lower that the number of documents, d, and the number of terms,
t, since it is unlikely that there are more than a dozen relevant clusters among the results of a search; this
means that the projected document space does not preserve much of the features of the original space; this
is, however, not a concern, since in our case we are using this space only as a means to discover clusters, and
not for retrieval purposes;
1

One of these methods is also used in the well known Matlab toolkit to compute eigenvalues/eigenvectors.

G. Mecca et al. / Data & Knowledge Engineering 62 (2007) 504522

509

it is therefore apparent that xing the value of k, for example assuming k in the order of 150300, as it is
often done in the literature, would not give good classication performance; the fact that lower values of k
are to be preferred in clustering applications is conrmed for example in [28]; nevertheless, in the latter
work xed values of k = 20 and k = 50 are compared, whereas the optimal value should ideally be discovered dynamically;
in light of these considerations, we also discard the other typical approach used to establish the value of k,
i.e., that of xing a threshold for the approximation that Ak gives of the original space A, as it is done, for
example, in [23].
In fact, the strategy used to dynamically select the optimal value of k is one of the main originality of our
algorithm.
The main intuition behind the selection of k is that assuming the original collection contains x clusters,
and that each cluster informally corresponds to some clearly dened topic then the points in the projected
document space should naturally fall into x clusters that are reasonably compact and well separated.
To see this, consider for example Fig. 2, which refers to one of our test document collection, corresponding
to search results for the query term life. We have identied four natural clusters in the collection, namely:
(a) documents about biological life; (b) documents about life cycles of a system or a living entity; (c) documents about social life and related events; (d) documents about the game of life. The gure shows the minimum spanning tree of document distances in space V4R4. Edges are drawn dierently based on their length;
more specically, the longest ones are drawn as a double line, the shortest ones as a dashed line. It can be seen
that, on the one side SVD has clearly brought documents belonging to the same cluster close to each other,
and on the other side that an accurate clustering can be obtained simply by removing the k 1 longest edges,
that is, by applying a variant of a spanning-tree based clustering algorithm [37,16].
Based on this observation, our algorithm can be informally sketched as follows:
given the original term-document space A we incrementally compute SVD, starting with a low number of
singular values, and, for each value of k , generate the projected space VkRk;
we nd the minimum spanning tree of points in the projected space; assuming we have found the optimal
value for k, we stop the algorithm and obtain our clusters simply by removing the k 1 longest edges from
the minimum spanning tree, and considering each connected component as a cluster;
to discover such optimal value of k, we dene a quality function for the clusters obtained from the minimum spanning tree; intuitively, this function rates higher those trees in which the k 1 longest edges are
signicantly longer than the average (details are given below); based on this, we stop as soon as we nd a
local maximum of the quality function; this is a hint that we have found a natural distribution of documents in the projected space, that would be worsened if we choose higher values of k, i.e., more clusters.

Fig. 2. Example of a minimum spanning tree for the life example.

510

G. Mecca et al. / Data & Knowledge Engineering 62 (2007) 504522

Fig. 3. Quality function for the life example.

A plot of the quality function for the life example discussed above is reported in Fig. 3. It can be
seen that the quality function has a maximum for k = 4, that is exactly the number of clusters we are
seeking in this example. It is also worth noting that such a strategy nicely solves a typical problem of
clustering algorithms, i.e., that of dynamically choosing the right number of clusters for a given dataset
[16].
The following section describes the clustering algorithm in detail.
3.2. Description of the dynamic SVD clustering algorithm
Let us rst formalize the quality function for minimum spanning trees. We x a value of k, and assume we
have computed the projected space VkRk by means of SVD. Let us call MSTk the minimum spanning tree of
document distances in the projected space. We call the k 1 longest edges in the tree e1, e2, . . . ek1,
respectively.
We nd the average avgk and the standard deviation rk of the distance distribution in the projected
space. Based on this, we assign to each edge e of length l(e) in the spanning tree a cost c(e) as
follows:
ce le avgk rk
Then, we dene the quality Q(MSTk) of MSTk as follows:
QMSTk k

k1
X

cei

Intuitively, the quality function is higher when edges that are candidate to be removed have lengths
that are signicantly above the average. Term k is necessary as a multiplicative factor since we have
observed that, by increasing the value of k, the distribution of distances in the space changes signicantly.
More specically, for higher values of k we have both higher average distance and standard deviation
i.e., the overall space is more disperse and therefore edges tend to bring a smaller contribution to the
quality function.
This said, we can formalize the DSC algorithm as follows. The input to the algorithm is a document collection D = {D0, D1, . . . Dd}. These may either be whole documents, or document summaries, such as snippets. We assume that documents have been previously indexed, i.e., for each document Di in D we have
build a term vector after performing stop-word removal. Note that we are not xing the weighting scheme
used to construct the vectors. In Section 4 we compare the impact of the dierent weighting schemes on the
clustering.

G. Mecca et al. / Data & Knowledge Engineering 62 (2007) 504522

511

The algorithm proceeds as follows.

Step 1: Initialization of the Vector Space Given the term vectors associated to documents in D we build matrix A.
Step 2: Incremental SVD We incrementally compute SVD for matrix A using the Arnoldi/Lanczos implicitly
restarted method, starting with k = 2; for each value of k, we derive the projected space VkRk.
Step 3: Quality Evaluation based on Minimum Spanning Trees We build the minimum spanning tree MSTk of
the graph of distances in VkRk and calculate Q(MSTk) . We stop as soon as we nd a value of k such
that:
QMSTk1 6 QMSTk > QMSTk1
Step 4: Clustering Once we have xed k, we obtain the clusters by removing the k 1 longest edges from
MSTk and building a cluster for each connected component obtained after the removal. In order to
avoid the production of singleton clusters, all clusters containing less than two documents are merged
into a single outliers clusters.
Step 5: Cluster Labeling To give a meaningful label to each cluster, we analyze the collection of term vectors
associated with documents in the cluster, and select the n most frequent terms (currently 4). Those
terms will be used as a label for the cluster. The cluster of outliers is labeled as Other Documents.
Note that our labeling scheme is similar to that used by [14]. As an alternative labeling scheme, we might
easily perform phrase analysis on snippets, using for example the algorithm of [38]. However, we have decided
not to implement this solution in our system since our main focus is on the clustering algorithm.
3.3. Computational complexity
A crucial requirement for any clustering algorithm is eciency. This is particularly true in our case, since:
(a) document vectors tend to have very high cardinality, and this may in principle produce high overheads in
order to calculate distances during the clustering phase; (b) performing a full SVD on a large matrix is a
demanding computation.
Our algorithm solves these two problems by virtue of its incremental nature. To show this, let us comment
on the complexity of the various steps. Let us rst note that a critical assumption here is that document indexing has been performed in advance, i.e., that the clustering engine has access to term vectors for the retrieved
documents. We call t the number of terms occurring in the document collection after stop-word removal.
Then, we have the following upper bounds:
step 1 (initialization of the vector space) is linear in time with respect to t and produces a matrix of size
t d;
step 2 (incremental SVD) is by far the most delicate step in terms of complexity; as discussed above, its time
cost is O(k3); therefore, the overall time complexity is strongly related to the value of k; however, note that
in clustering applications k is usually quite low typically less than 10 and this should guarantee good
performance in practical cases;
to evaluate the cost of step 3 (quality evaluation based on minimum spanning trees), please note that the projected space has size d k, with k t; as a consequence, the lower is the value of k, the faster we can compute distances between points in the space; in particular, each distance requires exactly k products; in order
to build the minimum spanning tree, we rst need to construct the complete graph of document distances in
space VkRk; this requires to compute d(d 1)/2 distances; then, to build the minimum spanning tree on this
graph, we use the classical Prims algorithm [27]; Prims algorithm implemented using a Fibonacci heap on
a graph of E edges and V nodes has O(E + Vlog(V)) time complexity; contextually, we can also nd the
k 1 longest edges in the space, and evaluate function Q();
step 4 (clustering) is linear in d, since it requires to visit the minimum spanning tree and nd connected components after removing the longest edges.
We will elaborate further on this in Section 4, where we report a study of computing times in practical
experiments. For the moment, let us note how this analysis suggests that the SVD computation and the

512

G. Mecca et al. / Data & Knowledge Engineering 62 (2007) 504522

Fig. 4. A snapshot of the noodles desktop search engine.

derivation of the minimum spanning tree are likely to have the highest impact in terms of running time. In
practice, however, SVD computation is by far the bottleneck in terms of eciency.
4. Implementation and experiments
The clustering algorithm has been implemented in the Noodles desktop search engine [30], a snapshot of
which is shown in Fig. 4. The system is written in Java, using Apache Lucene2 as an indexing engine, and
the Spring Rich Client Platform3 as a desktop application framework. It has been conceived as a general-purpose search tool, that can run both Web and desktop searches. In order to perform desktop searches, it incorporates a crawler module that runs as a daemon and incrementally indexes portions of the local disks that
have been specied by the user.
It also incorporates a testbed for the clustering engine, which we have used to conduct a number of experiments. Overall, we have run several hundreds of experiments, combining dierent datasets, dierent summarization schemes, dierent weighting schemes, dierent criteria for the selection of k in SVD, and dierent
clustering algorithms. The most interesting experimental evidences are described in this section.
4.1. Description of the datasets
We concentrated on two dierent categories of datasets. To construct datasets of the rst category, we ran
queries on Google and selected a number of the top-ranked search results. Those results were manually classied into a number of clusters. Then, the algorithm was run on the document collection to compare the suggested clusters with those identied manually. Although this kind of experiments closely mimic the situation in
which a user runs a query on a search engine and then inspects the results browsing the clusters produced by
the clustering engine, the manual classication step necessary to assess the quality of the clustering tends to
be labor-intensive and quite error-prone. As a consequence, document collections of this rst category tend to
be quite small.
2
3

http://lucene.apache.org.
http://www.springframework.org.

G. Mecca et al. / Data & Knowledge Engineering 62 (2007) 504522

513

In order to test the classication power of the algorithm on larger document collections, we implemented a
data fetching utility for the Open Directory Project (DMOZ) [10]. This utility was used in order to sample
categories inside DMOZ, and then run the clustering algorithm on the pre-categorized samples. In this case,
the clusters produced by the algorithm were compared with the original DMOZ categories.
Overall, we selected 12 dierent datasets, ranging from 16 to about 100 documents; three datasets are of the
rst category, and nine of the second category. The datasets are described in Table 4. For each dataset, we
have worked both on the full document and on a snippet.
4.2. Quality measures
In order to assess the clusters produced by the algorithm, we have computed several quality measures.

Table 4
List of datasets
Dataset

Description

docs

clust

Clusters

Amazon

Search results for

keywordamazon
Search results for keyword
life
Search results for keyword
power
Sampling of DMOZ
categories

amazon.com, Amazon forest, Amazons in mythology

Biological life, cycle of life, game of life, social life

DMOZsamples4A

Sampling of DMOZ
categories

DMOZsamples4B

Sampling of DMOZ
categories

DMOZsamples4C

Sampling of DMOZ
categories

DMOZsamples5A

Sampling of DMOZ
categories

DMOZsamples5B

Sampling of DMOZ
categories

DMOZsamples6

Sampling of DMOZ
categories

DMOZsamples7

Sampling of DMOZ
categories

DMOZsamples8

Sampling of DMOZ
categories

Brain power, energy power, military power, numerical

power
top/shopping/jewelry/diamonds, top/shopping/
antiques_and_collectibles/coins,top/shopping/vehicles/
motorcycles
top/arts/literature/myths_and_folktales/myths/greek, top/
computers/virtual_reality, top/recreation/food,top/
shopping/jewelry/diamonds
top/news/weather, top/regional/caribbean/bahamas/
travel_and_tourism,top/shopping/
antiques_and_collectibles/coins, top/shopping/jewelry/
diamonds
top/sports/basketball/professional/nba/
san_antonio_spurs, top/sports/events/olympics/baseball,
top/sports/volleyball,top/sports/water_sports/surng
top/computers/algorithms, top/games/video_games/
adventure, top/recreation/pets/dogs,top/science/math/
number_theory, top/sports/volleyball
top/arts/animation/studios/disney, top/health/medicine/
medical_specialties/neurology,top/home/gardening/
gardens/public, top/shopping/antiques_and_collectibles/
coins,top/sports/water_sports/surng
top/business/transportation_and_logistics, top/regional/
caribbean/bahamas/travel_and_tourism,top/science/
astronomy, top/shopping/vehicles/motorcycles, top/
society/religion_and_spirituality/yoga/practices,top/
sports/events/olympics/baseball
top/arts/movies/history, top/games/video_games/
adventure, top/recreation/pets/dogs,top/shopping/
antiques_and_collectibles/coins, top/shopping/jewelry/
diamonds,top/sports/basketball/professional/nba/
san_antonio_spurs, top/sports/volleyball
top/arts/animation/studios/disney, top/arts/literature/
myths_and_folktales/myths/greek,top/computers/
virtual_reality, top/health/medicine/medical_specialties/
neurology,top/home/gardening/gardens/public, top/
recreation/food, top/shopping/antiques_and_collectibles/
coins,top/sports/water_sports/surng

Life
Power
DMOZsamples3

514

G. Mecca et al. / Data & Knowledge Engineering 62 (2007) 504522

As a primary measure of quality we use the F-measure [34], i.e., the harmonic means of precision and recall.
More specically, given a cluster of documents C , to evaluate his quality with respect to an ideal cluster Ci we
rst compute precision and recall as usual:
PrC; C i

jC \ C i j
jCj

RecC; C i

jC \ C i j
jC i j

Then, we dene:
F C; C i

2PrC; C i RecC; C i
PrC; C i RecC; C i

Given a collection of clusters, {C1, C2, . . . Ck}, to evaluate its F-measure with respect to a collection of ideal
clusters fC i1 ; C i2 ; . . . C ih g we do as follows: (a) we nd for each ideal cluster C in a distinct cluster Cm that best
approximates it in the collection to evaluate, and evaluate F C m ; C in ; (b) then, we take as an F-measure of
the collection the average value of the F-measures over all ideal clusters.
To gain an in-depth understanding of the classication power of our technique, for each dataset we have
computed two dierent F-measures. First, we ran the clustering algorithm and stopped at the value of k chosen
by the algorithm; this experiment was used to compute a rst F-measure for the clustering algorithm. However, since we were also interested in evaluating the classication power of our SVD-based technique independently of the stop criterion, we also ran a second experiment, in which we computed clusterings for all values
of k, from 2 to n, and took the maximum F-measure obtained by the algorithm. In the following, we shall refer
to this second measure as F-measure Max.
Other works in the literature have used dierent quality metrics to assess their experimental results. To
compare our results to these, we have also computed several other quality measures. More specically,
Grouper [38] introduced a customized quality measure, called Grouper Quality Function, which in essence
looks at all pair of documents in a single cluster and counts the number of true positive pairs (the two documents were also in the same ideal cluster), and false positive pairs (they were not in the ideal cluster); then, it
combines these two gures to calculate an overall quality function. Lingo [23] uses an alternative measure,
cluster contamination [22], which intuitively measures how much a cluster produced by the system mixes documents that should ideally belong to dierent classes; of course, 1 cluster contamination can be considered as
a measure of the purity of the clustering.
Beside F-measure and F-measure Max, we also computed the Grouper Quality Function and cluster contamination for our experiments.
4.3. Indexing scheme
We are now ready to discuss a number of experimental evidences. As a rst step, we dealt with the problem of selecting the correct weighting scheme. We ran several preliminary experiments, both on full documents and on snippets, with and without SVD, comparing six weighting schemes obtained by the
combination of dierent local and global weights. In all cases we used cosine normalization, since we soon
noticed that results strongly degrade in absence of normalization. Fig. 5 shows average F-measures for the
six weighting schemes.
It can be seen that the best results were obtained using the augmented logarithmic local weight and no global weight. Therefore, we adopted this weighting scheme for the rest of our experiments. This phenomenon
seems to be related to the adoption of LSI. In essence, global weights are often used in information retrieval
as a means to reduce noise in the document space; to give an example, one typical consequence of the adoption
of a inverse document frequency global weight is that of assigning weight 0 to all terms that appear in every
document. However, similar eects are also obtained by applying SVD. It seems that SVD has indeed superior
noise-reduction power in clustering applications, so that global weights are no longer necessary. To conrm
this hypothesis, we performed several other experiments on document summaries. Being Web pages, most of
our documents typically contain some header e.g., banner and navigation links and footer material. We
ran the clustering algorithm after removing such sections, which in most cases are rather uninformative; in
essence, we extracted what, in information extraction terms, is called a rich region from the pages. Also in this

G. Mecca et al. / Data & Knowledge Engineering 62 (2007) 504522

515

Fig. 5. Average F-measure for dierent weighting schemes.

case, no signicant improvement in F-measures was observed (around 2%), thus conrming the intuition that
SVD does a very good job in ltering out this noisy portions of the documents.
A second preliminary evaluation was related to the adoption of stemming after stop-word removal in the
document indexing phase. We compared F-measures obtained in the 12 experiments after stemming to those
without stemming. As it can be seen in Fig. 6, there is no signicant improvement in quality related to the
adoption of stemming. However, we noticed that stemming on average reduced the number of terms by
23%, therefore making matrices smaller and computation more ecient. As a consequence, we decided to consistently perform stemming.
Fig. 6 also shows a comparison of F-measure and maximum F-measure obtained by the algorithm on full
documents and on document snippets. It can be easily seen that performance degrades of more than 40% when
analyzing snippets. This seems to conrm one of our initial assumptions, i.e. that snippets are often too short
to represent a meaningful summary of the document, and that the analysis of the whole document content is
very important in order to improve the quality of the classication. It is worth noting that this is somehow in
contrast with other works, for example [38], according to which there is no signicant degradation when going
from documents to snippets. Such a big discrepancy might be justied by the completely dierent nature of our
approach with respect to theirs.

Fig. 6. Stemming vs. no stemming and full documents vs. snippets.

516

G. Mecca et al. / Data & Knowledge Engineering 62 (2007) 504522

4.4. Quality of the algorithm

We are now ready to discuss the quality of the algorithm. Quality gures are reported in detail in Table 5
and in Figs. 7 and 8. Let us rst notice that the algorithm has shown excellent classication performance. In
fact, the maximum F-measure was very close to 100% in all the experiments.
If we consider the actual clustering produced by the algorithm, we can draw several conclusions. First,
the number of generated clusters was in all cases very close to the ideal one; note that in the table, 4 + 1
means that the algorithm produced 4 labeled clusters plus a cluster of Other documents containing outliers, i.e., documents that were considered somehow distant from other clusters. In essence, as it can be
seen in Table 5, the algorithm was able to choose the right value of k for most datasets; in the three cases
in which the number of clusters generated by the system was higher than the ideal one (DMOZSamples3,
DMOZSamples4C, DMOZSamples7) one or two additional small clusters were produced as a result of
splitting larger ones. Note that results on similar experiments reported in [24] with respect to Lingo
and Grouper seem to suggest a tendency by their algorithms to over classify DMOZ categories; in these
experiments, input documents drawn from 2 to 10 DMOZ categories produced 2436 clusters. Our system
does not show this kind of behavior, and is able to reproduce quite accurately the original DMOZ
categories.
A second observation is related to the quality of the clusterings that were produced by the algorithm.
First, F-measures are on average very high, well above 90%. Moreover, the other quality measures are in
both cases higher with respect to those reported by [38,22]. For example, papers about Grouper report that
average quality has a maximum for document collections with very few ideal clusters; this maximum is
below 80%; then, by increasing the number of clusters, the quality measure tends to decrease. Similarly,
Lingo reports cluster contamination on average at 25%, whereas in our case we had values below 5% on
average. These results further support the idea that clustering based on snippets has inherently lower quality
than on full documents, and therefore that in all cases in which the full document is available, it should be
used.
4.5. Computing times
On average, computation times in our experiments were of a few seconds. Also, we developed our user
interface in such a way to minimize the latency associated with the display of clusters. More specically,
when a user runs a query, the system very quickly returns the ranked search results. At the same time, it
starts to cluster top-ranked results on a background thread, and shows a Clustering button on the screen.
If the user wants to see the clusters, s/he has to select this button. Considering typical user reaction times,

Table 5
Quality measures
Collection

# of pre-classied
clusters

# of generated
clusters

F-measure
(%)

F-measure
Max (%)

Grouper
quality (%)

1-Cluster
contamination (%)

Life
Amazon
Power
DMOZsamples3
DMOZsamples4A
DMOZsamples4B
DMOZsamples4C
DMOZsamples5A
DMOZsamples5B
DMOZsamples6
DMOZsamples7
DMOZsamples8
Average

4
3
4
3
4
4
4
5
5
6
7
8

4
3+1
4
4+1
4
4+1
5
5+1
5
6+1
9
7+1

100
90.67
89
91.7
100
98.25
96
97.6
100
77.5
92.4
77
92.5

100
100
91
100
100
100
96
98.6
100
99
92.4
89.37
97.2

100
100
53.4
100
100
100
100
100
100
72.5
95.1
62.5
90.3

100
100
80.8
100
100
100
100
100
100
92
96.3
84.2
96.1

G. Mecca et al. / Data & Knowledge Engineering 62 (2007) 504522

517

Fig. 7. F-measure and F-measure Max.

Fig. 8. Other quality measures.

the net eect of such an organization of the user interface is that, in the user perception, the clustering is
almost immediate.
Fig. 9 reports computing times for the 12 experiments. More specically, for each experiment, we recorded
the number of seconds spent in the two most expensive steps, namely computing SVD and generating minimum spanning trees. The remaining steps have negligible running times. As it can be seen, the computation of
minimum spanning trees is not a real issue, since in all cases took less than 2 s (including the generation of
distance matrices in the projected spaces).
On the contrary, SVD computations are signicantly more expensive, thus conrming that this is the
most delicate task in the algorithm. On average, given the low values of k, our incremental SVD compu-

518

G. Mecca et al. / Data & Knowledge Engineering 62 (2007) 504522

Fig. 9. Computing times.

tation strategy performed rather well. However, in three of our experiments with the largest documents, we
had computing times above 10 s. While these times may still be considered acceptable, we believe that performance was negatively inuenced by the implementation of SVD used in the system. After a comparative
analysis, we selected the COLT4 matrix manipulation package, one of the open source matrix packages
available in Java, and customized the SVD algorithm to t our needs. However, we noticed that COLT performance tends to severely degrade on large-size matrices. To conrm this impression, we compared its performance to that of Matlab and JScience,5 the newest math package for the Java language. We noticed that
Matlab implementation of the Arnoldi/Lanczos method is orders of magnitude faster than that of COLT.
We suspect that this might be a consequence of a poor implementation of the sparse-matrices routines in
COLT. To conrm this impressions, we ran a number of comparative tests using JScience; also in this case,
JScience which is based on a high performance library called Javolution6 showed signicantly better performance both in terms of computing time and heap usage with respect to COLT and other similar packages. Unfortunately, JScience currently does not oer support for SVD, although this might be added to
future version. Re-implementing Arnoldi/Lanczos methods in JScience was beyond the purpose of this
paper. Nevertheless, we believe that JScience might signicantly improve computing times once support
for SVD is implemented.
5. Related works
As we have discussed in Section 1, there are several commercial search engines that incorporate some form
of clustering. Besides Vivisimo [35] and Grokker [12], other examples are Ask.com [2], iBoogie [15], Kartoo
[17], and WiseNut [36].
In fact, the idea of clustering search results as a means to improve retrieval performance has been investigated quite deeply in Information Retrieval. A seminal work in this respect is the Scatter/Gather project
[14,31]. ScatterGather provides a simple graphical user interface to do clustering on a traditional information
retrieval system. After the user has posed her/his query, s/he can decide to scatter the results into a xed
number of clusters; then, s/he can gather the most promising clusters, possibly to scatter them again in
order to further rene the search. One limitation in ScatterGather is the fact that the system is not able to

4
5
6

http://dsd.lbl.gov/hoschek/colt.
http://www.jscience.org.
http://javolution.org.

G. Mecca et al. / Data & Knowledge Engineering 62 (2007) 504522

519

infer the optimal number of clusters for a given query, and requires that this is specied by the user in advance.
This may in some cases have an adverse eect on the quality of the clustering.
Other traditional works on clustering along the same lines include [7,1, and 28]. In [7] the authors develop a
classication algorithm for Web documents, based on Yahoo! categories. The classication algorithm learns
representative terms for a number of pre-classied documents, and then tries to assign new documents to the
correct category. The reported recall values are in the range 6090%. More recently, the issue of classifying
Web document into a hierarchy of topics has been studied in [18].
The Paraphrase system [1] introduces latent semantic indexing as a means to cluster search results. The
paper focuses on the relationship among clusters produced by LSI and their labels, obtained by taking the
most representative words for each cluster. The latter technique is similar to the one we use for labeling our
clusters. Besides this, the authors follow a rather typical approach with respect to the selection of the number k of singular values. More specically, k is xed and equals 200; in fact, values in the range 100200
were considered as optimal in retrieval applications, competitive or even superior to term-based similarity
search.
In [28] the authors further explore the use of projection techniques for document clustering. Their results
show that LSI performs better than document truncation. Also in this case, the number k of singular values is
xed. The authors compare the quality of the clustering with k = 20, k = 50 and k = 150. An interesting point
is that, although k = 150 was considered a more typical value, the paper concludes that in clustering applications k = 20 gives better performance and clustering quality. This conclusion is coherent with our idea that the
optimal value of k for clustering documents is equal to the number of classes (or ideal clusters), and therefore
usually much lower than the number of documents.
The advent of Web search engines brought to a shift of perspective about the problem of clustering. The
requirements for eective clustering of Web documents are summarized in [19], with emphasis on performance. In fact, most of the recent clustering systems for the Web share a number of common features, namely:
they work in conjunction with one or more traditional search engines in order to run the queries submitted
by users and gather top results,
clustering is based on a form of post-processing of document snippets returned by the back-end search
engine,
the clustering algorithm is based on some form of common phrase extraction.
Grouper [38,39] is a snippet-based clustering engine based on the HuskySearch meta search engine. The
main feature of Grouper is the introduction of a phrase-analysis algorithm called STC (Sux Tree Clustering). In essence, the algorithm builds a sux tree of phrases in snippets; each representative phrase becomes
a candidate cluster; candidates with large overlap are merged together. The main contribution of Grouper
stands in the complexity of the clustering algorithm, which allows for very fast processing of large result
sets. However, due to inherent limitations of snippets, quality results are not always excellent. A key observation of the paper is that there is no signicant degradation in the quality of clusters when going from full
document to snippets. This experimental evidence for the STC algorithm is largely in contrast with our
experimental results on DSC.
Grouper has inspired a number of other proposals along the same lines. Two examples are SHOC [41]
and Lingo/Carrot Search [23,25]. Both these works extend the STC algorithm with the use of SVD in order
to lter some noise in the snippets and improve the quality of the produced clusters. SHOCs clustering is
based on two steps: during the rst step phrase analysis is used to generate a snippet-topic matrix in which
a topic is either a term or a phrase; then, as a second step, SVD is performed on the matrix in order to
identify the most relevant topics. A key dierence with our work is that the stop criterion for SVD is based
on a xed approximation threshold. No experimental results are reported in [41] to asses the quality of the
clustering algorithm.
Lingo/Carrot Search is similar in spirit, but it uses a dierent strategy. A primary concern is to produce
meaningful descriptions for the clusters. To do this, rst SVD is performed on a snippet-term matrix to identify a number of relevant topics. Also in this case, the selection of k is based on a xed approximation threshold specied by the user. Then, phrase analysis is done to identify, for each of the selected topics, a phrase that

520

G. Mecca et al. / Data & Knowledge Engineering 62 (2007) 504522

represents a good description. Finally, documents are assigned to clusters based on the contained phrases.
Experimental results in terms of cluster contamination have been reported in [22].
Other works along the same lines are [40,11]. In all these cases variants of phrase-analysis algorithms are
developed in order to improve the quality of the original STC algorithm. As an example, in [40] the problem of
improving cluster labels is studied. A machine learning approach is used: given a corpus of training data, the
system is able to identify the most salient phrases and use them as cluster titles. The algorithm is therefore
supervised. It is worth noting that this work was developed by Microsoft Research Asia, and has been used
to provide an experimental clustering service based on MSN contents [33].
SnakeT [11] introduces advanced cluster labeling with variable-length sentences, based on the use of gapped sentences, i.e., sentences made of terms that may not appear contiguously in the original snippet. The
main focus of the system in on personalization: the authors show how to plug the clustering engine into a
search engine to obtain a form of personalization.
A dierent approach has been recently undertaken in the Infocious System [20]. Infocious is a fulledged search engine based on natural language processing techniques. Linguistic processing is built into
the search engine in order to provide a deeper form of processing of documents and search terms, and a
much richer search interface to users. The authors compare the classication power of a Naive Bayes
Classier to that of a classier enhanced with NLP techniques, and show that this may reduce error rate
of about 7%.
6. Conclusions
The paper has introduced a new algorithm for clustering the results of a search engine. With respect to snippet-based clustering algorithms, we have shown that, by considering the whole document content and employing appropriate SVD-based compression techniques, it is possible to achieve very good classication results,
signicantly better than those obtained by analyzing document snippets only.
We believe that these results represent promising directions to improve the quality of clustering in all context in which it is reasonable to assume that document vectors are available to the clustering system. One of
these cases is that of desktop search engines. This impression is conrmed by our experiences with a prototypical implementation of the algorithm in the Noodles desktop search engine, which showed how the proposed
approach represents a good compromise between quality and eciency.
Another possible context of application is that in which the clustering algorithm is integrated into the
search engine, and not run as a postprocessing step. In fact, Google has experienced for a while with forms
of clustering of their search results [21], and has recently introduced a Rene Your Query feature7 that
essentially allows to select one of a few topics to narrow a search. Similar experiments are also being conducted by Microsoft [33]. We believe that this might draw more attention in the future around Web document
clustering.
Acknowledgments
The authors thank Martin Funk, Donatella Occorsio, and Maria Grazia Russo for the insightful discussions during the early stages of this research. Thanks also go to Donatello Santoro for his excellent work
in the implementation of the desktop search engine.
References
[1] P. Anick, S. Vaithyanathan, Exploiting clustering and phrases for context-based information retrieval, in: ACM SIGIR, 1997, pp.
314323.
[2] Ask.com Search Engine. http://search.ask.com.
[3] T. Berners-Lee, O. Lassila, J. Hendler, The semantic web, Scientic American 284 (5) (2001) 3443.

http://www.google.com/help/features.html#rene.

G. Mecca et al. / Data & Knowledge Engineering 62 (2007) 504522

521

[4] M.W. Berry, Large-scale sparse singular value computations, The International Journal of Supercomputer Applications 6 (1) (1992)
1349.
[5] S. Brin, L. Page, The anatomy of a large-scale hypertextual web search engine, Computer Networks and ISDN Systems 30 (17)
(1998) 107117.
[6] D. Calvetti, L. Reichel, D.C. Sorensen, An implicitly restarted lanczos method for large symmetric eigenvalue problems, Electronic
Transactions on Numerical Analysis 2 (1994) 121.
[7] C. Chekuri, P. Raghavan, Web search using automatic classication, in: Proceedings of the World Wide Web Conference,
1997.
[8] E. Chisholm, T.G. Kolda, New term weighting formulas for the vector space method in information retrieval, Technical Report No.
ORNL/TM-13756, Computer Science and Mathematics Division Oak Ridge National Laboratory, 1999.
[9] S. Deerwester, S.T. Dumais, G.W. Furnas, T.K. Landauer, R. Harshman, Indexing by latent semantic analysis, Journal of the
American Society for Information Sciences 41 (6) (1990) 391407.
[10] Open Directory Project (DMOZ). http://www.dmoz.org.
[11] P. Ferragina, A. Gulli, A personalized search engine based on web snippet hierarchical clustering, in: Proceedings of the World Wide
Web Conference, 2005.
[12] Grokker Search Engine. http://www.grokker.com.
[13] R. Guha, R. Mc Cool, E. Miller, Semantic search, in: Proceedings of the World Wide Web Conference, 2003.
[14] M.A. Hearst, J.O. Pedersen, Re-examining the cluster hypothesis: Scatter/gather on retrieval results, in: Proceedings of the ACM
SIGIR Conference, 1996.
[15] iBoogie Search Engine. http://www.iboogie.com.
[16] A.K. Jain, M.N. Murty, P.J. Flynn, Data clustering: a review, ACM Computing Surveys 31 (3) (1999) 265323.
[17] Kartoo Search Engine. http://www.kartoo.com.
[18] K. Kummamuru, R. Lotlikar, S. Roy, K. Singal, R. Krishnapuram, A hierarchical monothetic document clustering algorithm for
summarization and browsing search results, in: Proceedings of the World Wide Web Conference, 2004, pp. 658 665.
[19] Y.S. Maarek, R. Fagin, I.Z. Ben-Shaul, D. Pelleg, Ephemeral document clustering for web applications, IBM Research Report RJ
10186, IBM, 2000.
[20] A. Ntoulas, G. Chao, J. Cho, The infocious web search engine: Improving web searching through linguistic analysis, in: Proceedings
of the World Wide Web Conference, 2005, pp. 840849.
[21] Web 2.0 Exclusive Demonstration of Clustering from Google. http://www.searchenginelowdown.com/2004/10/web-20-exclusivedemonstration-of.html.
[22] S. Osinski, Dimensionality reduction techniques for search result clustering, Masters thesis, Department of Computer Science
University of Sheeld, 2004.
[23] S. Osinski, J. Stefanowski, D. Weiss, Lingo: Search results clustering algorithm based on singular value decomposition, in:
Proceedings of the International Conference on Intelligent Information Systems (IIPWM), 2004.
[24] S. Osinski, D. Weiss, Conceptual clustering using lingo algorithm: evaluation on open directory project data, in: Proceedings of the
International Conference on Intelligent Information Systems (IIPWM), 2004, pp. 369377.
[25] S. Osinski, D. Weiss, A concept-driven algorithm for clustering search results, IEEE Intelligent Systems 20 (3) (2005) 4854.
[26] C. Papadimitriou, P. Raghavan, H. Tamaki, S. Vempala, Latent semantic indexing: a probabilistic analysis, Journal of Computer and
System Sciences 61 (2) (2000) 217235.
[27] R.C. Prim, Shortest connection networks and some generalisations, Bell Systems Technical Journal 36 (1957) 13891401.
[28] H. Schutze, C. Silverstein, Projections for ecient document clustering, in: Proceedings of the ACM SIGIR Conference on Research
and Development in Information Retrieval (SIGIR), 1997.
[29] Google Desktop Search. http://desktop.google.com.
[30] Noodles Project Web Site. http://www.db.unibas.it/projects/noodles.
[31] The Scatter-Gather Project Web Site. http://www.sims.berkeley.edu/~hearst/sg-overview.html.
[32] D.C. Sorensen, Implicitly restarted arnoldi/lanczos methods for large scale eigenvalue calculations, Technical Report TR-96-40,
Department of Computational and Applied Mathematics Rice Universiry, 1996.
[33] SRC Search Engine. http://rwsm.directtaps.net/.
[34] C.J. van Rijsbergen, in: Information Retrieval, second ed., Butterworth, London, 1979.
[35] Vivisimo Search Engine. http://www.vivisimo.com.
[36] WiseNut Search Engine. http://www.wisenut.com.
[37] C.T. Zahn, Graph-theoretical methods for detecting and describing gestalt clusters, IEEE Transactions on Computers C-20 (1971)
6886.
[38] O. Zamir, O. Etzioni, Web document clustering: A feasibility demonstration, in: Proceedings of the ACM SIGIR Conference on
Research and Development in Information Retrieval (SIGIR), 1998.
[39] O. Zamir, O. Etzioni, Grouper: a dynamic clustering interface for web search results, Computer Networks 31 (1116) (1999) 1361
1374.
[40] H.J. Zeng, Q.C. He, Z. Chen, W.Y. Ma, J. Ma, Learning to cluster web search results, in: Proceedings of the ACM SIGIR
Conference, 2004.
[41] D. Zhang, Y. Dong, Semantic, hierarchical, online clustering of web search results, in: Proceedings of the Asia-Pacic Web
Conference, 2004.

522

G. Mecca et al. / Data & Knowledge Engineering 62 (2007) 504522

Giansalvatore Mecca is full professor of Computer Science at Universita della Basilicata. He graduated in
Computer Engineering in 1992 from Universita di Roma La Sapienza. In 1996 he received his PhD, also from
Universita di Roma La Sapienza. He is with Universita della Basilicata from 1995. His research interests
include information extraction, data management techniques for XML and Web data, and information extraction. He has also worked on cooperative database systems, string databases, deductive databases, and objectoriented databases.

Salvatore Raunich is a research assistant at Universita della Basilicata. He graduated in Computer Science at
Universita della Basilicata in 2003, and received a master in Computer Science in 2005. His research interests
include Web clustering and data integration.

Alessandro Pappalardo holds a master in Computer Science from Universita della Basilicata. He graduated in
Computer Science at Universita della Basilicata in 2004. He received his Master in 2006. His research interests
include Web clustering and data integration.