Chapter - 6 Part 1
Chapter - 6 Part 1
Web IR is different from classical IR for two kinds of reasons: concepts and technologies. The
characteristics of Web make the task of retrieving information from it quite different from
the Pre- Web (traditional) information retrieval.
Web IR Tools
Categories of Web IR tools:
Search Tools
The Search tools employ robots for indexing Web documents. They feature a user interface
for specifying queries and browsing the results. At the heart of a search tool is the search
engine, which is responsible for searching the index to retrieve documents relevant to a
user query.
Search tools can be distinguished into two categories on the transparency of the index to
the user.
✓Class1 search tools
✓Class2 search tools
Class1 search tools: General Purpose Search Engine: These tools completely hide the
organization and content of the index from the user. Example: AltaVista: www.altavista.com;
Excite: www.excite.com; Google: www.google.com; Lycos: www.lycos.com; Hotbot:
www.hotbot.com
Class 2 search tools: Subject Directories: These feature a hierarchically organized subject
catalog or directory of the Web, which is visible to users as they browse and search. Example:
Yahoo!: www.search.yahoo.com; WWW Virtual Library: http://vlib.org/; Galaxy:
https://usegalaxy.org/
Search Services
The Search services provide users a layer of abstraction over several search tools and
databases and aim at simplifying the Web search. Search services broadcast user queries to
several search engines and various other information sources simultaneously. Then they
merge the results submitted by these sources, check for duplicates, and present them to
the user as an HTML page with clickable URLs.
Example: MetaCrawler: http://www.metacrawler.com/; Dogpile:
http://www.dogpile.com/
Web IR Architecture (Search Engine
Architecture)
Document Repository
Index
Web crawlers follow links to find documents which must efficiently find huge numbers of web pages
(coverage) and keep them up‐to‐date (freshness).
The web crawler is implied to know that a word contained in headings, meta data and the first few
sentences are likely to be more important in the context of the page, and that keywords in prime
locations suggest that the page is really ‘about’ those keywords.
The text acquisition identifies and stores the documents for indexing by crawling the Web, then
converting the gathered information into a consistent format & finally storing the findings in a
document repository.
Text Transformation
Once the text is acquired, the next step is to transform the captured documents into index
terms or features. This is a kind of preprocessing step which involves parsing, stop-word
removal, stemming, link analysis & information extraction as the sub-steps.
Parser: It is the processing of the sequence of text tokens in the document to recognize
structural elements. For e.g., titles, links, headings, etc.
Stopping: Commonly occurring words are unlikely to give useful information and may be
removed from the vocabulary to speed processing. Stop-word removal or Stopping is a
process that removes the common words like “and”, “or”, “the”, “in”.
Stemming: Stemming is the process of removing suffixes from words to get the common origin. It is a
group words that are derived from a common stem. For e.g., “engineer”, “engineers”, “engineering”,
“engineered”.
Link Analysis: It makes use of links and anchor text in web pages and identifies popularity and
community information, for example, PageRank.
Information Extraction: This process identifies the classes of index terms that are important for some
applications. For e.g., named entity recognizers identify classes such as people, locations, companies,
dates, etc.
Classifier: It identifies the class‐related metadata for documents i.e.: It assigns labels to documents,
for example, topics, reading levels, sentiment, genre. The use of classifier depends on the application.
Index Creation
The document statistics such counts of index term occurrences, positions in the documents
where the index terms occurred, counts of occurrences over groups of documents, lengths
of documents in terms of the number of tokens and other features mostly used in ranking
algorithms are gathered. Actual data depends on the retrieval model and the associated
ranking method.
The index terms weights are then computed (for example, tf.idf weight which is a
combination of term frequency in document and inverse document frequency in the
collection).
Most indices use variants of inverted files.
An inverted file is a list of sorted words. Each word has pointers to the related pages. It is
referred to as “Inverted” because documents are associated with words, rather than words
with documents. A logical view of the text is indexed. Each pointer associates a short
description about the page that the pointer points to.
This is the core of the indexing process which tends to convert document-term information to
term-document for indexing but it is difficult to do that for very large numbers of documents.
The format of inverted file is designed for fast query processing that must also handle
updates.
Inverted index allows quick lookup of document ids with a particular word. The inverted
index are built as follows, for each index term is associated with an inverted list
Query processing takes the user’s query, and depending on the application, the context, and
other inputs, builds a better query automatically, and submits the enhanced query to the
search engine on the user’s behalf and displays the ranked results.
Thus, a query process comprises of a user interaction module which supports creation and
the refinement of query and displays the results and a ranking module which uses the query
and indexes (generated during the indexing process) to generate a ranked list of documents.
The Query Process
Document Repository
Index
User Interaction Ranking
Evaluation
Log
data
User interacts with the system through an interface where he inputs the query.
Query transformation is then employed to improve the initial query, both before and after
the initial search. It may use a variety of techniques, such as, the spell checker, query
suggestion providing alternatives to original query.
The query expansion approaches attempt to expand the original search query by adding
further, new or related terms. These additional terms are inserted to an existing query either
by the user (Interactive query expansion, IQE), or by the retrieval system (Automatic query
expansion, AQE) and intend to increase the accuracy of the search.
The fundamental challenge of a search engine is to rank pages that match the input query and returns an
ordered list. The search engines rank individual web-pages of a website, not the entire site. There many
variations of ranking algorithms and retrieval models.
Search engines use two different kinds of ranking factors: query-dependent factors and query-
independent factors.
Query-dependent are all ranking factors that are specific to a given query. These include measures such
as word documents frequency, the position of the query terms within the document or the inverted
document frequency, which are all measures that are used in traditional Information Retrieval.
Query-independent factors are attached to the documents, regardless of a given query and consider
measures such as an emphasis on anchor text, the language of the document in relation to the language
of the query or the measuring of the “geographical distance between the user and the document”. They
are used to determine the quality of a given document such that the search engines should provide the
user with the highest possible quality and should omit low-quality documents.
The Web Search can be categorized into two phases, namely the Offline phase which
includes the ‘Crawling’ & ‘Indexing’ Components; and the Online phase which includes the
‘Querying’ & ‘Ranking’ components of the Web IR system.
The results may be evaluated to monitor and measure the retrieval effectiveness and
efficiency. This is generally done offline and involves logging of user queries and their
interaction as it can be crucial to improve search effectiveness and efficiency. The query logs
and click-through data is used for query suggestion, spell checking, query caching, ranking,
advertising search, and other components.
A ranking analysis can be done to measure and tune the ranking effectiveness whereas a
performance analysis is carried to measure and tune the system efficiency.