Chapter - 6 Part 1

Uploaded by

CLAsH with Dx

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views

Chapter - 6 Part 1

Uploaded by

CLAsH with Dx

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

Chapter -6

Web IR: Information Retrieval on the Web

Web IR
Web IR can be defined as the application of theories and methodologies from IR to the
World Wide Web. It is concerned with addressing the technological challenges facing
Information Retrieval (IR) in the setting of WWW.

Web IR is different from classical IR for two kinds of reasons: concepts and technologies. The
characteristics of Web make the task of retrieving information from it quite different from
the Pre- Web (traditional) information retrieval.
Web IR Tools
Categories of Web IR tools:
Search Tools
The Search tools employ robots for indexing Web documents. They feature a user interface
for specifying queries and browsing the results. At the heart of a search tool is the search
engine, which is responsible for searching the index to retrieve documents relevant to a
user query.

Search tools can be distinguished into two categories on the transparency of the index to
the user.
✓Class1 search tools
✓Class2 search tools
Class1 search tools: General Purpose Search Engine: These tools completely hide the
organization and content of the index from the user. Example: AltaVista: www.altavista.com;
Excite: www.excite.com; Google: www.google.com; Lycos: www.lycos.com; Hotbot:
www.hotbot.com

Class 2 search tools: Subject Directories: These feature a hierarchically organized subject
catalog or directory of the Web, which is visible to users as they browse and search. Example:
Yahoo!: www.search.yahoo.com; WWW Virtual Library: http://vlib.org/; Galaxy:
https://usegalaxy.org/
Search Services
The Search services provide users a layer of abstraction over several search tools and
databases and aim at simplifying the Web search. Search services broadcast user queries to
several search engines and various other information sources simultaneously. Then they
merge the results submitted by these sources, check for duplicates, and present them to
the user as an HTML page with clickable URLs.
Example: MetaCrawler: http://www.metacrawler.com/; Dogpile:
http://www.dogpile.com/
Web IR Architecture (Search Engine
Architecture)

The architecture of a search engine determined by two requirements – effectiveness

(quality of results) and efficiency (response time and throughput) and has two parts
namely: The indexing process and the query process.
The Indexing Process
The Indexing process distills information contained within corpus documents into a format which is
amenable to quick access by the query processor. Typically, this involves extracting document
features by breaking down documents into their constituent terms, extracting statistics relating to
term presence within the documents and corpus, and calculating any query-independent evidence.
Once the indices are built, the system is ready to process queries.

It consists of three steps:

✓Text Acquisition
✓Text Transformation
✓Index creation
The Indexing Process

Document Repository

Index

Text Acquisition Index Creation

Web Documents (web pages, email, books, news

stories, scholarly papers, text messages, Word™,
Text Transformation
Powerpoint™, PDF, forum postings, patents, etc.)
Text Acquisition
A Crawler is usually responsible for gathering the documents and storing them in document
repository. It identifies and acquires documents for search engine.

Web crawlers follow links to find documents which must efficiently find huge numbers of web pages
(coverage) and keep them up‐to‐date (freshness).

The web crawler is implied to know that a word contained in headings, meta data and the first few
sentences are likely to be more important in the context of the page, and that keywords in prime
locations suggest that the page is really ‘about’ those keywords.

The text acquisition identifies and stores the documents for indexing by crawling the Web, then
converting the gathered information into a consistent format & finally storing the findings in a
document repository.
Text Transformation
Once the text is acquired, the next step is to transform the captured documents into index
terms or features. This is a kind of preprocessing step which involves parsing, stop-word
removal, stemming, link analysis & information extraction as the sub-steps.

Parser: It is the processing of the sequence of text tokens in the document to recognize
structural elements. For e.g., titles, links, headings, etc.

Stopping: Commonly occurring words are unlikely to give useful information and may be
removed from the vocabulary to speed processing. Stop-word removal or Stopping is a
process that removes the common words like “and”, “or”, “the”, “in”.
Stemming: Stemming is the process of removing suffixes from words to get the common origin. It is a
group words that are derived from a common stem. For e.g., “engineer”, “engineers”, “engineering”,
“engineered”.

Link Analysis: It makes use of links and anchor text in web pages and identifies popularity and
community information, for example, PageRank.

Information Extraction: This process identifies the classes of index terms that are important for some
applications. For e.g., named entity recognizers identify classes such as people, locations, companies,
dates, etc.

Classifier: It identifies the class‐related metadata for documents i.e.: It assigns labels to documents,
for example, topics, reading levels, sentiment, genre. The use of classifier depends on the application.
Index Creation

The document statistics such counts of index term occurrences, positions in the documents
where the index terms occurred, counts of occurrences over groups of documents, lengths
of documents in terms of the number of tokens and other features mostly used in ranking
algorithms are gathered. Actual data depends on the retrieval model and the associated
ranking method.

The index terms weights are then computed (for example, tf.idf weight which is a
combination of term frequency in document and inverse document frequency in the
collection).
Most indices use variants of inverted files.

An inverted file is a list of sorted words. Each word has pointers to the related pages. It is
referred to as “Inverted” because documents are associated with words, rather than words
with documents. A logical view of the text is indexed. Each pointer associates a short
description about the page that the pointer points to.

This is the core of the indexing process which tends to convert document-term information to
term-document for indexing but it is difficult to do that for very large numbers of documents.
The format of inverted file is designed for fast query processing that must also handle
updates.
Inverted index allows quick lookup of document ids with a particular word. The inverted
index are built as follows, for each index term is associated with an inverted list

◦ Contains lists of documents, or lists of word occurrences in documents, and other

information
◦ Each entry is called a posting
◦ The part of the posting that refers to a specific document or location is called a pointer
◦ Each document in the collection is given a unique number
◦ Lists are usually document-ordered (sorted by document number)
Let us consider a sample collection with three documents on Apple iPhone:
1. iPhone is a line of smartphone designed and marketed by Apple Inc.
2. The first generation iPhone was released on June 29, 2007.
3. Apple has released nine generations of iPhone models, each accompanied by
one of the nine major releases of the iOS operating system.

Sample Inverted Index

Term #Document: Frequency
iPhone I:1, II:1, III:1
Line I:1
Smartphone I:1
Release (released, releases) II:1, III:2
The Query Process

Query processing takes the user’s query, and depending on the application, the context, and
other inputs, builds a better query automatically, and submits the enhanced query to the
search engine on the user’s behalf and displays the ranked results.

Thus, a query process comprises of a user interaction module which supports creation and
the refinement of query and displays the results and a ranking module which uses the query
and indexes (generated during the indexing process) to generate a ranked list of documents.
The Query Process

Document Repository

Index
User Interaction Ranking

Evaluation
Log
data
User interacts with the system through an interface where he inputs the query.

Query transformation is then employed to improve the initial query, both before and after
the initial search. It may use a variety of techniques, such as, the spell checker, query
suggestion providing alternatives to original query.

The query expansion approaches attempt to expand the original search query by adding
further, new or related terms. These additional terms are inserted to an existing query either
by the user (Interactive query expansion, IQE), or by the retrieval system (Automatic query
expansion, AQE) and intend to increase the accuracy of the search.
The fundamental challenge of a search engine is to rank pages that match the input query and returns an
ordered list. The search engines rank individual web-pages of a website, not the entire site. There many
variations of ranking algorithms and retrieval models.
Search engines use two different kinds of ranking factors: query-dependent factors and query-
independent factors.
Query-dependent are all ranking factors that are specific to a given query. These include measures such
as word documents frequency, the position of the query terms within the document or the inverted
document frequency, which are all measures that are used in traditional Information Retrieval.
Query-independent factors are attached to the documents, regardless of a given query and consider
measures such as an emphasis on anchor text, the language of the document in relation to the language
of the query or the measuring of the “geographical distance between the user and the document”. They
are used to determine the quality of a given document such that the search engines should provide the
user with the highest possible quality and should omit low-quality documents.
The Web Search can be categorized into two phases, namely the Offline phase which
includes the ‘Crawling’ & ‘Indexing’ Components; and the Online phase which includes the
‘Querying’ & ‘Ranking’ components of the Web IR system.
The results may be evaluated to monitor and measure the retrieval effectiveness and
efficiency. This is generally done offline and involves logging of user queries and their
interaction as it can be crucial to improve search effectiveness and efficiency. The query logs
and click-through data is used for query suggestion, spell checking, query caching, ranking,
advertising search, and other components.
A ranking analysis can be done to measure and tune the ranking effectiveness whereas a
performance analysis is carried to measure and tune the system efficiency.

Power Bi Book PDF Free
100% (1)
Power Bi Book PDF Free
32 pages
MineSight Part1 Understanding Minesight
100% (2)
MineSight Part1 Understanding Minesight
57 pages
Request Fulfillment Process Training
No ratings yet
Request Fulfillment Process Training
16 pages
Search Engine Architecture
No ratings yet
Search Engine Architecture
15 pages
chapter 2
No ratings yet
chapter 2
45 pages
Text
No ratings yet
Text
5 pages
Chapter 2
No ratings yet
Chapter 2
31 pages
Informaiton Retrieval and Web Search
No ratings yet
Informaiton Retrieval and Web Search
44 pages
2 Mod-1_Lec-2
No ratings yet
2 Mod-1_Lec-2
58 pages
Lect 1 IRIntroduction
No ratings yet
Lect 1 IRIntroduction
59 pages
Information Retrieval
No ratings yet
Information Retrieval
142 pages
Everything in Brief Introduction
No ratings yet
Everything in Brief Introduction
5 pages
Topic 2 W2 - SDR - Edited - March2023
No ratings yet
Topic 2 W2 - SDR - Edited - March2023
25 pages
Chapter 1
No ratings yet
Chapter 1
52 pages
1-Overview of Information Retrieval
No ratings yet
1-Overview of Information Retrieval
44 pages
Chapter 2
No ratings yet
Chapter 2
23 pages
Web Search Engingine Indexing Crawling and Ranking
No ratings yet
Web Search Engingine Indexing Crawling and Ranking
63 pages
The Overview of Web Search Engines 16ep4np3gk
No ratings yet
The Overview of Web Search Engines 16ep4np3gk
23 pages
Mini Google
No ratings yet
Mini Google
34 pages
1-Overview of Information Retrieval
No ratings yet
1-Overview of Information Retrieval
44 pages
How A Search Engine Works - Slide
No ratings yet
How A Search Engine Works - Slide
40 pages
IR On Web Search Engines: Reference of Slides Taken From DR Haddawy's Material
No ratings yet
IR On Web Search Engines: Reference of Slides Taken From DR Haddawy's Material
21 pages
Ch2_IR and LT
No ratings yet
Ch2_IR and LT
45 pages
Information Retrieval Techniques(1)
No ratings yet
Information Retrieval Techniques(1)
59 pages
Introduction To Information Retrieval
No ratings yet
Introduction To Information Retrieval
50 pages
4
No ratings yet
4
35 pages
Cmpsci 446 Search Engines
No ratings yet
Cmpsci 446 Search Engines
32 pages
Information Storage And: Retrieval Techniques
No ratings yet
Information Storage And: Retrieval Techniques
56 pages
Information Retrievalpdf
No ratings yet
Information Retrievalpdf
7 pages
chapter 1 ir (1)
No ratings yet
chapter 1 ir (1)
37 pages
Preprocessing, Inverted Index
No ratings yet
Preprocessing, Inverted Index
15 pages
1_introIR
No ratings yet
1_introIR
15 pages
Indexing Processes (Text Transformation)
No ratings yet
Indexing Processes (Text Transformation)
10 pages
ISR chap..1
No ratings yet
ISR chap..1
27 pages
Chapter_6 - Searching and Indexing
No ratings yet
Chapter_6 - Searching and Indexing
44 pages
Chap 1
No ratings yet
Chap 1
22 pages
Automatic Image Annotation: Fundamentals and Applications
From Everand
Automatic Image Annotation: Fundamentals and Applications
Fouad Sabry
No ratings yet
Mining The Web Searching and Integration
No ratings yet
Mining The Web Searching and Integration
5 pages
Web Information Retrieval
No ratings yet
Web Information Retrieval
10 pages
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
From Everand
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
Fouad Sabry
No ratings yet
Search Engine Architecture 1
No ratings yet
Search Engine Architecture 1
23 pages
IR_workbook_answers
No ratings yet
IR_workbook_answers
36 pages
Web Search Engines: Practice and Experience: Content Analysis Query Prcessing Search Log
No ratings yet
Web Search Engines: Practice and Experience: Content Analysis Query Prcessing Search Log
21 pages
L01
No ratings yet
L01
33 pages
MSC IR 2021
100% (1)
MSC IR 2021
188 pages
Information Retrieval: Adt-V Unit
No ratings yet
Information Retrieval: Adt-V Unit
106 pages
11 Multimedia Media IR
No ratings yet
11 Multimedia Media IR
19 pages
Image Retrieval: Fundamentals and Applications
From Everand
Image Retrieval: Fundamentals and Applications
Fouad Sabry
No ratings yet
Information Retrieval
No ratings yet
Information Retrieval
5 pages
L001
No ratings yet
L001
49 pages
2 Introduction To Information Retrieval
No ratings yet
2 Introduction To Information Retrieval
38 pages
Aesthetics and Technology in Building, Pier Luigi Nervi
100% (4)
Aesthetics and Technology in Building, Pier Luigi Nervi
146 pages
CompletedUNIT 1 PPT 10.7.17
100% (6)
CompletedUNIT 1 PPT 10.7.17
87 pages
01 Introduction to ISR
No ratings yet
01 Introduction to ISR
34 pages
7 CurrentTrendsAndIssues
No ratings yet
7 CurrentTrendsAndIssues
50 pages
5 Unit Notes
100% (1)
5 Unit Notes
166 pages
1 IR Intro
No ratings yet
1 IR Intro
30 pages
IR-Module 1 and 2
No ratings yet
IR-Module 1 and 2
48 pages
Data Mining Unit 5
No ratings yet
Data Mining Unit 5
36 pages
Lecture1 Chap1
No ratings yet
Lecture1 Chap1
22 pages
Tokenization: Token Normalization Is The Process of Canonicalizing Tokens So That Matches Occur
No ratings yet
Tokenization: Token Normalization Is The Process of Canonicalizing Tokens So That Matches Occur
3 pages
Image Retrieval: Unlocking the Power of Visual Data
From Everand
Image Retrieval: Unlocking the Power of Visual Data
Fouad Sabry
No ratings yet
1_IR_Introductionn (1)
No ratings yet
1_IR_Introductionn (1)
30 pages
Unit 3 XML-rss Part 2
No ratings yet
Unit 3 XML-rss Part 2
31 pages
CSS (Media Type, Drop-Down Menus..)
No ratings yet
CSS (Media Type, Drop-Down Menus..)
10 pages
Chapter - 1
No ratings yet
Chapter - 1
45 pages
13 IntroJavascript
No ratings yet
13 IntroJavascript
41 pages
VBA Sample_ How to use ADO to query a table and print a simple report
No ratings yet
VBA Sample_ How to use ADO to query a table and print a simple report
5 pages
Simio Release Notes 203
No ratings yet
Simio Release Notes 203
99 pages
Tutorial - 10 - A2 and Query Optimization
No ratings yet
Tutorial - 10 - A2 and Query Optimization
16 pages
Consolidated - Exam v5
No ratings yet
Consolidated - Exam v5
97 pages
BCS613D Model Set 1 Paper
No ratings yet
BCS613D Model Set 1 Paper
2 pages
Project
No ratings yet
Project
939 pages
redp5743
No ratings yet
redp5743
62 pages
Text Mining Assignment
No ratings yet
Text Mining Assignment
12 pages
GCP DATA ENGINEER
No ratings yet
GCP DATA ENGINEER
8 pages
Micro Leveling Tech Note
No ratings yet
Micro Leveling Tech Note
4 pages
28 Updated New
No ratings yet
28 Updated New
6 pages
Xii - Python With SQL
No ratings yet
Xii - Python With SQL
9 pages
Generate Load and Report Data Using Custom Payroll Flow Pattern
No ratings yet
Generate Load and Report Data Using Custom Payroll Flow Pattern
30 pages
Mongo Shard
No ratings yet
Mongo Shard
9 pages
Project-Case-in-Database-Systems (1)
No ratings yet
Project-Case-in-Database-Systems (1)
2 pages
Pds Leb Manual
No ratings yet
Pds Leb Manual
54 pages
Federal University of Lafia: Department of Computer Science
No ratings yet
Federal University of Lafia: Department of Computer Science
8 pages
Hibernate With JPA Ppt(4)(1)
No ratings yet
Hibernate With JPA Ppt(4)(1)
28 pages
PD1 Set4 Su23
No ratings yet
PD1 Set4 Su23
21 pages
Exercises - Mastering Postgresql - Mastering SQL Using Postgresql
No ratings yet
Exercises - Mastering Postgresql - Mastering SQL Using Postgresql
25 pages
Thank You For Your Purchase
No ratings yet
Thank You For Your Purchase
26 pages
SQL Cheat Sheet: By: Ika Purnamasari
No ratings yet
SQL Cheat Sheet: By: Ika Purnamasari
2 pages
KukretiShubham Solidity-Notes
No ratings yet
KukretiShubham Solidity-Notes
27 pages
FOOPHONE
No ratings yet
FOOPHONE
16 pages
Veeam Backup 11 0 Permissions
No ratings yet
Veeam Backup 11 0 Permissions
27 pages
Coding Interview Prep Document Application Engineer, SAP: Time To Prepare!
No ratings yet
Coding Interview Prep Document Application Engineer, SAP: Time To Prepare!
7 pages
Linking Excel Data Into Navisworks - Intro - RVIT
No ratings yet
Linking Excel Data Into Navisworks - Intro - RVIT
2 pages