0% found this document useful (0 votes)

65 views25 pages

An Introduction To Information Retrieval Systems: Intelligent Systems March 18, 2004 Ramashis Das

This document provides an introduction to information retrieval systems. It defines information retrieval as informing users about the existence or location of documents related to their request, rather than changing their knowledge. The document contrasts data retrieval with information retrieval and discusses experimental versus operational systems. It also covers topics like document representation, indexing, query languages, and evaluating system effectiveness.

Uploaded by

Shivanshu Rastogi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

65 views25 pages

An Introduction To Information Retrieval Systems: Intelligent Systems March 18, 2004 Ramashis Das

Uploaded by

Shivanshu Rastogi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

An Introduction to Information

Retrieval Systems
Intelligent Systems
March 18, 2004
Ramashis Das
Definition
 We discuss about Automatic Information
Retrieval
 Automatic – as against ‘manual’.
 Information – as against ‘data’.
 Defn : An information retrieval system does not
inform (i.e. change the knowledge of) the user on
the subject of his inquiry. It merely informs on the
existence (or non-existence) and whereabouts of
documents relating to his request.
IR Vs Data Retrieval

Data Retrieval Info Retrieval

Matching Exact match Partial match, best
match
Inference Deduction Induction
Model Deterministic Probabilistic
Classification Monothetic Polythetic
Query language Artificial Natural
Query specification Complete Incomplete
Items wanted Matching Relevant
Error response Sensitive Insensitive
Classification
 Monothetic classification is one with classes
defined by objects possessing attributes both
necessary and sufficient to belong to a class.
 Polythetic classification is one where each
individual in a class will possess only a
proportion of all the attributes possessed by all
the members of that class.
 Hence no attribute is necessary nor sufficient for
membership to a class.
Experimental Vs Operational IR Systems

 Many Automatic Information Retrieval

Systems are Experimental. Experimental
IR is mainly carried on in a ‘Laboratory'
situation.
 Other kind are Operational Systems (or
‘Real World’ IR Systems) that are
Commercial Systems which charge for
the service they provide.
Why IR? – A Simple E.g.
 Suppose there is a store of documents and a
person (user of the store) formulates a
question (request or query) to which the
answer is a set of documents satisfying the
information need expressed by his question.
 Solution : User can read all the documents in
the store, retain the relevant documents and
discard all the others – Perfect Retrieval…
NOT POSSIBLE !!!
 Alternative : Use a High Speed Computer to
read entire document collection and extract
the relevant documents.
Black Box Model

FEEDBACK

Queries

INPUT PROCESSOR
OUTPUT

Documents
INPUT
 The main problem here is to obtain a
Representation of each Document and Query
suitable for a computer to use.
 Most Computer-Based Retrieval Systems
store only a representation of the Document
(or Query)
 Implies actual text is lost, an artificial language
used instead.
 User needs to be taught to express his information
need in the language.
Feedback and PROCESSOR

 On-line change in request during a

search session in the light of a sample
retrieval hoping improvement in the
subsequent retrieval run – Feedback.
 PROCESSOR – Retrieval Process.
 Structuring Information in appropriate way.
 Actual Retrieval Function – Search Strategy
in response to a Query.
OUTPUT

 Set of Citations or Document Numbers.

 For Experimental Systems, proper
Evaluation technique follows.
Historical Development
 Three main areas of Research:
 Content Analysis : Describing the contents
of documents in a form suitable for
computer processing;
 Information Structures : Exploiting
relationships between documents to
improve the efficiency and effectiveness of
retrieval strategies;
 Evaluation : the measurement of the
effectiveness of retrieval.
Information Representation
 Luhn’s approach : frequency count of words in
the Document.
 List of Keywords or Terms.
 Freq. of occurrence of Keyword in body of
Document indicates its significance.
 Statistical Association between Keywords -
exploited by Maron and Kuhns and Stiles
 Sparck Jones - measures of association
between keywords based on their frequency of
co-occurrence.
Information Structure

 Fairly Recent, Slow Development - loath

to try out new organization techniques
for faster and better retrieval.
 Serial File Organization
 Inverted File (?)
 Clustering – Good, Fairthorne; Doyle;
Rocchio
Evaluation of Retrieval Systems
 Extremely Difficult
 Dichotomous Scale : Relevant and Non-
Relevant.
 Precision - the ratio of the number of relevant
documents retrieved to the total number of
documents retrieved
 Recall - ratio of the number of relevant
documents retrieved to the total number of
relevant documents (both retrieved and not
retrieved).
Steps…
1. Generation of Machine Representations for the
Information.
2. Explanation of the Logical Structures that may be
arrived at by Clustering.
3. Representing these Structures in the Computer, or in
other words, choice of File Structures to Represent
the Logical Structure.
4. Search Strategies.
5. Probabilistic Retrieval, i.e. to create a Formal Model
for certain kinds of Search Strategies.
6. Ways of Evaluating the Effectiveness of Retrieval.
AUTOMATIC TEXT ANALYSIS

 Storing Information
 Original : In form of Documents
 Document Representation is stored

 Emphasis is on the statistical rather than

linguistic approaches.
 We start with original ideas of Luhn
Luhn’s Ideas

 Frequency of word occurrence in an

article furnishes a useful measurement
of word significance.
 relative position within a sentence of
words having given values of
significance furnish a useful
measurement for determining the
significance of sentences.
Demonstration

 f – Frequency of occurrence of words

 r – Rank Order
 Zipf’s Law - the product of the frequency
of use of words and the rank order is
approximately constant.
 Luhn used the above law to define two
cut-offs.
Generating Document
Representatives - conflation
 Text Processing System
 Input text – full text, abstract or title
 Output – a doc representative adequate for use in
an automatic retrieval system
 The document representative consists of a list
of class names, each name representing a
class of words occurring in the total input text.
 A document will be indexed by a name if one
of its significant words occurs as a member of
that class.
Text Processing System
 Such system will consist of three parts:
 Removal of high frequency words
 Suffix stripping
 Detecting equivalent stems
 Removal of High Freq words :
 One way of implementing Luhn’s upper cut-off.
 Maintain list of ‘stop list’; compare and remove
 Document size reduces by 30 to 50 %
Text Processing System
 Suffix stripping – more involved
 Complete list of suffixes; match and remove the longest
possible one.
 Context free removal leads to Error : Removing ‘UAL’ from
FACTUAL and EQUAL
 Solution : Have some rules
 Equivalent Stems :
 Map to same morphological form on removal of suffixes.
 Other kinds, which do not match on mere removal of suffixes.
(ABSORB- and ABSORPT-)
 For these, a list of equivalent stem-endings is maintained.
(For e.g. ‘B’ and ‘PT’ are equivalent stem ending)
Text Processing System
 The final output from a conflation algorithm is
a set of classes, one for each stem detected.
 A class name is assigned to a document if and
only if one of its members occurs as a
significant word in the text of the document.
 A document representative then becomes a
list of class names. These are often referred to
as the documents index terms or keywords.
 Queries : Queries are handled in the same
way.
Indexing

 index language is the language used to

describe documents and requests
 elements of the index language are
index terms which may be derived from
the text of the document to be described,
or may be arrived at independently.
Some distinctions
 Index Languages can be described as :
 Pre-coordinate : terms are coordinated at the time
of indexing
 Post-coordinate : at the time of searching.
 Vocabulary of Index Language :
 Controlled : list of approved index terms that an
indexer may use. One may put other kinds of
syntactic controls (e.g. certain terms used only as
adjectives)
 Uncontrolled

ISR
No ratings yet
ISR
68 pages
ISR U 1&2 Tech-Knowledge
No ratings yet
ISR U 1&2 Tech-Knowledge
68 pages
Introduction To Information Storage and Retrieval Systems: BY-Research Scholar
No ratings yet
Introduction To Information Storage and Retrieval Systems: BY-Research Scholar
42 pages
ISE Information Retrieval Mod-V (Uploaded by Snaptricks - In)
No ratings yet
ISE Information Retrieval Mod-V (Uploaded by Snaptricks - In)
48 pages
1 Information Retrieval System
No ratings yet
1 Information Retrieval System
10 pages
Unit I
No ratings yet
Unit I
65 pages
Chapter 1
No ratings yet
Chapter 1
69 pages
ISR Chap..1
No ratings yet
ISR Chap..1
27 pages
Mod 4
No ratings yet
Mod 4
35 pages
Mod4 NLP
No ratings yet
Mod4 NLP
53 pages
Introduction to Information Retrieval Course
No ratings yet
Introduction to Information Retrieval Course
39 pages
ISR Lab Manual
No ratings yet
ISR Lab Manual
110 pages
PPT08-Natural Language Processing
100% (1)
PPT08-Natural Language Processing
44 pages
IRS III Year UNIT-3 Part 1
50% (2)
IRS III Year UNIT-3 Part 1
18 pages
Unit - 3:: Explain Briefly About Automatic Indexing? Explain About Types of Classes Automatic Indexing?
No ratings yet
Unit - 3:: Explain Briefly About Automatic Indexing? Explain About Types of Classes Automatic Indexing?
28 pages
Information Retrieval Detailed Lecture Nov 2023
No ratings yet
Information Retrieval Detailed Lecture Nov 2023
39 pages
Information Retrieval Thesis Topics
100% (3)
Information Retrieval Thesis Topics
6 pages
Understanding Information Retrieval Systems
No ratings yet
Understanding Information Retrieval Systems
30 pages
Module 4
No ratings yet
Module 4
16 pages
Introduction to Information Retrieval
No ratings yet
Introduction to Information Retrieval
29 pages
Lecturenote - 580003121chapter 1
No ratings yet
Lecturenote - 580003121chapter 1
10 pages
Ch2 - IR and LT
No ratings yet
Ch2 - IR and LT
45 pages
IR Unit 1
No ratings yet
IR Unit 1
30 pages
IRS Automatic Indexing UNIT-2
75% (4)
IRS Automatic Indexing UNIT-2
18 pages
MSC IR 2021
100% (1)
MSC IR 2021
188 pages
Ii - 3 Unit
No ratings yet
Ii - 3 Unit
45 pages
2 Introduction To Information Retrieval
No ratings yet
2 Introduction To Information Retrieval
38 pages
IRS Unit - 1 & 2
No ratings yet
IRS Unit - 1 & 2
33 pages
Unit III
No ratings yet
Unit III
37 pages
Module 5 - Information Retrieval and Lexical Resources
0% (1)
Module 5 - Information Retrieval and Lexical Resources
80 pages
01 Introduction To ISR
No ratings yet
01 Introduction To ISR
34 pages
Introduction To Information Retrieval
No ratings yet
Introduction To Information Retrieval
50 pages
Unit 1: Introduction and Data Pre-Processing
No ratings yet
Unit 1: Introduction and Data Pre-Processing
71 pages
IR-Lec1 - Ch1-2023
No ratings yet
IR-Lec1 - Ch1-2023
41 pages
Irs Unit-1
No ratings yet
Irs Unit-1
61 pages
Unit 1introduction
No ratings yet
Unit 1introduction
44 pages
Unit 5
No ratings yet
Unit 5
37 pages
NLP Mod-V Q - A (Uploaded by Snaptricks - In)
No ratings yet
NLP Mod-V Q - A (Uploaded by Snaptricks - In)
7 pages
Information Retrieval
No ratings yet
Information Retrieval
5 pages
Informaiton Retrieval and Web Search
No ratings yet
Informaiton Retrieval and Web Search
44 pages
Text Databases and Information Retrieval: Riloff, Hollaar@cs - Utah.edu&
No ratings yet
Text Databases and Information Retrieval: Riloff, Hollaar@cs - Utah.edu&
3 pages
Multimedia Information Retrieval Overview
No ratings yet
Multimedia Information Retrieval Overview
19 pages
Comprehensive Guide to Information Retrieval
No ratings yet
Comprehensive Guide to Information Retrieval
74 pages
Lecture17 IR
No ratings yet
Lecture17 IR
28 pages
An Overview of Information Retrieval Outline: A (Simple) Database Example Databases vs. IR
No ratings yet
An Overview of Information Retrieval Outline: A (Simple) Database Example Databases vs. IR
16 pages
Information Retrieval Systems
No ratings yet
Information Retrieval Systems
46 pages
Introduction To IIR
No ratings yet
Introduction To IIR
53 pages
Introduction to Information Retrieval
No ratings yet
Introduction to Information Retrieval
108 pages
Enhancing Information Storage & Retrieval
No ratings yet
Enhancing Information Storage & Retrieval
32 pages
Part B
No ratings yet
Part B
12 pages
IRS Unit-1
No ratings yet
IRS Unit-1
61 pages
Text Databases and Information Retrieval
No ratings yet
Text Databases and Information Retrieval
23 pages
Understanding Information Retrieval Systems
100% (1)
Understanding Information Retrieval Systems
6 pages
Intro to Information Retrieval
No ratings yet
Intro to Information Retrieval
47 pages
Information Retrievalpdf
No ratings yet
Information Retrievalpdf
7 pages
IR Chapter 2
No ratings yet
IR Chapter 2
37 pages
Artificial Intelligence in Information Retrieval
No ratings yet
Artificial Intelligence in Information Retrieval
5 pages
Documentation Ir
No ratings yet
Documentation Ir
58 pages
Tank Radial and Tangential Stiffness Analysis
No ratings yet
Tank Radial and Tangential Stiffness Analysis
3 pages
M15 Pom 1
No ratings yet
M15 Pom 1
15 pages
MONARCH Lista de Errores Nice3000
No ratings yet
MONARCH Lista de Errores Nice3000
35 pages
Power Electronics Final Exam 2021
No ratings yet
Power Electronics Final Exam 2021
3 pages
LN 10.8.X Lnstudiodg En-Us
No ratings yet
LN 10.8.X Lnstudiodg En-Us
369 pages
Analysis and Design of Chaos in Switched Reluctance
No ratings yet
Analysis and Design of Chaos in Switched Reluctance
20 pages
Welding
No ratings yet
Welding
80 pages
Class 8 (ch-2)
100% (1)
Class 8 (ch-2)
2 pages
CPVC Model
No ratings yet
CPVC Model
1 page
Cooling Tower Damage
No ratings yet
Cooling Tower Damage
7 pages
Kirloskar Brothers LTD - Measurement-1
No ratings yet
Kirloskar Brothers LTD - Measurement-1
10 pages
Research Problem Formulation Guide
No ratings yet
Research Problem Formulation Guide
343 pages
Setup FTP Server on Windows IIS
No ratings yet
Setup FTP Server on Windows IIS
9 pages
Pre RMO 2013 Question Paper
No ratings yet
Pre RMO 2013 Question Paper
2 pages
G59/2 Protection Panel for UK Embedded Generation
No ratings yet
G59/2 Protection Panel for UK Embedded Generation
2 pages
EFSchubert Physical Foundations of Solid State Devices
100% (1)
EFSchubert Physical Foundations of Solid State Devices
273 pages
MATH 9 - Week 3
No ratings yet
MATH 9 - Week 3
4 pages
Calibre
No ratings yet
Calibre
18 pages
AIvs ML
No ratings yet
AIvs ML
31 pages
Tests For Cations and Anions
100% (2)
Tests For Cations and Anions
2 pages
ETABS 使用指南
No ratings yet
ETABS 使用指南
48 pages
MSc Radio Diagnosis Course Guide
No ratings yet
MSc Radio Diagnosis Course Guide
24 pages
Hipotiroidisme Pasca Radioterapi Untuk Karsinoma Nasofaring
No ratings yet
Hipotiroidisme Pasca Radioterapi Untuk Karsinoma Nasofaring
28 pages
Structural Design Calculations
100% (1)
Structural Design Calculations
26 pages
MSA 2.0/2.1 Electrofusion Manual
No ratings yet
MSA 2.0/2.1 Electrofusion Manual
92 pages
AWT Maintanace Manual
No ratings yet
AWT Maintanace Manual
132 pages
Implementing User-Defined Functions
No ratings yet
Implementing User-Defined Functions
18 pages
Channel For Chinese National Standard (CNS) & Japanese Industral Standard (JIS)
No ratings yet
Channel For Chinese National Standard (CNS) & Japanese Industral Standard (JIS)
1 page
(1-2) SIDE SEAL Dock Gate (ASSEMBLY) - Model
No ratings yet
(1-2) SIDE SEAL Dock Gate (ASSEMBLY) - Model
1 page

An Introduction To Information Retrieval Systems: Intelligent Systems March 18, 2004 Ramashis Das

Uploaded by

An Introduction To Information Retrieval Systems: Intelligent Systems March 18, 2004 Ramashis Das

Uploaded by

An Introduction to Information

Data Retrieval Info Retrieval

 Many Automatic Information Retrieval

 On-line change in request during a

 Set of Citations or Document Numbers.

 Fairly Recent, Slow Development - loath

 Emphasis is on the statistical rather than

 Frequency of word occurrence in an

 f – Frequency of occurrence of words

 index language is the language used to

You might also like