An Introduction to Information
Retrieval Systems
Intelligent Systems
March 18, 2004
Ramashis Das
Definition
We discuss about Automatic Information
Retrieval
Automatic – as against ‘manual’.
Information – as against ‘data’.
Defn : An information retrieval system does not
inform (i.e. change the knowledge of) the user on
the subject of his inquiry. It merely informs on the
existence (or non-existence) and whereabouts of
documents relating to his request.
IR Vs Data Retrieval
Data Retrieval Info Retrieval
Matching Exact match Partial match, best
match
Inference Deduction Induction
Model Deterministic Probabilistic
Classification Monothetic Polythetic
Query language Artificial Natural
Query specification Complete Incomplete
Items wanted Matching Relevant
Error response Sensitive Insensitive
Classification
Monothetic classification is one with classes
defined by objects possessing attributes both
necessary and sufficient to belong to a class.
Polythetic classification is one where each
individual in a class will possess only a
proportion of all the attributes possessed by all
the members of that class.
Hence no attribute is necessary nor sufficient for
membership to a class.
Experimental Vs Operational IR Systems
Many Automatic Information Retrieval
Systems are Experimental. Experimental
IR is mainly carried on in a ‘Laboratory'
situation.
Other kind are Operational Systems (or
‘Real World’ IR Systems) that are
Commercial Systems which charge for
the service they provide.
Why IR? – A Simple E.g.
Suppose there is a store of documents and a
person (user of the store) formulates a
question (request or query) to which the
answer is a set of documents satisfying the
information need expressed by his question.
Solution : User can read all the documents in
the store, retain the relevant documents and
discard all the others – Perfect Retrieval…
NOT POSSIBLE !!!
Alternative : Use a High Speed Computer to
read entire document collection and extract
the relevant documents.
Black Box Model
FEEDBACK
Queries
INPUT PROCESSOR
OUTPUT
Documents
INPUT
The main problem here is to obtain a
Representation of each Document and Query
suitable for a computer to use.
Most Computer-Based Retrieval Systems
store only a representation of the Document
(or Query)
Implies actual text is lost, an artificial language
used instead.
User needs to be taught to express his information
need in the language.
Feedback and PROCESSOR
On-line change in request during a
search session in the light of a sample
retrieval hoping improvement in the
subsequent retrieval run – Feedback.
PROCESSOR – Retrieval Process.
Structuring Information in appropriate way.
Actual Retrieval Function – Search Strategy
in response to a Query.
OUTPUT
Set of Citations or Document Numbers.
For Experimental Systems, proper
Evaluation technique follows.
Historical Development
Three main areas of Research:
Content Analysis : Describing the contents
of documents in a form suitable for
computer processing;
Information Structures : Exploiting
relationships between documents to
improve the efficiency and effectiveness of
retrieval strategies;
Evaluation : the measurement of the
effectiveness of retrieval.
Information Representation
Luhn’s approach : frequency count of words in
the Document.
List of Keywords or Terms.
Freq. of occurrence of Keyword in body of
Document indicates its significance.
Statistical Association between Keywords -
exploited by Maron and Kuhns and Stiles
Sparck Jones - measures of association
between keywords based on their frequency of
co-occurrence.
Information Structure
Fairly Recent, Slow Development - loath
to try out new organization techniques
for faster and better retrieval.
Serial File Organization
Inverted File (?)
Clustering – Good, Fairthorne; Doyle;
Rocchio
Evaluation of Retrieval Systems
Extremely Difficult
Dichotomous Scale : Relevant and Non-
Relevant.
Precision - the ratio of the number of relevant
documents retrieved to the total number of
documents retrieved
Recall - ratio of the number of relevant
documents retrieved to the total number of
relevant documents (both retrieved and not
retrieved).
Steps…
1. Generation of Machine Representations for the
Information.
2. Explanation of the Logical Structures that may be
arrived at by Clustering.
3. Representing these Structures in the Computer, or in
other words, choice of File Structures to Represent
the Logical Structure.
4. Search Strategies.
5. Probabilistic Retrieval, i.e. to create a Formal Model
for certain kinds of Search Strategies.
6. Ways of Evaluating the Effectiveness of Retrieval.
AUTOMATIC TEXT ANALYSIS
Storing Information
Original : In form of Documents
Document Representation is stored
Emphasis is on the statistical rather than
linguistic approaches.
We start with original ideas of Luhn
Luhn’s Ideas
Frequency of word occurrence in an
article furnishes a useful measurement
of word significance.
relative position within a sentence of
words having given values of
significance furnish a useful
measurement for determining the
significance of sentences.
Demonstration
f – Frequency of occurrence of words
r – Rank Order
Zipf’s Law - the product of the frequency
of use of words and the rank order is
approximately constant.
Luhn used the above law to define two
cut-offs.
Generating Document
Representatives - conflation
Text Processing System
Input text – full text, abstract or title
Output – a doc representative adequate for use in
an automatic retrieval system
The document representative consists of a list
of class names, each name representing a
class of words occurring in the total input text.
A document will be indexed by a name if one
of its significant words occurs as a member of
that class.
Text Processing System
Such system will consist of three parts:
Removal of high frequency words
Suffix stripping
Detecting equivalent stems
Removal of High Freq words :
One way of implementing Luhn’s upper cut-off.
Maintain list of ‘stop list’; compare and remove
Document size reduces by 30 to 50 %
Text Processing System
Suffix stripping – more involved
Complete list of suffixes; match and remove the longest
possible one.
Context free removal leads to Error : Removing ‘UAL’ from
FACTUAL and EQUAL
Solution : Have some rules
Equivalent Stems :
Map to same morphological form on removal of suffixes.
Other kinds, which do not match on mere removal of suffixes.
(ABSORB- and ABSORPT-)
For these, a list of equivalent stem-endings is maintained.
(For e.g. ‘B’ and ‘PT’ are equivalent stem ending)
Text Processing System
The final output from a conflation algorithm is
a set of classes, one for each stem detected.
A class name is assigned to a document if and
only if one of its members occurs as a
significant word in the text of the document.
A document representative then becomes a
list of class names. These are often referred to
as the documents index terms or keywords.
Queries : Queries are handled in the same
way.
Indexing
index language is the language used to
describe documents and requests
elements of the index language are
index terms which may be derived from
the text of the document to be described,
or may be arrived at independently.
Some distinctions
Index Languages can be described as :
Pre-coordinate : terms are coordinated at the time
of indexing
Post-coordinate : at the time of searching.
Vocabulary of Index Language :
Controlled : list of approved index terms that an
indexer may use. One may put other kinds of
syntactic controls (e.g. certain terms used only as
adjectives)
Uncontrolled