0% found this document useful (0 votes)
16 views10 pages

1 Information Retrieval System

The document discusses information retrieval systems and how they work. Information retrieval systems store and manage documents to help users find relevant information. They index documents and return results based on similarity to user queries rather than directly answering questions. Key aspects include defining the text database, building an index, processing user queries, ranking results, and evaluating systems using precision and recall.

Uploaded by

rm23082001
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views10 pages

1 Information Retrieval System

The document discusses information retrieval systems and how they work. Information retrieval systems store and manage documents to help users find relevant information. They index documents and return results based on similarity to user queries rather than directly answering questions. Key aspects include defining the text database, building an index, processing user queries, ranking results, and evaluating systems using precision and recall.

Uploaded by

rm23082001
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 10

Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text)

that satisfies an information need from within large collections (usually stored on computers).

Information Retrieval System

An information retrieval system is a software programme that stores and manages information on
documents, often textual documents but possibly multimedia. The system assists users in finding the
information they need. It does not explicitly return information or answer questions. Instead, it informs
on the existence and location of documents that might contain the desired information.

Difference Between Information Retrieval and Data Retrieval

Information Retrieval Data Retrieval

Retrieves information based on the similarity Retrieves data based on the keywords in the
1
between the query and the document. query entered by the user.

Small errors are tolerated and will likely go There is no room for errors since it results in
2
unnoticed. complete system failure.

It is ambiguous and doesn’t have a defined It has a defined structure with respect to
3
structure. semantics.
Does not provide a solution to the user of the Provides solutions to the user of the database
4
database system. system.

Information Retrieval system produces


5 Data Retrieval system produces exact results.
approximate results

6 Displayed results are sorted by relevance Displayed results are not sorted by relevance.

The Data Retrieval model is deterministic by


7 The IR model is probabilistic by nature.
nature.

Information retrieving system architecture

First of all, before the retrieval process can even be initiated, it is necessary to define the text
database. This is usually done by the manager of the database, which specifies the following: (a)
the documents to be used, (b) the operations to be performed on the text, and (c) the text model
(i.e., the text structure and what elements can be retrieved). The text operations transform the
original documents and generate a logical view of them.
Once the logical view of the documents is defined, the database manager builds an index
of the text. An index is a critical data structure because it allows fast searching over large
volumes of data. Different index structures might be used, but the most popular one is the
inverted file. The resources (time and storage space) spent on defining the text database and
building the index are amortized by querying the retrieval system many times.
Given that the document database is indexed, the retrieval process can be initiated. The
user first specifies a user need which is then parsed and transformed by the same text operations
applied to the text. Then, query operations might be applied before the actual query, which
provides a system representation for the user need, is generated. The query is then processed to
obtain the retrieved documents. Fast query processing is made possible by the index structure
previously built.
Before been sent to the user, the retrieved documents are ranked according to a likelihood
of relevance. The user then examines the set of ranked documents in the search for useful
information. At this point, he might pinpoint a subset of the documents seen as definitely of
interest and initiate a user feedback cycle. In such a cycle, the system uses the documents
selected by the user to change the query formulation. Hopefully, this modified query is a better
representation

Issues with IR systems:


Are the retrieved documents relevant? (precision)
Are all the relevant documents retrieved? (Recall)

Evaluation of IR system

Two of the evaluation measures are precision and recall.


 Precision is the proportion of retrieved documents that are relevant.
Recall is the proportion of relevant documents that are retrieved.
Precision = Relevant documents ∩ Retrieved documents
Retrieved documents
 Recall = Relevant documents ∩ Retrieved documents
Relevant documents

 When the recall measure is used, there is an assumption that all the
relevant documents for a given query are known. Such an assumption is
clearly problematic in a web search environment, but with smaller test
collection of documents, this measure can be useful. It is not suitable
for large volumes of log data.

You can increase recall by returning more docs.


Recall is a non-decreasing function of the number of docs retrieved.
A system that returns all docs has 100% recall!
The converse is also true (usually): It’s easy to get high precision for very low recall.

Q Calculate precision and recall of the following truth table.

Sol.
TP=20, FP=40, FN=60

Luhn’s Idea

One of the first text summarization algorithms was published in 1958 by Hans Peter Luhn, working at
IBM research. Luhn’s algorithm is a naive approach based on TF-IDF and looking at the “window size” of
non-important words between words of high importance.

Luhn’s algorithm is an approach based on TF-IDF. It selects only the words of higher importance as per
their frequency. Higher weights are assigned to the words present at the begining of the document. It
considers the words lying in the shaded region in this graph:
The region on the right signifies highest occurring elements while words on the left signifies least
occurring elements. Luhn introduced the following criteria during text pre-processing:

1. Removing stopwords

2. Stemming (Likes->Like)

In this method we select sentences with highest concentration of salient content terms. For example , if
we have 10 words in a sentence and 4 of the words are significant.
For calculating the significance instead of number of significant words by all words
here we divide them by the span that consist of these words. Thus the Score obtained from our example
would be
Score= 42/6 = 2.7

Application
The Luhns method is most significant when:

1. Too low frequent words are not significant

2. Too high frequent words are also not significant (e.g. “is”, “and”)

3. Removing low frequent words is easy

 set a minimum frequency-threshold

4. Removing common (high frequent) words:

 Setting a maximum frequency threshold (statistically obtained)

 Comparing to a common-word list

5. Used for summarizing technical documents.

Algorithm
Luhns method is a simple technique to generate a summary from given words. The algorithm can be
implemented in two stages.
In the first stage, we try to determine which words are more significant towards the meaning of
document. Luhn states that this is first done by doing a frequency analysis, then finding words which are
significant, but not unimportant English words.
In the second phase, we find out the most common words in the document, and then take a subset of
those that are not these most common english words, but are still important. It usually consists of
following three steps:
1. It begins with transforming the content of sentences into a mathematical expression, or vector
(represented below through binary representation). Here we use a bag of words , which ignores all
the filler words. Filler words are usually the supporting words that do not have any impact on our
document meaning. Then we count all the valuable words left to us. For example,

In the above table we can clearly see that the words like an and a that are the stopwords are not
considered while evaluation.
2. In this step we use evaluate sentences using sentence scoring technique. We can use the scoring
method as illustrated below.

Score= (Number of meaningful words)2/(Span of meaningful words)

A span here refers to the part of sentence (in our case)/document consisting of all the meaningful words.
Tf-idf can also be used to prioritize the words in a sentence.
3. Once the sentence scoring is complete, the last step is simply to select those sentences with the
highest overall rankings.
TF-IDF (term frequency-inverse document frequency) is a statistical measure that evaluates how
relevant a word is to a document in a collection of documents.

This is done by multiplying two metrics: how many times a word appears in a document, and the inverse
document frequency of the word across a set of documents

TF-IDF was invented for document search and information retrieval. It works by increasing
proportionally to the number of times a word appears in a document, but is offset by the number of
documents that contain the word. So, words that are common in every document, such as this, what, and
if, rank low even though they may appear many times, since they don’t mean much to that document.

How is TF-IDF calculated?

TF-IDF for a word in a document is calculated by multiplying two different metrics:

 The term frequency of a word in a document. There are several ways of calculating this
frequency, with the simplest being a raw count of instances a word appears in a document. Then,
there are ways to adjust the frequency, by length of a document, or by the raw frequency of the
most frequent word in a document.
 The inverse document frequency of the word across a set of documents. This means, how
common or rare a word is in the entire document set. The closer it is to 0, the more common a
word is. This metric can be calculated by taking the total number of documents, dividing it by the
number of documents that contain a word, and calculating the logarithm.
 So, if the word is very common and appears in many documents, this number will approach 0.
Otherwise, it will approach 1.

Multiplying these two numbers results in the TF-IDF score of a word in a document. The higher the
score, the more relevant that word is in that document.

To put it in more formal mathematical terms, the TF-IDF score for the word t in the document d from the
document set D is calculated as follows:

Where:
Conflation algorithms
Conflation algorithms are used in Information Retrieval (IR) systems for matching the morphological
variants of terms for efficient indexing and faster retrieval operations. The conflation process can be done
either manually or automatically. The automatic conflation operation is also called stemming.

Conflation algorithms are used or improving IR performance by finding morphological variants of search
terms. For example, a searcher enters the term stemming as part of a query, it is likely that he or she will
also be interested in such variants as stemmed and stem. We use the term conflation, meaning the act of
fusing or combining, as the general term for the process of matching morphological term variants.
Conflation can be either manual--using regular expressions--or automatic, via programs called stemmers.

Stemming is the process of reducing a word to its word stem that affixes to suffixes and
prefixes or to the roots of words known as a lemma. For example: words such as “Likes”,”
liked”, ”likely” and ”liking” will be reduced to “like” after stemming

.
There are four automatic approaches. Affix removal algorithms removes affixes or prefixes
from terms leaving a stem. Successor variety stemmers use the frequencies of letter
sequences in the text as the basis for stemming. N-gram method conflates the terms based
on the number of diagrams or n-grams they share. Correctness, retrieval effectiveness and
compression performance judges the stemmers. There are two was a stemming can be
incorrect over stemming and under stemming. When a term is over stemmed too much of
the stem is removed. Over stemming may cause unrelated terms to be conflated. Under
stemming is removal of too little of a term and will make the related terms from being
conflated.

You might also like