1 Information Retrieval System
1 Information Retrieval System
that satisfies an information need from within large collections (usually stored on computers).
An information retrieval system is a software programme that stores and manages information on
documents, often textual documents but possibly multimedia. The system assists users in finding the
information they need. It does not explicitly return information or answer questions. Instead, it informs
on the existence and location of documents that might contain the desired information.
Retrieves information based on the similarity Retrieves data based on the keywords in the
1
between the query and the document. query entered by the user.
Small errors are tolerated and will likely go There is no room for errors since it results in
2
unnoticed. complete system failure.
It is ambiguous and doesn’t have a defined It has a defined structure with respect to
3
structure. semantics.
Does not provide a solution to the user of the Provides solutions to the user of the database
4
database system. system.
6 Displayed results are sorted by relevance Displayed results are not sorted by relevance.
First of all, before the retrieval process can even be initiated, it is necessary to define the text
database. This is usually done by the manager of the database, which specifies the following: (a)
the documents to be used, (b) the operations to be performed on the text, and (c) the text model
(i.e., the text structure and what elements can be retrieved). The text operations transform the
original documents and generate a logical view of them.
Once the logical view of the documents is defined, the database manager builds an index
of the text. An index is a critical data structure because it allows fast searching over large
volumes of data. Different index structures might be used, but the most popular one is the
inverted file. The resources (time and storage space) spent on defining the text database and
building the index are amortized by querying the retrieval system many times.
Given that the document database is indexed, the retrieval process can be initiated. The
user first specifies a user need which is then parsed and transformed by the same text operations
applied to the text. Then, query operations might be applied before the actual query, which
provides a system representation for the user need, is generated. The query is then processed to
obtain the retrieved documents. Fast query processing is made possible by the index structure
previously built.
Before been sent to the user, the retrieved documents are ranked according to a likelihood
of relevance. The user then examines the set of ranked documents in the search for useful
information. At this point, he might pinpoint a subset of the documents seen as definitely of
interest and initiate a user feedback cycle. In such a cycle, the system uses the documents
selected by the user to change the query formulation. Hopefully, this modified query is a better
representation
Evaluation of IR system
When the recall measure is used, there is an assumption that all the
relevant documents for a given query are known. Such an assumption is
clearly problematic in a web search environment, but with smaller test
collection of documents, this measure can be useful. It is not suitable
for large volumes of log data.
Sol.
TP=20, FP=40, FN=60
Luhn’s Idea
One of the first text summarization algorithms was published in 1958 by Hans Peter Luhn, working at
IBM research. Luhn’s algorithm is a naive approach based on TF-IDF and looking at the “window size” of
non-important words between words of high importance.
Luhn’s algorithm is an approach based on TF-IDF. It selects only the words of higher importance as per
their frequency. Higher weights are assigned to the words present at the begining of the document. It
considers the words lying in the shaded region in this graph:
The region on the right signifies highest occurring elements while words on the left signifies least
occurring elements. Luhn introduced the following criteria during text pre-processing:
1. Removing stopwords
2. Stemming (Likes->Like)
In this method we select sentences with highest concentration of salient content terms. For example , if
we have 10 words in a sentence and 4 of the words are significant.
For calculating the significance instead of number of significant words by all words
here we divide them by the span that consist of these words. Thus the Score obtained from our example
would be
Score= 42/6 = 2.7
Application
The Luhns method is most significant when:
2. Too high frequent words are also not significant (e.g. “is”, “and”)
Algorithm
Luhns method is a simple technique to generate a summary from given words. The algorithm can be
implemented in two stages.
In the first stage, we try to determine which words are more significant towards the meaning of
document. Luhn states that this is first done by doing a frequency analysis, then finding words which are
significant, but not unimportant English words.
In the second phase, we find out the most common words in the document, and then take a subset of
those that are not these most common english words, but are still important. It usually consists of
following three steps:
1. It begins with transforming the content of sentences into a mathematical expression, or vector
(represented below through binary representation). Here we use a bag of words , which ignores all
the filler words. Filler words are usually the supporting words that do not have any impact on our
document meaning. Then we count all the valuable words left to us. For example,
In the above table we can clearly see that the words like an and a that are the stopwords are not
considered while evaluation.
2. In this step we use evaluate sentences using sentence scoring technique. We can use the scoring
method as illustrated below.
A span here refers to the part of sentence (in our case)/document consisting of all the meaningful words.
Tf-idf can also be used to prioritize the words in a sentence.
3. Once the sentence scoring is complete, the last step is simply to select those sentences with the
highest overall rankings.
TF-IDF (term frequency-inverse document frequency) is a statistical measure that evaluates how
relevant a word is to a document in a collection of documents.
This is done by multiplying two metrics: how many times a word appears in a document, and the inverse
document frequency of the word across a set of documents
TF-IDF was invented for document search and information retrieval. It works by increasing
proportionally to the number of times a word appears in a document, but is offset by the number of
documents that contain the word. So, words that are common in every document, such as this, what, and
if, rank low even though they may appear many times, since they don’t mean much to that document.
The term frequency of a word in a document. There are several ways of calculating this
frequency, with the simplest being a raw count of instances a word appears in a document. Then,
there are ways to adjust the frequency, by length of a document, or by the raw frequency of the
most frequent word in a document.
The inverse document frequency of the word across a set of documents. This means, how
common or rare a word is in the entire document set. The closer it is to 0, the more common a
word is. This metric can be calculated by taking the total number of documents, dividing it by the
number of documents that contain a word, and calculating the logarithm.
So, if the word is very common and appears in many documents, this number will approach 0.
Otherwise, it will approach 1.
Multiplying these two numbers results in the TF-IDF score of a word in a document. The higher the
score, the more relevant that word is in that document.
To put it in more formal mathematical terms, the TF-IDF score for the word t in the document d from the
document set D is calculated as follows:
Where:
Conflation algorithms
Conflation algorithms are used in Information Retrieval (IR) systems for matching the morphological
variants of terms for efficient indexing and faster retrieval operations. The conflation process can be done
either manually or automatically. The automatic conflation operation is also called stemming.
Conflation algorithms are used or improving IR performance by finding morphological variants of search
terms. For example, a searcher enters the term stemming as part of a query, it is likely that he or she will
also be interested in such variants as stemmed and stem. We use the term conflation, meaning the act of
fusing or combining, as the general term for the process of matching morphological term variants.
Conflation can be either manual--using regular expressions--or automatic, via programs called stemmers.
Stemming is the process of reducing a word to its word stem that affixes to suffixes and
prefixes or to the roots of words known as a lemma. For example: words such as “Likes”,”
liked”, ”likely” and ”liking” will be reduced to “like” after stemming
.
There are four automatic approaches. Affix removal algorithms removes affixes or prefixes
from terms leaving a stem. Successor variety stemmers use the frequencies of letter
sequences in the text as the basis for stemming. N-gram method conflates the terms based
on the number of diagrams or n-grams they share. Correctness, retrieval effectiveness and
compression performance judges the stemmers. There are two was a stemming can be
incorrect over stemming and under stemming. When a term is over stemmed too much of
the stem is removed. Over stemming may cause unrelated terms to be conflated. Under
stemming is removal of too little of a term and will make the related terms from being
conflated.