0% found this document useful (0 votes)

16 views10 pages

1 Information Retrieval System

The document discusses information retrieval systems and how they work. Information retrieval systems store and manage documents to help users find relevant information. They index documents and return results based on similarity to user queries rather than directly answering questions. Key aspects include defining the text database, building an index, processing user queries, ranking results, and evaluating systems using precision and recall.

Uploaded by

rm23082001

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views10 pages

1 Information Retrieval System

Uploaded by

rm23082001

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 10

Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text)

that satisfies an information need from within large collections (usually stored on computers).

Information Retrieval System

An information retrieval system is a software programme that stores and manages information on
documents, often textual documents but possibly multimedia. The system assists users in finding the
information they need. It does not explicitly return information or answer questions. Instead, it informs
on the existence and location of documents that might contain the desired information.

Difference Between Information Retrieval and Data Retrieval

Information Retrieval Data Retrieval

Retrieves information based on the similarity Retrieves data based on the keywords in the
1
between the query and the document. query entered by the user.

Small errors are tolerated and will likely go There is no room for errors since it results in
2
unnoticed. complete system failure.

It is ambiguous and doesn’t have a defined It has a defined structure with respect to
3
structure. semantics.
Does not provide a solution to the user of the Provides solutions to the user of the database
4
database system. system.

Information Retrieval system produces

5 Data Retrieval system produces exact results.
approximate results

6 Displayed results are sorted by relevance Displayed results are not sorted by relevance.

The Data Retrieval model is deterministic by

7 The IR model is probabilistic by nature.
nature.

Information retrieving system architecture

First of all, before the retrieval process can even be initiated, it is necessary to define the text
database. This is usually done by the manager of the database, which specifies the following: (a)
the documents to be used, (b) the operations to be performed on the text, and (c) the text model
(i.e., the text structure and what elements can be retrieved). The text operations transform the
original documents and generate a logical view of them.
Once the logical view of the documents is defined, the database manager builds an index
of the text. An index is a critical data structure because it allows fast searching over large
volumes of data. Different index structures might be used, but the most popular one is the
inverted file. The resources (time and storage space) spent on defining the text database and
building the index are amortized by querying the retrieval system many times.
Given that the document database is indexed, the retrieval process can be initiated. The
user first specifies a user need which is then parsed and transformed by the same text operations
applied to the text. Then, query operations might be applied before the actual query, which
provides a system representation for the user need, is generated. The query is then processed to
obtain the retrieved documents. Fast query processing is made possible by the index structure
previously built.
Before been sent to the user, the retrieved documents are ranked according to a likelihood
of relevance. The user then examines the set of ranked documents in the search for useful
information. At this point, he might pinpoint a subset of the documents seen as definitely of
interest and initiate a user feedback cycle. In such a cycle, the system uses the documents
selected by the user to change the query formulation. Hopefully, this modified query is a better
representation

Issues with IR systems:

Are the retrieved documents relevant? (precision)
Are all the relevant documents retrieved? (Recall)

Evaluation of IR system

Two of the evaluation measures are precision and recall.

 Precision is the proportion of retrieved documents that are relevant.
Recall is the proportion of relevant documents that are retrieved.
Precision = Relevant documents ∩ Retrieved documents
Retrieved documents
 Recall = Relevant documents ∩ Retrieved documents
Relevant documents

 When the recall measure is used, there is an assumption that all the
relevant documents for a given query are known. Such an assumption is
clearly problematic in a web search environment, but with smaller test
collection of documents, this measure can be useful. It is not suitable
for large volumes of log data.

You can increase recall by returning more docs.

Recall is a non-decreasing function of the number of docs retrieved.
A system that returns all docs has 100% recall!
The converse is also true (usually): It’s easy to get high precision for very low recall.

Q Calculate precision and recall of the following truth table.

Sol.
TP=20, FP=40, FN=60

Luhn’s Idea

One of the first text summarization algorithms was published in 1958 by Hans Peter Luhn, working at
IBM research. Luhn’s algorithm is a naive approach based on TF-IDF and looking at the “window size” of
non-important words between words of high importance.

Luhn’s algorithm is an approach based on TF-IDF. It selects only the words of higher importance as per
their frequency. Higher weights are assigned to the words present at the begining of the document. It
considers the words lying in the shaded region in this graph:
The region on the right signifies highest occurring elements while words on the left signifies least
occurring elements. Luhn introduced the following criteria during text pre-processing:

1. Removing stopwords

2. Stemming (Likes->Like)

In this method we select sentences with highest concentration of salient content terms. For example , if
we have 10 words in a sentence and 4 of the words are significant.
For calculating the significance instead of number of significant words by all words
here we divide them by the span that consist of these words. Thus the Score obtained from our example
would be
Score= 42/6 = 2.7

Application
The Luhns method is most significant when:

1. Too low frequent words are not significant

2. Too high frequent words are also not significant (e.g. “is”, “and”)

3. Removing low frequent words is easy

 set a minimum frequency-threshold

4. Removing common (high frequent) words:

 Setting a maximum frequency threshold (statistically obtained)

 Comparing to a common-word list

5. Used for summarizing technical documents.

Algorithm
Luhns method is a simple technique to generate a summary from given words. The algorithm can be
implemented in two stages.
In the first stage, we try to determine which words are more significant towards the meaning of
document. Luhn states that this is first done by doing a frequency analysis, then finding words which are
significant, but not unimportant English words.
In the second phase, we find out the most common words in the document, and then take a subset of
those that are not these most common english words, but are still important. It usually consists of
following three steps:
1. It begins with transforming the content of sentences into a mathematical expression, or vector
(represented below through binary representation). Here we use a bag of words , which ignores all
the filler words. Filler words are usually the supporting words that do not have any impact on our
document meaning. Then we count all the valuable words left to us. For example,

In the above table we can clearly see that the words like an and a that are the stopwords are not
considered while evaluation.
2. In this step we use evaluate sentences using sentence scoring technique. We can use the scoring
method as illustrated below.

Score= (Number of meaningful words)2/(Span of meaningful words)

A span here refers to the part of sentence (in our case)/document consisting of all the meaningful words.
Tf-idf can also be used to prioritize the words in a sentence.
3. Once the sentence scoring is complete, the last step is simply to select those sentences with the
highest overall rankings.
TF-IDF (term frequency-inverse document frequency) is a statistical measure that evaluates how
relevant a word is to a document in a collection of documents.

This is done by multiplying two metrics: how many times a word appears in a document, and the inverse
document frequency of the word across a set of documents

TF-IDF was invented for document search and information retrieval. It works by increasing
proportionally to the number of times a word appears in a document, but is offset by the number of
documents that contain the word. So, words that are common in every document, such as this, what, and
if, rank low even though they may appear many times, since they don’t mean much to that document.

How is TF-IDF calculated?

TF-IDF for a word in a document is calculated by multiplying two different metrics:

 The term frequency of a word in a document. There are several ways of calculating this
frequency, with the simplest being a raw count of instances a word appears in a document. Then,
there are ways to adjust the frequency, by length of a document, or by the raw frequency of the
most frequent word in a document.
 The inverse document frequency of the word across a set of documents. This means, how
common or rare a word is in the entire document set. The closer it is to 0, the more common a
word is. This metric can be calculated by taking the total number of documents, dividing it by the
number of documents that contain a word, and calculating the logarithm.
 So, if the word is very common and appears in many documents, this number will approach 0.
Otherwise, it will approach 1.

Multiplying these two numbers results in the TF-IDF score of a word in a document. The higher the
score, the more relevant that word is in that document.

To put it in more formal mathematical terms, the TF-IDF score for the word t in the document d from the
document set D is calculated as follows:

Where:
Conflation algorithms
Conflation algorithms are used in Information Retrieval (IR) systems for matching the morphological
variants of terms for efficient indexing and faster retrieval operations. The conflation process can be done
either manually or automatically. The automatic conflation operation is also called stemming.

Conflation algorithms are used or improving IR performance by finding morphological variants of search
terms. For example, a searcher enters the term stemming as part of a query, it is likely that he or she will
also be interested in such variants as stemmed and stem. We use the term conflation, meaning the act of
fusing or combining, as the general term for the process of matching morphological term variants.
Conflation can be either manual--using regular expressions--or automatic, via programs called stemmers.

Stemming is the process of reducing a word to its word stem that affixes to suffixes and
prefixes or to the roots of words known as a lemma. For example: words such as “Likes”,”
liked”, ”likely” and ”liking” will be reduced to “like” after stemming

.
There are four automatic approaches. Affix removal algorithms removes affixes or prefixes
from terms leaving a stem. Successor variety stemmers use the frequencies of letter
sequences in the text as the basis for stemming. N-gram method conflates the terms based
on the number of diagrams or n-grams they share. Correctness, retrieval effectiveness and
compression performance judges the stemmers. There are two was a stemming can be
incorrect over stemming and under stemming. When a term is over stemmed too much of
the stem is removed. Over stemming may cause unrelated terms to be conflated. Under
stemming is removal of too little of a term and will make the related terms from being
conflated.

Tugas Teknik Kuantitatif Linear Programming: Diketahui
No ratings yet
Tugas Teknik Kuantitatif Linear Programming: Diketahui
7 pages
Quantum Measurement and Control - Wiseman & Milburn
100% (1)
Quantum Measurement and Control - Wiseman & Milburn
478 pages
Unit V Easy To Learn
No ratings yet
Unit V Easy To Learn
21 pages
2 Introduction To Information Retrieval
No ratings yet
2 Introduction To Information Retrieval
38 pages
An Introduction To Information Retrieval Systems: Intelligent Systems March 18, 2004 Ramashis Das
No ratings yet
An Introduction To Information Retrieval Systems: Intelligent Systems March 18, 2004 Ramashis Das
25 pages
CMP 312 - 2
No ratings yet
CMP 312 - 2
5 pages
Chapter 2
No ratings yet
Chapter 2
37 pages
4 IRinArabic2021 Ranked Retrieval I
No ratings yet
4 IRinArabic2021 Ranked Retrieval I
49 pages
ISE Information Retrieval Mod-V (Uploaded by Snaptricks.in)
No ratings yet
ISE Information Retrieval Mod-V (Uploaded by Snaptricks.in)
48 pages
Information Retrievalpdf
No ratings yet
Information Retrievalpdf
7 pages
What is Information Retrieval (IR)
No ratings yet
What is Information Retrieval (IR)
15 pages
NLP UNIT-II(PART-I)
No ratings yet
NLP UNIT-II(PART-I)
19 pages
Irt-23 Unit 2
No ratings yet
Irt-23 Unit 2
10 pages
ISE Information Retrieval Mod-V
No ratings yet
ISE Information Retrieval Mod-V
48 pages
Indexing Database Systems
No ratings yet
Indexing Database Systems
5 pages
Module 6 Updated Final
No ratings yet
Module 6 Updated Final
48 pages
bulu
No ratings yet
bulu
47 pages
Information Retrieval Systems Information Retrieval Systems
No ratings yet
Information Retrieval Systems Information Retrieval Systems
7 pages
NLP Mod-5
No ratings yet
NLP Mod-5
17 pages
UNIT-5 ADT
No ratings yet
UNIT-5 ADT
11 pages
Anti-Serendipity: Finding Useless Documents and Similar Documents
No ratings yet
Anti-Serendipity: Finding Useless Documents and Similar Documents
9 pages
ISR U 1&2 Tech-Knowledge
No ratings yet
ISR U 1&2 Tech-Knowledge
68 pages
Unit 1 Notes-1
No ratings yet
Unit 1 Notes-1
10 pages
22103071-ASSIGNMENT - II
No ratings yet
22103071-ASSIGNMENT - II
7 pages
Jurnal Information Retrieval
No ratings yet
Jurnal Information Retrieval
4 pages
Introduction To Information Storage and Retrieval Systems: BY-Research Scholar
No ratings yet
Introduction To Information Storage and Retrieval Systems: BY-Research Scholar
42 pages
Lec 4
No ratings yet
Lec 4
39 pages
Artificial_Intelligence_in_Information_Retrieval
No ratings yet
Artificial_Intelligence_in_Information_Retrieval
5 pages
L02-IR Models MMN
No ratings yet
L02-IR Models MMN
27 pages
1 IR Intro
No ratings yet
1 IR Intro
30 pages
mod 4
No ratings yet
mod 4
35 pages
Recuperación Información Modelo Vectorial
No ratings yet
Recuperación Información Modelo Vectorial
40 pages
Modern Information Retrieval Chapter 7: Text Operations: Ricardo Baeza-Yates Berthier Ribeiro-Neto
No ratings yet
Modern Information Retrieval Chapter 7: Text Operations: Ricardo Baeza-Yates Berthier Ribeiro-Neto
40 pages
a
No ratings yet
a
48 pages
Information Retrieval Detailed Lecture Nov 2023
No ratings yet
Information Retrieval Detailed Lecture Nov 2023
39 pages
UNIT-1
No ratings yet
UNIT-1
15 pages
Documentation Ir
No ratings yet
Documentation Ir
58 pages
Chapter 1
No ratings yet
Chapter 1
69 pages
NLP-week10-IR-enc-dec
No ratings yet
NLP-week10-IR-enc-dec
68 pages
IRS BZU Lecture 7 Jan23
No ratings yet
IRS BZU Lecture 7 Jan23
27 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
40 pages
1_IR_Introductionn (1)
No ratings yet
1_IR_Introductionn (1)
30 pages
Information Retrieval - 1
No ratings yet
Information Retrieval - 1
47 pages
6 Tfidf
No ratings yet
6 Tfidf
48 pages
Unit 5_ Query Operations and Languages
No ratings yet
Unit 5_ Query Operations and Languages
11 pages
ISR chap..1
No ratings yet
ISR chap..1
27 pages
APznzaYcMuGblum6dLzbrgiqMX2w4bWppprlEIRBfo09L31Bp7TZFL5IS8fAgxpO CgAMHeGkzBPPpNf0HEhWu8R KPQ7TsySBlMbc3jvEt9kxIc1YYGbxUS583lgyfO7lZAHU2 BSem 3ALyNioOTTg0CnINSsKq3i83eomvYlk6zzhHghfqeU4YpU8o48sJRXG5AYG r
No ratings yet
APznzaYcMuGblum6dLzbrgiqMX2w4bWppprlEIRBfo09L31Bp7TZFL5IS8fAgxpO CgAMHeGkzBPPpNf0HEhWu8R KPQ7TsySBlMbc3jvEt9kxIc1YYGbxUS583lgyfO7lZAHU2 BSem 3ALyNioOTTg0CnINSsKq3i83eomvYlk6zzhHghfqeU4YpU8o48sJRXG5AYG r
60 pages
Part B
No ratings yet
Part B
12 pages
TF Idf Algorithm
No ratings yet
TF Idf Algorithm
4 pages
01 Introduction to ISR
No ratings yet
01 Introduction to ISR
34 pages
Information Retrieval
No ratings yet
Information Retrieval
5 pages
Relevance of A Document To A Query
No ratings yet
Relevance of A Document To A Query
10 pages
IRS III Year UNIT-3 Part 1
50% (2)
IRS III Year UNIT-3 Part 1
18 pages
IR Problem: Introduction To Information Retrieval Outline
No ratings yet
IR Problem: Introduction To Information Retrieval Outline
11 pages
Text Pre Processing With NLTK
No ratings yet
Text Pre Processing With NLTK
42 pages
Module 1print
No ratings yet
Module 1print
5 pages
Term Weighting 2021
100% (2)
Term Weighting 2021
38 pages
Information Retrieval Thesis Topics
100% (3)
Information Retrieval Thesis Topics
6 pages
Concept Mining: Fundamentals and Applications
From Everand
Concept Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet
Automatic Image Annotation: Fundamentals and Applications
From Everand
Automatic Image Annotation: Fundamentals and Applications
Fouad Sabry
No ratings yet
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
From Everand
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
Fouad Sabry
No ratings yet
Pattern Recognition: Fundamentals and Applications
From Everand
Pattern Recognition: Fundamentals and Applications
Fouad Sabry
No ratings yet
An Efficient Algorithm For Classical Density Functional Theory in Three Dimensions: Ionic Solutions
No ratings yet
An Efficient Algorithm For Classical Density Functional Theory in Three Dimensions: Ionic Solutions
23 pages
2 Basic Root Locus Analysis and Examples PDF
No ratings yet
2 Basic Root Locus Analysis and Examples PDF
26 pages
Digital Watermarking and Steganography Syllabus
100% (1)
Digital Watermarking and Steganography Syllabus
1 page
Phishing Detection Using Clustering and Machine Learning
No ratings yet
Phishing Detection Using Clustering and Machine Learning
11 pages
Program -9
No ratings yet
Program -9
7 pages
MODULE 12 - Matrices
No ratings yet
MODULE 12 - Matrices
14 pages
Topological order
No ratings yet
Topological order
3 pages
Analysis of Algorithm: By: Waqas Haider Khan Bangyal
No ratings yet
Analysis of Algorithm: By: Waqas Haider Khan Bangyal
12 pages
Unit 1.6 - 1.6A, 1.6B & 1.6C
No ratings yet
Unit 1.6 - 1.6A, 1.6B & 1.6C
8 pages
Univariate Smoothing
No ratings yet
Univariate Smoothing
37 pages
Lec 20
No ratings yet
Lec 20
56 pages
Digital Control System Analysis & Design
No ratings yet
Digital Control System Analysis & Design
5 pages
Course Info
No ratings yet
Course Info
2 pages
Modeling and Simulation Lab 09
No ratings yet
Modeling and Simulation Lab 09
11 pages
Ipo and Pseudo Worksheet
No ratings yet
Ipo and Pseudo Worksheet
4 pages
Unit-III Sorting and Searching
No ratings yet
Unit-III Sorting and Searching
20 pages
Computer Science Practical File XII - Unlocked
No ratings yet
Computer Science Practical File XII - Unlocked
24 pages
Cs6503 Toc Unit 1 VB
100% (1)
Cs6503 Toc Unit 1 VB
58 pages
Signals and Systems: BITS Pilani
No ratings yet
Signals and Systems: BITS Pilani
21 pages
Oracle Parallel Distribution and 12c Adaptive Plans
No ratings yet
Oracle Parallel Distribution and 12c Adaptive Plans
4 pages
Neeraj Resume
No ratings yet
Neeraj Resume
2 pages
01 - The Role of Algorithms in Computing
100% (1)
01 - The Role of Algorithms in Computing
22 pages
11.3 Eigenvalues and Eigenvectors of A Tridiagonal Matrix
No ratings yet
11.3 Eigenvalues and Eigenvectors of A Tridiagonal Matrix
7 pages
Estimation of Multinomial Logit Models in R: The Mlogit Packages
No ratings yet
Estimation of Multinomial Logit Models in R: The Mlogit Packages
73 pages
Linear Feedback Shift Register
No ratings yet
Linear Feedback Shift Register
73 pages
Data Analyst Program PDF
No ratings yet
Data Analyst Program PDF
5 pages
AICTE-VAANI Proposal Template
No ratings yet
AICTE-VAANI Proposal Template
1 page

1 Information Retrieval System

Uploaded by

1 Information Retrieval System

Uploaded by

Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text)

Information Retrieval System

Difference Between Information Retrieval and Data Retrieval

Information Retrieval Data Retrieval

Information Retrieval system produces

The Data Retrieval model is deterministic by

Information retrieving system architecture

Issues with IR systems:

Two of the evaluation measures are precision and recall.

You can increase recall by returning more docs.

Q Calculate precision and recall of the following truth table.

1. Too low frequent words are not significant

3. Removing low frequent words is easy

 set a minimum frequency-threshold

4. Removing common (high frequent) words:

 Setting a maximum frequency threshold (statistically obtained)

 Comparing to a common-word list

5. Used for summarizing technical documents.

Score= (Number of meaningful words)2/(Span of meaningful words)

How is TF-IDF calculated?

TF-IDF for a word in a document is calculated by multiplying two different metrics:

You might also like