Understanding Information Retrieval Models

The document discusses information retrieval (IR) models and the vector space model. Some key points: 1) IR models use terms to represent documents and queries as vectors in a multidimensional space. This allows calculating similarity between queries and documents. 2) The vector space model represents documents and queries as weighted vectors based on term frequency and inverse document frequency. 3) Documents are ranked based on similarity to the query vector, typically calculated using cosine similarity. Higher similarity means more relevance.

Uploaded by

Bemenet Biniyam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

87 views28 pages

Understanding Information Retrieval Models

Uploaded by

Bemenet Biniyam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

Chapter Five

IR models
IR Models - Basic Concepts
• Word evidence:
 IR systems usually adopt index terms to index and retrieve
documents
 Each document is represented by a set of representative
keywords or index terms (called Bag of Words)
• An index term is a word useful for remembering the
document main themes
• Not all terms are equally useful for representing the
document contents:
 less frequent terms allow identifying a narrower set of
documents
• But no ordering information is attached to the Bag of
Words identified from the document collection.
IR Models - Basic Concepts
• One central problem regarding IR systems is the
issue of predicting the degree of relevance of
documents for a given query
 Such a decision is usually dependent on a ranking
algorithm which attempts to establish a simple
ordering of the documents retrieved
 Documents appearning at the top of this ordering
are considered to be more likely to be relevant
• Thus ranking algorithms are at the core of IR systems
 The IR models determine the predictions of what
is relevant and what is not, based on the notion of
relevance implemented by the system
IR models

Probabilistic
relevance
General Procedures Followed
To find relevant documens for a given query,
• First, map documents and queries into term-document
vector space.
 Note that queries are considered as short document
• Second, queries and documents are represented as
weighted vectors, wij
 There are binary weights & non-binary weighting
technique
• Third, rank documents by the closeness of their vectors to
the query.
 Documents are ranked by closeness to the query.
Closeness is determined by a similarity score calculation
Mapping Documents & Queries
• Represent both documents and queries as N-dimensional
vectors in a term-document matrix, which shows
occurrence of terms in the document collection or query

d j  (t1, j , t 2, j ,..., t N , j ); qk  (t1,k , t 2,k ,..., t N ,k )
• An entry in the matrix corresponds to the “weight” of a
term in the document; zero means the term doesn’t exist in
the document.
T1 T2 …. TN – Document collection is mapped to
D1 w11 w12 … w1N term-by-document matrix
D2 w21 w22 … w2N – View as vector in multidimensional
space
: : : :
• Nearby vectors are related
: : : :
– Normalize for vector length to avoid
DM wM1 wM2 … wMN
the effect of document length
Qi wi1 wi2 … wiN
How to evaluate Models?
• We need to investigate what procedures the IR Models
follow and what techniques they use:
– What is the weighting technique used by the IR Models for
measuring importance of terms in documents?
• Are they using binary or non-binary weight?
– What is the matching technique used by the IR models?
• Are they measuring similarity or dissimilarity?
– Are they applying exact matching or partial matching in the
course of finding relevant documents for a given query?
– Are they applying best matching principle to measure the
degree of relevance of documents to display in ranked-order?
• Is there any Ranking mechanism applied before displaying
relevant documents for the users?
The Boolean Model
• Boolean model is a simple model based on set theory
 The Boolean model imposes a binary criterion
for deciding relevance
• Terms are either present or absent. Thus,
wij  {0,1}
• sim(q,dj) = 1, if document satisfies the boolean query
0 otherwise T1 T 2 …. TN
D1 w11 w12 … w1N
- Note that, no weights D2 w21 w22 … w2N
assigned in-between 0 and 1, : : : :
just only values 0 or 1
: : : :
DM wM1 wM2 … wMN
The Boolean Model: Example
Given the following three documents, Construct Term – document
matrix and find the relevant documents retrieved by the
Boolean model for the query “gold silver truck”
• D1: “Shipment of gold damaged in a fire”
• D2: “Delivery of silver arrived in a silver truck”
• D3: “Shipment of gold arrived in a truck”
Table below shows document –term (ti) matrix
Arrive damage deliver fire gold silver ship truck
D1 0 1 0 1 1 0 1 0
D2 1 0 1 0 0 1 0 1
D3 1 0 0 0 1 0 1 1
query 0 0 0 0 1 1 0 1

Also find the documents relevant for the queries:

(a)gold delivery; (b) ship gold; (c) silver truck
The Boolean Model: Further Example
• Given the following determine documents retrieved by the
Boolean model based IR system
• Index Terms: K1, …,K8.
• Documents:

1. D1 = {K1, K2, K3, K4, K5}

2. D2 = {K1, K2, K3, K4}
3. D3 = {K2, K4, K6, K8}
4. D4 = {K1, K3, K5, K7}
5. D5 = {K4, K5, K6, K7, K8}
6. D6 = {K1, K2, K3, K4}
• Query: K1 (K2  K3)
• Answer: {D1, D2, D4, D6} ({D1, D2, D3, D6} {D3, D5})
= {D1, D2, D6}
Exercise
Given the following four documents with the following
contents:
– D1 = “computer information retrieval”
– D2 = “computer retrieval”
– D3 = “information”
– D4 = “computer information”

• What are the relevant documents retrieved for the

queries:
– Q1 = “information  retrieval”
– Q2 = “information  ¬computer”
Drawbacks of the Boolean Model
• Retrieval based on binary decision criteria with no
notion of partial matching
• No ranking of the documents is provided (absence of
a grading scale)
• Information need has to be translated into a Boolean
expression which most users find awkward
• The Boolean queries formulated by the users are
most often too simplistic
 As a consequence, the Boolean model frequently
returns either too few or too many documents in
response to a user query
Vector-Space Model
• This is the most commonly used strategy for measuring
relevance of documents for a given query. This is
because,
 Use of binary weights is too limiting
 Non-binary weights provide consideration for partial
matches
• These term weights are used to compute a degree of
similarity between a query and each document
 Ranked set of documents provides for better
matching
• The idea behind VSM is that
 the meaning of a document is conveyed by the words
used in that document
Vector-Space Model
To find relevant documens for a given query,
• First, map documents and queries into term-document
vector space.
Note that queries are considered as short document
• Second, in the vector space, queries and documents are
represented as weighted vectors, wij
There are different weighting technique; the most widely used
one is computing TF*IDF weight for each term

• Third, similarity measurement is used to rank documents

by the closeness of their vectors to the query.
To measure closeness of documents to the query cosine
similarity score is used by most search engines
Term-document matrix.
• A collection of n documents and query can be represented
in the vector space model by a term-document matrix.
– An entry in the matrix corresponds to the “weight” of a term in
the document;
–zero means the term has no significance in the document or
it simply doesn’t exist in the document. Otherwise, wij > 0
whenever ki  dj
T1 T2 …. TN
• How to compute weights
D1 w11 w21 … w1N for term i in document j and
D2 w21 w22 … w2N in query q; wij and wiq ?
: : : :
: : : :
DM wM1 wM2 … wMN
Example: Computing weights
• A collection includes 10,000 documents
 The term A appears 20 times in a particular document j
 The maximum appearance of any term in document j is
50
 The term A appears in 2,000 of the collection
documents.

• Compute TF*IDF weight of term A?

 tf(A,j) = freq(A,j) / max(freq(k,j)) = 20/50 = 0.4
 idf(A) = log(N/DFA) = log (10,000/2,000) = log(5) = 2.32
 wAj = tf(A,j) * log(N/DFA) = 0.4 * 2.32 = 0.928
Similarity Measure
• A similarity measure is a function that computes the
degree of similarity between document j and users
query.  

n
d j q wi , j wi ,q
sim(d j , q)     i 1

dj q i 1 w i1 i ,q
n 2 n2
i, j w

• Using a similarity score between the query and each

document:
– It is possible to rank the retrieved documents in the
order of presumed relevance.
– It is possible to enforce a certain beginning so that we
can control the size of the retrieved set of documents.
Vector Space with Term Weights and
Cosine Matching
Di=(d1i,w1di;d2i, w2di;…;dti, wtdi)
Term B
Q =(q1i,w1qi;q2i, w2qi;…;qti, wtqi)
1.0 Q = (0.4,0.8)

t
D2 Q D1=(0.8,0.3)
j 1
w jq w jdi
0.8 D2=(0.2,0.7) sim(Q, Di ) 
 j 1 (w jq )  j 1 jdi
t 2 t 2
( w )
0.6
2 (0.4  0.2)  (0.8  0.7)
sim (Q, D 2) 
0.4
[(0.4) 2  (0.8) 2 ]  [(0.2) 2  (0.7) 2 ]
D1
0.2 1 0.64
  0.98
0.42
0 0.2 0.4 0.6 0.8 1.0
.56
Term A sim(Q, D1 )   0.74
0.58
Vector-Space Model: Example
• Suppose user query for: Q = “gold silver truck”. The
database collection consists of three documents with the
following content.
D1: “Shipment of gold damaged in a fire”
D2: “Delivery of silver arrived in a silver truck”
D3: “Shipment of gold arrived in a truck”
• Show retrieval results in ranked order?
1.Assume that full text terms are used during indexing,
without removing common terms, stop words, & also no
terms are stemmed.
2.Assume that content-bearing terms are selected during
indexing
3.Also compare your result with or without normalizing
term frequency
Vector-Space Model: Example
Counts TF Wi = TF*IDF
Terms Q D1 D2 D3 DF IDF Q D1 D2 D3

a 0 1 1 1 3 0 0 0 0 0
arrived 0 0 1 1 2 0.176 0 0 0.176 0.176
damaged 0 1 0 0 1 0.477 0 0.477 0 0
delivery 0 0 1 0 1 0.477 0 0 0.477 0
fire 0 1 0 0 1 0.477 0 0.477 0 0
gold 1 1 0 1 2 0.176 0.176 0.176 0 0.176
in 0 1 1 1 3 0 0 0 0 0
of 0 1 1 1 3 0 0 0 0 0
silver 1 0 2 0 1 0.477 0.477 0 0.954 0
shipment 0 1 0 1 2 0.176 0 0.176 0 0.176
Vector-Space Model
Terms Q D1 D2 D3
a 0 0 0 0
arrived 0 0 0.176 0.176
damaged 0 0.477 0 0
delivery 0 0 0.477 0
fire 0 0.477 0 0
gold 0.176 0.176 0 0.176
in 0 0 0 0
of 0 0 0 0
silver 0.477 0 0.954 0
shipment 0 0.176 0 0.176
truck 0.176 0 0.176 0.176
Vector-Space Model: Example
• Compute similarity using cosine Sim(q,d1)
• First, for each document and query, compute all vector
lengths (zero terms ignored)
|d1|= 0.477
= 2  0.477
= 0.719
2
 0.1762  0.1762 0.517
|d2|= =2 = 1.095
2
0.176  0.477  0.9542  0.1762 1.1996
|d3|= =2 = 20.352
0.176  0.176  0.1762  0.1762 0.124
|q|= = =20.538 2 0.2896
0.176  0.471  0.1762
• Next, compute dot products (zero products ignored)
Q*d1= 0.176*0.167 = 0.0310
Q*d2 = 0.954*0.477 + 0.176 *0.176 = 0.4862
Q*d3 = 0.176*0.167 + 0.176*0.167 = 0.0620
Vector-Space Model: Example
Now, compute similarity score
Sim(q,d1) = (0.0310) / (0.538*0.719) = 0.0801
Sim(q,d1) = (0.4862 ) / (0.538*1.095)= 0.8246
Sim(q,d1) = (0.0620) / (0.538*0.352)= 0.3271
Finally, we sort and rank documents in descending
order according to the similarity scores
Rank 1: Doc 2 = 0.8246
Rank 2: Doc 3 = 0.3271
Rank 3: Doc 1 = 0.0801
Vector-Space Model
• Advantages:
• Term-weighting improves quality of the answer set
since it helps to display relevant documents in ranked
order
• Partial matching allows retrieval of documents that
approximate the query conditions
• Cosine ranking formula sorts documents according to
degree of similarity to the query
• Disadvantages:
• Assumes independence of index terms. It doesn’t
relate one term with another term
• Computationally expensive since it measures the
similarity between each document and the query
Exercise 1
Suppose the database collection consists of the following
documents.
c1: Human machine interface for Lab ABC computer
applications
c2: A survey of user opinion of computer system response time
c3: The EPS user interface management system
c4: System and human system engineering testing of EPS
c5: Relation of user-perceived response time to error measure
M1: The generation of random, binary, unordered trees
M2: The intersection graph of paths in trees
M3: Graph minors: Widths of trees and well-quasi-ordering
m4: Graph minors: A survey
Query:
Find documents relevant to "human computer
Probabilistic Model
• IR is an uncertain process
–Mapping Information need to Query is not perfect
–Mapping Documents to index terms is a logical representation
–Query terms and index terms mostly mismatch
• This situation leads to several statistical approaches:
probability theory, fuzzy logic, theory of evidence,
language modeling, etc.
• Probabilistic retrieval model is hard formal model that
attempts to predict the probability that a given document
will be relevant to a given query; i.e. Prob(R|(q,di))
–Use probability to estimate the “odds” of relevance of a query
to a document.
–It relies on accurate estimates of probabilities
Terms Existence in Relevant Document
N=the total number of documents in the collection
n= the total number of documents that contain term ti
R=the total number of relevant documents retrieved
r=the total number of relevant documents retrieved that contain term ti

(r  0.5)( N  n  R  r  0.5)
wi  log
(n  r  0.5)( R  r  0.5)
Probabilistic model
• Probabilistic model uses probability theory to model the
uncertainty in the retrieval process
– Relevance feedback can improve the ranking by giving better
term probability estimates
• Advantages of probabilistic model over vector‐space
– Strong theoretical basis
– Since the base is probability theory, it is very well understood
– Easy to extend
• Disadvantages
– Models are often complicated
– No term frequency weighting

Understanding IR Models and Ranking
No ratings yet
Understanding IR Models and Ranking
25 pages
4 IRModels
No ratings yet
4 IRModels
46 pages
Overview of Information Retrieval Models
100% (1)
Overview of Information Retrieval Models
26 pages
4-IR Models
No ratings yet
4-IR Models
33 pages
Understanding Information Retrieval Models
No ratings yet
Understanding Information Retrieval Models
51 pages
Understanding IR Models and Weighting Techniques
No ratings yet
Understanding IR Models and Weighting Techniques
33 pages
Overview of Information Retrieval Models
100% (1)
Overview of Information Retrieval Models
32 pages
Introduction to IR Models and Techniques
100% (1)
Introduction to IR Models and Techniques
32 pages
Understanding Information Retrieval Models
No ratings yet
Understanding Information Retrieval Models
30 pages
Understanding Information Retrieval Models
No ratings yet
Understanding Information Retrieval Models
30 pages
Understanding IR Models and Ranking
No ratings yet
Understanding IR Models and Ranking
43 pages
Understanding Information Retrieval Models
No ratings yet
Understanding Information Retrieval Models
46 pages
Understanding Information Retrieval Models
No ratings yet
Understanding Information Retrieval Models
32 pages
Overview of Information Retrieval Models
No ratings yet
Overview of Information Retrieval Models
46 pages
Overview of Information Retrieval Models
No ratings yet
Overview of Information Retrieval Models
8 pages
Introduction to IR Models
No ratings yet
Introduction to IR Models
46 pages
Chapter 4 IR Models
No ratings yet
Chapter 4 IR Models
34 pages
Information Retrieval Models Explained
No ratings yet
Information Retrieval Models Explained
24 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
27 pages
Introduction of IR Models
No ratings yet
Introduction of IR Models
67 pages
Web Search
No ratings yet
Web Search
30 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
27 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
33 pages
Module 3 Indexing Part A
No ratings yet
Module 3 Indexing Part A
46 pages
Information Retrieval Models Explained
No ratings yet
Information Retrieval Models Explained
30 pages
IR Models for Students
No ratings yet
IR Models for Students
62 pages
Information Retrieval in Databases
No ratings yet
Information Retrieval in Databases
21 pages
Comprehensive Guide to IR Models
100% (3)
Comprehensive Guide to IR Models
58 pages
Overview of Traditional Information Retrieval Models
No ratings yet
Overview of Traditional Information Retrieval Models
65 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
31 pages
IR Lecture 4b
No ratings yet
IR Lecture 4b
57 pages
Unit 2 - Modern Information Retrieval - WWW - Rgpvnotes.in
No ratings yet
Unit 2 - Modern Information Retrieval - WWW - Rgpvnotes.in
8 pages
Information Retrieval Models Guide
No ratings yet
Information Retrieval Models Guide
30 pages
Information Retrieval Models Overview
No ratings yet
Information Retrieval Models Overview
21 pages
Information Retrieval Modeling Techniques
No ratings yet
Information Retrieval Modeling Techniques
28 pages
IR Lecture 4b
No ratings yet
IR Lecture 4b
57 pages
Ir Mod2 Notes
No ratings yet
Ir Mod2 Notes
26 pages
Retrieval Models in Information Retrieval
No ratings yet
Retrieval Models in Information Retrieval
37 pages
Understanding Basic IR Models and TF-IDF
No ratings yet
Understanding Basic IR Models and TF-IDF
22 pages
Overview of Information Retrieval Models
No ratings yet
Overview of Information Retrieval Models
17 pages
Unit 2
No ratings yet
Unit 2
13 pages
Information Retrieval Models Overview
No ratings yet
Information Retrieval Models Overview
420 pages
Information Retrieval Models Overview
No ratings yet
Information Retrieval Models Overview
54 pages
Retrieval Models and Ranking Techniques
No ratings yet
Retrieval Models and Ranking Techniques
16 pages
Unit-5 Adt
No ratings yet
Unit-5 Adt
11 pages
Understanding Information Retrieval Systems
No ratings yet
Understanding Information Retrieval Systems
7 pages
Information Retrieval Models Explained
No ratings yet
Information Retrieval Models Explained
15 pages
Overview of Information Retrieval Systems
No ratings yet
Overview of Information Retrieval Systems
42 pages
Understanding Search Techniques and Queries
No ratings yet
Understanding Search Techniques and Queries
24 pages
Cs8080 Ir Unit2 I Modeling and Retrieval Evaluation
No ratings yet
Cs8080 Ir Unit2 I Modeling and Retrieval Evaluation
42 pages
TF-IDF and Vector Space Model Overview
No ratings yet
TF-IDF and Vector Space Model Overview
37 pages
Information Retrieval Models
No ratings yet
Information Retrieval Models
113 pages
Modern Information Retrieval Models
No ratings yet
Modern Information Retrieval Models
47 pages
Overview of Information Retrieval Models
No ratings yet
Overview of Information Retrieval Models
58 pages
Unit - II
100% (1)
Unit - II
5 pages
Acm Iconiaac 2014
No ratings yet
Acm Iconiaac 2014
8 pages
Search Engine Evaluation Template
No ratings yet
Search Engine Evaluation Template
48 pages
Vector Space Model
No ratings yet
Vector Space Model
7 pages
1) Define Information Retrieval: 2) Explain Logical View of A Document With Diagram
No ratings yet
1) Define Information Retrieval: 2) Explain Logical View of A Document With Diagram
18 pages
E-Commerce and Web Technologies: Christian Huemer Pasquale Lops
No ratings yet
E-Commerce and Web Technologies: Christian Huemer Pasquale Lops
222 pages
Dutch Phone Call Sentiment Analysis
No ratings yet
Dutch Phone Call Sentiment Analysis
9 pages
A Study of Suspicious E-Mail Detection Techniques
No ratings yet
A Study of Suspicious E-Mail Detection Techniques
8 pages
AI in Robotics: Implications Analysis
No ratings yet
AI in Robotics: Implications Analysis
84 pages
Types of Data Represented As Strings
No ratings yet
Types of Data Represented As Strings
2 pages
Expose Thanh
No ratings yet
Expose Thanh
10 pages
Information Retrieval Fundamentals
No ratings yet
Information Retrieval Fundamentals
11 pages
Artificial Intelligence Notes
No ratings yet
Artificial Intelligence Notes
11 pages
Top 50 NLP Interview Q&A Guide
No ratings yet
Top 50 NLP Interview Q&A Guide
27 pages
Effective Document Indexing Techniques
No ratings yet
Effective Document Indexing Techniques
65 pages
Spam Email Detection Techniques Explained
No ratings yet
Spam Email Detection Techniques Explained
16 pages
An Automated Resume Screening System Using Natural
No ratings yet
An Automated Resume Screening System Using Natural
5 pages
Unit I AML
No ratings yet
Unit I AML
73 pages
Ranked Retrieval in Information Retrieval
No ratings yet
Ranked Retrieval in Information Retrieval
31 pages
How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics For Dialogue Response Generation
No ratings yet
How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics For Dialogue Response Generation
15 pages
Visualisation of Online Discussion Forums 241202 095728
No ratings yet
Visualisation of Online Discussion Forums 241202 095728
8 pages
TF-IDF Guide for Data Scientists
No ratings yet
TF-IDF Guide for Data Scientists
20 pages
Word Count and TF-IDF Analysis
No ratings yet
Word Count and TF-IDF Analysis
6 pages
Intelligent Assistant for Linux CLI
No ratings yet
Intelligent Assistant for Linux CLI
7 pages
Com 4115
No ratings yet
Com 4115
6 pages
GL DL Intro To NLP
No ratings yet
GL DL Intro To NLP
69 pages
Coursera Course - Machine Learning - A Case Study Approach
No ratings yet
Coursera Course - Machine Learning - A Case Study Approach
25 pages
Introduction To Equivalence Classes
No ratings yet
Introduction To Equivalence Classes
39 pages
Social IoT Query Model and Dataset
No ratings yet
Social IoT Query Model and Dataset
34 pages
Bangla News Summarization
No ratings yet
Bangla News Summarization
10 pages
Fraud Email Detection Project Report
No ratings yet
Fraud Email Detection Project Report
30 pages
5.2 Feature Engineering
No ratings yet
5.2 Feature Engineering
57 pages
Song Recommendation System Using TF-IDF Vectorization and Sentimental Analysis
No ratings yet
Song Recommendation System Using TF-IDF Vectorization and Sentimental Analysis
11 pages
IMDB Movie Reviews Sentiment Analysis
0% (1)
IMDB Movie Reviews Sentiment Analysis
22 pages

Understanding Information Retrieval Models

Uploaded by

Understanding Information Retrieval Models

Uploaded by

Chapter Five

Also find the documents relevant for the queries:

1. D1 = {K1, K2, K3, K4, K5}

• What are the relevant documents retrieved for the

• Third, similarity measurement is used to rank documents

• Compute TF*IDF weight of term A?

• Using a similarity score between the query and each

You might also like