SlideShare a Scribd company logo
Chapter Five
IR models
• Word evidence:
IR systems usually adopt index terms to index and retrieve
documents
Each document is represented by a set of representative
keywords or index terms (called Bag of Words)
• An index term is a word useful for remembering the
document main themes
• Not all terms are equally useful for representing the
document contents:
less frequent terms allow identifying a narrower set of
documents
• But no ordering information is attached to the Bag of
Words identified from the document collection.
IR Models - Basic Concepts
IR Models - Basic Concepts
• One central problem regarding IR systems is the
issue of predicting the degree of relevance of
documents for a given query
 Such a decision is usually dependent on a ranking
algorithm which attempts to establish a simple
ordering of the documents retrieved
 Documents appearning at the top of this ordering
are considered to be more likely to be relevant
• Thus ranking algorithms are at the core of IR systems
 The IR models determine the predictions of what
is relevant and what is not, based on the notion of
relevance implemented by the system
IR models
Probabilistic
relevance
General Procedures Followed
To find relevant documens for a given query,
• First, map documents and queries into term-document
vector space.
 Note that queries are considered as short document
• Second, queries and documents are represented as
weighted vectors, wij
 There are binary weights & non-binary weighting
technique
• Third, rank documents by the closeness of their vectors to
the query.
 Documents are ranked by closeness to the query.
Closeness is determined by a similarity score calculation
Mapping Documents & Queries
• Represent both documents and queries as N-dimensional
vectors in a term-document matrix, which shows
occurrence of terms in the document collection or query
• An entry in the matrix corresponds to the “weight” of a
term in the document; zero means the term doesn’t exist in
the document.
)
,...,
,
(
);
,...,
,
( ,
,
2
,
1
,
,
2
,
1 k
N
k
k
k
j
N
j
j
j t
t
t
q
t
t
t
d 



T1 T2 …. TN
D1 w11 w12 … w1N
D2 w21 w22 … w2N
: : : :
: : : :
DM wM1 wM2 … wMN
Qi wi1 wi2 … wiN
–Document collection is mapped to
term-by-document matrix
–View as vector in multidimensional
space
• Nearby vectors are related
–Normalize for vector length to avoid
the effect of document length
How to evaluate Models?
• We need to investigate what procedures the IR Models
follow and what techniques they use:
– What is the weighting technique used by the IR Models for
measuring importance of terms in documents?
• Are they using binary or non-binary weight?
– What is the matching technique used by the IR models?
• Are they measuring similarity or dissimilarity?
– Are they applying exact matching or partial matching in the
course of finding relevant documents for a given query?
– Are they applying best matching principle to measure the
degree of relevance of documents to display in ranked-order?
• Is there any Ranking mechanism applied before displaying
relevant documents for the users?
The Boolean Model
• Boolean model is a simple model based on set theory
The Boolean model imposes a binary criterion
for deciding relevance
• Terms are either present or absent. Thus,
wij  {0,1}
• sim(q,dj) = 1, if document satisfies the boolean query
0 otherwise T1 T2 …. TN
D1 w11 w12 … w1N
D2 w21 w22 … w2N
: : : :
: : : :
DM wM1 wM2 … wMN
- Note that, no weights
assigned in-between 0 and 1,
just only values 0 or 1
Given the following three documents, Construct Term – document
matrix and find the relevant documents retrieved by the
Boolean model for the query “gold silver truck”
• D1: “Shipment of gold damaged in a fire”
• D2: “Delivery of silver arrived in a silver truck”
• D3: “Shipment of gold arrived in a truck”
Table below shows document –term (ti) matrix
The Boolean Model: Example
Arrive damage deliver fire gold silver ship truck
D1 0 1 0 1 1 0 1 0
D2 1 0 1 0 0 1 0 1
D3 1 0 0 0 1 0 1 1
query 0 0 0 0 1 1 0 1
Also find the documents relevant for the queries:
(a)gold delivery; (b) ship gold; (c) silver truck
• Given the following determine documents retrieved by the
Boolean model based IR system
• Index Terms: K1, …,K8.
• Documents:
1. D1 = {K1, K2, K3, K4, K5}
2. D2 = {K1, K2, K3, K4}
3. D3 = {K2, K4, K6, K8}
4. D4 = {K1, K3, K5, K7}
5. D5 = {K4, K5, K6, K7, K8}
6. D6 = {K1, K2, K3, K4}
• Query: K1 (K2  K3)
• Answer: {D1, D2, D4, D6} ({D1, D2, D3, D6} {D3, D5})
= {D1, D2, D6}
The Boolean Model: Further Example
Exercise
Given the following four documents with the following
contents:
– D1 = “computer information retrieval”
– D2 = “computer retrieval”
– D3 = “information”
– D4 = “computer information”
• What are the relevant documents retrieved for the
queries:
– Q1 = “information  retrieval”
– Q2 = “information  ¬computer”
Drawbacks of the Boolean Model
• Retrieval based on binary decision criteria with no
notion of partial matching
• No ranking of the documents is provided (absence of
a grading scale)
• Information need has to be translated into a Boolean
expression which most users find awkward
• The Boolean queries formulated by the users are
most often too simplistic
 As a consequence, the Boolean model frequently
returns either too few or too many documents in
response to a user query
Vector-Space Model
• This is the most commonly used strategy for measuring
relevance of documents for a given query. This is
because,
 Use of binary weights is too limiting
 Non-binary weights provide consideration for partial
matches
• These term weights are used to compute a degree of
similarity between a query and each document
 Ranked set of documents provides for better
matching
• The idea behind VSM is that
 the meaning of a document is conveyed by the words
used in that document
Vector-Space Model
To find relevant documens for a given query,
• First, map documents and queries into term-document
vector space.
Note that queries are considered as short document
• Second, in the vector space, queries and documents are
represented as weighted vectors, wij
There are different weighting technique; the most widely used
one is computing TF*IDF weight for each term
• Third, similarity measurement is used to rank documents
by the closeness of their vectors to the query.
To measure closeness of documents to the query cosine
similarity score is used by most search engines
Term-document matrix.
• A collection of n documents and query can be represented
in the vector space model by a term-document matrix.
–An entry in the matrix corresponds to the “weight” of a term in
the document;
–zero means the term has no significance in the document or
it simply doesn’t exist in the document. Otherwise, wij > 0
whenever ki  dj
T1 T2 …. TN
D1 w11 w21 … w1N
D2 w21 w22 … w2N
: : : :
: : : :
DM wM1 wM2 … wMN
• How to compute weights
for term i in document j and
in query q; wij and wiq ?
• A collection includes 10,000 documents
The term A appears 20 times in a particular document j
 The maximum appearance of any term in document j is
50
 The term A appears in 2,000 of the collection
documents.
• Compute TF*IDF weight of term A?
tf(A,j) = freq(A,j) / max(freq(k,j)) = 20/50 = 0.4
idf(A) = log(N/DFA) = log (10,000/2,000) = log(5) = 2.32
wAj = tf(A,j) * log(N/DFA) = 0.4 * 2.32 = 0.928
Example: Computing weights
Similarity Measure
• A similarity measure is a function that computes the
degree of similarity between document j and users
query.
• Using a similarity score between the query and each
document:
– It is possible to rank the retrieved documents in the
order of presumed relevance.
– It is possible to enforce a certain beginning so that we
can control the size of the retrieved set of documents.









n
i q
i
n
i j
i
n
i q
i
j
i
j
j
j
w
w
w
w
q
d
q
d
q
d
sim
1
2
,
1
2
,
1 ,
,
)
,
( 



Vector Space with Term Weights and
Cosine Matching
1.0
0.8
0.6
0.4
0.2
0.8
0.6
0.4
0.2
0 1.0
D2
D1
Q
1

2

Term B
Term A
Di=(d1i,w1di;d2i, w2di;…;dti, wtdi)
Q =(q1i,w1qi;q2i, w2qi;…;qti, wtqi)
 

 


t
j
t
j jd
jq
t
j jd
jq
i
i
i
w
w
w
w
D
Q
sim
1 1
2
2
1
)
(
)
(
)
,
(
Q = (0.4,0.8)
D1=(0.8,0.3)
D2=(0.2,0.7)
98
.
0
42
.
0
64
.
0
]
)
7
.
0
(
)
2
.
0
[(
]
)
8
.
0
(
)
4
.
0
[(
)
7
.
0
8
.
0
(
)
2
.
0
4
.
0
(
)
2
,
(
2
2
2
2









D
Q
sim
74
.
0
58
.
0
56
.
)
,
( 1 

D
Q
sim
• Suppose user query for: Q = “gold silver truck”. The
database collection consists of three documents with the
following content.
D1: “Shipment of gold damaged in a fire”
D2: “Delivery of silver arrived in a silver truck”
D3: “Shipment of gold arrived in a truck”
• Show retrieval results in ranked order?
1.Assume that full text terms are used during indexing,
without removing common terms, stop words, & also no
terms are stemmed.
2.Assume that content-bearing terms are selected during
indexing
3.Also compare your result with or without normalizing
term frequency
Vector-Space Model: Example
Vector-Space Model: Example
Terms Q
Counts TF
DF IDF
Wi = TF*IDF
D1 D2 D3 Q D1 D2 D3
a 0 1 1 1 3 0 0 0 0 0
arrived 0 0 1 1 2 0.176 0 0 0.176 0.176
damaged 0 1 0 0 1 0.477 0 0.477 0 0
delivery 0 0 1 0 1 0.477 0 0 0.477 0
fire 0 1 0 0 1 0.477 0 0.477 0 0
gold 1 1 0 1 2 0.176 0.176 0.176 0 0.176
in 0 1 1 1 3 0 0 0 0 0
of 0 1 1 1 3 0 0 0 0 0
silver 1 0 2 0 1 0.477 0.477 0 0.954 0
shipment 0 1 0 1 2 0.176 0 0.176 0 0.176
Vector-Space Model
Terms Q D1 D2 D3
a 0 0 0 0
arrived 0 0 0.176 0.176
damaged 0 0.477 0 0
delivery 0 0 0.477 0
fire 0 0.477 0 0
gold 0.176 0.176 0 0.176
in 0 0 0 0
of 0 0 0 0
silver 0.477 0 0.954 0
shipment 0 0.176 0 0.176
truck 0.176 0 0.176 0.176
Vector-Space Model: Example
• Compute similarity using cosine Sim(q,d1)
• First, for each document and query, compute all vector
lengths (zero terms ignored)
|d1|= = = 0.719
|d2|= = = 1.095
|d3|= = = 0.352
|q|= = = 0.538
• Next, compute dot products (zero products ignored)
Q*d1= 0.176*0.167 = 0.0310
Q*d2 = 0.954*0.477 + 0.176 *0.176 = 0.4862
Q*d3 = 0.176*0.167 + 0.176*0.167 = 0.0620
2
2
2
2
176
.
0
176
.
0
477
.
0
477
.
0 

 517
.
0
2
2
2
2
176
.
0
954
.
0
477
.
0
176
.
0 

 1996
.
1
2
2
2
2
176
.
0
176
.
0
176
.
0
176
.
0 

 124
.
0
2896
.
0
2
2
2
176
.
0
471
.
0
176
.
0 

Vector-Space Model: Example
Now, compute similarity score
Sim(q,d1) = (0.0310) / (0.538*0.719) = 0.0801
Sim(q,d1) = (0.4862 ) / (0.538*1.095)= 0.8246
Sim(q,d1) = (0.0620) / (0.538*0.352)= 0.3271
Finally, we sort and rank documents in descending
order according to the similarity scores
Rank 1: Doc 2 = 0.8246
Rank 2: Doc 3 = 0.3271
Rank 3: Doc 1 = 0.0801
Vector-Space Model
• Advantages:
• Term-weighting improves quality of the answer set
since it helps to display relevant documents in ranked
order
• Partial matching allows retrieval of documents that
approximate the query conditions
• Cosine ranking formula sorts documents according to
degree of similarity to the query
• Disadvantages:
• Assumes independence of index terms. It doesn’t
relate one term with another term
• Computationally expensive since it measures the
similarity between each document and the query
Exercise 1
Suppose the database collection consists of the following
documents.
c1: Human machine interface for Lab ABC computer
applications
c2: A survey of user opinion of computer system response time
c3: The EPS user interface management system
c4: System and human system engineering testing of EPS
c5: Relation of user-perceived response time to error measure
M1: The generation of random, binary, unordered trees
M2: The intersection graph of paths in trees
M3: Graph minors: Widths of trees and well-quasi-ordering
m4: Graph minors: A survey
Query:
Find documents relevant to "human computer
Probabilistic Model
•IR is an uncertain process
–Mapping Information need to Query is not perfect
–Mapping Documents to index terms is a logical representation
–Query terms and index terms mostly mismatch
•This situation leads to several statistical approaches:
probability theory, fuzzy logic, theory of evidence,
language modeling, etc.
•Probabilistic retrieval model is hard formal model that
attempts to predict the probability that a given document
will be relevant to a given query; i.e. Prob(R|(q,di))
–Use probability to estimate the “odds” of relevance of a query
to a document.
–It relies on accurate estimates of probabilities
Terms Existence in Relevant Document
N=the total number of documents in the collection
n= the total number of documents that contain term ti
R=the total number of relevant documents retrieved
r=the total number of relevant documents retrieved that contain term ti
)
5
.
0
)(
5
.
0
(
)
5
.
0
)(
5
.
0
(
log










r
R
r
n
r
R
n
N
r
wi
Probabilistic model
• Probabilistic model uses probability theory to model the
uncertainty in the retrieval process
– Relevance feedback can improve the ranking by giving better
term probability estimates
• Advantages of probabilistic model over vector space
‐
– Strong theoretical basis
– Since the base is probability theory, it is very well understood
– Easy to extend
• Disadvantages
– Models are often complicated
– No term frequency weighting

More Related Content

PPT
4-IR Models_new.ppt
BereketAraya
 
PPT
4-IR Models_new.ppt
BereketAraya
 
PDF
Chapter 4 IR Models.pdf
Habtamu100
 
DOCX
UNIT 3 IRT.docx
thenmozhip8
 
PPT
Ir models
Ambreen Angel
 
PPT
Lec 4,5
alaa223
 
PDF
Information Retrieval
rchbeir
 
PPTX
IRT Unit_ 2.pptx
thenmozhip8
 
4-IR Models_new.ppt
BereketAraya
 
4-IR Models_new.ppt
BereketAraya
 
Chapter 4 IR Models.pdf
Habtamu100
 
UNIT 3 IRT.docx
thenmozhip8
 
Ir models
Ambreen Angel
 
Lec 4,5
alaa223
 
Information Retrieval
rchbeir
 
IRT Unit_ 2.pptx
thenmozhip8
 

Similar to chapter 5 Information Retrieval Models.ppt (20)

PDF
191CSEH IR UNIT - II for an engineering subject
philipsmohan
 
PPTX
The vector space model
pkgosh
 
PPT
Information Retrieval and Storage Systems
abduwasiahmed
 
PPTX
Information retrival system and PageRank algorithm
Rupali Bhatnagar
 
PPTX
unit -4MODELING AND RETRIEVAL EVALUATION
karthiksmart21
 
PPT
6640200.pptNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
RAtna29
 
PPTX
JM Information Retrieval Techniques Unit II
JeyamohanHAsstProfCS
 
PPTX
Boolean,vector space retrieval Models
Primya Tamil
 
PPT
Intro.ppt
WrushabhShirsat3
 
PPTX
Search Engines
butest
 
PPTX
Document ranking using qprp with concept of multi dimensional subspace
Prakash Dubey
 
PPT
Vsm 벡터공간모델
JUNGEUN KANG
 
PPT
Vsm 벡터공간모델
guesta34d441
 
PPT
Cs583 info-retrieval
Borseshweta
 
PDF
Tutorial 1 (information retrieval basics)
Kira
 
PPT
Slides
butest
 
PPTX
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Sean Golliher
 
PDF
IRS-total ppts.pdf which have the detail abt the
MARasheed3
 
PDF
Some Information Retrieval Models and Our Experiments for TREC KBA
Patrice Bellot - Aix-Marseille Université / CNRS (LIS, INS2I)
 
PPTX
Tdm information retrieval
KU Leuven
 
191CSEH IR UNIT - II for an engineering subject
philipsmohan
 
The vector space model
pkgosh
 
Information Retrieval and Storage Systems
abduwasiahmed
 
Information retrival system and PageRank algorithm
Rupali Bhatnagar
 
unit -4MODELING AND RETRIEVAL EVALUATION
karthiksmart21
 
6640200.pptNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
RAtna29
 
JM Information Retrieval Techniques Unit II
JeyamohanHAsstProfCS
 
Boolean,vector space retrieval Models
Primya Tamil
 
Intro.ppt
WrushabhShirsat3
 
Search Engines
butest
 
Document ranking using qprp with concept of multi dimensional subspace
Prakash Dubey
 
Vsm 벡터공간모델
JUNGEUN KANG
 
Vsm 벡터공간모델
guesta34d441
 
Cs583 info-retrieval
Borseshweta
 
Tutorial 1 (information retrieval basics)
Kira
 
Slides
butest
 
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Sean Golliher
 
IRS-total ppts.pdf which have the detail abt the
MARasheed3
 
Some Information Retrieval Models and Our Experiments for TREC KBA
Patrice Bellot - Aix-Marseille Université / CNRS (LIS, INS2I)
 
Tdm information retrieval
KU Leuven
 
Ad

More from KelemAlebachew (20)

PPTX
MicroProcessor and Assembly Langauge chap2.pptx
KelemAlebachew
 
PPTX
Microprocessor and assembly language.pptx
KelemAlebachew
 
PPTX
chapter 1 Human Computer Interaction.pptx
KelemAlebachew
 
PPTX
chapter one _to_seven_Human ComputerI.pptx
KelemAlebachew
 
PPT
chapter3__ HUMAN COMPUTER INTERACTION.ppt
KelemAlebachew
 
PPTX
CHAPTER six DataBase Driven Websites.pptx
KelemAlebachew
 
PDF
selected topic Pervasive Computing edited 02.pdf
KelemAlebachew
 
PDF
selected Pervasive Computing edited 05.pdf
KelemAlebachew
 
PDF
Selected Pervasive Computing edited 04.pdf
KelemAlebachew
 
PDF
Selected Pervasive Computing edited 03.pdf
KelemAlebachew
 
PDF
Selected Pervasive Computing edited 01.pdf
KelemAlebachew
 
PPTX
Chapter 4-Concrruncy controling techniques.pptx
KelemAlebachew
 
PPT
introduction to database systems Chapter01.ppt
KelemAlebachew
 
PPTX
Decision Support System CHapter one.pptx
KelemAlebachew
 
PPTX
Chapter 4 server side Php Haypertext P.pptx
KelemAlebachew
 
PPTX
Chapter 3 INTRODUCTION TO JAVASCRIPT S.pptx
KelemAlebachew
 
PPT
information retrieval term Weighting.ppt
KelemAlebachew
 
PPT
Information Retrieval QueryLanguageOperation.ppt
KelemAlebachew
 
PPTX
chapter_six_ethics and proffesionalism_new-1.pptx
KelemAlebachew
 
PPTX
Decision Support System /Chapter one.pptx
KelemAlebachew
 
MicroProcessor and Assembly Langauge chap2.pptx
KelemAlebachew
 
Microprocessor and assembly language.pptx
KelemAlebachew
 
chapter 1 Human Computer Interaction.pptx
KelemAlebachew
 
chapter one _to_seven_Human ComputerI.pptx
KelemAlebachew
 
chapter3__ HUMAN COMPUTER INTERACTION.ppt
KelemAlebachew
 
CHAPTER six DataBase Driven Websites.pptx
KelemAlebachew
 
selected topic Pervasive Computing edited 02.pdf
KelemAlebachew
 
selected Pervasive Computing edited 05.pdf
KelemAlebachew
 
Selected Pervasive Computing edited 04.pdf
KelemAlebachew
 
Selected Pervasive Computing edited 03.pdf
KelemAlebachew
 
Selected Pervasive Computing edited 01.pdf
KelemAlebachew
 
Chapter 4-Concrruncy controling techniques.pptx
KelemAlebachew
 
introduction to database systems Chapter01.ppt
KelemAlebachew
 
Decision Support System CHapter one.pptx
KelemAlebachew
 
Chapter 4 server side Php Haypertext P.pptx
KelemAlebachew
 
Chapter 3 INTRODUCTION TO JAVASCRIPT S.pptx
KelemAlebachew
 
information retrieval term Weighting.ppt
KelemAlebachew
 
Information Retrieval QueryLanguageOperation.ppt
KelemAlebachew
 
chapter_six_ethics and proffesionalism_new-1.pptx
KelemAlebachew
 
Decision Support System /Chapter one.pptx
KelemAlebachew
 
Ad

Recently uploaded (20)

PPTX
easa module 3 funtamental electronics.pptx
tryanothert7
 
PDF
67243-Cooling and Heating & Calculation.pdf
DHAKA POLYTECHNIC
 
PDF
top-5-use-cases-for-splunk-security-analytics.pdf
yaghutialireza
 
PPT
SCOPE_~1- technology of green house and poyhouse
bala464780
 
PPTX
IoT_Smart_Agriculture_Presentations.pptx
poojakumari696707
 
PPT
Ppt for engineering students application on field effect
lakshmi.ec
 
PPTX
Information Retrieval and Extraction - Module 7
premSankar19
 
PPTX
22PCOAM21 Session 2 Understanding Data Source.pptx
Guru Nanak Technical Institutions
 
PPTX
MT Chapter 1.pptx- Magnetic particle testing
ABCAnyBodyCanRelax
 
PPTX
Color Model in Textile ( RGB, CMYK).pptx
auladhossain191
 
PDF
67243-Cooling and Heating & Calculation.pdf
DHAKA POLYTECHNIC
 
PPTX
MSME 4.0 Template idea hackathon pdf to understand
alaudeenaarish
 
PDF
dse_final_merit_2025_26 gtgfffffcjjjuuyy
rushabhjain127
 
PDF
Zero carbon Building Design Guidelines V4
BassemOsman1
 
PDF
The Effect of Artifact Removal from EEG Signals on the Detection of Epileptic...
Partho Prosad
 
PPTX
business incubation centre aaaaaaaaaaaaaa
hodeeesite4
 
PDF
July 2025: Top 10 Read Articles Advanced Information Technology
ijait
 
PDF
Introduction to Ship Engine Room Systems.pdf
Mahmoud Moghtaderi
 
PPTX
Victory Precisions_Supplier Profile.pptx
victoryprecisions199
 
PDF
20ME702-Mechatronics-UNIT-1,UNIT-2,UNIT-3,UNIT-4,UNIT-5, 2025-2026
Mohanumar S
 
easa module 3 funtamental electronics.pptx
tryanothert7
 
67243-Cooling and Heating & Calculation.pdf
DHAKA POLYTECHNIC
 
top-5-use-cases-for-splunk-security-analytics.pdf
yaghutialireza
 
SCOPE_~1- technology of green house and poyhouse
bala464780
 
IoT_Smart_Agriculture_Presentations.pptx
poojakumari696707
 
Ppt for engineering students application on field effect
lakshmi.ec
 
Information Retrieval and Extraction - Module 7
premSankar19
 
22PCOAM21 Session 2 Understanding Data Source.pptx
Guru Nanak Technical Institutions
 
MT Chapter 1.pptx- Magnetic particle testing
ABCAnyBodyCanRelax
 
Color Model in Textile ( RGB, CMYK).pptx
auladhossain191
 
67243-Cooling and Heating & Calculation.pdf
DHAKA POLYTECHNIC
 
MSME 4.0 Template idea hackathon pdf to understand
alaudeenaarish
 
dse_final_merit_2025_26 gtgfffffcjjjuuyy
rushabhjain127
 
Zero carbon Building Design Guidelines V4
BassemOsman1
 
The Effect of Artifact Removal from EEG Signals on the Detection of Epileptic...
Partho Prosad
 
business incubation centre aaaaaaaaaaaaaa
hodeeesite4
 
July 2025: Top 10 Read Articles Advanced Information Technology
ijait
 
Introduction to Ship Engine Room Systems.pdf
Mahmoud Moghtaderi
 
Victory Precisions_Supplier Profile.pptx
victoryprecisions199
 
20ME702-Mechatronics-UNIT-1,UNIT-2,UNIT-3,UNIT-4,UNIT-5, 2025-2026
Mohanumar S
 

chapter 5 Information Retrieval Models.ppt

  • 2. • Word evidence: IR systems usually adopt index terms to index and retrieve documents Each document is represented by a set of representative keywords or index terms (called Bag of Words) • An index term is a word useful for remembering the document main themes • Not all terms are equally useful for representing the document contents: less frequent terms allow identifying a narrower set of documents • But no ordering information is attached to the Bag of Words identified from the document collection. IR Models - Basic Concepts
  • 3. IR Models - Basic Concepts • One central problem regarding IR systems is the issue of predicting the degree of relevance of documents for a given query  Such a decision is usually dependent on a ranking algorithm which attempts to establish a simple ordering of the documents retrieved  Documents appearning at the top of this ordering are considered to be more likely to be relevant • Thus ranking algorithms are at the core of IR systems  The IR models determine the predictions of what is relevant and what is not, based on the notion of relevance implemented by the system
  • 5. General Procedures Followed To find relevant documens for a given query, • First, map documents and queries into term-document vector space.  Note that queries are considered as short document • Second, queries and documents are represented as weighted vectors, wij  There are binary weights & non-binary weighting technique • Third, rank documents by the closeness of their vectors to the query.  Documents are ranked by closeness to the query. Closeness is determined by a similarity score calculation
  • 6. Mapping Documents & Queries • Represent both documents and queries as N-dimensional vectors in a term-document matrix, which shows occurrence of terms in the document collection or query • An entry in the matrix corresponds to the “weight” of a term in the document; zero means the term doesn’t exist in the document. ) ,..., , ( ); ,..., , ( , , 2 , 1 , , 2 , 1 k N k k k j N j j j t t t q t t t d     T1 T2 …. TN D1 w11 w12 … w1N D2 w21 w22 … w2N : : : : : : : : DM wM1 wM2 … wMN Qi wi1 wi2 … wiN –Document collection is mapped to term-by-document matrix –View as vector in multidimensional space • Nearby vectors are related –Normalize for vector length to avoid the effect of document length
  • 7. How to evaluate Models? • We need to investigate what procedures the IR Models follow and what techniques they use: – What is the weighting technique used by the IR Models for measuring importance of terms in documents? • Are they using binary or non-binary weight? – What is the matching technique used by the IR models? • Are they measuring similarity or dissimilarity? – Are they applying exact matching or partial matching in the course of finding relevant documents for a given query? – Are they applying best matching principle to measure the degree of relevance of documents to display in ranked-order? • Is there any Ranking mechanism applied before displaying relevant documents for the users?
  • 8. The Boolean Model • Boolean model is a simple model based on set theory The Boolean model imposes a binary criterion for deciding relevance • Terms are either present or absent. Thus, wij  {0,1} • sim(q,dj) = 1, if document satisfies the boolean query 0 otherwise T1 T2 …. TN D1 w11 w12 … w1N D2 w21 w22 … w2N : : : : : : : : DM wM1 wM2 … wMN - Note that, no weights assigned in-between 0 and 1, just only values 0 or 1
  • 9. Given the following three documents, Construct Term – document matrix and find the relevant documents retrieved by the Boolean model for the query “gold silver truck” • D1: “Shipment of gold damaged in a fire” • D2: “Delivery of silver arrived in a silver truck” • D3: “Shipment of gold arrived in a truck” Table below shows document –term (ti) matrix The Boolean Model: Example Arrive damage deliver fire gold silver ship truck D1 0 1 0 1 1 0 1 0 D2 1 0 1 0 0 1 0 1 D3 1 0 0 0 1 0 1 1 query 0 0 0 0 1 1 0 1 Also find the documents relevant for the queries: (a)gold delivery; (b) ship gold; (c) silver truck
  • 10. • Given the following determine documents retrieved by the Boolean model based IR system • Index Terms: K1, …,K8. • Documents: 1. D1 = {K1, K2, K3, K4, K5} 2. D2 = {K1, K2, K3, K4} 3. D3 = {K2, K4, K6, K8} 4. D4 = {K1, K3, K5, K7} 5. D5 = {K4, K5, K6, K7, K8} 6. D6 = {K1, K2, K3, K4} • Query: K1 (K2  K3) • Answer: {D1, D2, D4, D6} ({D1, D2, D3, D6} {D3, D5}) = {D1, D2, D6} The Boolean Model: Further Example
  • 11. Exercise Given the following four documents with the following contents: – D1 = “computer information retrieval” – D2 = “computer retrieval” – D3 = “information” – D4 = “computer information” • What are the relevant documents retrieved for the queries: – Q1 = “information  retrieval” – Q2 = “information  ¬computer”
  • 12. Drawbacks of the Boolean Model • Retrieval based on binary decision criteria with no notion of partial matching • No ranking of the documents is provided (absence of a grading scale) • Information need has to be translated into a Boolean expression which most users find awkward • The Boolean queries formulated by the users are most often too simplistic  As a consequence, the Boolean model frequently returns either too few or too many documents in response to a user query
  • 13. Vector-Space Model • This is the most commonly used strategy for measuring relevance of documents for a given query. This is because,  Use of binary weights is too limiting  Non-binary weights provide consideration for partial matches • These term weights are used to compute a degree of similarity between a query and each document  Ranked set of documents provides for better matching • The idea behind VSM is that  the meaning of a document is conveyed by the words used in that document
  • 14. Vector-Space Model To find relevant documens for a given query, • First, map documents and queries into term-document vector space. Note that queries are considered as short document • Second, in the vector space, queries and documents are represented as weighted vectors, wij There are different weighting technique; the most widely used one is computing TF*IDF weight for each term • Third, similarity measurement is used to rank documents by the closeness of their vectors to the query. To measure closeness of documents to the query cosine similarity score is used by most search engines
  • 15. Term-document matrix. • A collection of n documents and query can be represented in the vector space model by a term-document matrix. –An entry in the matrix corresponds to the “weight” of a term in the document; –zero means the term has no significance in the document or it simply doesn’t exist in the document. Otherwise, wij > 0 whenever ki  dj T1 T2 …. TN D1 w11 w21 … w1N D2 w21 w22 … w2N : : : : : : : : DM wM1 wM2 … wMN • How to compute weights for term i in document j and in query q; wij and wiq ?
  • 16. • A collection includes 10,000 documents The term A appears 20 times in a particular document j  The maximum appearance of any term in document j is 50  The term A appears in 2,000 of the collection documents. • Compute TF*IDF weight of term A? tf(A,j) = freq(A,j) / max(freq(k,j)) = 20/50 = 0.4 idf(A) = log(N/DFA) = log (10,000/2,000) = log(5) = 2.32 wAj = tf(A,j) * log(N/DFA) = 0.4 * 2.32 = 0.928 Example: Computing weights
  • 17. Similarity Measure • A similarity measure is a function that computes the degree of similarity between document j and users query. • Using a similarity score between the query and each document: – It is possible to rank the retrieved documents in the order of presumed relevance. – It is possible to enforce a certain beginning so that we can control the size of the retrieved set of documents.          n i q i n i j i n i q i j i j j j w w w w q d q d q d sim 1 2 , 1 2 , 1 , , ) , (    
  • 18. Vector Space with Term Weights and Cosine Matching 1.0 0.8 0.6 0.4 0.2 0.8 0.6 0.4 0.2 0 1.0 D2 D1 Q 1  2  Term B Term A Di=(d1i,w1di;d2i, w2di;…;dti, wtdi) Q =(q1i,w1qi;q2i, w2qi;…;qti, wtqi)        t j t j jd jq t j jd jq i i i w w w w D Q sim 1 1 2 2 1 ) ( ) ( ) , ( Q = (0.4,0.8) D1=(0.8,0.3) D2=(0.2,0.7) 98 . 0 42 . 0 64 . 0 ] ) 7 . 0 ( ) 2 . 0 [( ] ) 8 . 0 ( ) 4 . 0 [( ) 7 . 0 8 . 0 ( ) 2 . 0 4 . 0 ( ) 2 , ( 2 2 2 2          D Q sim 74 . 0 58 . 0 56 . ) , ( 1   D Q sim
  • 19. • Suppose user query for: Q = “gold silver truck”. The database collection consists of three documents with the following content. D1: “Shipment of gold damaged in a fire” D2: “Delivery of silver arrived in a silver truck” D3: “Shipment of gold arrived in a truck” • Show retrieval results in ranked order? 1.Assume that full text terms are used during indexing, without removing common terms, stop words, & also no terms are stemmed. 2.Assume that content-bearing terms are selected during indexing 3.Also compare your result with or without normalizing term frequency Vector-Space Model: Example
  • 20. Vector-Space Model: Example Terms Q Counts TF DF IDF Wi = TF*IDF D1 D2 D3 Q D1 D2 D3 a 0 1 1 1 3 0 0 0 0 0 arrived 0 0 1 1 2 0.176 0 0 0.176 0.176 damaged 0 1 0 0 1 0.477 0 0.477 0 0 delivery 0 0 1 0 1 0.477 0 0 0.477 0 fire 0 1 0 0 1 0.477 0 0.477 0 0 gold 1 1 0 1 2 0.176 0.176 0.176 0 0.176 in 0 1 1 1 3 0 0 0 0 0 of 0 1 1 1 3 0 0 0 0 0 silver 1 0 2 0 1 0.477 0.477 0 0.954 0 shipment 0 1 0 1 2 0.176 0 0.176 0 0.176
  • 21. Vector-Space Model Terms Q D1 D2 D3 a 0 0 0 0 arrived 0 0 0.176 0.176 damaged 0 0.477 0 0 delivery 0 0 0.477 0 fire 0 0.477 0 0 gold 0.176 0.176 0 0.176 in 0 0 0 0 of 0 0 0 0 silver 0.477 0 0.954 0 shipment 0 0.176 0 0.176 truck 0.176 0 0.176 0.176
  • 22. Vector-Space Model: Example • Compute similarity using cosine Sim(q,d1) • First, for each document and query, compute all vector lengths (zero terms ignored) |d1|= = = 0.719 |d2|= = = 1.095 |d3|= = = 0.352 |q|= = = 0.538 • Next, compute dot products (zero products ignored) Q*d1= 0.176*0.167 = 0.0310 Q*d2 = 0.954*0.477 + 0.176 *0.176 = 0.4862 Q*d3 = 0.176*0.167 + 0.176*0.167 = 0.0620 2 2 2 2 176 . 0 176 . 0 477 . 0 477 . 0    517 . 0 2 2 2 2 176 . 0 954 . 0 477 . 0 176 . 0    1996 . 1 2 2 2 2 176 . 0 176 . 0 176 . 0 176 . 0    124 . 0 2896 . 0 2 2 2 176 . 0 471 . 0 176 . 0  
  • 23. Vector-Space Model: Example Now, compute similarity score Sim(q,d1) = (0.0310) / (0.538*0.719) = 0.0801 Sim(q,d1) = (0.4862 ) / (0.538*1.095)= 0.8246 Sim(q,d1) = (0.0620) / (0.538*0.352)= 0.3271 Finally, we sort and rank documents in descending order according to the similarity scores Rank 1: Doc 2 = 0.8246 Rank 2: Doc 3 = 0.3271 Rank 3: Doc 1 = 0.0801
  • 24. Vector-Space Model • Advantages: • Term-weighting improves quality of the answer set since it helps to display relevant documents in ranked order • Partial matching allows retrieval of documents that approximate the query conditions • Cosine ranking formula sorts documents according to degree of similarity to the query • Disadvantages: • Assumes independence of index terms. It doesn’t relate one term with another term • Computationally expensive since it measures the similarity between each document and the query
  • 25. Exercise 1 Suppose the database collection consists of the following documents. c1: Human machine interface for Lab ABC computer applications c2: A survey of user opinion of computer system response time c3: The EPS user interface management system c4: System and human system engineering testing of EPS c5: Relation of user-perceived response time to error measure M1: The generation of random, binary, unordered trees M2: The intersection graph of paths in trees M3: Graph minors: Widths of trees and well-quasi-ordering m4: Graph minors: A survey Query: Find documents relevant to "human computer
  • 26. Probabilistic Model •IR is an uncertain process –Mapping Information need to Query is not perfect –Mapping Documents to index terms is a logical representation –Query terms and index terms mostly mismatch •This situation leads to several statistical approaches: probability theory, fuzzy logic, theory of evidence, language modeling, etc. •Probabilistic retrieval model is hard formal model that attempts to predict the probability that a given document will be relevant to a given query; i.e. Prob(R|(q,di)) –Use probability to estimate the “odds” of relevance of a query to a document. –It relies on accurate estimates of probabilities
  • 27. Terms Existence in Relevant Document N=the total number of documents in the collection n= the total number of documents that contain term ti R=the total number of relevant documents retrieved r=the total number of relevant documents retrieved that contain term ti ) 5 . 0 )( 5 . 0 ( ) 5 . 0 )( 5 . 0 ( log           r R r n r R n N r wi
  • 28. Probabilistic model • Probabilistic model uses probability theory to model the uncertainty in the retrieval process – Relevance feedback can improve the ranking by giving better term probability estimates • Advantages of probabilistic model over vector space ‐ – Strong theoretical basis – Since the base is probability theory, it is very well understood – Easy to extend • Disadvantages – Models are often complicated – No term frequency weighting

Editor's Notes

  • #19: Why D2 is better similar if we see the distribution q gives higher value to term two and also term two is give emphasis to term two so it is closs to query(similar)
  • #21: Carefully look it, it is about exam
  • #22: Three similarity. D1,D2,D3