0% found this document useful (0 votes)
87 views28 pages

Understanding Information Retrieval Models

The document discusses information retrieval (IR) models and the vector space model. Some key points: 1) IR models use terms to represent documents and queries as vectors in a multidimensional space. This allows calculating similarity between queries and documents. 2) The vector space model represents documents and queries as weighted vectors based on term frequency and inverse document frequency. 3) Documents are ranked based on similarity to the query vector, typically calculated using cosine similarity. Higher similarity means more relevance.

Uploaded by

Bemenet Biniyam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
87 views28 pages

Understanding Information Retrieval Models

The document discusses information retrieval (IR) models and the vector space model. Some key points: 1) IR models use terms to represent documents and queries as vectors in a multidimensional space. This allows calculating similarity between queries and documents. 2) The vector space model represents documents and queries as weighted vectors based on term frequency and inverse document frequency. 3) Documents are ranked based on similarity to the query vector, typically calculated using cosine similarity. Higher similarity means more relevance.

Uploaded by

Bemenet Biniyam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd

Chapter Five

IR models
IR Models - Basic Concepts
• Word evidence:
 IR systems usually adopt index terms to index and retrieve
documents
 Each document is represented by a set of representative
keywords or index terms (called Bag of Words)
• An index term is a word useful for remembering the
document main themes
• Not all terms are equally useful for representing the
document contents:
 less frequent terms allow identifying a narrower set of
documents
• But no ordering information is attached to the Bag of
Words identified from the document collection.
IR Models - Basic Concepts
• One central problem regarding IR systems is the
issue of predicting the degree of relevance of
documents for a given query
 Such a decision is usually dependent on a ranking
algorithm which attempts to establish a simple
ordering of the documents retrieved
 Documents appearning at the top of this ordering
are considered to be more likely to be relevant
• Thus ranking algorithms are at the core of IR systems
 The IR models determine the predictions of what
is relevant and what is not, based on the notion of
relevance implemented by the system
IR models

Probabilistic
relevance
General Procedures Followed
To find relevant documens for a given query,
• First, map documents and queries into term-document
vector space.
 Note that queries are considered as short document
• Second, queries and documents are represented as
weighted vectors, wij
 There are binary weights & non-binary weighting
technique
• Third, rank documents by the closeness of their vectors to
the query.
 Documents are ranked by closeness to the query.
Closeness is determined by a similarity score calculation
Mapping Documents & Queries
• Represent both documents and queries as N-dimensional
vectors in a term-document matrix, which shows
occurrence of terms in the document collection or query

d j  (t1, j , t 2, j ,..., t N , j ); qk  (t1,k , t 2,k ,..., t N ,k )
• An entry in the matrix corresponds to the “weight” of a
term in the document; zero means the term doesn’t exist in
the document.
T1 T2 …. TN – Document collection is mapped to
D1 w11 w12 … w1N term-by-document matrix
D2 w21 w22 … w2N – View as vector in multidimensional
space
: : : :
• Nearby vectors are related
: : : :
– Normalize for vector length to avoid
DM wM1 wM2 … wMN
the effect of document length
Qi wi1 wi2 … wiN
How to evaluate Models?
• We need to investigate what procedures the IR Models
follow and what techniques they use:
– What is the weighting technique used by the IR Models for
measuring importance of terms in documents?
• Are they using binary or non-binary weight?
– What is the matching technique used by the IR models?
• Are they measuring similarity or dissimilarity?
– Are they applying exact matching or partial matching in the
course of finding relevant documents for a given query?
– Are they applying best matching principle to measure the
degree of relevance of documents to display in ranked-order?
• Is there any Ranking mechanism applied before displaying
relevant documents for the users?
The Boolean Model
• Boolean model is a simple model based on set theory
 The Boolean model imposes a binary criterion
for deciding relevance
• Terms are either present or absent. Thus,
wij  {0,1}
• sim(q,dj) = 1, if document satisfies the boolean query
0 otherwise T1 T 2 …. TN
D1 w11 w12 … w1N
- Note that, no weights D2 w21 w22 … w2N
assigned in-between 0 and 1, : : : :
just only values 0 or 1
: : : :
DM wM1 wM2 … wMN
The Boolean Model: Example
Given the following three documents, Construct Term – document
matrix and find the relevant documents retrieved by the
Boolean model for the query “gold silver truck”
• D1: “Shipment of gold damaged in a fire”
• D2: “Delivery of silver arrived in a silver truck”
• D3: “Shipment of gold arrived in a truck”
Table below shows document –term (ti) matrix
Arrive damage deliver fire gold silver ship truck
D1 0 1 0 1 1 0 1 0
D2 1 0 1 0 0 1 0 1
D3 1 0 0 0 1 0 1 1
query 0 0 0 0 1 1 0 1

Also find the documents relevant for the queries:


(a)gold delivery; (b) ship gold; (c) silver truck
The Boolean Model: Further Example
• Given the following determine documents retrieved by the
Boolean model based IR system
• Index Terms: K1, …,K8.
• Documents:

1. D1 = {K1, K2, K3, K4, K5}


2. D2 = {K1, K2, K3, K4}
3. D3 = {K2, K4, K6, K8}
4. D4 = {K1, K3, K5, K7}
5. D5 = {K4, K5, K6, K7, K8}
6. D6 = {K1, K2, K3, K4}
• Query: K1 (K2  K3)
• Answer: {D1, D2, D4, D6} ({D1, D2, D3, D6} {D3, D5})
= {D1, D2, D6}
Exercise
Given the following four documents with the following
contents:
– D1 = “computer information retrieval”
– D2 = “computer retrieval”
– D3 = “information”
– D4 = “computer information”

• What are the relevant documents retrieved for the


queries:
– Q1 = “information  retrieval”
– Q2 = “information  ¬computer”
Drawbacks of the Boolean Model
• Retrieval based on binary decision criteria with no
notion of partial matching
• No ranking of the documents is provided (absence of
a grading scale)
• Information need has to be translated into a Boolean
expression which most users find awkward
• The Boolean queries formulated by the users are
most often too simplistic
 As a consequence, the Boolean model frequently
returns either too few or too many documents in
response to a user query
Vector-Space Model
• This is the most commonly used strategy for measuring
relevance of documents for a given query. This is
because,
 Use of binary weights is too limiting
 Non-binary weights provide consideration for partial
matches
• These term weights are used to compute a degree of
similarity between a query and each document
 Ranked set of documents provides for better
matching
• The idea behind VSM is that
 the meaning of a document is conveyed by the words
used in that document
Vector-Space Model
To find relevant documens for a given query,
• First, map documents and queries into term-document
vector space.
Note that queries are considered as short document
• Second, in the vector space, queries and documents are
represented as weighted vectors, wij
There are different weighting technique; the most widely used
one is computing TF*IDF weight for each term

• Third, similarity measurement is used to rank documents


by the closeness of their vectors to the query.
To measure closeness of documents to the query cosine
similarity score is used by most search engines
Term-document matrix.
• A collection of n documents and query can be represented
in the vector space model by a term-document matrix.
– An entry in the matrix corresponds to the “weight” of a term in
the document;
–zero means the term has no significance in the document or
it simply doesn’t exist in the document. Otherwise, wij > 0
whenever ki  dj
T1 T2 …. TN
• How to compute weights
D1 w11 w21 … w1N for term i in document j and
D2 w21 w22 … w2N in query q; wij and wiq ?
: : : :
: : : :
DM wM1 wM2 … wMN
Example: Computing weights
• A collection includes 10,000 documents
 The term A appears 20 times in a particular document j
 The maximum appearance of any term in document j is
50
 The term A appears in 2,000 of the collection
documents.

• Compute TF*IDF weight of term A?


 tf(A,j) = freq(A,j) / max(freq(k,j)) = 20/50 = 0.4
 idf(A) = log(N/DFA) = log (10,000/2,000) = log(5) = 2.32
 wAj = tf(A,j) * log(N/DFA) = 0.4 * 2.32 = 0.928
Similarity Measure
• A similarity measure is a function that computes the
degree of similarity between document j and users
query.  

n
d j q wi , j wi ,q
sim(d j , q)     i 1

dj q i 1 w i1 i ,q
n 2 n2
i, j w

• Using a similarity score between the query and each


document:
– It is possible to rank the retrieved documents in the
order of presumed relevance.
– It is possible to enforce a certain beginning so that we
can control the size of the retrieved set of documents.
Vector Space with Term Weights and
Cosine Matching
Di=(d1i,w1di;d2i, w2di;…;dti, wtdi)
Term B
Q =(q1i,w1qi;q2i, w2qi;…;qti, wtqi)
1.0 Q = (0.4,0.8)

t
D2 Q D1=(0.8,0.3)
j 1
w jq w jdi
0.8 D2=(0.2,0.7) sim(Q, Di ) 
 j 1 (w jq )  j 1 jdi
t 2 t 2
( w )
0.6
2 (0.4  0.2)  (0.8  0.7)
sim (Q, D 2) 
0.4
[(0.4) 2  (0.8) 2 ]  [(0.2) 2  (0.7) 2 ]
D1
0.2 1 0.64
  0.98
0.42
0 0.2 0.4 0.6 0.8 1.0
.56
Term A sim(Q, D1 )   0.74
0.58
Vector-Space Model: Example
• Suppose user query for: Q = “gold silver truck”. The
database collection consists of three documents with the
following content.
D1: “Shipment of gold damaged in a fire”
D2: “Delivery of silver arrived in a silver truck”
D3: “Shipment of gold arrived in a truck”
• Show retrieval results in ranked order?
1.Assume that full text terms are used during indexing,
without removing common terms, stop words, & also no
terms are stemmed.
2.Assume that content-bearing terms are selected during
indexing
3.Also compare your result with or without normalizing
term frequency
Vector-Space Model: Example
Counts TF Wi = TF*IDF
Terms Q D1 D2 D3 DF IDF Q D1 D2 D3

a 0 1 1 1 3 0 0 0 0 0
arrived 0 0 1 1 2 0.176 0 0 0.176 0.176
damaged 0 1 0 0 1 0.477 0 0.477 0 0
delivery 0 0 1 0 1 0.477 0 0 0.477 0
fire 0 1 0 0 1 0.477 0 0.477 0 0
gold 1 1 0 1 2 0.176 0.176 0.176 0 0.176
in 0 1 1 1 3 0 0 0 0 0
of 0 1 1 1 3 0 0 0 0 0
silver 1 0 2 0 1 0.477 0.477 0 0.954 0
shipment 0 1 0 1 2 0.176 0 0.176 0 0.176
Vector-Space Model
Terms Q D1 D2 D3
a 0 0 0 0
arrived 0 0 0.176 0.176
damaged 0 0.477 0 0
delivery 0 0 0.477 0
fire 0 0.477 0 0
gold 0.176 0.176 0 0.176
in 0 0 0 0
of 0 0 0 0
silver 0.477 0 0.954 0
shipment 0 0.176 0 0.176
truck 0.176 0 0.176 0.176
Vector-Space Model: Example
• Compute similarity using cosine Sim(q,d1)
• First, for each document and query, compute all vector
lengths (zero terms ignored)
|d1|= 0.477
= 2  0.477
= 0.719
2
 0.1762  0.1762 0.517
|d2|= =2 = 1.095
2
0.176  0.477  0.9542  0.1762 1.1996
|d3|= =2 = 20.352
0.176  0.176  0.1762  0.1762 0.124
|q|= = =20.538 2 0.2896
0.176  0.471  0.1762
• Next, compute dot products (zero products ignored)
Q*d1= 0.176*0.167 = 0.0310
Q*d2 = 0.954*0.477 + 0.176 *0.176 = 0.4862
Q*d3 = 0.176*0.167 + 0.176*0.167 = 0.0620
Vector-Space Model: Example
Now, compute similarity score
Sim(q,d1) = (0.0310) / (0.538*0.719) = 0.0801
Sim(q,d1) = (0.4862 ) / (0.538*1.095)= 0.8246
Sim(q,d1) = (0.0620) / (0.538*0.352)= 0.3271
Finally, we sort and rank documents in descending
order according to the similarity scores
Rank 1: Doc 2 = 0.8246
Rank 2: Doc 3 = 0.3271
Rank 3: Doc 1 = 0.0801
Vector-Space Model
• Advantages:
• Term-weighting improves quality of the answer set
since it helps to display relevant documents in ranked
order
• Partial matching allows retrieval of documents that
approximate the query conditions
• Cosine ranking formula sorts documents according to
degree of similarity to the query
• Disadvantages:
• Assumes independence of index terms. It doesn’t
relate one term with another term
• Computationally expensive since it measures the
similarity between each document and the query
Exercise 1
Suppose the database collection consists of the following
documents.
 c1: Human machine interface for Lab ABC computer
applications
 c2: A survey of user opinion of computer system response time
 c3: The EPS user interface management system
 c4: System and human system engineering testing of EPS
 c5: Relation of user-perceived response time to error measure
M1: The generation of random, binary, unordered trees
M2: The intersection graph of paths in trees
M3: Graph minors: Widths of trees and well-quasi-ordering
 m4: Graph minors: A survey
Query:
Find documents relevant to "human computer
Probabilistic Model
• IR is an uncertain process
–Mapping Information need to Query is not perfect
–Mapping Documents to index terms is a logical representation
–Query terms and index terms mostly mismatch
• This situation leads to several statistical approaches:
probability theory, fuzzy logic, theory of evidence,
language modeling, etc.
• Probabilistic retrieval model is hard formal model that
attempts to predict the probability that a given document
will be relevant to a given query; i.e. Prob(R|(q,di))
–Use probability to estimate the “odds” of relevance of a query
to a document.
–It relies on accurate estimates of probabilities
Terms Existence in Relevant Document
N=the total number of documents in the collection
n= the total number of documents that contain term ti
R=the total number of relevant documents retrieved
r=the total number of relevant documents retrieved that contain term ti

(r  0.5)( N  n  R  r  0.5)
wi  log
(n  r  0.5)( R  r  0.5)
Probabilistic model
• Probabilistic model uses probability theory to model the
uncertainty in the retrieval process
– Relevance feedback can improve the ranking by giving better
term probability estimates
• Advantages of probabilistic model over vector‐space
– Strong theoretical basis
– Since the base is probability theory, it is very well understood
– Easy to extend
• Disadvantages
– Models are often complicated
– No term frequency weighting

You might also like