0% found this document useful (0 votes)
54 views

CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica

The document discusses two main information retrieval models: the Boolean model and the vector space model. In the Boolean model, documents and queries are represented as sets of terms and relevance is binary. Phrase queries can also be supported by checking term adjacency. The vector space model represents documents and queries as weighted term vectors in a multidimensional space, where weights are based on term frequency and inverse document frequency (tf-idf). This model allows documents to be ranked by similarity to a query vector.

Uploaded by

BlackMooth
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views

CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica

The document discusses two main information retrieval models: the Boolean model and the vector space model. In the Boolean model, documents and queries are represented as sets of terms and relevance is binary. Phrase queries can also be supported by checking term adjacency. The vector space model represents documents and queries as weighted term vectors in a multidimensional space, where weights are based on term frequency and inverse document frequency (tf-idf). This model allows documents to be ranked by similarity to a query vector.

Uploaded by

BlackMooth
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

CAIM: Cerca i Anàlisi d’Informació Massiva

FIB, Grau en Enginyeria Informàtica

Slides by Marta Arias, José Luis Balcázar,


Ramon Ferrer-i-Cancho, Ricard Gavaldá
Department of Computer Science, UPC

Fall 2018
http://www.cs.upc.edu/~caim

1 / 21
2. Information Retrieval Models
Information Retrieval Models, I
Setting the stage to think about IR

What is an Information Retrieval Model?


We need to clarify:
I A proposal for a logical view of documents
(what info is stored/indexed about each document?),
I a query language
(what kinds of queries will be allowed?),
I and a notion of relevance
(how to handle each document, given a query?).

3 / 21
Information Retrieval Models, II
A couple of IR models

Focus for this course:


I Boolean model,
I Boolean queries, exact answers;
I extension: phrase queries.
I Vector model,
I weights on terms and documents;
I similarity queries, approximate answers, ranking.

4 / 21
Boolean Model of Information Retrieval
Relevance assumed binary

Documents:
A document is completely identified by the set of terms that it
contains.
I Order of occurrence considered irrelevant,
I number of occurrences considered irrelevant
(but a closely related model, called bag-of-words or BoW,
does consider relevant the number of occurrences).

Thus, for a set of terms T = {t1 , . . . , tT }, a document is just a


subset of T .
Each document can be seen as a bit vector of length T ,
d = (d1 , . . . , dT ), where
I di = 1 if and only if ti appears in d, or, equivalently,
I di = 0 if and only if ti does not appear in d.

5 / 21
Queries in the Boolean Model, I
Boolean queries, exact answers

Atomic query:
a single term.
The answer is the set of documents that contain it.

Combining queries:
I OR, AND: operate as union or intersection of answers;
I Set difference, t1 BUTNOT t2 ≡ t1 AND NOT t2 ;
I motivation: avoid unmanageably large answer sets.
In Lucene: +/− signs on query terms, Boolean operators.

6 / 21
Queries in the Boolean Model, II
A close relative to propositional logic

Analogy:
I Terms act as propositional variables;
I documents act as propositional models;
I a document is relevant for a term if it contains the term,
that is, if, as a propositional model, satisfies the variable;
I queries are propositional formulas
(with a syntactic condition of avoiding global negation);
I a document is relevant for a query if, as a propositional
model, it satisfies the propositional formula.

7 / 21
Example, I
A very simple toy case

Consider 7 documents with a vocabulary of 6 terms:

d1 = one three
d2 = two two three
d3 = one three four five five five
d4 = one two two two two three six six
d5 = three four four four six
d6 = three three three six six
d7 = four five

8 / 21
Example, II
Our documents in the Boolean model

f ive f our one six three two

d1 = [ 0 0 1 0 1 0 ]
d2 = [ 0 0 0 0 1 1 ]
d3 = [ 1 1 1 0 1 0 ]
d4 = [ 0 0 1 1 1 1 ]
d5 = [ 0 1 0 1 1 0 ]
d6 = [ 0 0 0 1 1 0 ]
d7 = [ 1 1 0 0 0 0 ]

(Invent some queries and compute their answers!)

9 / 21
Queries in the Boolean Model, III
No ranking of answers

Answers are not quantified:


A document either
I matches the query (is fully relevant),
I or does not match the query (is fully irrelevant).

Depending on user needs and application, this feature may be


good or may be bad.

10 / 21
Phrase Queries, I
Slightly beyond the Boolean model

Phrase queries: conjunction plus adjacency


Ability to answer with the set of documents that have the terms
of the query consecutively.
I A user querying “Keith Richards” may not wish a document
that mentions both Keith Emerson and Emil Richards.
I Requires extending the notion of “basic query” to include
adjacency.

11 / 21
Phrase Queries, II
Options to “hack them in”

Options:
I Run as conjunctive query, then doublecheck the whole
answer set to filter out nonadjacency cases.
This option may be very slow in cases of large amounts of
“false positives”.
I Keep in the index dedicated information about adjacency
of any two terms in a document (e.g. positions).
I Keep in the index dedicated information about a choice of
“interesting pairs” of words.

12 / 21
Vector Space Model of Information Retrieval, I
Basis of all successful approaches

I Order of words still irrelevant.


I Frequence is relevant.
I Not all words are equally important.
I For a set of terms T = {t1 , . . . , tT }, a document is a vector
d = (w1 , . . . , wT ) of floats instead of bits.
I wi is the weight of ti in d.

13 / 21
Vector Space Model of Information Retrieval, II
Moving to vector space

I A document is now a vector in IRT .


I The document collection conceptually becomes a matrix
terms × documents.
but we never compute the matrix explicitly.
I Queries may also be seen as vectors in IRT .

14 / 21
The tf-idf scheme
A way to assign weight vector to documents
Two principles:
I The more frequent t is in d, the higher weight it should
have.

I The more frequent t is in the whole collection, the less it


discriminates among documents, so the lower its weight
should be in all documents.

15 / 21
The tf-idf scheme, II
The formula
A document is a vector of weights

d = [wd,1 , . . . , wd,i , . . . , wd,T ].

Each weight is a product of two terms

wd,i = tfd,i · idfi .

The term frequency term tf is


fd,i
tfd,i = , where fd,j is the frequency of tj in d.
máxj fd,j

And the inverse document frequency idf is


D
idfi = log2 , where D = number of documents
dfi
and dfi = number of documents that contain term ti .
16 / 21
Example, I

f ive f our one six three two maxf

d1 = [ 0 0 1 0 1 0 ] 1
d2 = [ 0 0 0 0 1 2 ] 2
d3 = [ 3 1 1 0 1 0 ] 3
d4 = [ 0 0 1 2 1 4 ] 4
d5 = [ 0 3 0 1 1 0 ] 3
d6 = [ 0 0 0 2 3 0 ] 3
d7 = [ 1 1 0 0 0 0 ] 1

df = 2 3 3 3 6 2

17 / 21
Example, II

df = 2 3 3 3 6 2
d3 = [ 3 1 1 0 1 0 ]

3 7 1 7 1 7 0 7 1 7 0 7
d3 = [ 3
log2 2 3
log2 3 3
log2 3 3
log2 3 3
log2 6 3
log2 2
]

= [ 1.81 0.41 0.41 0 0.07 0 ]

d4 = [ 0 0 1 2 1 4 ]

0 7 0 7 1 7 2 7 1 7 4 7
d4 = [ 4
log2 2 4
log2 3 4
log2 3 4
log2 3 4
log2 6 4
log2 2
]

= [ 0 0 0.61 1.22 0.11 3.61 ]

18 / 21
Similarity of Documents in the Vector Space Model
The cosine similarity measure

I “Similar vectors” may happen to have very different sizes.


I We better compare only their directions.
I Equivalently, we normalize them before comparing them to
have the same Euclidean length.

d1 · d2 d1 d2
sim(d1, d2) = = ·
|d1| |d2| |d1| |d2|
where

X sX
v·w = vi · wi , and |v| = v·v = vi2 .
i i

I Our weights are all nonnegative.


I Therefore, all cosines / similarities are between 0 and 1.
19 / 21
Cosine similarity, Example

d3 = [ 1.81 0.41 0.41 0 0.07 0 ]


d4 = [ 0 0 0.61 1.22 0.11 3.61 ]
Then
|d3| = 1.898, |d4| = 3.866, d3 · d4 = 0.26
and sim(d3, d4) = 0.035 (i.e., small similarity).

20 / 21
Query Answering

I Queries can be transformed to vectors too.


I Sometimes, tf-idf weights; often, binary weights.
I sim(doc, query) ∈ [0, 1].
I Answer: List of documents sorted by decreasing similarity.

I We will find uses for comparing sim(d1, d2) too.

21 / 21

You might also like