CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
Fall 2018
http://www.cs.upc.edu/~caim
1 / 21
2. Information Retrieval Models
Information Retrieval Models, I
Setting the stage to think about IR
3 / 21
Information Retrieval Models, II
A couple of IR models
4 / 21
Boolean Model of Information Retrieval
Relevance assumed binary
Documents:
A document is completely identified by the set of terms that it
contains.
I Order of occurrence considered irrelevant,
I number of occurrences considered irrelevant
(but a closely related model, called bag-of-words or BoW,
does consider relevant the number of occurrences).
5 / 21
Queries in the Boolean Model, I
Boolean queries, exact answers
Atomic query:
a single term.
The answer is the set of documents that contain it.
Combining queries:
I OR, AND: operate as union or intersection of answers;
I Set difference, t1 BUTNOT t2 ≡ t1 AND NOT t2 ;
I motivation: avoid unmanageably large answer sets.
In Lucene: +/− signs on query terms, Boolean operators.
6 / 21
Queries in the Boolean Model, II
A close relative to propositional logic
Analogy:
I Terms act as propositional variables;
I documents act as propositional models;
I a document is relevant for a term if it contains the term,
that is, if, as a propositional model, satisfies the variable;
I queries are propositional formulas
(with a syntactic condition of avoiding global negation);
I a document is relevant for a query if, as a propositional
model, it satisfies the propositional formula.
7 / 21
Example, I
A very simple toy case
d1 = one three
d2 = two two three
d3 = one three four five five five
d4 = one two two two two three six six
d5 = three four four four six
d6 = three three three six six
d7 = four five
8 / 21
Example, II
Our documents in the Boolean model
d1 = [ 0 0 1 0 1 0 ]
d2 = [ 0 0 0 0 1 1 ]
d3 = [ 1 1 1 0 1 0 ]
d4 = [ 0 0 1 1 1 1 ]
d5 = [ 0 1 0 1 1 0 ]
d6 = [ 0 0 0 1 1 0 ]
d7 = [ 1 1 0 0 0 0 ]
9 / 21
Queries in the Boolean Model, III
No ranking of answers
10 / 21
Phrase Queries, I
Slightly beyond the Boolean model
11 / 21
Phrase Queries, II
Options to “hack them in”
Options:
I Run as conjunctive query, then doublecheck the whole
answer set to filter out nonadjacency cases.
This option may be very slow in cases of large amounts of
“false positives”.
I Keep in the index dedicated information about adjacency
of any two terms in a document (e.g. positions).
I Keep in the index dedicated information about a choice of
“interesting pairs” of words.
12 / 21
Vector Space Model of Information Retrieval, I
Basis of all successful approaches
13 / 21
Vector Space Model of Information Retrieval, II
Moving to vector space
14 / 21
The tf-idf scheme
A way to assign weight vector to documents
Two principles:
I The more frequent t is in d, the higher weight it should
have.
15 / 21
The tf-idf scheme, II
The formula
A document is a vector of weights
d1 = [ 0 0 1 0 1 0 ] 1
d2 = [ 0 0 0 0 1 2 ] 2
d3 = [ 3 1 1 0 1 0 ] 3
d4 = [ 0 0 1 2 1 4 ] 4
d5 = [ 0 3 0 1 1 0 ] 3
d6 = [ 0 0 0 2 3 0 ] 3
d7 = [ 1 1 0 0 0 0 ] 1
df = 2 3 3 3 6 2
17 / 21
Example, II
df = 2 3 3 3 6 2
d3 = [ 3 1 1 0 1 0 ]
→
3 7 1 7 1 7 0 7 1 7 0 7
d3 = [ 3
log2 2 3
log2 3 3
log2 3 3
log2 3 3
log2 6 3
log2 2
]
d4 = [ 0 0 1 2 1 4 ]
→
0 7 0 7 1 7 2 7 1 7 4 7
d4 = [ 4
log2 2 4
log2 3 4
log2 3 4
log2 3 4
log2 6 4
log2 2
]
18 / 21
Similarity of Documents in the Vector Space Model
The cosine similarity measure
d1 · d2 d1 d2
sim(d1, d2) = = ·
|d1| |d2| |d1| |d2|
where
√
X sX
v·w = vi · wi , and |v| = v·v = vi2 .
i i
20 / 21
Query Answering
21 / 21