lecture2-indexing
lecture2-indexing
Indexing
Information Retrieval
Computer Science Tripos Part II
Ronan Cummins1
2016
1
Adapted from Simone Teufel’s original slides
41
IR System Components
Document
Collection
IR System
Query
Set of relevant
documents
42
IR System Components
Document
Collection
Document Normalisation
Indexer
Query Norm.
IR System
Query
UI
Indexes
Ranking/Matching Module
Set of relevant
documents
43
IR System Components
Document
Collection
Document Normalisation
Indexer
Query Norm.
IR System
Query
UI
Indexes
Ranking/Matching Module
Set of relevant
documents
44
Overview
1 Index construction
Postings list and Skip lists
Single-pass Indexing
45
Example: index creation by sorting
Term docID Term (sorted) docID
I 1 ambitious 2
did 1 be 2
enact 1 brutus 1
julius 1 brutus 2
Doc 1: caesar 1 capitol 2
I did enact Julius I 1 caesar 1
Caesar: I was killed =⇒ was 1 caesar 2
i’ the Capitol;Brutus Tokenisation killed 1 caesar 2
killed me. i’ 1 did 1
the 1 enact 1
capitol 1 hath 1
brutus 1 I 1
killed 1 I 1
me 1 i’ 1
so 2 =⇒ it 2
let 2 Sorting julius 1
it 2 killed 1
Doc 2: be 2 killed 2
So let it be with with 2 let 2
Caesar. The noble caesar 2 me 1
Brutus hath told =⇒ the 2 noble 2
you Caesar was Tokenisation noble 2 so 2
ambitious. brutus 2 the 1
hath 2 the 2
told 2 told 2
you 2 you 2
caesar 2 was 1
was 2 was 1
ambitious 2 with 2
46
Index creation; grouping step (“uniq”)
Term & doc. freq. Postings list
ambitious 1 → 2 Primary sort by term
be 1 → 2 (dictionary)
brutus 2 → 1 → 2
capitol 1 → 1 Secondary sort (within
caesar 2 → 1 → 2 postings list) by document
did 1 → 1
enact 1 → 1
ID
hath 1 → 2 Document frequency (=
I 1 1
i’ 1
→
→ 1
length of postings list):
it 1 → 2 for more efficient
julius 1 → 1 Boolean searching (later
killed 1 1
→
today)
let 1 → 2
me 1 → 1
for term weighting
noble 1 → 2 (lecture 4)
so 1 → 2
the 2 → 1 → 2
keep Dictionary in memory
told 1 2
→
keep Postings List (much
you 1 → 2
was 2 → 1 → 2 larger) on disk
with 1 → 2
47
Data structures for Postings Lists
48
Optimisation: Skip Lists
49
Tradeoff Skip Lists
51
Single-pass in-memory indexing (1)
Abbreviation: SPIMI
Key idea 1: Generate separate dictionaries for each block.
Key idea 2: Accumulate postings in postings lists as they
occur.
With these two ideas we can generate a complete inverted
index for each block.
These separate indexes can then be merged into one big index.
Worked example!
52
Single-pass in-memory indexing (2)
53
Single-pass in-memory indexing (3)
54
Overview
1 Index construction
Postings list and Skip lists
Single-pass Indexing
55
Documents
56
Parsing a document
57
Character decoding
58
Format/Language: Complications
59
Normalisation
60
Tokenisation
Mr. O’Neill thinks that the boys’ stories about Chile’s capital
aren’t amusing.
neill aren’t
oneill arent
o’neill are n’t
o’ neill aren t
o neill
?
?
61
Tokenisation problems: One word or two? (or several)
Hewlett-Packard
State-of-the-art
co-education
the hold-him-back-and-drag-him-away maneuver
data base
San Francisco
Los Angeles-based company
cheap San Francisco-Los Angeles fares
York University vs. New York University
62
Numbers
20/3/91
3/20/91
Mar 20, 1991
B-52
100.2.86.144
(800) 234-2333
800.234.2333
63
Chinese: No Whitespace
64
Chinese: Ambiguous segmentation
65
Other cases of “no whitespace”: Compounding
German
Lebensversicherungsgesellschaftsangestellter
leben+s+versicherung+s+gesellschaft+s+angestellter
66
Other cases of “no whitespace”: Agglutination
“Agglutinative” languages do this not just for compounds:
Inuit
tusaatsiarunnangittualuujunga
(= “I can’t hear very well”)
Finnish
epäjärjestelmällistyttämättömyydellänsäkäänköhän
(= “I wonder if – even with his/her quality of not
having been made unsystematized”)
Turkish
Çekoslovakyalılaştıramadıklarımızdanmşçasına
(= “as if you were one of those whom we could not
make resemble the Czechoslovacian people”)
67
Japanese
68
Arabic script and bidirectionality
69
Accents and diacritics
70
Case Folding
71
Stop words
a, an, and, are, as, at, be, by, for, from, has, he, in, is, it, its, of,
on, that, the, to, was, were, will, with
to be or not to be
prince of Denmark
bamboo in water
72
More equivalence classing
73
Lemmatisation
am, are, is → be
car, car’s, cars’, cars → car
the boy’s cars are different colours → the boy car be different color
74
Stemming
75
Porter Stemmer
76
Stemming: Representation of a word
[C] (VC){m}[V]
C : one or more adjacent consonants
V : one or more adjacent vowels
[ ] : optionality
( ) : group operator
{x} : repetition x times
m : the “measure” of a word
77
Porter stemmer: selected rules
SSES → SS
IES → I
SS → SS
S→
caresses → caress
cares → care
(m>0) EED →
EE
feed → feed
agreed → agree
BUT: freed, succeed
78
Porter Stemmer: selected rules
(*v*) ED →
plastered → plaster
bled → bled
79
Three stemmers: a comparison
Such an analysis can reveal features that are not easily visible from the
variations in the individual genes and can lead to a picture of expression that is
more biologically transparent and accessible to interpretation.
Porter Stemmer
such an analysi can reveal featur that ar not easili visibl from the variat in the
individu gene and can lead to a pictur of express that is more biolog transpar
and access to interpret
Lovins Stemmer
such an analys can reve featur that ar not eas vis from th vari in th individu
gen and can lead to a pictur of expres that is mor biolog transpar and acces to
interpres
Paice Stemmer
such an analys can rev feat that are not easy vis from the vary in the individ
gen and can lead to a pict of express that is mor biolog transp and access to
interpret
80
Does stemming improve effectiveness?
In general, stemming increases effectiveness for some queries
and decreases it for others.
Example queries where stemming helps
tartan sweaters → sweater, sweaters
sightseeing tour san francisco → tour, tours
81
Phrase Queries
82
Biword indexes
83
Longer phrase queries
84
Issues with biword indexes
85
Issues with biword indexes
85
Positional indexes
86
Positional indexes: Example
be, 178239:
< 1: < 17, 25>;
4: < 17, 191, 291, 430, 434>;
5: <14, 19, 101>;
... ...>
Document 4 is a match.
(As always: docid, term, doc freq; new: offsets)
87
Proximity search
employment /4 place
Find all documents that contain employment and place within
4 words of each other.
HIT: Employment agencies that place healthcare workers are
seeing growth.
NO HIT: Employment agencies that have learned to adapt
now place healthcare workers.
88
Proximity search
89
Proximity intersection
PositionalIntersect(p1, p2, k)
1 answer ←<>
2 while p1 6= nil and p2 6= nil
3 do if docID(p1) = docID(p2)
4 then l ← <>
5 pp1 ← positions(p1)
6 pp2 ← positions(p2)
7 while pp1 6= nil
8 do while pp2 6= nil
9 do if |pos(pp1) pos(pp2)| ≤ k
10 then Add(l , pos(pp2))
11 else if pos(pp2) > pos(pp1)
12 then break
13 pp2 ← next(pp2)
14 while l 6=<> and |l [0] pos(pp1)| > k
15 do Delete(l [0])
16 for each ps l
17 do Add(answer , hdocID(p1), pos(pp1), psi)
18 pp1 ← next(pp1)
19 p1 ← next(p1)
20 p2 ← next(p2)
21 else if docID(p1) < docID(p2)
22 then p1 ← next(p1)
23 else p2 ← next(p2)
24 return answer
90
Combination scheme
91
RCV1 collection
92
RCV1 collection
N documents 800,000
M terms (= word types) 400,000
T non-positional postings 100,000,000
92
RCV1 collection
N documents 800,000
M terms (= word types) 400,000
T non-positional postings 100,000,000
92
RCV1 collection
N documents 800,000
M terms (= word types) 400,000
T non-positional postings 100,000,000
92
Effect of preprocessing for Reuters
93
How big is the term vocabulary?
94
How big is the term vocabulary?
94
How big is the term vocabulary?
94
How big is the term vocabulary?
94
How big is the term vocabulary?
94
How big is the term vocabulary?
94
How big is the term vocabulary?
94
How big is the term vocabulary?
94
How big is the term vocabulary?
94
How big is the term vocabulary?
94
How big is the term vocabulary?
94
Heaps’ law for Reuters
Vocabulary size M as a
function of collection size
T (number of tokens) for
Reuters-RCV1. For these
data, the dashed line
log10 M =
0.49 ∗ log10 T + 1.64 is the
best least squares fit.
Thus, M = 101.64 T 0.49
and k = 101.64 ≈ 44 and
b = 0.49.
95
Empirical fit for Reuters
96
Empirical fit for Reuters
96
Empirical fit for Reuters
44 × 1,000,0200.49 ≈ 38,323
96
Empirical fit for Reuters
44 × 1,000,0200.49 ≈ 38,323
96
Empirical fit for Reuters
44 × 1,000,0200.49 ≈ 38,323
96
Take-away
97
Reading
98