0% found this document useful (0 votes)
5 views78 pages

lecture2-indexing

The document discusses the construction of inverted indexes in information retrieval systems, detailing processes such as document normalization, tokenization, and the creation of postings lists. It introduces various data structures for managing postings, including linked lists and skip lists, and describes the single-pass in-memory indexing algorithm (SPIMI) for efficient indexing. Additionally, it covers challenges in document parsing and normalization, emphasizing the importance of handling different languages and formats in indexing.

Uploaded by

olfa.gaddour
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views78 pages

lecture2-indexing

The document discusses the construction of inverted indexes in information retrieval systems, detailing processes such as document normalization, tokenization, and the creation of postings lists. It introduces various data structures for managing postings, including linked lists and skip lists, and describes the single-pass in-memory indexing algorithm (SPIMI) for efficient indexing. Additionally, it covers challenges in document parsing and normalization, emphasizing the importance of handling different languages and formats in indexing.

Uploaded by

olfa.gaddour
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 78

Lecture 2: Data structures and Algorithms for

Indexing
Information Retrieval
Computer Science Tripos Part II

Ronan Cummins1

Natural Language and Information Processing (NLIP) Group

[email protected]

2016

1
Adapted from Simone Teufel’s original slides
41
IR System Components

Document
Collection

IR System
Query

Set of relevant
documents

Today: The indexer

42
IR System Components

Document
Collection

Document Normalisation

Indexer
Query Norm.

IR System
Query
UI

Indexes

Ranking/Matching Module

Set of relevant
documents

Today: The indexer

43
IR System Components

Document
Collection

Document Normalisation

Indexer
Query Norm.

IR System
Query
UI

Indexes

Ranking/Matching Module

Set of relevant
documents

Today: The indexer

44
Overview

1 Index construction
Postings list and Skip lists
Single-pass Indexing

2 Document and Term Normalisation


Documents
Terms
Reuter RCV1 and Heap’s Law
Index construction

The major steps in inverted index construction:


Collect the documents to be indexed.
Tokenize the text.
Perform linguistic preprocessing of tokens.
Index the documents that each term occurs in.

45
Example: index creation by sorting
Term docID Term (sorted) docID
I 1 ambitious 2
did 1 be 2
enact 1 brutus 1
julius 1 brutus 2
Doc 1: caesar 1 capitol 2
I did enact Julius I 1 caesar 1
Caesar: I was killed =⇒ was 1 caesar 2
i’ the Capitol;Brutus Tokenisation killed 1 caesar 2
killed me. i’ 1 did 1
the 1 enact 1
capitol 1 hath 1
brutus 1 I 1
killed 1 I 1
me 1 i’ 1
so 2 =⇒ it 2
let 2 Sorting julius 1
it 2 killed 1
Doc 2: be 2 killed 2
So let it be with with 2 let 2
Caesar. The noble caesar 2 me 1
Brutus hath told =⇒ the 2 noble 2
you Caesar was Tokenisation noble 2 so 2
ambitious. brutus 2 the 1
hath 2 the 2
told 2 told 2
you 2 you 2
caesar 2 was 1
was 2 was 1
ambitious 2 with 2

46
Index creation; grouping step (“uniq”)
Term & doc. freq. Postings list
ambitious 1 → 2 Primary sort by term
be 1 → 2 (dictionary)
brutus 2 → 1 → 2
capitol 1 → 1 Secondary sort (within
caesar 2 → 1 → 2 postings list) by document
did 1 → 1
enact 1 → 1
ID
hath 1 → 2 Document frequency (=
I 1 1
i’ 1

→ 1
length of postings list):
it 1 → 2 for more efficient
julius 1 → 1 Boolean searching (later
killed 1 1

today)
let 1 → 2
me 1 → 1
for term weighting
noble 1 → 2 (lecture 4)
so 1 → 2
the 2 → 1 → 2
keep Dictionary in memory
told 1 2

keep Postings List (much
you 1 → 2
was 2 → 1 → 2 larger) on disk
with 1 → 2

47
Data structures for Postings Lists

Singly linked list


Allow cheap insertion of documents into postings lists (e.g.,
when recrawling)
Naturally extend to skip lists for faster access
Variable length array
Better in terms of space requirements
Also better in terms of time requirements if memory caches are
used, as they use contiguous memory
Hybrid scheme: linked list of variable length array for each
term.
write posting lists on disk as contiguous block without explicit
pointers
minimises the size of postings lists and number of disk seeks

48
Optimisation: Skip Lists

Some postings lists can contain several million entries


Check skip list if present to skip multiple entries
sqrt(L) Skips can be placed evenly for a list of length L.

49
Tradeoff Skip Lists

Number of items skipped vs. frequency that skip can be taken


More skips: each pointer skips only a few items, but we can
frequently use it.
Fewer skips: each skip pointer skips many items, but we can
not use it very often.
Skip pointers used to help a lot, but with today’s fast CPUs,
they don’t help that much anymore.
50
Algorithm: single-pass in-memory indexing or SPIMI

As we build index, we parse docs one at a time.


The final postings for any term are incomplete until the end.
But for large collections, we cannot keep all postings in
memory and then sort in-memory at the end
We cannot sort very large sets of records on disk either (too
many disk seeks, expensive)
Thus: We need to store intermediate results on disk.
We need a scalable Block-Based sorting algorithm.

51
Single-pass in-memory indexing (1)

Abbreviation: SPIMI
Key idea 1: Generate separate dictionaries for each block.
Key idea 2: Accumulate postings in postings lists as they
occur.
With these two ideas we can generate a complete inverted
index for each block.
These separate indexes can then be merged into one big index.
Worked example!

52
Single-pass in-memory indexing (2)

53
Single-pass in-memory indexing (3)

We could save space in memory by assigning term-ids to


terms for each block-based dictionary
However, we then need to have an in-memory term-term-id
mapping which often does not fit in memory (on a single
machine at least)
This approach is called blocked sort-based indexing BSBI and
you can read about it in the book (Chapter 4.2)

54
Overview

1 Index construction
Postings list and Skip lists
Single-pass Indexing

2 Document and Term Normalisation


Documents
Terms
Reuter RCV1 and Heap’s Law
Document and Term Normalisation

To build an inverted index, we need to get from

Input: Friends, Romans, countrymen. So let it be with Caesar. . .


Output: friend roman countryman so
Each token is a candidate for a postings entry.
What are valid tokens to emit?

55
Documents

Up to now, we assumed that


We know what a document is.
We can “machine-read” each document
More complex in reality

56
Parsing a document

We need do deal with format and language of each document


Format could be excel, pdf, latex, word. . .
What language is it in?
What character set is it in?
Each of these is a statistical classification problem
Alternatively we can use heuristics

57
Character decoding

Text is not just a linear stream of logical “characters”...


Determine correct character encoding (Unicode UTF-8) – by
ML or by metadata or heuristics.
Compressions, binary representation (DOC)
Treat XML characters separately (&amp)

58
Format/Language: Complications

A single index usually contains terms of several languages.


Documents or their components can contain multiple
languages/format, for instance a French email with a Spanish
pdf attachment
What is the document unit for indexing?
a file?
an email?
an email with 5 attachments?
an email thread?
Answering the question “What is a document?” is not trivial.
Smaller units raise precision, drop recall
Also might have to deal with XML/hierarchies of HTML
documents etc.

59
Normalisation

Need to normalise words in the indexed text as well as query


terms to the same form
Example: We want to match U.S.A. to USA
We most commonly implicitly define equivalence classes of
terms.
Alternatively, we could do asymmetric expansion:

window → window, windows


windows → Windows,
windows, window
Windows → Windows
Either at query time, or at index time
More powerful, but less efficient

60
Tokenisation

Mr. O’Neill thinks that the boys’ stories about Chile’s capital
aren’t amusing.

neill aren’t
oneill arent
o’neill are n’t
o’ neill aren t
o neill
?
?

61
Tokenisation problems: One word or two? (or several)

Hewlett-Packard
State-of-the-art
co-education
the hold-him-back-and-drag-him-away maneuver
data base
San Francisco
Los Angeles-based company
cheap San Francisco-Los Angeles fares
York University vs. New York University

62
Numbers

20/3/91
3/20/91
Mar 20, 1991
B-52
100.2.86.144
(800) 234-2333
800.234.2333

Older IR systems may not index numbers...


... but generally it’s a useful feature.

63
Chinese: No Whitespace

Need to perform word segmentation


Use a lexicon or supervised machine-learning

64
Chinese: Ambiguous segmentation

As one word, means “monk”


As two words, means “and” and “still”

65
Other cases of “no whitespace”: Compounding

Compounding in Dutch, German, Swedish

German
Lebensversicherungsgesellschaftsangestellter
leben+s+versicherung+s+gesellschaft+s+angestellter

66
Other cases of “no whitespace”: Agglutination
“Agglutinative” languages do this not just for compounds:

Inuit
tusaatsiarunnangittualuujunga
(= “I can’t hear very well”)

Finnish
epäjärjestelmällistyttämättömyydellänsäkäänköhän
(= “I wonder if – even with his/her quality of not
having been made unsystematized”)

Turkish
Çekoslovakyalılaştıramadıklarımızdanmşçasına
(= “as if you were one of those whom we could not
make resemble the Czechoslovacian people”)

67
Japanese

Different scripts (alphabets) might be mixed in one language.


Japanese has 4 scripts: kanja, katakana, hiragana, Romanji
no spaces

68
Arabic script and bidirectionality

Direction of writing changes in some scripts (writing systems);


e.g., Arabic.

Rendering vs. conceptual order


Bidirectionality is not a problem if Unicode encoding is chosen

69
Accents and diacritics

résumé vs. resume


Universität
Meaning-changing in some languages:

peña = cliff, pena = sorrow


(Spanish)
Main questions: will users apply it when querying?

70
Case Folding

Reduce all letters to lower case


Even though case can be semantically distinguishing

Fed vs. fed


March vs. march
Turkey vs. turkey
US vs. us

Best to reduce to lowercase because users will use lowercase


regardness of correct capitalisation.

71
Stop words

Extremely common words which are of little value in helping


select documents matching a user need

a, an, and, are, as, at, be, by, for, from, has, he, in, is, it, its, of,
on, that, the, to, was, were, will, with

Used to be standard in older IR systems.


Need them to search for

to be or not to be
prince of Denmark
bamboo in water

Length of practically used stoplists has shrunk over the years.


Most web search engines do index stop words.

72
More equivalence classing

Thesauri: semantic equivalence, car = automobile


Soundex: phonetic equivalence, Muller = Mueller; lecture 3

73
Lemmatisation

Reduce inflectional/variant forms to base form

am, are, is → be
car, car’s, cars’, cars → car
the boy’s cars are different colours → the boy car be different color

Lemmatisation implies doing “proper” reduction to dictionary


headword form (the lemma)
Inflectional morphology (cutting → cut)
Derivational morphology (destruction → destroy)

74
Stemming

Stemming is a crude heuristic process that chops off the ends


of words in the hope of achieving what “principled”
lemmatisation attempts to do with a lot of linguistic
knowledge.
language dependent, but fast and space-efficient
does not require a stem dictionary, only a suffix dictionary
Often both inflectional and derivational

automate, automation, automatic → automat

Root changes (deceive/deception, resume/resumption) aren’t


dealt with, but these are rare

75
Porter Stemmer

M. Porter, “An algorithm for suffix stripping”, Program


14(3):130-137, 1980
Most common algorithm for stemming English
Results suggest it is at least as good as other stemmers
Syllable-like shapes + 5 phases of reductions
Of the rules in a compound command, select the top one and
exit that compound (this rule will have affecte the longest
suffix possible, due to the ordering of the rules).

76
Stemming: Representation of a word

[C] (VC){m}[V]
C : one or more adjacent consonants
V : one or more adjacent vowels
[ ] : optionality
( ) : group operator
{x} : repetition x times
m : the “measure” of a word

shoe [sh]C [oe]V m=0


Mississippi [M]C ([i]V [ss]C )([i]V [ss]C )([i]V [pp]C )[i]V m=3
ears ([ea]V [rs]C ) m=1

Notation: measure m is calculated on the word excluding the suffix of


the rule under consideration

77
Porter stemmer: selected rules

SSES → SS
IES → I
SS → SS
S→
caresses → caress
cares → care

(m>0) EED →
EE
feed → feed
agreed → agree
BUT: freed, succeed

78
Porter Stemmer: selected rules

(*v*) ED →

plastered → plaster
bled → bled

79
Three stemmers: a comparison

Such an analysis can reveal features that are not easily visible from the
variations in the individual genes and can lead to a picture of expression that is
more biologically transparent and accessible to interpretation.

Porter Stemmer
such an analysi can reveal featur that ar not easili visibl from the variat in the
individu gene and can lead to a pictur of express that is more biolog transpar
and access to interpret

Lovins Stemmer
such an analys can reve featur that ar not eas vis from th vari in th individu
gen and can lead to a pictur of expres that is mor biolog transpar and acces to
interpres

Paice Stemmer
such an analys can rev feat that are not easy vis from the vary in the individ
gen and can lead to a pict of express that is mor biolog transp and access to
interpret

80
Does stemming improve effectiveness?
In general, stemming increases effectiveness for some queries
and decreases it for others.
Example queries where stemming helps
tartan sweaters → sweater, sweaters
sightseeing tour san francisco → tour, tours

Example queries where stemming hurts


operational research → “oper” = operates, operatives, operate,
operation, operational, operative

operating system → operates, operatives, operate, operation,


operational, operative

operative dentistry → operates, operatives, operate, operation,


operational, operative

81
Phrase Queries

We want to answer a query such as [cambridge university] –


as a phrase.
The Duke of Cambridge recently went for a term-long course
to a famous university should not be a match
About 10% of web queries are phrase queries.
Consequence for inverted indexes: no longer sufficient to store
docIDs in postings lists.
Two ways of extending the inverted index:
biword index
positional index

82
Biword indexes

Index every consecutive pair of terms in the text as a phrase.

Friends, Romans, Countrymen


Generates two biwords:
friends romans
romans countrymen

Each of these biwords is now a vocabulary term.


Two-word phrases can now easily be answered.

83
Longer phrase queries

A long phrase like cambridge university west campus can be


represented as the Boolean query

cambridge university AND university west AND west campus

We need to do post-filtering of hits to identify subset that


actually contains the 4-word phrase.

84
Issues with biword indexes

Why are biword indexes rarely used?

85
Issues with biword indexes

Why are biword indexes rarely used?


False positives, as noted above
Index blowup due to very large term vocabulary

85
Positional indexes

Positional indexes are a more efficient alternative to biword


indexes.
Postings lists in a nonpositional index: each posting is just a
docID
Postings lists in a positional index: each posting is a docID
and a list of positions (offsets)

86
Positional indexes: Example

Query: “to1 be2 or3 not4 to5 be6 ”


to, 993427:
< 1: < 7, 18, 33, 72, 86, 231>;
2: <1, 17, 74, 222, 255>;
4: <8, 16, 190, 429, 433>;
5: <363, 367>;
7: <13, 23, 191>;
... ...>

be, 178239:
< 1: < 17, 25>;
4: < 17, 191, 291, 430, 434>;
5: <14, 19, 101>;
... ...>

Document 4 is a match.
(As always: docid, term, doc freq; new: offsets)

87
Proximity search

We just saw how to use a positional index for phrase searches.


We can also use it for proximity search.

employment /4 place
Find all documents that contain employment and place within
4 words of each other.
HIT: Employment agencies that place healthcare workers are
seeing growth.
NO HIT: Employment agencies that have learned to adapt
now place healthcare workers.

88
Proximity search

Use the positional index


Simplest algorithm: look at cross-product of positions of (i)
“employment” in document and (ii) “place” in document
Very inefficient for frequent words, especially stop words
Note that we want to return the actual matching positions,
not just a list of documents.
This is important for dynamic summaries etc.

89
Proximity intersection
PositionalIntersect(p1, p2, k)
1 answer ←<>
2 while p1 6= nil and p2 6= nil
3 do if docID(p1) = docID(p2)
4 then l ← <>
5 pp1 ← positions(p1)
6 pp2 ← positions(p2)
7 while pp1 6= nil
8 do while pp2 6= nil
9 do if |pos(pp1) pos(pp2)| ≤ k
10 then Add(l , pos(pp2))
11 else if pos(pp2) > pos(pp1)
12 then break
13 pp2 ← next(pp2)
14 while l 6=<> and |l [0] pos(pp1)| > k
15 do Delete(l [0])
16 for each ps l
17 do Add(answer , hdocID(p1), pos(pp1), psi)
18 pp1 ← next(pp1)
19 p1 ← next(p1)
20 p2 ← next(p2)
21 else if docID(p1) < docID(p2)
22 then p1 ← next(p1)
23 else p2 ← next(p2)
24 return answer
90
Combination scheme

Biword indexes and positional indexes can be profitably


combined.
Many biwords are extremely frequent: Michael Jackson,
Britney Spears etc
For these biwords, increased speed compared to positional
postings intersection is substantial.
Combination scheme: Include frequent biwords as vocabulary
terms in the index. Do all other phrases by positional
intersection.
Williams et al. (2004) evaluate a more sophisticated mixed
indexing scheme. Faster than a positional index, at a cost of
26% more space for index.
For web search engines, positional queries are much more
expensive than regular Boolean queries.

91
RCV1 collection

92
RCV1 collection

Shakespeare’s collected works are not large enough to


demonstrate scalable index construction algorithms.

N documents 800,000
M terms (= word types) 400,000
T non-positional postings 100,000,000

92
RCV1 collection

Shakespeare’s collected works are not large enough to


demonstrate scalable index construction algorithms.
Instead, we will use the Reuters RCV1 collection.

N documents 800,000
M terms (= word types) 400,000
T non-positional postings 100,000,000

92
RCV1 collection

Shakespeare’s collected works are not large enough to


demonstrate scalable index construction algorithms.
Instead, we will use the Reuters RCV1 collection.
English newswire articles published in a 12 month period
(1995/6)

N documents 800,000
M terms (= word types) 400,000
T non-positional postings 100,000,000

92
Effect of preprocessing for Reuters

word types non-positional positional postings


(terms) postings (word tokens)
size of dictionary non-positional index positional index
size ∆ cml size ∆ cml size ∆ cml
unfiltered 484,494 109,971,179 197,879,290
no numbers 473,723 -2 -2 100,680,242 -8 -8 179,158,204 -9 -9
case folding 391,523 -17 -19 96,969,056 -3 -12 179,158,204 -0 -9
30 stopw’s 391,493 -0 -19 83,390,443 -14 -24 121,857,825 -31 -38
150 stopw’s 391,373 -0 -19 67,001,847 -30 -39 94,516,599 -47 -52
stemming 322,383 -17 -33 63,812,300 -4 -42 94,516,599 -0 -52

93
How big is the term vocabulary?

94
How big is the term vocabulary?

That is, how many distinct words are there?

94
How big is the term vocabulary?

That is, how many distinct words are there?


Can we assume there is an upper bound?

94
How big is the term vocabulary?

That is, how many distinct words are there?


Can we assume there is an upper bound?
Not really: At least 7020 ≈ 1037 different words of length 20.

94
How big is the term vocabulary?

That is, how many distinct words are there?


Can we assume there is an upper bound?
Not really: At least 7020 ≈ 1037 different words of length 20.
The vocabulary will keep growing with collection size.

94
How big is the term vocabulary?

That is, how many distinct words are there?


Can we assume there is an upper bound?
Not really: At least 7020 ≈ 1037 different words of length 20.
The vocabulary will keep growing with collection size.
Heaps’ law: M = kT b

94
How big is the term vocabulary?

That is, how many distinct words are there?


Can we assume there is an upper bound?
Not really: At least 7020 ≈ 1037 different words of length 20.
The vocabulary will keep growing with collection size.
Heaps’ law: M = kT b
M is the size of the vocabulary, T is the number of tokens in
the collection.

94
How big is the term vocabulary?

That is, how many distinct words are there?


Can we assume there is an upper bound?
Not really: At least 7020 ≈ 1037 different words of length 20.
The vocabulary will keep growing with collection size.
Heaps’ law: M = kT b
M is the size of the vocabulary, T is the number of tokens in
the collection.
Typical values for the parameters k and b are: 30 ≤ k ≤ 100
and b ≈ 0.5.

94
How big is the term vocabulary?

That is, how many distinct words are there?


Can we assume there is an upper bound?
Not really: At least 7020 ≈ 1037 different words of length 20.
The vocabulary will keep growing with collection size.
Heaps’ law: M = kT b
M is the size of the vocabulary, T is the number of tokens in
the collection.
Typical values for the parameters k and b are: 30 ≤ k ≤ 100
and b ≈ 0.5.
Heaps’ law is linear in log-log space.

94
How big is the term vocabulary?

That is, how many distinct words are there?


Can we assume there is an upper bound?
Not really: At least 7020 ≈ 1037 different words of length 20.
The vocabulary will keep growing with collection size.
Heaps’ law: M = kT b
M is the size of the vocabulary, T is the number of tokens in
the collection.
Typical values for the parameters k and b are: 30 ≤ k ≤ 100
and b ≈ 0.5.
Heaps’ law is linear in log-log space.
It is the simplest possible relationship between collection size
and vocabulary size in log-log space.

94
How big is the term vocabulary?

That is, how many distinct words are there?


Can we assume there is an upper bound?
Not really: At least 7020 ≈ 1037 different words of length 20.
The vocabulary will keep growing with collection size.
Heaps’ law: M = kT b
M is the size of the vocabulary, T is the number of tokens in
the collection.
Typical values for the parameters k and b are: 30 ≤ k ≤ 100
and b ≈ 0.5.
Heaps’ law is linear in log-log space.
It is the simplest possible relationship between collection size
and vocabulary size in log-log space.
Empirical law

94
Heaps’ law for Reuters
Vocabulary size M as a
function of collection size
T (number of tokens) for
Reuters-RCV1. For these
data, the dashed line
log10 M =
0.49 ∗ log10 T + 1.64 is the
best least squares fit.
Thus, M = 101.64 T 0.49
and k = 101.64 ≈ 44 and
b = 0.49.

95
Empirical fit for Reuters

96
Empirical fit for Reuters

Good, as we just saw in the graph.

96
Empirical fit for Reuters

Good, as we just saw in the graph.


Example: for the first 1,000,020 tokens Heaps’ law predicts
38,323 terms:

44 × 1,000,0200.49 ≈ 38,323

96
Empirical fit for Reuters

Good, as we just saw in the graph.


Example: for the first 1,000,020 tokens Heaps’ law predicts
38,323 terms:

44 × 1,000,0200.49 ≈ 38,323

The actual number is 38,365 terms, very close to the


prediction.

96
Empirical fit for Reuters

Good, as we just saw in the graph.


Example: for the first 1,000,020 tokens Heaps’ law predicts
38,323 terms:

44 × 1,000,0200.49 ≈ 38,323

The actual number is 38,365 terms, very close to the


prediction.
Empirical observation: fit is good in general.

96
Take-away

Understanding of the basic unit of classical information


retrieval systems: words and documents: What is a document,
what is a term?
Tokenization: how to get from raw text to terms (or tokens)
More complex indexes for phrases

97
Reading

MRS Chapter 2.2


MRS Chapter 2.4
MRS Chapter 4.3

98

You might also like