0% found this document useful (0 votes)
31 views47 pages

Lecture 2: Datastructures and Algorithms For Indexing: Information Retrieval Computer Science Tripos Part II

1) The document discusses the indexer component of an information retrieval system and focuses on index construction and document/term normalization. 2) Index construction involves collecting documents, tokenizing text, preprocessing tokens, and indexing which terms occur in which documents. 3) Document and term normalization processes words in the text to generate normalized terms/tokens, such as removing case and morphology variations, to group related words. 4) Other topics discussed include optimizations like skip lists in postings lists and challenges of determining document boundaries and languages when documents can be in various formats.

Uploaded by

Daniel Ergicho
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views47 pages

Lecture 2: Datastructures and Algorithms For Indexing: Information Retrieval Computer Science Tripos Part II

1) The document discusses the indexer component of an information retrieval system and focuses on index construction and document/term normalization. 2) Index construction involves collecting documents, tokenizing text, preprocessing tokens, and indexing which terms occur in which documents. 3) Document and term normalization processes words in the text to generate normalized terms/tokens, such as removing case and morphology variations, to group related words. 4) Other topics discussed include optimizations like skip lists in postings lists and challenges of determining document boundaries and languages when documents can be in various formats.

Uploaded by

Daniel Ergicho
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 47

Lecture 2: Datastructures and Algorithms for

Indexing
Information Retrieval
Computer Science Tripos Part II

Simone Teufel

Natural Language and Information Processing (NLIP) Group

[email protected]

Lent 2014

43
IR System Components

Today: the indexer

44
IR System Components

Today: The indexer

45
IR System Components

Today: the indexer

46
Overview

1 Index construction

2 Document and Term Normalisation


Documents
Terms

3 Other types of indexes


Biword indexes
Positional indexes
Index construction

The major steps in inverted index construction:


Collect the documents to be indexed.
Tokenize the text.
Perform linguistic preprocessing of tokens.
Index the documents that each term occurs in.

47
Definitions

Word: a delimited string of characters as it appears in the text.


Term: a “normalised” word (case, morphology, spelling etc);
an equivalence class of words
Token: an instance of a word or term occurring in a document.
Type: an equivalence class of tokens (same as “term” in most
cases)

48
Example: index creation by sorting

Term docID Term (sorted) docID


I 1 ambitious 2
did 1 be 2
enact 1 brutus 1
julius 1 brutus 2
Doc 1: caesar 1 capitol 2
I did enact Julius I 1 caesar 1
Caesar: I was killed =⇒ was 1 caesar 2
i’ the Capitol;Brutus Tokenisation killed 1 caesar 2
killed me. i’ 1 did 1
the 1 enact 1
capitol 1 hath 1
brutus 1 I 1
killed 1 I 1
me 1 i’ 1
so 2 =⇒ it 2
let 2 Sorting julius 1
it 2 killed 1
Doc 2: be 2 killed 2
So let it be with with 2 let 2
Caesar. The noble caesar 2 me 1
Brutus hath told =⇒ the 2 noble 2
you Caesar was Tokenisation noble 2 so 2
ambitious. brutus 2 the 1
hath 2 the 2
told 2 told 2
you 2 you 2
caesar 2 was 1
was 2 was 1
ambitious 2 with 2

49
Index creation; grouping step (“uniq”)

Term & doc. freq. Postings list


ambitious 1 → 2
Primary sort by term
be 1 → 2 (dictionary)
brutus 2 → 1 → 2
capitol 1 → 1 Secondary sort (within
caesar 2 → 1 → 2 postings list) by document
did 1 → 1
enact 1 → 1
ID
hath 1 → 2 Document frequency (=
I 1 → 1
i’ 1 → 1
length of postings list):
it 1 → 2 for more efficient
julius 1 → 1 Boolean searching (cf.
killed 1 → 1
lecture 1)
let 1 → 2
me 1 → 1
for term weighting
noble 1 → 2 (lecture 4)
so 1 → 2
the 2 → 1 → 2 keep dictionary in memory
told 1 → 2
keep postings list (much
you 1 → 2
was 2 → 1 → 2 larger) on disk
with 1 → 2

50
Optimisation: Skip Lists

Some postings lists can contain several million entries


Enter skip lists
Check skip list if present, in order to skip multiple entries

Tradeoff: How many skips to place?


More skips: each pointer skips only a few items, but we can
frequently use it.
Fewer skips: each skip pointer skips many items, but we can
not use it very often.

Workable heuristic: place L skips evenly for a list of length L.
With today’s fast CPUs, skip lists don’t help that much
anymore.
51
Overview

1 Index construction

2 Document and Term Normalisation


Documents
Terms

3 Other types of indexes


Biword indexes
Positional indexes
Document and Term Normalisation

To build an inverted index, we need to get from Input

Friends, Romans, countrymen. So let it be with Caesar. . .

to Output

friend roman countryman so

Each token is a candidate for a postings entry.


What are valid tokens to emit?

52
Parsing a document

Up to now, we assumed that


We know what a document is
We can easily “machine-read” each document
We need do deal with format and language of each document
Format could be excel, latex, HTML . . .
Document could be compressed or in binary format (excel,
word)
Character set could be Unicode, UTF-8, Big-5, XML (&amp)
Language could be French email with Spanish quote or
attachment
Each of these is a statistical classification problem
Alternatively we can use heuristics

53
Format/Language:Complications

A single index usually contains terms of several languages.


Documents or their components can contain multiple
languages
What is the document unit for indexing?
a file?
an email?
an email with 5 attachments?
an email thread?
Also might have to deal with XML/hierarchies of HTML
documents etc.
Answering the question “What is a document?” is not trivial.
Smaller units raise precision, drop recall

54
Normalisation

Need to normalise words in the indexed text as wellas query


terms to the same form
Example: We want to match U.S.A. to USA
We most commonly implicitly define equivalence classes of
terms.
Alternatively, we could do asymmetric expansion:

window → window, windows


windows → Windows, windows, window
Windows → Windows

Either at query time, or at index time


More powerful, but less efficient

55
Tokenisation

Mr. O’Neill thinks that the boys’ stories about Chile’s capital
aren’t amusing.

neill aren’t
oneill arent
o’neill aren t
o’ neill are n’t
o neill
?
?

56
Tokenisation problems: One word or two? (or several)

Hewlett-Packard
State-of-the-art
co-education
the hold-him-back-and-drag-him-away maneuver
data base
San Francisco
Los Angeles-based company

cheap San Francisco-Los Angeles

fares

York University vs. New York University

57
Numbers

20/3/91
3/20/91
Mar 20, 1991
B-52
6-year-old
100.2.86.144
(800) 234-2333
800.234.2333
.74189359872398457

Older IR systems may not index numbers


... but generally it’s a useful feature.

58
Chinese: No Whitespace

Need to perform word segmentation


Use a lexicon or supervised machine-learning
Ambiguity
As one word, means “monk”
As two words, means “and” and “still”

59
Script-related Problems

Different scripts (alphabets) might be mixed in one language.


e.g., Japanese has 4 scripts: kanji, katakana, hiragana, romanji
no spaces

Scripts can incorporate different reading directions.


e.g., Arabic script and bidirectionality
Rendering vs. conceptual order
60
Other cases of “no whitespace”: Compounding

Compounding in Dutch, German, Swedish

German
Lebensversicherungsgesellschaftsangestellter
leben+s+versicherung+s+gesellschaft+s+angestellter

61
Other cases of “no whitespace”: Agglutination
“Agglutinative” languages do this not just for compounds:

Inuit
tusaatsiarunnangittualuujunga
(= “I can’t hear very well”)

Finnish
epäjärjestelmällistyttämättömyydellänsäkäänköhän
(= “I wonder if - even with his/her quality of not
having been made unsystematized”)

Turkish
¸ekoslovakyalılastıramadıklarımızdanmsçasına
(= “as if you were one of those whom we could not
make resemble the Czechoslovacian people”)

62
Casefolding, accents, diacritics

Casefolding can be semantically distinguishing:

Fed vs. fed


March vs. march
Turkey vs. turkey
US vs. us

Though in most cases it’s not.


Accents and Diacritics can be semantically distinguishing:
Spanish
peña = cliff, pena = sorrow

Though in most cases they are not (résumé vs. resume)


Most systems case-fold (reduce all letters to lower case) and
throw away accents.
Main decision criterion: will users apply it when querying?

63
Stop words

Extremely common words which are oflittle value in helping


select documents matching a user need

a, an, and, are, as, at, be, by, for, from, has,he, in, is, it, its, of,
on, that, the, to, was, were, will, with

Used to be standardly non-indexed in older IR systems.


Need them to search for the following queries:

to be or not to be
prince of Denmark
bamboo in water

Length of practically used stoplists has shrunk over the years.


Most web search engines do index stop words.

64
Lemmatisation

Reduce inflectional/variant forms to base form

am, are, is → be
car, car’s, cars’, cars → car
the boy’s cars are different colours → the boy car be different color

Lemmatisation implies doing “proper” reduction to dictionary


headword form (the lemma)
Inflectional morphology (cutting → cut)
Derivational morphology (destruction → destroy)

65
Stemming

Stemming is a crude heuristic process that chops off the ends


of words in the hope of achieving what “principled”
lemmatisation attempts to do with a lot of linguistic
knowledge.

automate, automation, automatic → automat

language dependent, but fast and space-efficient


does not require a stem dictionary, only a suffix dictionary
Often both inflectional and derivational

66
Porter Stemmer

M. Porter, “An algorithm for suffix stripping”, Program


14(3):130-137, 1980
Most common algorithm for stemming English
Results suggest it is at least as good as other stemmers
Syllable-like shapes + 5 phases of reductions
Of the rules in a compound command, select the top one and
exit that compound (this rule will have affecte the longest
suffix possible, due to the ordering of the rules).

67
Stemming: Representation of a word

[C] (VC){m}[V]
C : one or more adjacent consonants
V : one or more adjacent vowels
[ ] : optionality
( ) : group operator
{x} : repetition x times
m : the “measure” of a word

shoe [sh]C [oe]V m=0


Mississippi [M]C ([i]V [ss]C )([i]V [ss]C )([i]V [pp]C )[i]V m=3
ears ([ea]V [rs]C ) m=1

Notation: measure m is calculated on the word excluding the suffix of


the rule under consideration

68
Porter stemmer: selected rules

SSES → SS
IES → I
SS → SS
S→
caresses → caress
cares → care

(m>0) EED → EE

feed → feed
agreed → agree
BUT: freed, succeed

(*v*) ED →

plastered → plaster
bled → bled

69
Three stemmers: a comparison

Such an analysis can reveal features that are not easily visible from the
variations in the individual genes and can lead to a picture of expression that is
more biologically transparent and accessible to interpretation.

Porter Stemmer
such an analysi can reveal featur that ar not easili visibl from the variat in the
individu gene and can lead to a pictur of express that is more biolog transpar
and access to interpret

Lovins Stemmer
such an analys can reve featur that ar not eas vis from th vari in th individu
gen and can lead to a pictur of expres that is mor biolog transpar and acces to
interpres

Paice Stemmer
such an analys can rev feat that are not easy vis from the vary in the individ
gen and can lead to a pict of express that is mor biolog transp and access to
interpret

70
Does stemming improve effectiveness?

In general, stemming increases effectiveness for some queries


and decreases it for others.
Example queries where stemming helps
tartan sweaters → sweater, sweaters
sightseeing tour san francisco → tour, tours

Example queries where stemming hurts


operational research → oper = operates, operatives, oper-
ate, operation, operational, op-
erative
operating system → oper
operative dentistry → oper

71
More equivalence classing

Thesauri: semantic equivalence, car = automobile


Soundex: phonetic equivalence, Muller = Mueller

72
Overview

1 Index construction

2 Document and Term Normalisation


Documents
Terms

3 Other types of indexes


Biword indexes
Positional indexes
Phrase Queries

We want to answer a query such as [cambridge university] -


as a phrase.

None of these should be a match:

But this one is OK:

73
Phrase Queries

About 10% of web queries are phrase queries.


Consequence for inverted indexes: no longer sufficient to store
docIDs in postings lists.
Two ways of extending the inverted index:
biword index
positional index

74
Biword indexes

Index every consecutive pair of terms in the text as a phrase.

Friends, Romans, Countrymen


Generates two biwords:
friends romans
romans countrymen

Each of these biwords is now a vocabulary term.


Two-word phrases can now easily be answered.

75
Longer phrase queries

A long phrase like cambridge university west campus can be


represented as the Boolean query

cambridge university AND university west AND west campus

We need to do post-filtering of hits to identify subset that


actually contains the 4-word phrase.

76
Issues with biword indexes

Why are biword indexes rarely used?


False positives, as noted above
Index blowup due to very large term vocabulary

77
Positional indexes

Positional indexes are a more efficient alternative to biword


indexes.
Postings lists in a nonpositional index: each posting is just a
docID
Postings lists in a positional index: each posting is a docID
and a list of positions (offsets)

78
Positional indexes: Example

Query: “to1 be2 or3 not4 to5 be6”


to, 993427:
〈 1: 〈 7, 18, 33, 72, 86, 231 〉 ;
2: 〈 1, 17, 74, 222, 255 〉 ;
4: 〈 8, 16, 190, 429, 433 〉 ;
5: 〈 363, 367 〉 ;
7: 〈 13, 23, 191 〉 ; . . . 〉

be, 178239:
〈 1: 〈 17, 25 〉 ;
4: 〈 17, 191, 291, 430, 434 〉 ;
5: 〈 14, 19, 101 〉 ; . . . 〉

As always:

79
Positional indexes: Example

Query: “to1 be2 or3 not4 to5 be6”


to, 993427:
〈 1: 〈 7, 18, 33, 72, 86, 231 〉 ;
2: 〈 1, 17, 74, 222, 255 〉 ;
4: 〈 8, 16, 190, 429, 433 〉 ;
5: 〈 363, 367 〉 ;
7: 〈 13, 23, 191 〉 ; . . . 〉

be, 178239:
〈 1: 〈 17, 25 〉 ;
4: 〈 17, 191, 291, 430, 434 〉 ;
5: 〈 14, 19, 101 〉 ; . . . 〉

As always: docid, term, doc freq; new: offsets

79
Positional indexes: Example

Query: “to1 be2 or3 not4 to5 be6”


to, 993427:
〈 1: 〈 7, 18, 33, 72, 86, 231 〉 ;
2: 〈 1, 17, 74, 222, 255 〉 ;
4: 〈 8, 16, 190, 429, 433 〉 ;
5: 〈 363, 367 〉 ;
7: 〈 13, 23, 191 〉 ; . . . 〉

be, 178239:
〈 1: 〈 17, 25 〉 ;
4: 〈 17, 191, 291, 430, 434 〉 ;
5: 〈 14, 19, 101 〉 ; . . . 〉

Document 4 is a match!

79
Complexity of search with positional index

Unfortunately, Θ(T) rather than Θ(N)


T ...number of tokens in document collection
N ...number of documents in document collection
Combination scheme:
Include frequent biwords as vocabulary terms in the index
(“Cambridge University”, “Britney Spears”)
Resolve all other phrases by positional intersection

80
Proximity search

We just saw how to use a positional index for phrase searches.


We can also use it for proximity search.

employment /4 place
Find all documents that contain employment and place within
4 words of each other.
HIT: Employment agencies that place healthcare workers are
seeing growth.
NO HIT: Employment agencies that have learned to adapt
now place healthcare workers.

81
Proximity search with positional index

Simplest algorithm: look at cross-product of positions of (i)


“employment” in document and (ii) “place” in document
Note that we want to return the actual matching positions,
not just a list of documents.
Very inefficient for frequent words, especially stop words
More efficient algorithm in book

82
Take-away

Understanding of the basic unit of classical information


retrieval systems: words and documents: What is a document,
what is a term?
Tokenization: how to get from raw text to terms (or tokens)
More complex indexes for phrase and proximity search
biword index
positional index

83
Reading

MRS Chapter 2.2


MRS Chapter 2.4

84

You might also like