0% found this document useful (0 votes)

31 views47 pages

Lecture 2: Datastructures and Algorithms For Indexing: Information Retrieval Computer Science Tripos Part II

1) The document discusses the indexer component of an information retrieval system and focuses on index construction and document/term normalization. 2) Index construction involves collecting documents, tokenizing text, preprocessing tokens, and indexing which terms occur in which documents. 3) Document and term normalization processes words in the text to generate normalized terms/tokens, such as removing case and morphology variations, to group related words. 4) Other topics discussed include optimizations like skip lists in postings lists and challenges of determining document boundaries and languages when documents can be in various formats.

Uploaded by

Daniel Ergicho

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views47 pages

Lecture 2: Datastructures and Algorithms For Indexing: Information Retrieval Computer Science Tripos Part II

Uploaded by

Daniel Ergicho

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 47

Lecture 2: Datastructures and Algorithms for

Indexing
Information Retrieval
Computer Science Tripos Part II

Simone Teufel

Natural Language and Information Processing (NLIP) Group

[email protected]

Lent 2014

43
IR System Components

Today: the indexer

44
IR System Components

Today: The indexer

45
IR System Components

Today: the indexer

46
Overview

1 Index construction

2 Document and Term Normalisation

Documents
Terms

3 Other types of indexes

Biword indexes
Positional indexes
Index construction

The major steps in inverted index construction:

Collect the documents to be indexed.
Tokenize the text.
Perform linguistic preprocessing of tokens.
Index the documents that each term occurs in.

47
Definitions

Word: a delimited string of characters as it appears in the text.

Term: a “normalised” word (case, morphology, spelling etc);
an equivalence class of words
Token: an instance of a word or term occurring in a document.
Type: an equivalence class of tokens (same as “term” in most
cases)

48
Example: index creation by sorting

Term docID Term (sorted) docID

I 1 ambitious 2
did 1 be 2
enact 1 brutus 1
julius 1 brutus 2
Doc 1: caesar 1 capitol 2
I did enact Julius I 1 caesar 1
Caesar: I was killed =⇒ was 1 caesar 2
i’ the Capitol;Brutus Tokenisation killed 1 caesar 2
killed me. i’ 1 did 1
the 1 enact 1
capitol 1 hath 1
brutus 1 I 1
killed 1 I 1
me 1 i’ 1
so 2 =⇒ it 2
let 2 Sorting julius 1
it 2 killed 1
Doc 2: be 2 killed 2
So let it be with with 2 let 2
Caesar. The noble caesar 2 me 1
Brutus hath told =⇒ the 2 noble 2
you Caesar was Tokenisation noble 2 so 2
ambitious. brutus 2 the 1
hath 2 the 2
told 2 told 2
you 2 you 2
caesar 2 was 1
was 2 was 1
ambitious 2 with 2

49
Index creation; grouping step (“uniq”)

Term & doc. freq. Postings list

ambitious 1 → 2
Primary sort by term
be 1 → 2 (dictionary)
brutus 2 → 1 → 2
capitol 1 → 1 Secondary sort (within
caesar 2 → 1 → 2 postings list) by document
did 1 → 1
enact 1 → 1
ID
hath 1 → 2 Document frequency (=
I 1 → 1
i’ 1 → 1
length of postings list):
it 1 → 2 for more efficient
julius 1 → 1 Boolean searching (cf.
killed 1 → 1
lecture 1)
let 1 → 2
me 1 → 1
for term weighting
noble 1 → 2 (lecture 4)
so 1 → 2
the 2 → 1 → 2 keep dictionary in memory
told 1 → 2
keep postings list (much
you 1 → 2
was 2 → 1 → 2 larger) on disk
with 1 → 2

50
Optimisation: Skip Lists

Some postings lists can contain several million entries

Enter skip lists
Check skip list if present, in order to skip multiple entries

Tradeoff: How many skips to place?

More skips: each pointer skips only a few items, but we can
frequently use it.
Fewer skips: each skip pointer skips many items, but we can
not use it very often.
√
Workable heuristic: place L skips evenly for a list of length L.
With today’s fast CPUs, skip lists don’t help that much
anymore.
51
Overview

1 Index construction

2 Document and Term Normalisation

Documents
Terms

3 Other types of indexes

Biword indexes
Positional indexes
Document and Term Normalisation

To build an inverted index, we need to get from Input

Friends, Romans, countrymen. So let it be with Caesar. . .

to Output

friend roman countryman so

Each token is a candidate for a postings entry.

What are valid tokens to emit?

52
Parsing a document

Up to now, we assumed that

We know what a document is
We can easily “machine-read” each document
We need do deal with format and language of each document
Format could be excel, latex, HTML . . .
Document could be compressed or in binary format (excel,
word)
Character set could be Unicode, UTF-8, Big-5, XML (&amp)
Language could be French email with Spanish quote or
attachment
Each of these is a statistical classification problem
Alternatively we can use heuristics

53
Format/Language:Complications

A single index usually contains terms of several languages.

Documents or their components can contain multiple
languages
What is the document unit for indexing?
a file?
an email?
an email with 5 attachments?
an email thread?
Also might have to deal with XML/hierarchies of HTML
documents etc.
Answering the question “What is a document?” is not trivial.
Smaller units raise precision, drop recall

54
Normalisation

Need to normalise words in the indexed text as wellas query

terms to the same form
Example: We want to match U.S.A. to USA
We most commonly implicitly define equivalence classes of
terms.
Alternatively, we could do asymmetric expansion:

window → window, windows

windows → Windows, windows, window
Windows → Windows

Either at query time, or at index time

More powerful, but less efficient

55
Tokenisation

Mr. O’Neill thinks that the boys’ stories about Chile’s capital
aren’t amusing.

neill aren’t
oneill arent
o’neill aren t
o’ neill are n’t
o neill
?
?

56
Tokenisation problems: One word or two? (or several)

Hewlett-Packard
State-of-the-art
co-education
the hold-him-back-and-drag-him-away maneuver
data base
San Francisco
Los Angeles-based company

cheap San Francisco-Los Angeles

fares

York University vs. New York University

57
Numbers

20/3/91
3/20/91
Mar 20, 1991
B-52
6-year-old
100.2.86.144
(800) 234-2333
800.234.2333
.74189359872398457

Older IR systems may not index numbers

... but generally it’s a useful feature.

58
Chinese: No Whitespace

Need to perform word segmentation

Use a lexicon or supervised machine-learning
Ambiguity
As one word, means “monk”
As two words, means “and” and “still”

59
Script-related Problems

Different scripts (alphabets) might be mixed in one language.

e.g., Japanese has 4 scripts: kanji, katakana, hiragana, romanji
no spaces

Scripts can incorporate different reading directions.

e.g., Arabic script and bidirectionality
Rendering vs. conceptual order
60
Other cases of “no whitespace”: Compounding

Compounding in Dutch, German, Swedish

German
Lebensversicherungsgesellschaftsangestellter
leben+s+versicherung+s+gesellschaft+s+angestellter

61
Other cases of “no whitespace”: Agglutination
“Agglutinative” languages do this not just for compounds:

Inuit
tusaatsiarunnangittualuujunga
(= “I can’t hear very well”)

Finnish
epäjärjestelmällistyttämättömyydellänsäkäänköhän
(= “I wonder if - even with his/her quality of not
having been made unsystematized”)

Turkish
¸ekoslovakyalılastıramadıklarımızdanmsçasına
(= “as if you were one of those whom we could not
make resemble the Czechoslovacian people”)

62
Casefolding, accents, diacritics

Casefolding can be semantically distinguishing:

Fed vs. fed

March vs. march
Turkey vs. turkey
US vs. us

Though in most cases it’s not.

Accents and Diacritics can be semantically distinguishing:
Spanish
peña = cliff, pena = sorrow

Though in most cases they are not (résumé vs. resume)

Most systems case-fold (reduce all letters to lower case) and
throw away accents.
Main decision criterion: will users apply it when querying?

63
Stop words

Extremely common words which are oflittle value in helping

select documents matching a user need

a, an, and, are, as, at, be, by, for, from, has,he, in, is, it, its, of,
on, that, the, to, was, were, will, with

Used to be standardly non-indexed in older IR systems.

Need them to search for the following queries:

to be or not to be
prince of Denmark
bamboo in water

Length of practically used stoplists has shrunk over the years.

Most web search engines do index stop words.

64
Lemmatisation

Reduce inflectional/variant forms to base form

am, are, is → be
car, car’s, cars’, cars → car
the boy’s cars are different colours → the boy car be different color

Lemmatisation implies doing “proper” reduction to dictionary

headword form (the lemma)
Inflectional morphology (cutting → cut)
Derivational morphology (destruction → destroy)

65
Stemming

Stemming is a crude heuristic process that chops off the ends

of words in the hope of achieving what “principled”
lemmatisation attempts to do with a lot of linguistic
knowledge.

automate, automation, automatic → automat

language dependent, but fast and space-efficient

does not require a stem dictionary, only a suffix dictionary
Often both inflectional and derivational

66
Porter Stemmer

M. Porter, “An algorithm for suffix stripping”, Program

14(3):130-137, 1980
Most common algorithm for stemming English
Results suggest it is at least as good as other stemmers
Syllable-like shapes + 5 phases of reductions
Of the rules in a compound command, select the top one and
exit that compound (this rule will have affecte the longest
suffix possible, due to the ordering of the rules).

67
Stemming: Representation of a word

[C] (VC){m}[V]
C : one or more adjacent consonants
V : one or more adjacent vowels
[ ] : optionality
( ) : group operator
{x} : repetition x times
m : the “measure” of a word

shoe [sh]C [oe]V m=0

Mississippi [M]C ([i]V [ss]C )([i]V [ss]C )([i]V [pp]C )[i]V m=3
ears ([ea]V [rs]C ) m=1

Notation: measure m is calculated on the word excluding the suffix of

the rule under consideration

68
Porter stemmer: selected rules

SSES → SS
IES → I
SS → SS
S→
caresses → caress
cares → care

(m>0) EED → EE

feed → feed
agreed → agree
BUT: freed, succeed

(*v*) ED →

plastered → plaster
bled → bled

69
Three stemmers: a comparison

Such an analysis can reveal features that are not easily visible from the
variations in the individual genes and can lead to a picture of expression that is
more biologically transparent and accessible to interpretation.

Porter Stemmer
such an analysi can reveal featur that ar not easili visibl from the variat in the
individu gene and can lead to a pictur of express that is more biolog transpar
and access to interpret

Lovins Stemmer
such an analys can reve featur that ar not eas vis from th vari in th individu
gen and can lead to a pictur of expres that is mor biolog transpar and acces to
interpres

Paice Stemmer
such an analys can rev feat that are not easy vis from the vary in the individ
gen and can lead to a pict of express that is mor biolog transp and access to
interpret

70
Does stemming improve effectiveness?

In general, stemming increases effectiveness for some queries

and decreases it for others.
Example queries where stemming helps
tartan sweaters → sweater, sweaters
sightseeing tour san francisco → tour, tours

Example queries where stemming hurts

operational research → oper = operates, operatives, oper-
ate, operation, operational, op-
erative
operating system → oper
operative dentistry → oper

71
More equivalence classing

Thesauri: semantic equivalence, car = automobile

Soundex: phonetic equivalence, Muller = Mueller

72
Overview

1 Index construction

2 Document and Term Normalisation

Documents
Terms

3 Other types of indexes

Biword indexes
Positional indexes
Phrase Queries

We want to answer a query such as [cambridge university] -

as a phrase.

None of these should be a match:

But this one is OK:

73
Phrase Queries

About 10% of web queries are phrase queries.

Consequence for inverted indexes: no longer sufficient to store
docIDs in postings lists.
Two ways of extending the inverted index:
biword index
positional index

74
Biword indexes

Index every consecutive pair of terms in the text as a phrase.

Friends, Romans, Countrymen

Generates two biwords:
friends romans
romans countrymen

Each of these biwords is now a vocabulary term.

Two-word phrases can now easily be answered.

75
Longer phrase queries

A long phrase like cambridge university west campus can be

represented as the Boolean query

cambridge university AND university west AND west campus

We need to do post-filtering of hits to identify subset that

actually contains the 4-word phrase.

76
Issues with biword indexes

Why are biword indexes rarely used?

False positives, as noted above
Index blowup due to very large term vocabulary

77
Positional indexes

Positional indexes are a more efficient alternative to biword

indexes.
Postings lists in a nonpositional index: each posting is just a
docID
Postings lists in a positional index: each posting is a docID
and a list of positions (offsets)

78
Positional indexes: Example

Query: “to1 be2 or3 not4 to5 be6”

to, 993427:
〈 1: 〈 7, 18, 33, 72, 86, 231 〉 ;
2: 〈 1, 17, 74, 222, 255 〉 ;
4: 〈 8, 16, 190, 429, 433 〉 ;
5: 〈 363, 367 〉 ;
7: 〈 13, 23, 191 〉 ; . . . 〉

be, 178239:
〈 1: 〈 17, 25 〉 ;
4: 〈 17, 191, 291, 430, 434 〉 ;
5: 〈 14, 19, 101 〉 ; . . . 〉

As always:

79
Positional indexes: Example

Query: “to1 be2 or3 not4 to5 be6”

to, 993427:
〈 1: 〈 7, 18, 33, 72, 86, 231 〉 ;
2: 〈 1, 17, 74, 222, 255 〉 ;
4: 〈 8, 16, 190, 429, 433 〉 ;
5: 〈 363, 367 〉 ;
7: 〈 13, 23, 191 〉 ; . . . 〉

be, 178239:
〈 1: 〈 17, 25 〉 ;
4: 〈 17, 191, 291, 430, 434 〉 ;
5: 〈 14, 19, 101 〉 ; . . . 〉

As always: docid, term, doc freq; new: offsets

79
Positional indexes: Example

Query: “to1 be2 or3 not4 to5 be6”

to, 993427:
〈 1: 〈 7, 18, 33, 72, 86, 231 〉 ;
2: 〈 1, 17, 74, 222, 255 〉 ;
4: 〈 8, 16, 190, 429, 433 〉 ;
5: 〈 363, 367 〉 ;
7: 〈 13, 23, 191 〉 ; . . . 〉

be, 178239:
〈 1: 〈 17, 25 〉 ;
4: 〈 17, 191, 291, 430, 434 〉 ;
5: 〈 14, 19, 101 〉 ; . . . 〉

Document 4 is a match!

79
Complexity of search with positional index

Unfortunately, Θ(T) rather than Θ(N)

T ...number of tokens in document collection
N ...number of documents in document collection
Combination scheme:
Include frequent biwords as vocabulary terms in the index
(“Cambridge University”, “Britney Spears”)
Resolve all other phrases by positional intersection

80
Proximity search

We just saw how to use a positional index for phrase searches.

We can also use it for proximity search.

employment /4 place
Find all documents that contain employment and place within
4 words of each other.
HIT: Employment agencies that place healthcare workers are
seeing growth.
NO HIT: Employment agencies that have learned to adapt
now place healthcare workers.

81
Proximity search with positional index

Simplest algorithm: look at cross-product of positions of (i)

“employment” in document and (ii) “place” in document
Note that we want to return the actual matching positions,
not just a list of documents.
Very inefficient for frequent words, especially stop words
More efficient algorithm in book

82
Take-away

Understanding of the basic unit of classical information

retrieval systems: words and documents: What is a document,
what is a term?
Tokenization: how to get from raw text to terms (or tokens)
More complex indexes for phrase and proximity search
biword index
positional index

83
Reading

MRS Chapter 2.2

MRS Chapter 2.4

IRS Chapter 2
No ratings yet
IRS Chapter 2
57 pages
Lec 19
No ratings yet
Lec 19
60 pages
2T-Inverted Index
No ratings yet
2T-Inverted Index
54 pages
lecture2-indexing
No ratings yet
lecture2-indexing
78 pages
Lecture 3 - Terms, Postings, Dictionaries, and Tolerant Retrieval
No ratings yet
Lecture 3 - Terms, Postings, Dictionaries, and Tolerant Retrieval
77 pages
Lecture 3-Term Vocabulary and Posting Lists
No ratings yet
Lecture 3-Term Vocabulary and Posting Lists
26 pages
1. 2_text Operation_1 (2)
No ratings yet
1. 2_text Operation_1 (2)
28 pages
3-More on Indexing & Text Operations
No ratings yet
3-More on Indexing & Text Operations
27 pages
Lecture 3-Term Vocabulary and Posting Lists
No ratings yet
Lecture 3-Term Vocabulary and Posting Lists
38 pages
Chapter 2 Part 1 & 2
No ratings yet
Chapter 2 Part 1 & 2
58 pages
Session 1
No ratings yet
Session 1
33 pages
CL_lec 6
No ratings yet
CL_lec 6
28 pages
3. text-processing
No ratings yet
3. text-processing
70 pages
Lecture2-Dictionary - Term Vocabulary and Postings Lists ch2 and ch4
No ratings yet
Lecture2-Dictionary - Term Vocabulary and Postings Lists ch2 and ch4
33 pages
lecture2-dictionary
No ratings yet
lecture2-dictionary
37 pages
IR Lec03 Vocabulary Postings List
No ratings yet
IR Lec03 Vocabulary Postings List
28 pages
1-Getting Started With ELK
No ratings yet
1-Getting Started With ELK
44 pages
CSE 435/535 Information Retrieval: Chapter 2: Tokenization, Stemming, Lemmatization
No ratings yet
CSE 435/535 Information Retrieval: Chapter 2: Tokenization, Stemming, Lemmatization
48 pages
Lecture3 Roy
No ratings yet
Lecture3 Roy
5 pages
Advanced Indexing Issues
No ratings yet
Advanced Indexing Issues
52 pages
C2 Dictionary
No ratings yet
C2 Dictionary
6 pages
Unit 3_ Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval
No ratings yet
Unit 3_ Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval
8 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
34 pages
04 - Lect4 - Text Transformation
No ratings yet
04 - Lect4 - Text Transformation
16 pages
2 Text Operations
No ratings yet
2 Text Operations
32 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
48 pages
lec5
No ratings yet
lec5
22 pages
Chapter 4 - Processing Text
No ratings yet
Chapter 4 - Processing Text
7 pages
Information Retrieval: Text Processing
No ratings yet
Information Retrieval: Text Processing
43 pages
2-Text Operations_new
No ratings yet
2-Text Operations_new
39 pages
Lecture 4
No ratings yet
Lecture 4
48 pages
03 -Lect3 search engines-part2
No ratings yet
03 -Lect3 search engines-part2
32 pages
Lecture2 Dictionary
No ratings yet
Lecture2 Dictionary
62 pages
Information Retrieval Systems Chap 2
67% (3)
Information Retrieval Systems Chap 2
60 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
48 pages
Chapter 1: Boolean Retrieval
No ratings yet
Chapter 1: Boolean Retrieval
9 pages
20 Tolerantretrieval
No ratings yet
20 Tolerantretrieval
39 pages
03Text Processing
No ratings yet
03Text Processing
22 pages
Natural Language Processing (CSE4022) : by N. Ilakiyaselvan
No ratings yet
Natural Language Processing (CSE4022) : by N. Ilakiyaselvan
80 pages
chap2part2
No ratings yet
chap2part2
20 pages
Chap 2 Part 2
No ratings yet
Chap 2 Part 2
20 pages
3 Indexing (2)
No ratings yet
3 Indexing (2)
28 pages
Unit 2 Data - Structures
No ratings yet
Unit 2 Data - Structures
84 pages
6 The Term Vocabulary & Posting List
No ratings yet
6 The Term Vocabulary & Posting List
19 pages
Apex Institute of Technology Natural Language Processing (20CST354)
No ratings yet
Apex Institute of Technology Natural Language Processing (20CST354)
43 pages
02 Text Operation
No ratings yet
02 Text Operation
52 pages
Lecture 3
No ratings yet
Lecture 3
70 pages
4_Indexing (2)
No ratings yet
4_Indexing (2)
29 pages
Introduction To Information Retrieval: Jian-Yun Nie University of Montreal Canada
No ratings yet
Introduction To Information Retrieval: Jian-Yun Nie University of Montreal Canada
61 pages
Term Vocabulary and Postings List
No ratings yet
Term Vocabulary and Postings List
64 pages
Introduction To Information Retrieval: Courtesy
No ratings yet
Introduction To Information Retrieval: Courtesy
61 pages
Introduction IR
No ratings yet
Introduction IR
61 pages
Text Mining
No ratings yet
Text Mining
62 pages
3.tolerant Retrieval
No ratings yet
3.tolerant Retrieval
46 pages
6-Spelling Correction Soundex
No ratings yet
6-Spelling Correction Soundex
52 pages
2.boolean Retrieval Model
No ratings yet
2.boolean Retrieval Model
40 pages
Explain Item Normalization?
No ratings yet
Explain Item Normalization?
7 pages
6_2018_09_11!11_16_16_AM
No ratings yet
6_2018_09_11!11_16_16_AM
101 pages
5 The Term Vocabulary & Posting List
No ratings yet
5 The Term Vocabulary & Posting List
36 pages
We Permeate into the Tao Te Ching
From Everand
We Permeate into the Tao Te Ching
Lao Tzu
No ratings yet
Handouts
No ratings yet
Handouts
97 pages
Cambridge English Skills Reading Esample
No ratings yet
Cambridge English Skills Reading Esample
4 pages
SYLLABUS in INTRODUCTION TO LINGUISTICS
100% (1)
SYLLABUS in INTRODUCTION TO LINGUISTICS
8 pages
Fernandas-Direct Instruction Lesson
No ratings yet
Fernandas-Direct Instruction Lesson
8 pages
How To Attempt Extended Response To Reading
No ratings yet
How To Attempt Extended Response To Reading
3 pages
Fine-Grained POS-tagging: Full Disambiguation of Verbal Morpho-Syntactic Tags
No ratings yet
Fine-Grained POS-tagging: Full Disambiguation of Verbal Morpho-Syntactic Tags
99 pages
EN221 CMap
No ratings yet
EN221 CMap
64 pages
Grammar Re-Imagined Foregrounding Understanding of Language Choice in Writing
No ratings yet
Grammar Re-Imagined Foregrounding Understanding of Language Choice in Writing
15 pages
ENGLISH SEMESTER 2 REVIEW GUIDE - Tổng hợp hướng dẫn nội dung ôn tập (file in)
No ratings yet
ENGLISH SEMESTER 2 REVIEW GUIDE - Tổng hợp hướng dẫn nội dung ôn tập (file in)
19 pages
English Assignment 1 and 2 and 3
No ratings yet
English Assignment 1 and 2 and 3
3 pages
The Coming Of The Third Reich Richard J Evans instant download
100% (1)
The Coming Of The Third Reich Richard J Evans instant download
40 pages
Ishihara - 21 - On The Lack O.2
No ratings yet
Ishihara - 21 - On The Lack O.2
44 pages
Unit 8 Extension Test
No ratings yet
Unit 8 Extension Test
1 page
Semaseology
No ratings yet
Semaseology
10 pages
Grammar: Part Iii: Academic Studies English
100% (1)
Grammar: Part Iii: Academic Studies English
58 pages
If - Clause PDF
No ratings yet
If - Clause PDF
8 pages
Online Activities
No ratings yet
Online Activities
9 pages
Pertemuan 11 Noun Clause
No ratings yet
Pertemuan 11 Noun Clause
3 pages
E.Grammar. present perfect
No ratings yet
E.Grammar. present perfect
13 pages
Interpreting and Translation Coursebook: January 1999
No ratings yet
Interpreting and Translation Coursebook: January 1999
43 pages
GERUNDS AND INFINITIVES
No ratings yet
GERUNDS AND INFINITIVES
2 pages
ABS S1L10 030810 Kclass101
No ratings yet
ABS S1L10 030810 Kclass101
5 pages
3s1 Test of English Second Term PDF
No ratings yet
3s1 Test of English Second Term PDF
3 pages
Grammar Revision Exercises
No ratings yet
Grammar Revision Exercises
7 pages
Compiler Design MCQ - Javatpoint
No ratings yet
Compiler Design MCQ - Javatpoint
1 page
Pages 1-100
No ratings yet
Pages 1-100
99 pages
Grammar 5. Relative Clause
No ratings yet
Grammar 5. Relative Clause
10 pages
Recipient Means - Google Search PDF
No ratings yet
Recipient Means - Google Search PDF
1 page
Garden-path technique
No ratings yet
Garden-path technique
5 pages
MATH 20033: Reasoning Mathematics
No ratings yet
MATH 20033: Reasoning Mathematics
53 pages

Lecture 2: Datastructures and Algorithms For Indexing: Information Retrieval Computer Science Tripos Part II

Uploaded by

Lecture 2: Datastructures and Algorithms For Indexing: Information Retrieval Computer Science Tripos Part II

Uploaded by

Lecture 2: Datastructures and Algorithms for

Natural Language and Information Processing (NLIP) Group

Today: the indexer

Today: The indexer

Today: the indexer

2 Document and Term Normalisation

3 Other types of indexes

The major steps in inverted index construction:

Word: a delimited string of characters as it appears in the text.

Term docID Term (sorted) docID

Term & doc. freq. Postings list

Some postings lists can contain several million entries

Tradeoff: How many skips to place?

2 Document and Term Normalisation

3 Other types of indexes

To build an inverted index, we need to get from Input

Friends, Romans, countrymen. So let it be with Caesar. . .

friend roman countryman so

Each token is a candidate for a postings entry.

Up to now, we assumed that

A single index usually contains terms of several languages.

Need to normalise words in the indexed text as wellas query

window → window, windows

Either at query time, or at index time

cheap San Francisco-Los Angeles

York University vs. New York University

Older IR systems may not index numbers

Need to perform word segmentation

Different scripts (alphabets) might be mixed in one language.

Scripts can incorporate different reading directions.

Compounding in Dutch, German, Swedish

Casefolding can be semantically distinguishing:

Fed vs. fed

Though in most cases it’s not.

Though in most cases they are not (résumé vs. resume)

Extremely common words which are oflittle value in helping

Used to be standardly non-indexed in older IR systems.

Length of practically used stoplists has shrunk over the years.

Reduce inflectional/variant forms to base form

Lemmatisation implies doing “proper” reduction to dictionary

Stemming is a crude heuristic process that chops off the ends

automate, automation, automatic → automat

language dependent, but fast and space-efficient

M. Porter, “An algorithm for suffix stripping”, Program

shoe [sh]C [oe]V m=0

Notation: measure m is calculated on the word excluding the suffix of

In general, stemming increases effectiveness for some queries

Example queries where stemming hurts

Thesauri: semantic equivalence, car = automobile

2 Document and Term Normalisation

3 Other types of indexes

We want to answer a query such as [cambridge university] -

None of these should be a match:

But this one is OK:

About 10% of web queries are phrase queries.

Index every consecutive pair of terms in the text as a phrase.

Friends, Romans, Countrymen

Each of these biwords is now a vocabulary term.

A long phrase like cambridge university west campus can be

cambridge university AND university west AND west campus

We need to do post-filtering of hits to identify subset that

Why are biword indexes rarely used?

Positional indexes are a more efficient alternative to biword

Query: “to1 be2 or3 not4 to5 be6”

Query: “to1 be2 or3 not4 to5 be6”

As always: docid, term, doc freq; new: offsets

Query: “to1 be2 or3 not4 to5 be6”

Unfortunately, Θ(T) rather than Θ(N)

We just saw how to use a positional index for phrase searches.

Simplest algorithm: look at cross-product of positions of (i)

Understanding of the basic unit of classical information

MRS Chapter 2.2

You might also like