0% found this document useful (0 votes)

4 views37 pages

lecture2-dictionary

The document provides an overview of information retrieval, focusing on the indexing pipeline, tokenization, and normalization processes essential for document ingestion. It discusses various challenges related to document formats, languages, and the complexities of defining what constitutes a document. Additionally, it covers techniques like stemming and lemmatization, as well as the use of skip pointers to optimize query processing in information retrieval systems.

Uploaded by

Elroy Merwyn Monis

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views37 pages

lecture2-dictionary

Uploaded by

Elroy Merwyn Monis

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

Introduction to Information Retrieval

Introduction to
Information Retrieval
Document ingestion
Introduction to Information Retrieval

Recall the basic indexing pipeline

Documents to Friends, Romans, countrymen.
be indexed

Tokenizer
Token stream Friends Romans Countrymen
Linguistic
modules
Modified tokens friend roman countryman

Indexer friend 2 4
roman 1 2
Inverted index
countryman 13 16
Introduction to Information Retrieval Sec. 2.1

Parsing a document
▪ What format is it in?
▪ pdf/word/excel/html?
▪ What language is it in?
▪ What character set is in use?
▪ (CP1252, UTF-8, …)

Each of these is a classification problem,

which we will study later in the course.

But these tasks are often done heuristically …

Introduction to Information Retrieval Sec. 2.1

Complications: Format/language
▪ Documents being indexed can include docs from
many different languages
▪ A single index may contain terms from many languages.
▪ Sometimes a document or its components can
contain multiple languages/formats
▪ French email with a German pdf attachment.
▪ French email quote clauses from an English-language
contract

▪ There are commercial and open source libraries that

can handle a lot of this stuff
Introduction to Information Retrieval Sec. 2.1

Complications: What is a document?

We return from our query “documents” but there are
often interesting questions of grain size:

What is a unit document?

▪ A file?
▪ An email? (Perhaps one of many in a single mbox file)
▪ What about an email with 5 attachments?
▪ A group of files (e.g., PPT or LaTeX split over HTML pages)
Introduction to Information Retrieval

Introduction to
Information Retrieval
Tokens
Introduction to Information Retrieval Sec. 2.2.1

Tokenization
▪ Input: “Friends, Romans and Countrymen”
▪ Output: Tokens
▪ Friends
▪ Romans
▪ Countrymen
▪ A token is an instance of a sequence of characters
▪ Each such token is now a candidate for an index
entry, after further processing
▪ Described below
▪ But what are valid tokens to emit?
Introduction to Information Retrieval Sec. 2.2.1

Tokenization
▪ Issues in tokenization:
▪ Finland’s capital →
Finland AND s? Finlands? Finland’s?
▪ Hewlett-Packard → Hewlett and Packard as two
tokens?
▪ state-of-the-art: break up hyphenated sequence.
▪ co-education
▪ lowercase, lower-case, lower case ?
▪ It can be effective to get the user to put in possible hyphens
▪ San Francisco: one token or two?
▪ How do you decide it is one token?
Introduction to Information Retrieval Sec. 2.2.1

Numbers
▪ 3/20/91 Mar. 12, 1991 20/3/91
▪ 55 B.C.
▪ B-52
▪ My PGP key is 324a3df234cb23e
▪ (800) 234-2333
▪ Often have embedded spaces
▪ Older IR systems may not index numbers
▪ But often very useful: think about things like looking up error
codes/stacktraces on the web
▪ (One answer is using n-grams: IIR ch. 3)
▪ Will often index “meta-data” separately
▪ Creation date, format, etc.
Introduction to Information Retrieval Sec. 2.2.1

Tokenization: language issues

▪ French
▪ L'ensemble → one token or two?
▪ L ? L’ ? Le ?
▪ Want l’ensemble to match with un ensemble
▪ Until at least 2003, it didn’t on Google
▪ Internationalization!

▪ German noun compounds are not segmented

▪ Lebensversicherungsgesellschaftsangestellter
▪ ‘life insurance company employee’
▪ German retrieval systems benefit greatly from a compound splitter
module
▪ Can give a 15% performance boost for German
Introduction to Information Retrieval Sec. 2.2.1

Tokenization: language issues

▪ Chinese and Japanese have no spaces between
words:
▪ 莎拉波娃现在居住在美国东南部的佛罗里达。
▪ Not always guaranteed a unique tokenization
▪ Further complicated in Japanese, with multiple
alphabets intermingled
▪ Dates/amounts in multiple formats
フォーチュン500社は情報不足のため時間あた$500K(約6,000万円)

Katakana Hiragana Kanji Romaji

End-user can express query entirely in hiragana!

Introduction to Information Retrieval Sec. 2.2.1

Tokenization: language issues

▪ Arabic (or Hebrew) is basically written right to left,
but with certain items like numbers written left to
right
▪ Words are separated, but letter forms within a word
form complex ligatures

▪ ← → ←→ ← start
▪ ‘Algeria achieved its independence in 1962 after 132
years of French occupation.’
▪ With Unicode, the surface presentation is complex, but the
stored form is straightforward
Introduction to Information Retrieval

Introduction to
Information Retrieval
Terms
The things indexed in an IR system
Introduction to Information Retrieval Sec. 2.2.2

Stop words
▪ With a stop list, you exclude from the dictionary
entirely the commonest words. Intuition:
▪ They have little semantic content: the, a, and, to, be
▪ There are a lot of them: ~30% of postings for top 30 words
▪ But the trend is away from doing this:
▪ Good compression techniques (IIR 5) means the space for including
stop words in a system is very small
▪ Good query optimization techniques (IIR 7) mean you pay little at
query time for including stop words.
▪ You need them for:
▪ Phrase queries: “King of Denmark”
▪ Various song titles, etc.: “Let it be”, “To be or not to be”
▪ “Relational” queries: “flights to London”
Introduction to Information Retrieval Sec. 2.2.3

Normalization to terms
▪ We may need to “normalize” words in indexed text
as well as query words into the same form
▪ We want to match U.S.A. and USA
▪ Result is terms: a term is a (normalized) word type,
which is an entry in our IR system dictionary
▪ We most commonly implicitly define equivalence
classes of terms by, e.g.,
▪ deleting periods to form a term
▪ U.S.A., USA  USA
▪ deleting hyphens to form a term
▪ anti-discriminatory, antidiscriminatory  antidiscriminatory
Introduction to Information Retrieval Sec. 2.2.3

Normalization: other languages

▪ Accents: e.g., French résumé vs. resume.
▪ Umlauts: e.g., German: Tuebingen vs. Tübingen
▪ Should be equivalent
▪ Most important criterion:
▪ How are your users like to write their queries for these
words?

▪ Even in languages that standardly have accents, users

often may not type them
▪ Often best to normalize to a de-accented term
▪ Tuebingen, Tübingen, Tubingen  Tubingen
Introduction to Information Retrieval Sec. 2.2.3

Normalization: other languages

▪ Normalization of things like date forms
▪ 7月30日 vs. 7/30
▪ Japanese use of kana vs. Chinese characters

▪ Tokenization and normalization may depend on the

language and so is intertwined with language
detection Is this
Morgen will ich in MIT … German “mit”?

▪ Crucial: Need to “normalize” indexed text as well as

query terms identically
Introduction to Information Retrieval Sec. 2.2.3

Case folding
▪ Reduce all letters to lower case
▪ exception: upper case in mid-sentence?
▪ e.g., General Motors
▪ Fed vs. fed
▪ SAIL vs. sail
▪ Often best to lower case everything, since users will use
lowercase regardless of ‘correct’ capitalization…

▪ Longstanding Google example: [fixed in 2011…]

▪ Query C.A.T.
▪ #1 result is for “cats” (well, Lolcats) not Caterpillar Inc.
Introduction to Information Retrieval Sec. 2.2.3

Normalization to terms

▪ An alternative to equivalence classing is to do

asymmetric expansion
▪ An example of where this may be useful
▪ Enter: window Search: window, windows
▪ Enter: windows Search: Windows, windows, window
▪ Enter: Windows Search: Windows
▪ Potentially more powerful, but less efficient
Introduction to Information Retrieval

Thesauri and soundex

▪ Do we handle synonyms and homonyms?
▪ E.g., by hand-constructed equivalence classes
▪ car = automobile color = colour
▪ We can rewrite to form equivalence-class terms
▪ When the document contains automobile, index it under car-
automobile (and vice-versa)
▪ Or we can expand a query
▪ When the query contains automobile, look under car as well
▪ What about spelling mistakes?
▪ One approach is Soundex, which forms equivalence classes
of words based on phonetic heuristics
▪ More in IIR 3 and IIR 9
Introduction to Information Retrieval

Introduction to
Information Retrieval
Stemming and Lemmatization
Introduction to Information Retrieval Sec. 2.2.4

Lemmatization
▪ Reduce inflectional/variant forms to base form
▪ E.g.,
▪ am, are, is → be
▪ car, cars, car's, cars' → car
▪ the boy's cars are different colors → the boy car be
different color
▪ Lemmatization implies doing “proper” reduction to
dictionary headword form
Introduction to Information Retrieval Sec. 2.2.4

Stemming
▪ Reduce terms to their “roots” before indexing
▪ “Stemming” suggests crude affix chopping
▪ language dependent
▪ e.g., automate(s), automatic, automation all reduced to
automat.

for example compressed for exampl compress and

and compression are both compress ar both accept
accepted as equivalent to as equival to compress
compress.
Introduction to Information Retrieval Sec. 2.2.4

Porter’s algorithm
▪ Commonest algorithm for stemming English
▪ Results suggest it’s at least as good as other stemming
options
▪ Conventions + 5 phases of reductions
▪ phases applied sequentially
▪ each phase consists of a set of commands
▪ sample convention: Of the rules in a compound command,
select the one that applies to the longest suffix.
Introduction to Information Retrieval Sec. 2.2.4

Typical rules in Porter

▪ sses → ss
▪ ies → i
▪ ational → ate
▪ tional → tion

▪ Weight of word sensitive rules

▪ (m>1) EMENT →
▪ replacement → replac
▪ cement → cement
Introduction to Information Retrieval Sec. 2.2.4

Other stemmers
▪ Other stemmers exist:
▪ Lovins stemmer
▪ http://www.comp.lancs.ac.uk/computing/research/stemming/general/lovins.htm
▪ Single-pass, longest suffix removal (about 250 rules)
▪ Paice/Husk stemmer
▪ Snowball

▪ Full morphological analysis (lemmatization)

▪ At most modest benefits for retrieval
Introduction to Information Retrieval Sec. 2.2.4

Language-specificity
▪ The above methods embody transformations that
are
▪ Language-specific, and often
▪ Application-specific
▪ These are “plug-in” addenda to the indexing process
▪ Both open source and commercial plug-ins are
available for handling these
Introduction to Information Retrieval Sec. 2.2.4

Does stemming help?

▪ English: very mixed results. Helps recall for some
queries but harms precision on others
▪ E.g., operative (dentistry) ⇒ oper
▪ Definitely useful for Spanish, German, Finnish, …
▪ 30% performance gains for Finnish!
Introduction to Information Retrieval

Introduction to
Information Retrieval
Faster postings merges:
Skip pointers/Skip lists
Introduction to Information Retrieval Sec. 2.3

Recall basic merge

▪ Walk through the two postings simultaneously, in
time linear in the total number of postings entries

2 4 8 41 48 64 128 Brutus
2 8
1 2 3 8 11 17 21 31 Caesar

If the list lengths are m and n, the merge takes O(m+n)

operations.

Can we do better?
Yes (if the index isn’t changing too fast).
Introduction to Information Retrieval Sec. 2.3

Augment postings with skip pointers

(at indexing time)
41 128
2 4 8 41 48 64 128

11 31
1 2 3 8 11 17 21 31

▪ Why?
▪ To skip postings that will not figure in the search
results.
▪ How?
▪ Where do we place skip pointers?
Introduction to Information Retrieval Sec. 2.3

Query processing with skip pointers

41 128
2 4 8 41 48 64 128

11 31
1 2 3 8 11 17 21 31

Suppose we’ve stepped through the lists until we

process 8 on each list. We match it and advance.

We then have 41 and 11 on the lower. 11 is smaller.

But the skip successor of 11 on the lower list is 31, so

we can skip ahead past the intervening postings.
Introduction to Information Retrieval

Postings lists intersection with skip pointers

33
Introduction to Information Retrieval Sec. 2.3

Where do we place skips?

▪ Tradeoff:
▪ More skips → shorter skip spans  more likely to skip.
But lots of comparisons to skip pointers.
▪ Fewer skips → few pointer comparison, but then long skip
spans  few successful skips.
Introduction to Information Retrieval Sec. 2.3

Placing skips
▪ Simple heuristic: for postings of length L, use L
evenly-spaced skip pointers [Moffat and Zobel 1996]
▪ This ignores the distribution of query terms.
▪ Easy if the index is relatively static; harder if L keeps
changing because of updates.

▪ This definitely used to help; with modern hardware it

may not unless you’re memory-based [Bahle et al. 2002]
▪ The I/O cost of loading a bigger postings list can outweigh
the gains from quicker in memory merging!
Introduction to Information Retrieval

We have a two-word query. For one term the postings list consists of
the following 16 entries:
[4,6,10,12,14,16,18,20,22,32,47,81,120,122,157,180]
and for the other it is the one entry postings list:
[47].
Work out how many comparisons would be done to intersect the two
postings lists with the following two strategies. Briefly justify your
answers:
a. Using standard postings lists
b. Using postings lists stored with skip pointers, with a skip length of
√P.

36
Introduction to Information Retrieval

Consider a postings intersection between this postings list, with skip

pointers:
3 5 9 15 24 39 60 68 75 81 84 89 92 96 97 100 115
and the following intermediate result postings list (which hence has no
skip pointers):
3 5 89 95 97 99 100 101
Trace through the postings intersection algorithm
a. How often is a skip pointer followed (i.e., p is advanced to skip(p ))?
1 1

b. How many postings comparisons will be made by this algorithm while

intersecting the two lists?
c. How many postings comparisons would be made if the postings lists
are intersected without the use of skip pointers?

Belden Electronic Wire Catalog No 864
No ratings yet
Belden Electronic Wire Catalog No 864
28 pages
Lecture2-Dictionary - Term Vocabulary and Postings Lists ch2 and ch4
No ratings yet
Lecture2-Dictionary - Term Vocabulary and Postings Lists ch2 and ch4
33 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
34 pages
C2 Dictionary
No ratings yet
C2 Dictionary
6 pages
Lecture 3-Term Vocabulary and Posting Lists
No ratings yet
Lecture 3-Term Vocabulary and Posting Lists
26 pages
Lecture2 Dictionary
No ratings yet
Lecture2 Dictionary
62 pages
Information Retrieval Systems Chap 2
67% (3)
Information Retrieval Systems Chap 2
60 pages
CSE 435/535 Information Retrieval: Chapter 2: Tokenization, Stemming, Lemmatization
No ratings yet
CSE 435/535 Information Retrieval: Chapter 2: Tokenization, Stemming, Lemmatization
48 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
48 pages
Lecture 3-Term Vocabulary and Posting Lists
No ratings yet
Lecture 3-Term Vocabulary and Posting Lists
38 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
48 pages
IR Lec03 Vocabulary Postings List
No ratings yet
IR Lec03 Vocabulary Postings List
28 pages
Lecture3 Roy
No ratings yet
Lecture3 Roy
5 pages
Term Vocabulary and Postings List
No ratings yet
Term Vocabulary and Postings List
64 pages
Lecture 3 - Terms, Postings, Dictionaries, and Tolerant Retrieval
No ratings yet
Lecture 3 - Terms, Postings, Dictionaries, and Tolerant Retrieval
77 pages
IRS Chapter 2
No ratings yet
IRS Chapter 2
57 pages
Informa (On Retrieval: Recap of The Previous Lecture
No ratings yet
Informa (On Retrieval: Recap of The Previous Lecture
8 pages
2T-Inverted Index
No ratings yet
2T-Inverted Index
54 pages
IR 02 02 Tokens
No ratings yet
IR 02 02 Tokens
8 pages
Chap 2 Part 2
No ratings yet
Chap 2 Part 2
20 pages
lec5
No ratings yet
lec5
22 pages
chap2part2
No ratings yet
chap2part2
20 pages
3-More on Indexing & Text Operations
No ratings yet
3-More on Indexing & Text Operations
27 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
115 pages
Lecture 5-Dictionaries and Tolerant Retrieval
No ratings yet
Lecture 5-Dictionaries and Tolerant Retrieval
48 pages
3. text-processing
No ratings yet
3. text-processing
70 pages
1. 2_text Operation_1 (2)
No ratings yet
1. 2_text Operation_1 (2)
28 pages
Lecture3 Tolerant Retrieval Handout 6 Per
No ratings yet
Lecture3 Tolerant Retrieval Handout 6 Per
8 pages
C7 SpellCorrection
No ratings yet
C7 SpellCorrection
43 pages
Module 4-Boolean Retrieval Models-Edit Distance, Spelling Correction
No ratings yet
Module 4-Boolean Retrieval Models-Edit Distance, Spelling Correction
124 pages
Lec 19
No ratings yet
Lec 19
60 pages
Lecture 4-Dictionaries and Tolerant Retrieval
No ratings yet
Lecture 4-Dictionaries and Tolerant Retrieval
50 pages
Lecture5 Spell Correction 1per
No ratings yet
Lecture5 Spell Correction 1per
61 pages
week6
No ratings yet
week6
98 pages
Lecture 2: Datastructures and Algorithms For Indexing: Information Retrieval Computer Science Tripos Part II
No ratings yet
Lecture 2: Datastructures and Algorithms For Indexing: Information Retrieval Computer Science Tripos Part II
47 pages
Lecture3 Tolerant Retrieval
100% (1)
Lecture3 Tolerant Retrieval
48 pages
Lecture3 Tolerant Retrieval
100% (1)
Lecture3 Tolerant Retrieval
48 pages
Lecture3-Tolerant-retrieval Dictionaries and Tolerant Retrieval CH 3
No ratings yet
Lecture3-Tolerant-retrieval Dictionaries and Tolerant Retrieval CH 3
47 pages
Chapter 2 Part 1 & 2
No ratings yet
Chapter 2 Part 1 & 2
58 pages
Lecture3 Tolerant Retrieval
No ratings yet
Lecture3 Tolerant Retrieval
48 pages
Ranking Algorithms in Information Retrieval:
No ratings yet
Ranking Algorithms in Information Retrieval:
10 pages
03 Dictionaries
No ratings yet
03 Dictionaries
112 pages
03 Dictionaries
No ratings yet
03 Dictionaries
112 pages
6_2018_09_11!11_16_16_AM
No ratings yet
6_2018_09_11!11_16_16_AM
101 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
46 pages
02 Text Operation
No ratings yet
02 Text Operation
52 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
54 pages
Chapter 1: Boolean Retrieval
No ratings yet
Chapter 1: Boolean Retrieval
9 pages
IR 02 04 Stemming
No ratings yet
IR 02 04 Stemming
9 pages
Lect 7 Normalization
No ratings yet
Lect 7 Normalization
9 pages
2 Text Operations
No ratings yet
2 Text Operations
32 pages
IR-Lec1 - Ch1-2023
No ratings yet
IR-Lec1 - Ch1-2023
41 pages
1-Getting Started With ELK
No ratings yet
1-Getting Started With ELK
44 pages
Introduction IR
No ratings yet
Introduction IR
61 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
51 pages
Ir 1
No ratings yet
Ir 1
59 pages
IR Problem: Introduction To Information Retrieval Outline
No ratings yet
IR Problem: Introduction To Information Retrieval Outline
11 pages
IR Summary Lec 1 - Introduction
No ratings yet
IR Summary Lec 1 - Introduction
54 pages
MSC IR 2021
100% (1)
MSC IR 2021
188 pages
Information Retrival Systems
No ratings yet
Information Retrival Systems
50 pages
Domain-Specific Languages in R: Advanced Statistical Programming
From Everand
Domain-Specific Languages in R: Advanced Statistical Programming
Thomas Mailund
No ratings yet
DSE_5251_Endsem
No ratings yet
DSE_5251_Endsem
3 pages
lecture5-compression
No ratings yet
lecture5-compression
47 pages
lecture1-intro
No ratings yet
lecture1-intro
60 pages
The Elements of Statistical Learning Data Mining I
No ratings yet
The Elements of Statistical Learning Data Mining I
2 pages
Aventics Emerson Closing en
No ratings yet
Aventics Emerson Closing en
2 pages
Selfrescue David Fasulo Mike Clelland instant download
No ratings yet
Selfrescue David Fasulo Mike Clelland instant download
90 pages
The Impact of Ventilation On Air Quality in Indoor Ice Skating Arenas
No ratings yet
The Impact of Ventilation On Air Quality in Indoor Ice Skating Arenas
6 pages
Finance Crypto Assignment
No ratings yet
Finance Crypto Assignment
8 pages
Batumi in Your Pocket
No ratings yet
Batumi in Your Pocket
36 pages
BusyAnt - Year2 - PNL - 7 2
No ratings yet
BusyAnt - Year2 - PNL - 7 2
2 pages
Microsoft Word - LIBS - TASK FPETR20 SAMPLE - TEST - OCT 104643 2017
No ratings yet
Microsoft Word - LIBS - TASK FPETR20 SAMPLE - TEST - OCT 104643 2017
10 pages
Goyanko vs. UCPB (G.R No 179096)
No ratings yet
Goyanko vs. UCPB (G.R No 179096)
6 pages
Amberjet 1000 Na L
No ratings yet
Amberjet 1000 Na L
2 pages
CT Operating Instructions
No ratings yet
CT Operating Instructions
15 pages
Curriculum Vitae - Mike Pearson: Executive Summary
No ratings yet
Curriculum Vitae - Mike Pearson: Executive Summary
3 pages
PDF Love's Last Stand S. B. Moores download
100% (4)
PDF Love's Last Stand S. B. Moores download
66 pages
Local Budget Circular: Republic OF THE Philippines / Department OF Budget AND Managemen
No ratings yet
Local Budget Circular: Republic OF THE Philippines / Department OF Budget AND Managemen
41 pages
Partnership Agreement
No ratings yet
Partnership Agreement
2 pages
PnLs Labor For PMs
No ratings yet
PnLs Labor For PMs
3 pages
Client Information Form
No ratings yet
Client Information Form
5 pages
2009 Origins Rogue Trader Tournament Official Rules
No ratings yet
2009 Origins Rogue Trader Tournament Official Rules
1 page
SOP 24 Standard Operating Procedure For Calibration of Stopwatches and Timing Devices
No ratings yet
SOP 24 Standard Operating Procedure For Calibration of Stopwatches and Timing Devices
10 pages
Automation And Robotics
No ratings yet
Automation And Robotics
31 pages
Song of The Sky Loom
No ratings yet
Song of The Sky Loom
9 pages
SEA2M68 Stepper Motor Driver - Shenzhen Instar Electromechanical Technology Development Co., LTD
No ratings yet
SEA2M68 Stepper Motor Driver - Shenzhen Instar Electromechanical Technology Development Co., LTD
6 pages
Machining Process - Definition, Types, Advantages, Disadvantages & Applications (PDF)
100% (1)
Machining Process - Definition, Types, Advantages, Disadvantages & Applications (PDF)
6 pages
Pega Certifications - Pega Academy
No ratings yet
Pega Certifications - Pega Academy
2 pages
Lucid Recall Results Profile - Gijsels, Nele - (Completed 01-Jul-2022 092441)
No ratings yet
Lucid Recall Results Profile - Gijsels, Nele - (Completed 01-Jul-2022 092441)
1 page
Westfield London Mall Guide
No ratings yet
Westfield London Mall Guide
2 pages
IP Practical File - Reference
No ratings yet
IP Practical File - Reference
98 pages
Btec HM Coursebook
No ratings yet
Btec HM Coursebook
388 pages
Nuun Recruiting Svcs 2018
No ratings yet
Nuun Recruiting Svcs 2018
14 pages
By The Author and The Senior High School St. Theresa's School of Novaliches
No ratings yet
By The Author and The Senior High School St. Theresa's School of Novaliches
86 pages

lecture2-dictionary

Uploaded by

lecture2-dictionary

Uploaded by

Introduction to Information Retrieval

Recall the basic indexing pipeline

Each of these is a classification problem,

But these tasks are often done heuristically …

▪ There are commercial and open source libraries that

Complications: What is a document?

What is a unit document?

Tokenization: language issues

▪ German noun compounds are not segmented

Tokenization: language issues

Katakana Hiragana Kanji Romaji

End-user can express query entirely in hiragana!

Tokenization: language issues

Normalization: other languages

▪ Even in languages that standardly have accents, users

Normalization: other languages

▪ Tokenization and normalization may depend on the

▪ Crucial: Need to “normalize” indexed text as well as

▪ Longstanding Google example: [fixed in 2011…]

▪ An alternative to equivalence classing is to do

Thesauri and soundex

for example compressed for exampl compress and

Typical rules in Porter

▪ Weight of word sensitive rules

▪ Full morphological analysis (lemmatization)

Does stemming help?

Recall basic merge

If the list lengths are m and n, the merge takes O(m+n)

Augment postings with skip pointers

Query processing with skip pointers

Suppose we’ve stepped through the lists until we

We then have 41 and 11 on the lower. 11 is smaller.

But the skip successor of 11 on the lower list is 31, so

Postings lists intersection with skip pointers

Where do we place skips?

▪ This definitely used to help; with modern hardware it

Consider a postings intersection between this postings list, with skip

b. How many postings comparisons will be made by this algorithm while

You might also like