chap2part2
chap2part2
1
Creating Equivalence Classes
We most commonly implicitly define equivalence classes of terms by, e.g.,
deleting periods to form a term
U.S.A., USA USA
deleting hyphens to form a term
anti-discriminatory, antidiscriminatory map to antidiscriminatory
Normalization to terms
Capitalization/case-folding
Reduce all letters to lower case
exception: upper case in mid-sentence?
e.g., General Motors
Fed vs. fed(governmental organization)
SAIL vs. sail
• Porter’s algorithm
Conventions + 5 phases of reductions
phases applied sequentially
each phase consists of a set of rules
Porter’s algorithm
Question:
circus
canaries
boss
ponies
measure of a word:
which check the number of syllables to see if the word is long
enough to regard the matching portion of a rule as a suffix rather
than as part of the stem of word
Can we do better?
Yes (if the index isn’t changing too fast).
Sec. 2.3
11 31
1 2 3 8 11 17 21 31
• Why?
• To skip postings that will not figure in the
search results.
• How?
• Where do we place skip pointers?
Sec. 2.3
11 31
1 2 3 8 11 17 21 31
Suppose we’ve stepped through the lists until we process 8 on each list. We match it
and advance.
24 75 92 115
and the following intermediate result postings list (which hence has no
skip pointers):
3 5 89 95 97 99 100 101
c. How many postings comparisons would be made if the postings lists are
intersected without the use of skip pointers?
17
Sec. 2.3
Placing skips
•