lec7
lec7
1
WILD-CARD QUERIES: *
Wildcard queries are used in any of the following situations:
(1) User is uncertain of the spelling of a query term (e.g., Sydney vs. Sidney,
which leads to the wildcard query S*dney)
(2) User is aware of multiple variants of spelling a term and seeks
documents containing any of the variants (e.g., color vs. colour)
(3) User seeks documents containing variants of a term that would be
caught by stemming, but is unsure whether the search engine performs
stemming (e.g., judicial vs. judiciary, leading to the wildcard query
judicia*)
(4) User is uncertain of the correct rendition of a foreign word or phrase
(e.g., the query Universit* Stuttgart).
2
Sec. 3.2
Wild-card queries: *
• mon*: find all docs containing any word
beginning with “mon”.
• Easy with binary tree (or B-tree) lexicon:
retrieve all words in range: mon ≤ w < moo
• *mon: find words ending in “mon”: harder
– Maintain an additional B-tree for terms
Lemon-nomel
backwards (reverse B-tree).
Can retrieve all
Exercise: from words
this, in
how can range:
we nom
enumerate
meeting the wild-card query pro*cent ?
≤ w < non.
all terms
3
Sec. 3.2
4
Sec. 3.2
Query processing
• At this point, we have an enumeration of all terms
in the dictionary that match the wild-card query.
• We still have to look up the postings for each
enumerated term.
• E.g., consider the query:
se*ate AND fil*er
This may result in the execution of many Boolean
AND queries.
This gives rise to the Permuterm Index.
5
Sec. 3.2.1
Permuterm index
• For term hello, index under:
– hello$, ello$h, llo$he, lo$hel, o$hell, $hello
where $ is a special symbol.
• Queries:
– X lookup on X$ X* lookup on $X*
– *X lookup on X$* *X* lookup on X*
– X*Y lookup on Y$X* X*Y*ZQuery:
??? Exercise!
fi*mo*er
1-Look up er$fi*
Query = hel*o 2-filter terms to ensure
X=hel, Y=o mo in middle.
Lookup o$hel* (fishmonger but not
filibuster) 6
Sec. 3.2.1
7
Sec. 3.2.2
mo among amortize
on along among
9
Sec. 3.2.2
Processing wild-cards
• Query mon* can now be run as
– $m AND mo AND on
• Gets terms that match AND version of our
wildcard query.
• But we’d enumerate moon.
• Must post-filter these terms against query
(eg..red*($r AND red) retired) [post-
filtering step]
• Surviving enumerated terms are then looked up
in the term-document inverted index. 10
Sec. 3.2.2
11