0% found this document useful (0 votes)
5 views11 pages

lec7

Wildcard queries are utilized when users are uncertain about spelling, seek multiple variants, or are unsure if stemming is applied. The document discusses techniques for processing wildcard queries, including the use of B-trees and permuterm indexes to efficiently handle queries with wildcards. Additionally, it introduces bigram indexes to facilitate searching for terms based on character sequences.

Uploaded by

menaahmed15200
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views11 pages

lec7

Wildcard queries are utilized when users are uncertain about spelling, seek multiple variants, or are unsure if stemming is applied. The document discusses techniques for processing wildcard queries, including the use of B-trees and permuterm indexes to efficiently handle queries with wildcards. Additionally, it introduces bigram indexes to facilitate searching for terms based on character sequences.

Uploaded by

menaahmed15200
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 11

WILD-CARD QUERIES

1
WILD-CARD QUERIES: *
Wildcard queries are used in any of the following situations:
(1) User is uncertain of the spelling of a query term (e.g., Sydney vs. Sidney,
which leads to the wildcard query S*dney)
(2) User is aware of multiple variants of spelling a term and seeks
documents containing any of the variants (e.g., color vs. colour)
(3) User seeks documents containing variants of a term that would be
caught by stemming, but is unsure whether the search engine performs
stemming (e.g., judicial vs. judiciary, leading to the wildcard query
judicia*)
(4) User is uncertain of the correct rendition of a foreign word or phrase
(e.g., the query Universit* Stuttgart).

2
Sec. 3.2

Wild-card queries: *
• mon*: find all docs containing any word
beginning with “mon”.
• Easy with binary tree (or B-tree) lexicon:
retrieve all words in range: mon ≤ w < moo
• *mon: find words ending in “mon”: harder
– Maintain an additional B-tree for terms
Lemon-nomel
backwards (reverse B-tree).
Can retrieve all
Exercise: from words
this, in
how can range:
we nom
enumerate
meeting the wild-card query pro*cent ?
≤ w < non.
all terms

3
Sec. 3.2

B-trees handle *’s at the end of a query


term
• How can we handle *’s in the middle of query
term?
– co*tion
• We could look up co* AND *tion in a B-tree
and intersect the two term sets
– Expensive
• The solution: transform wild-card queries so
that the *’s occur at the end

4
Sec. 3.2

Query processing
• At this point, we have an enumeration of all terms
in the dictionary that match the wild-card query.
• We still have to look up the postings for each
enumerated term.
• E.g., consider the query:
se*ate AND fil*er
This may result in the execution of many Boolean
AND queries.
This gives rise to the Permuterm Index.
5
Sec. 3.2.1

Permuterm index
• For term hello, index under:
– hello$, ello$h, llo$he, lo$hel, o$hell, $hello
where $ is a special symbol.
• Queries:
– X lookup on X$ X* lookup on $X*
– *X lookup on X$* *X* lookup on X*
– X*Y lookup on Y$X* X*Y*ZQuery:
??? Exercise!
fi*mo*er
1-Look up er$fi*
Query = hel*o 2-filter terms to ensure
X=hel, Y=o mo in middle.
Lookup o$hel* (fishmonger but not
filibuster) 6
Sec. 3.2.1

Permuterm query processing


• Rotate query wild-card to the right
• Now use B-tree lookup as before.
• Permuterm problem: ≈ quadruples lexicon size
Empirical observation for English.

7
Sec. 3.2.2

Bigram (k-gram) indexes


• Enumerate all k-grams (sequence of k chars)
occurring in any term
• e.g., from text “April is the cruelest month”
we get the 2-grams (bigrams)
$a,ap,pr,ri,il,l$,$i,is,s$,$t,th,he,e$,$c,cr,ru,
ue,el,le,es,st,t$, $m,mo,on,nt,h$

– $ is a special word boundary symbol


• Maintain a second inverted index from
bigrams to dictionary terms that match each 8
Sec. 3.2.2

Bigram index example


• The k-gram index finds terms based on a
query consisting of k-grams (here k=2).
$m mace madden

mo among amortize

on along among

9
Sec. 3.2.2

Processing wild-cards
• Query mon* can now be run as
– $m AND mo AND on
• Gets terms that match AND version of our
wildcard query.
• But we’d enumerate moon.
• Must post-filter these terms against query
(eg..red*($r AND red) retired) [post-
filtering step]
• Surviving enumerated terms are then looked up
in the term-document inverted index. 10
Sec. 3.2.2

Processing wild-card queries


• As before, we must execute a Boolean query
for each enumerated, filtered term.
• Wild-cards can result in expensive query
execution pyth* AND prog*
• If you encourage “laziness” people will
respond!
Searc
h
Type your search terms, use ‘*’ if you need to.
E.g., Alex* will match Alexander.

11

You might also like