Lecture 4
Index construction
Plan
 Last lecture: Tolerant retrieval
 Wildcards
 Spell correction
 Soundex
 This time:
 Index construction
Index construction
 How do we construct an index?
 What strategies can we use with limited
main memory?
Our corpus for this lecture
 Number of docs = n = 1M
 Each doc has 1K terms
 Number of distinct terms = m = 500K
 667 million postings entries
 Documents are parsed to extract words and
these are saved with the Document ID.
I did enact Julius
Caesar I was killed
i' the Capitol;
Brutus killed me.
Doc 1
So let it be with
Caesar. The noble
Brutus hath told you
Caesar was ambitious
Doc 2
Recall index construction
Term Doc #
I 1
did 1
enact 1
julius 1
caesar 1
I 1
was 1
killed 1
i' 1
the 1
capitol 1
brutus 1
killed 1
me 1
so 2
let 2
it 2
be 2
with 2
caesar 2
the 2
noble 2
brutus 2
hath 2
told 2
you 2
caesar 2
was 2
ambitious 2
Term Doc #
I 1
did 1
enact 1
julius 1
caesar 1
I 1
was 1
killed 1
i' 1
the 1
capitol 1
brutus 1
killed 1
me 1
so 2
let 2
it 2
be 2
with 2
caesar 2
the 2
noble 2
brutus 2
hath 2
told 2
you 2
caesar 2
was 2
ambitious 2
Term Doc #
ambitious 2
be 2
brutus 1
brutus 2
capitol 1
caesar 1
caesar 2
caesar 2
did 1
enact 1
hath 1
I 1
I 1
i' 1
it 2
julius 1
killed 1
killed 1
let 2
me 1
noble 2
so 2
the 1
the 2
told 2
you 2
was 1
was 2
with 2
Key step
 After all documents have
been parsed the inverted file
is sorted by terms.
We focus on this sort step.
We have 667M items to sort.
Index construction
 As we build up the index, cannot exploit
compression tricks
 Parse docs one at a time.
 Final postings for any term – incomplete until the
end.
 (actually you can exploit compression, but this
becomes a lot more complex)
 At 10-12 bytes per postings entry, demands
several temporary gigabytes
System parameters for design
 Disk seek ~ 10 milliseconds
 Block transfer from disk ~ 1 microsecond per
byte (following a seek)
 All other ops ~ 10 microseconds
 E.g., compare two postings entries and decide
their merge order
Bottleneck
 Parse and build postings entries one doc at a
time
 Now sort postings entries by term (then by doc
within each term)
 Doing this with random disk seeks would be too
slow – must sort N=667M records
If every comparison took 2 disk seeks, and N items could be
sorted with N log2N comparisons, how long would this take?
Sorting with fewer disk seeks
 12-byte (4+4+4) records (term, doc, freq).
 These are generated as we parse docs.
 Must now sort 667M such 12-byte records by
term.
 Define a Block ~ 10M such records
 can “easily” fit a couple into memory.
 Will have 64 such blocks to start with.
 Will sort within blocks first, then merge the blocks
into one long sorted order.
Sorting 64 blocks of 10M records
 First, read each block and sort within:
 Quicksort takes 2N ln N expected steps
 In our case 2 x (10M ln 10M) steps

Exercise: estimate total time to read each block
Exercise: estimate total time to read each block
from disk and and quicksort it.
from disk and and quicksort it.
 64 times this estimate - gives us 64 sorted runs
of 10M records each.
 Need 2 copies of data on disk, throughout.
Merging 64 sorted runs
 Merge tree of log264= 6 layers.
 During each layer, read into memory runs in
blocks of 10M, merge, write back.
Disk
1
3 4
2
2
1
4
3
Runs being
merged.
Merged run.
Merge tree
…
…
Sorted runs.
1 2 64
63
32 runs, 20M/run
16 runs, 40M/run
8 runs, 80M/run
4 runs … ?
2 runs … ?
1 run … ?
Bottom level
of tree.
Merging 64 runs
 Time estimate for disk transfer:
 6 x (64runs x 120MB x 10-6
sec) x 2 ~ 25hrs.
Disk block
transfer time.
Why is this an
Overestimate?
Work out how these
transfers are staged,
and the total time for
merging.
# Layers in
merge tree
Read +
Write
Exercise - fill in this table
Time
Step
64 initial quicksorts of 10M records each
Read 2 sorted blocks for merging, write back
Merge 2 sorted blocks
1
2
3
4
5
Add (2) + (3) = time to read/merge/write
64 times (4) = total merge time
?
Large memory indexing
 Suppose instead that we had 16GB of memory
for the above indexing task.
 Exercise: What initial block sizes would we
choose? What index time does this yield?
 Repeat with a couple of values of n, m.
 In practice, spidering often interlaced with
indexing.
 Spidering bottlenecked by WAN speed and many
other factors - more on this later.
Distributed indexing
 For web-scale indexing (don’t try this at home!):
must use a distributed computing cluster
 Individual machines are fault-prone
 Can unpredictably slow down or fail
 How do we exploit such a pool of machines?
Distributed indexing
 Maintain a master machine directing the indexing
job – considered “safe”.
 Break up indexing into sets of (parallel) tasks.
 Master machine assigns each task to an idle
machine from a pool.
Parallel tasks
 We will use two sets of parallel tasks
 Parsers
 Inverters
 Break the input document corpus into splits
 Each split is a subset of documents
 Master assigns a split to an idle parser machine
 Parser reads a document at a time and emits
(term, doc) pairs
Parallel tasks
 Parser writes pairs into j partitions
 Each for a range of terms’ first letters
 (e.g., a-f, g-p, q-z) – here j=3.
 Now to complete the index inversion
Data flow
splits
Parser
Parser
Parser
Master
a-f g-p q-z
a-f g-p q-z
a-f g-p q-z
Inverter
Inverter
Inverter
Postings
a-f
g-p
q-z
assign assign
Inverters
 Collect all (term, doc) pairs for a partition
 Sorts and writes to postings list
 Each partition contains a set of postings
Above process flow a special case of MapReduce
(general architecture for distributed computing).
Dynamic indexing
 Docs come in over time
 postings updates for terms already in dictionary
 new terms added to dictionary
 Docs get deleted
Simplest approach
 Maintain “big” main index
 New docs go into “small” auxiliary index
 Search across both, merge results
 Deletions
 Invalidation bit-vector for deleted docs
 Filter docs output on a search result by this
invalidation bit-vector
 Periodically, re-index into one main index
Issue with big and small indexes
 Corpus-wide statistics are hard to maintain
 E.g., when we spoke of spell-correction: which of
several corrected alternatives do we present to
the user?
 We said, pick the one with the most hits
 How do we maintain the top ones with multiple
indexes?
 One possibility: ignore the small index for such
ordering
 Will see more such statistics used in results
ranking
Building positional indexes
 Still a sorting problem (but larger)
 Exercise: given 1GB of memory, how would you
adapt the block merge described earlier?
Why?
Building n-gram indexes
 As text is parsed, enumerate n-grams.
 For each n-gram, need pointers to all dictionary
terms containing it – the “postings”.
 Note that the same “postings entry” can arise
repeatedly in parsing the docs – need efficient
“hash” to keep track of this.
 E.g., that the trigram uou occurs in the term
deciduous will be discovered on each text
occurrence of deciduous
Building n-gram indexes
 Once all (n-gramterm) pairs have been
enumerated, must sort for inversion
 Recall average English dictionary term is ~8
characters
 So about 6 trigrams per term on average
 For a vocabulary of 500K terms, this is about 3
million pointers – can compress
Index on disk vs. memory
 Most retrieval systems keep the dictionary in
memory and the postings on disk
 Web search engines frequently keep both in
memory
 massive memory requirement
 feasible for large web service installations
 less so for commercial usage where query loads
are lighter
Indexing in the real world
 Typically, don’t have all documents sitting on a
local filesystem
 Documents need to be spidered
 Could be dispersed over a WAN with varying
connectivity
 Must schedule distributed spiders
 Have already discussed distributed indexers
 Could be (secure content) in

Databases

Content management applications

Email applications
Content residing in applications
 Mail systems/groupware, content management
contain the most “valuable” documents
 http often not the most efficient way of fetching
these documents - native API fetching
 Specialized, repository-specific connectors
 These connectors also facilitate document viewing
when a search result is selected for viewing
Secure documents
 Each document is accessible to a subset of
users
 Usually implemented through some form of
Access Control Lists (ACLs)
 Search users are authenticated
 Query should retrieve a document only if user
can access it
 So if there are docs matching your search but
you’re not privy to them, “Sorry no results found”
 E.g., as a lowly employee in the company, I get
“No results” for the query “salary roster”
Users in groups, docs from groups
 Index the ACLs and filter results by them
 Often, user membership in an ACL group verified
at query time – slowdown
Users
Documents
0/1 0 if user can’t read
doc, 1 otherwise.
Exercise
 Can spelling suggestion compromise such
document-level security?
 Consider the case when there are documents
matching my query, but I lack access to them.
Compound documents
 What if a doc consisted of components
 Each component has its own ACL.
 Your search should get a doc only if your query
meets one of its components that you have
access to.
 More generally: doc assembled from
computations on components
 e.g., in Lotus databases or in content
management systems
 How do you index such docs?
No good answers …
“Rich” documents
 (How) Do we index images?
 Researchers have devised Query Based on
Image Content (QBIC) systems
 “show me a picture similar to this orange circle”
 watch for lecture on vector space retrieval
 In practice, image search usually based on meta-
data such as file name e.g., monalisa.jpg
 New approaches exploit social tagging
 E.g., flickr.com
Passage/sentence retrieval
 Suppose we want to retrieve not an entire
document matching a query, but only a
passage/sentence - say, in a very long document
 Can index passages/sentences as mini-
documents – what should the index units be?
 This is the subject of XML search

lecture04-constEDGYUAOPKFGHFDSSAHJDFKGruction.ppt

  • 1.
  • 2.
    Plan  Last lecture:Tolerant retrieval  Wildcards  Spell correction  Soundex  This time:  Index construction
  • 3.
    Index construction  Howdo we construct an index?  What strategies can we use with limited main memory?
  • 4.
    Our corpus forthis lecture  Number of docs = n = 1M  Each doc has 1K terms  Number of distinct terms = m = 500K  667 million postings entries
  • 5.
     Documents areparsed to extract words and these are saved with the Document ID. I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me. Doc 1 So let it be with Caesar. The noble Brutus hath told you Caesar was ambitious Doc 2 Recall index construction Term Doc # I 1 did 1 enact 1 julius 1 caesar 1 I 1 was 1 killed 1 i' 1 the 1 capitol 1 brutus 1 killed 1 me 1 so 2 let 2 it 2 be 2 with 2 caesar 2 the 2 noble 2 brutus 2 hath 2 told 2 you 2 caesar 2 was 2 ambitious 2
  • 6.
    Term Doc # I1 did 1 enact 1 julius 1 caesar 1 I 1 was 1 killed 1 i' 1 the 1 capitol 1 brutus 1 killed 1 me 1 so 2 let 2 it 2 be 2 with 2 caesar 2 the 2 noble 2 brutus 2 hath 2 told 2 you 2 caesar 2 was 2 ambitious 2 Term Doc # ambitious 2 be 2 brutus 1 brutus 2 capitol 1 caesar 1 caesar 2 caesar 2 did 1 enact 1 hath 1 I 1 I 1 i' 1 it 2 julius 1 killed 1 killed 1 let 2 me 1 noble 2 so 2 the 1 the 2 told 2 you 2 was 1 was 2 with 2 Key step  After all documents have been parsed the inverted file is sorted by terms. We focus on this sort step. We have 667M items to sort.
  • 7.
    Index construction  Aswe build up the index, cannot exploit compression tricks  Parse docs one at a time.  Final postings for any term – incomplete until the end.  (actually you can exploit compression, but this becomes a lot more complex)  At 10-12 bytes per postings entry, demands several temporary gigabytes
  • 8.
    System parameters fordesign  Disk seek ~ 10 milliseconds  Block transfer from disk ~ 1 microsecond per byte (following a seek)  All other ops ~ 10 microseconds  E.g., compare two postings entries and decide their merge order
  • 9.
    Bottleneck  Parse andbuild postings entries one doc at a time  Now sort postings entries by term (then by doc within each term)  Doing this with random disk seeks would be too slow – must sort N=667M records If every comparison took 2 disk seeks, and N items could be sorted with N log2N comparisons, how long would this take?
  • 10.
    Sorting with fewerdisk seeks  12-byte (4+4+4) records (term, doc, freq).  These are generated as we parse docs.  Must now sort 667M such 12-byte records by term.  Define a Block ~ 10M such records  can “easily” fit a couple into memory.  Will have 64 such blocks to start with.  Will sort within blocks first, then merge the blocks into one long sorted order.
  • 11.
    Sorting 64 blocksof 10M records  First, read each block and sort within:  Quicksort takes 2N ln N expected steps  In our case 2 x (10M ln 10M) steps  Exercise: estimate total time to read each block Exercise: estimate total time to read each block from disk and and quicksort it. from disk and and quicksort it.  64 times this estimate - gives us 64 sorted runs of 10M records each.  Need 2 copies of data on disk, throughout.
  • 12.
    Merging 64 sortedruns  Merge tree of log264= 6 layers.  During each layer, read into memory runs in blocks of 10M, merge, write back. Disk 1 3 4 2 2 1 4 3 Runs being merged. Merged run.
  • 13.
    Merge tree … … Sorted runs. 12 64 63 32 runs, 20M/run 16 runs, 40M/run 8 runs, 80M/run 4 runs … ? 2 runs … ? 1 run … ? Bottom level of tree.
  • 14.
    Merging 64 runs Time estimate for disk transfer:  6 x (64runs x 120MB x 10-6 sec) x 2 ~ 25hrs. Disk block transfer time. Why is this an Overestimate? Work out how these transfers are staged, and the total time for merging. # Layers in merge tree Read + Write
  • 15.
    Exercise - fillin this table Time Step 64 initial quicksorts of 10M records each Read 2 sorted blocks for merging, write back Merge 2 sorted blocks 1 2 3 4 5 Add (2) + (3) = time to read/merge/write 64 times (4) = total merge time ?
  • 16.
    Large memory indexing Suppose instead that we had 16GB of memory for the above indexing task.  Exercise: What initial block sizes would we choose? What index time does this yield?  Repeat with a couple of values of n, m.  In practice, spidering often interlaced with indexing.  Spidering bottlenecked by WAN speed and many other factors - more on this later.
  • 17.
    Distributed indexing  Forweb-scale indexing (don’t try this at home!): must use a distributed computing cluster  Individual machines are fault-prone  Can unpredictably slow down or fail  How do we exploit such a pool of machines?
  • 18.
    Distributed indexing  Maintaina master machine directing the indexing job – considered “safe”.  Break up indexing into sets of (parallel) tasks.  Master machine assigns each task to an idle machine from a pool.
  • 19.
    Parallel tasks  Wewill use two sets of parallel tasks  Parsers  Inverters  Break the input document corpus into splits  Each split is a subset of documents  Master assigns a split to an idle parser machine  Parser reads a document at a time and emits (term, doc) pairs
  • 20.
    Parallel tasks  Parserwrites pairs into j partitions  Each for a range of terms’ first letters  (e.g., a-f, g-p, q-z) – here j=3.  Now to complete the index inversion
  • 21.
    Data flow splits Parser Parser Parser Master a-f g-pq-z a-f g-p q-z a-f g-p q-z Inverter Inverter Inverter Postings a-f g-p q-z assign assign
  • 22.
    Inverters  Collect all(term, doc) pairs for a partition  Sorts and writes to postings list  Each partition contains a set of postings Above process flow a special case of MapReduce (general architecture for distributed computing).
  • 23.
    Dynamic indexing  Docscome in over time  postings updates for terms already in dictionary  new terms added to dictionary  Docs get deleted
  • 24.
    Simplest approach  Maintain“big” main index  New docs go into “small” auxiliary index  Search across both, merge results  Deletions  Invalidation bit-vector for deleted docs  Filter docs output on a search result by this invalidation bit-vector  Periodically, re-index into one main index
  • 25.
    Issue with bigand small indexes  Corpus-wide statistics are hard to maintain  E.g., when we spoke of spell-correction: which of several corrected alternatives do we present to the user?  We said, pick the one with the most hits  How do we maintain the top ones with multiple indexes?  One possibility: ignore the small index for such ordering  Will see more such statistics used in results ranking
  • 26.
    Building positional indexes Still a sorting problem (but larger)  Exercise: given 1GB of memory, how would you adapt the block merge described earlier? Why?
  • 27.
    Building n-gram indexes As text is parsed, enumerate n-grams.  For each n-gram, need pointers to all dictionary terms containing it – the “postings”.  Note that the same “postings entry” can arise repeatedly in parsing the docs – need efficient “hash” to keep track of this.  E.g., that the trigram uou occurs in the term deciduous will be discovered on each text occurrence of deciduous
  • 28.
    Building n-gram indexes Once all (n-gramterm) pairs have been enumerated, must sort for inversion  Recall average English dictionary term is ~8 characters  So about 6 trigrams per term on average  For a vocabulary of 500K terms, this is about 3 million pointers – can compress
  • 29.
    Index on diskvs. memory  Most retrieval systems keep the dictionary in memory and the postings on disk  Web search engines frequently keep both in memory  massive memory requirement  feasible for large web service installations  less so for commercial usage where query loads are lighter
  • 30.
    Indexing in thereal world  Typically, don’t have all documents sitting on a local filesystem  Documents need to be spidered  Could be dispersed over a WAN with varying connectivity  Must schedule distributed spiders  Have already discussed distributed indexers  Could be (secure content) in  Databases  Content management applications  Email applications
  • 31.
    Content residing inapplications  Mail systems/groupware, content management contain the most “valuable” documents  http often not the most efficient way of fetching these documents - native API fetching  Specialized, repository-specific connectors  These connectors also facilitate document viewing when a search result is selected for viewing
  • 32.
    Secure documents  Eachdocument is accessible to a subset of users  Usually implemented through some form of Access Control Lists (ACLs)  Search users are authenticated  Query should retrieve a document only if user can access it  So if there are docs matching your search but you’re not privy to them, “Sorry no results found”  E.g., as a lowly employee in the company, I get “No results” for the query “salary roster”
  • 33.
    Users in groups,docs from groups  Index the ACLs and filter results by them  Often, user membership in an ACL group verified at query time – slowdown Users Documents 0/1 0 if user can’t read doc, 1 otherwise.
  • 34.
    Exercise  Can spellingsuggestion compromise such document-level security?  Consider the case when there are documents matching my query, but I lack access to them.
  • 35.
    Compound documents  Whatif a doc consisted of components  Each component has its own ACL.  Your search should get a doc only if your query meets one of its components that you have access to.  More generally: doc assembled from computations on components  e.g., in Lotus databases or in content management systems  How do you index such docs? No good answers …
  • 36.
    “Rich” documents  (How)Do we index images?  Researchers have devised Query Based on Image Content (QBIC) systems  “show me a picture similar to this orange circle”  watch for lecture on vector space retrieval  In practice, image search usually based on meta- data such as file name e.g., monalisa.jpg  New approaches exploit social tagging  E.g., flickr.com
  • 37.
    Passage/sentence retrieval  Supposewe want to retrieve not an entire document matching a query, but only a passage/sentence - say, in a very long document  Can index passages/sentences as mini- documents – what should the index units be?  This is the subject of XML search