0% found this document useful (0 votes)
4 views38 pages

chap5-index-construction

The document discusses the structure and function of indexes in search engines, particularly focusing on inverted indexes which facilitate faster search and ranking of documents. It covers the construction of these indexes, including the importance of document IDs, postings, and the efficiency of proximity matches. Additionally, it addresses the challenges of scaling index construction for large collections and outlines methods for merging partial indexes to optimize performance.

Uploaded by

hihifi1326
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views38 pages

chap5-index-construction

The document discusses the structure and function of indexes in search engines, particularly focusing on inverted indexes which facilitate faster search and ranking of documents. It covers the construction of these indexes, including the importance of document IDs, postings, and the efficiency of proximity matches. Additionally, it addresses the challenges of scaling index construction for large collections and outlines methods for merging partial indexes to optimize performance.

Uploaded by

hihifi1326
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Search Engines

Information Retrieval in Practice

All slides ©Addison Wesley, 2008


With changes by Crista Lopes
Indexes
• Indexes are data structures designed to make
search faster
• Text search has unique requirements, which
leads to unique data structures
• Most common data structure is inverted index
– general name for a class of structures
– “inverted” because documents are associated
with words, rather than words with documents
• similar to a concordance
Indexes and Ranking
• Indexes are designed to support search
– faster response time, supports updates
• Text search engines use a particular form of
search: ranking
– documents are retrieved in sorted order according to
a score computing using the document
representation, the query, and a ranking algorithm
• What is a reasonable abstract model for ranking?
– enables discussion of indexes without details of
retrieval model
Abstract Model of Ranking
More Concrete Model
Inverted Index
• Each index term is associated with an inverted
list
– Contains lists of documents, or lists of word
occurrences in documents, and other information
– Each entry is called a posting
– The part of the posting that refers to a specific
document or location is called a pointer
– Each document in the collection is given a unique
number
– Lists are usually document-ordered (sorted by
document number)
Example “Collection”
Simple Inverted
Index
Inverted Index
with counts

• supports better
ranking algorithms
Inverted Index
with positions

• supports
proximity matches
Proximity Matches
• Matching phrases or words within a window
– e.g., "tropical fish", or “find tropical within 5
words of fish”
• Word positions in inverted lists make these
types of query features efficient
– e.g.,
Fields and Extents
• Document structure is useful in search
– field restrictions
• e.g., date, from:, etc.
– some fields more important
• e.g., title
• Options:
– separate inverted lists for each field type
– add information about fields to postings
– use extent lists
Extent Lists
• An extent is a contiguous region of a
document
– represent extents using word positions
– inverted list records all extents for a given field
type
– e.g.,

extent list
Other Issues
• Precomputed scores in inverted list
– e.g., list for “fish” [(1:3.6), (3:2.2)], where 3.6 is
total feature value for document 1
– improves speed but reduces flexibility
• Score-ordered lists
– query processing engine can focus only on the top
part of each inverted list, where the highest-
scoring documents are recorded
– very efficient for single-word queries
Index Construction
• Simple in-memory indexer

List<Posting>()

It.append(Posting(n))
Index Construction
• Simple in-memory indexer

List<Posting>()

It.append(Posting(n))
Index Construction
• Simple in-memory indexer

List<Posting>()

It.append(Posting(n))
Index Construction: Doc Id
• Our doc ids are URLs:
– https://www.ics.uci.edu/~aburtsev/238P/lectures
/lecture03-calling-conventions
• That’s 77 characters
• 77 bytes in C++
• 126 bytes in Python3
• 352 bytes in Java
• Remember, they are used in postings lists
– Very wasteful!
Index Construction: Doc Id
• Map URLs to integers as you process docs:
• 0  https://www.ics.uci.edu/~aburtsev/238P/lectures/lecture03-
calling-conventions
• 1  https://www.ics.uci.edu/~goodrich/teach/ics247/notes
• 2
https://www.ics.uci.edu/~thornton/ics184/MidtermSolutions.html
• Etc.
• Size of integers:
– 4 bytes in C++ and Java
– 28 bytes in Python
– Always the same size independent of the string
Index Construction: Doc Id
• Estimate postings size for Python3:
– 10,000 index terms
– Avg. 20 postings per term
– Avg. URL: 51 characters (= 100 bytes)
• Doc ids as URLs:
– 10,000 x 20 x 100 = 20,000,000 = 20M
• Doc ids as integers:
– 10,000 x 20 x 28 = 5,600,000 = 5.6M
– Reduced to 28% of string version
– Can fit 4x more postings in memory
Index Construction: Doc Id
• Map URLs to integers as you process docs:
• 0  https://www.ics.uci.edu/~aburtsev/238P/lectures/lecture03-
calling-conventions
• 1  https://www.ics.uci.edu/~goodrich/teach/ics247/notes
• 2
https://www.ics.uci.edu/~thornton/ics184/MidtermSolutions.html
• Etc.
• Must keep that mapping stored somewhere
– You will need it for showing the search results
– Typically not part of inverted index itself, but auxiliary
bookkeeping file (e.g. docs.txt)
• More on this later
Index Construction
• Simple in-memory indexer

List<Posting>()

It.append(Posting(n))
Index Construction: Postings
• Contain context of term occurrence in a
document
– Doc id
Required for Project 3
– Frequency count or TF-IDF
– Fields
– Positions
– ...
Index Construction: Postings

class Posting:
def __init__(self, docid, tfidf, fields):
self.docid = docid
self.tfidf = tfidf # use freq counts for now
self.fields = fields

Think of what this data structure must contain.


May model it as a class or as a structured tuple.
Index Construction
• Simple in-memory indexer

List<Posting>()
What happens if you
It.append(Posting(n))
run out of memory?
Sec. 4.2

Scaling index construction


• In-memory index construction does not scale
– Can’t fit entire collection into memory
• How can we construct an index for very large
collections?
• Taking into account hardware constraints. . .
– Memory, disk, speed, etc.
Index Construction
• Simple in-memory indexer

List<Posting>()
Could this be a file,
It.append(Posting(n))
directly?
Sec. 4.1

Hardware basics
• Servers used in IR systems typically have
several GB of main memory, sometimes tens
of GB.
• Available disk space is several (2–3) orders of
magnitude larger.
• But... Fault tolerance is very expensive: It’s
much cheaper to use many regular machines
rather than one fault tolerant machine.
– Regular machines  Much smaller RAM
Sec. 4.1

Hardware basics
• Access to data in memory is much faster than
access to data on disk.
• Disk seeks: No data is transferred from disk while
the disk head is being positioned.
– Transferring one large chunk of data from disk to
memory is faster than transferring many small chunks.
• Disk I/O is block-based: Reading and writing of
entire blocks (as opposed to smaller chunks).
– Block sizes: 8KB to 256 KB.
Sec. 4.1

Jeff Dean’s (*)


“Latency Numbers Every Programmer Should Know”
• Latency Comparison Numbers (~2012)

• L1 cache reference 0.5 ns


• L2 cache reference 7 ns
• Mutex lock/unlock 25 ns
• Main memory reference 100 ns
• Compress 1K bytes with Zippy 3,000 ns 3 us
• Send 1K bytes over 1 Gbps network 10,000 ns 10 us
• Read 4K randomly from SSD* 150,000 ns 150 us
• Read 1 MB sequentially from memory 250,000 ns 250 us
• Round trip within same datacenter 500,000 ns 500 us
• Read 1 MB sequentially from SSD* 1,000,000 ns 1,000 us 1 ms
• Disk seek 10,000,000 ns 10,000 us 10 ms
• Read 1 MB sequentially from disk 20,000,000 ns 20,000 us 20 ms
• Send packet CA->Netherlands->CA 150,000,000 ns 150,000 us 150 ms

(*) https://ai.google/research/people/jeff/
Sec. 4.2

Sort using disk as “memory”?


• Can we use the same index construction
algorithm for larger collections, but by using
disk instead of memory?
• No: accessing/modifying T = 100,000,000
records on disk is too slow – too many disk
seeks.
• We need an external sorting algorithm.
Partial Indexes + Merging
• Build the inverted list structure until a certain
size
• Then write the partial index to disk, start
making a new one
• At the end of this process, the disk is filled
with many partial indexes, which are merged
• Partial lists must be designed so they can be
merged in small pieces
– e.g., storing in alphabetical order
Partial indexes
procedure BuildIndex(D)
I  HashTable
n  0
B  [] # batch of documents
while D is not empty do
B  GetBatch(D)
for all documents d in B do
n  n+1
T  Parse(D)
RemoveDuplicates(T)
for all tokens e in T do
if t not in I then
I[t] = []
I[t].append(Posting(n))
end for
end for
SortAndWriteToDisk(I, name)
I.empty()
end while
Merging
Sec. 4.2

How to merge the sorted runs?


• Can do binary merges, 2 files at a time
• During each layer, read into memory in blocks of 10M, merge,
write back.
brutus d1,d3,d6,d7

brutus d1,d3 brutus d6,d7 caesar d1,d2,d4,d8,d9

caesar d1,d2,d4 caesar d8,d9 julius d10

noble d5 julius d10 killed d8

with d1,d2,d3,d5 killed d8 noble d5

with d1,d2,d3,d5
Postings lists
to be merged Merged
postings list
Disk
Sec. 4.2

How to merge the sorted runs?


• But it is more efficient to do a multi-way merge, where you are
reading from all files simultaneously
– Open all partial index files simultaneously and
maintain a read buffer for each one and a write buffer
for the output file
– In each iteration, pick the lowest termID that hasn’t
been processed
– Merge all postings lists for that termID and write it out

• Providing you read decent-sized chunks of each block into memory


and then write out a decent-sized output chunk, then you’re not
killed by disk seeks
Project 3: indexer
• Milestone #1:
– Start with a small set of files (e.g. www-
db_ics_uci_edu). Short development cycle for simple
tasks:
• Traversing folders in search for JSON files
• Opening and reading one file at a time
• JSON & HTML parsing
• Tokenization & stemming
• Simple in-memory inverted index
• Simple index serialization to disk
– Expand gradually to as much of the dataset as
possible, until you hit memory limits
– No need to scale up yet...
Project 3: indexer
• Milestone #3:
– Scale up
• Use this lecture’s material

You might also like