chap5-index-construction
chap5-index-construction
• supports better
ranking algorithms
Inverted Index
with positions
• supports
proximity matches
Proximity Matches
• Matching phrases or words within a window
– e.g., "tropical fish", or “find tropical within 5
words of fish”
• Word positions in inverted lists make these
types of query features efficient
– e.g.,
Fields and Extents
• Document structure is useful in search
– field restrictions
• e.g., date, from:, etc.
– some fields more important
• e.g., title
• Options:
– separate inverted lists for each field type
– add information about fields to postings
– use extent lists
Extent Lists
• An extent is a contiguous region of a
document
– represent extents using word positions
– inverted list records all extents for a given field
type
– e.g.,
extent list
Other Issues
• Precomputed scores in inverted list
– e.g., list for “fish” [(1:3.6), (3:2.2)], where 3.6 is
total feature value for document 1
– improves speed but reduces flexibility
• Score-ordered lists
– query processing engine can focus only on the top
part of each inverted list, where the highest-
scoring documents are recorded
– very efficient for single-word queries
Index Construction
• Simple in-memory indexer
List<Posting>()
It.append(Posting(n))
Index Construction
• Simple in-memory indexer
List<Posting>()
It.append(Posting(n))
Index Construction
• Simple in-memory indexer
List<Posting>()
It.append(Posting(n))
Index Construction: Doc Id
• Our doc ids are URLs:
– https://www.ics.uci.edu/~aburtsev/238P/lectures
/lecture03-calling-conventions
• That’s 77 characters
• 77 bytes in C++
• 126 bytes in Python3
• 352 bytes in Java
• Remember, they are used in postings lists
– Very wasteful!
Index Construction: Doc Id
• Map URLs to integers as you process docs:
• 0 https://www.ics.uci.edu/~aburtsev/238P/lectures/lecture03-
calling-conventions
• 1 https://www.ics.uci.edu/~goodrich/teach/ics247/notes
• 2
https://www.ics.uci.edu/~thornton/ics184/MidtermSolutions.html
• Etc.
• Size of integers:
– 4 bytes in C++ and Java
– 28 bytes in Python
– Always the same size independent of the string
Index Construction: Doc Id
• Estimate postings size for Python3:
– 10,000 index terms
– Avg. 20 postings per term
– Avg. URL: 51 characters (= 100 bytes)
• Doc ids as URLs:
– 10,000 x 20 x 100 = 20,000,000 = 20M
• Doc ids as integers:
– 10,000 x 20 x 28 = 5,600,000 = 5.6M
– Reduced to 28% of string version
– Can fit 4x more postings in memory
Index Construction: Doc Id
• Map URLs to integers as you process docs:
• 0 https://www.ics.uci.edu/~aburtsev/238P/lectures/lecture03-
calling-conventions
• 1 https://www.ics.uci.edu/~goodrich/teach/ics247/notes
• 2
https://www.ics.uci.edu/~thornton/ics184/MidtermSolutions.html
• Etc.
• Must keep that mapping stored somewhere
– You will need it for showing the search results
– Typically not part of inverted index itself, but auxiliary
bookkeeping file (e.g. docs.txt)
• More on this later
Index Construction
• Simple in-memory indexer
List<Posting>()
It.append(Posting(n))
Index Construction: Postings
• Contain context of term occurrence in a
document
– Doc id
Required for Project 3
– Frequency count or TF-IDF
– Fields
– Positions
– ...
Index Construction: Postings
class Posting:
def __init__(self, docid, tfidf, fields):
self.docid = docid
self.tfidf = tfidf # use freq counts for now
self.fields = fields
List<Posting>()
What happens if you
It.append(Posting(n))
run out of memory?
Sec. 4.2
List<Posting>()
Could this be a file,
It.append(Posting(n))
directly?
Sec. 4.1
Hardware basics
• Servers used in IR systems typically have
several GB of main memory, sometimes tens
of GB.
• Available disk space is several (2–3) orders of
magnitude larger.
• But... Fault tolerance is very expensive: It’s
much cheaper to use many regular machines
rather than one fault tolerant machine.
– Regular machines Much smaller RAM
Sec. 4.1
Hardware basics
• Access to data in memory is much faster than
access to data on disk.
• Disk seeks: No data is transferred from disk while
the disk head is being positioned.
– Transferring one large chunk of data from disk to
memory is faster than transferring many small chunks.
• Disk I/O is block-based: Reading and writing of
entire blocks (as opposed to smaller chunks).
– Block sizes: 8KB to 256 KB.
Sec. 4.1
(*) https://ai.google/research/people/jeff/
Sec. 4.2
with d1,d2,d3,d5
Postings lists
to be merged Merged
postings list
Disk
Sec. 4.2