0% found this document useful (0 votes)

4 views38 pages

chap5-index-construction

The document discusses the structure and function of indexes in search engines, particularly focusing on inverted indexes which facilitate faster search and ranking of documents. It covers the construction of these indexes, including the importance of document IDs, postings, and the efficiency of proximity matches. Additionally, it addresses the challenges of scaling index construction for large collections and outlines methods for merging partial indexes to optimize performance.

Uploaded by

hihifi1326

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views38 pages

chap5-index-construction

Uploaded by

hihifi1326

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 38

Search Engines

Information Retrieval in Practice

All slides ©Addison Wesley, 2008

With changes by Crista Lopes
Indexes
• Indexes are data structures designed to make
search faster
• Text search has unique requirements, which
leads to unique data structures
• Most common data structure is inverted index
– general name for a class of structures
– “inverted” because documents are associated
with words, rather than words with documents
• similar to a concordance
Indexes and Ranking
• Indexes are designed to support search
– faster response time, supports updates
• Text search engines use a particular form of
search: ranking
– documents are retrieved in sorted order according to
a score computing using the document
representation, the query, and a ranking algorithm
• What is a reasonable abstract model for ranking?
– enables discussion of indexes without details of
retrieval model
Abstract Model of Ranking
More Concrete Model
Inverted Index
• Each index term is associated with an inverted
list
– Contains lists of documents, or lists of word
occurrences in documents, and other information
– Each entry is called a posting
– The part of the posting that refers to a specific
document or location is called a pointer
– Each document in the collection is given a unique
number
– Lists are usually document-ordered (sorted by
document number)
Example “Collection”
Simple Inverted
Index
Inverted Index
with counts

• supports better
ranking algorithms
Inverted Index
with positions

• supports
proximity matches
Proximity Matches
• Matching phrases or words within a window
– e.g., "tropical fish", or “find tropical within 5
words of fish”
• Word positions in inverted lists make these
types of query features efficient
– e.g.,
Fields and Extents
• Document structure is useful in search
– field restrictions
• e.g., date, from:, etc.
– some fields more important
• e.g., title
• Options:
– separate inverted lists for each field type
– add information about fields to postings
– use extent lists
Extent Lists
• An extent is a contiguous region of a
document
– represent extents using word positions
– inverted list records all extents for a given field
type
– e.g.,

extent list
Other Issues
• Precomputed scores in inverted list
– e.g., list for “fish” [(1:3.6), (3:2.2)], where 3.6 is
total feature value for document 1
– improves speed but reduces flexibility
• Score-ordered lists
– query processing engine can focus only on the top
part of each inverted list, where the highest-
scoring documents are recorded
– very efficient for single-word queries
Index Construction
• Simple in-memory indexer

List<Posting>()

It.append(Posting(n))
Index Construction
• Simple in-memory indexer

List<Posting>()

It.append(Posting(n))
Index Construction
• Simple in-memory indexer

List<Posting>()

It.append(Posting(n))
Index Construction: Doc Id
• Our doc ids are URLs:
– https://www.ics.uci.edu/~aburtsev/238P/lectures
/lecture03-calling-conventions
• That’s 77 characters
• 77 bytes in C++
• 126 bytes in Python3
• 352 bytes in Java
• Remember, they are used in postings lists
– Very wasteful!
Index Construction: Doc Id
• Map URLs to integers as you process docs:
• 0  https://www.ics.uci.edu/~aburtsev/238P/lectures/lecture03-
calling-conventions
• 1  https://www.ics.uci.edu/~goodrich/teach/ics247/notes
• 2
https://www.ics.uci.edu/~thornton/ics184/MidtermSolutions.html
• Etc.
• Size of integers:
– 4 bytes in C++ and Java
– 28 bytes in Python
– Always the same size independent of the string
Index Construction: Doc Id
• Estimate postings size for Python3:
– 10,000 index terms
– Avg. 20 postings per term
– Avg. URL: 51 characters (= 100 bytes)
• Doc ids as URLs:
– 10,000 x 20 x 100 = 20,000,000 = 20M
• Doc ids as integers:
– 10,000 x 20 x 28 = 5,600,000 = 5.6M
– Reduced to 28% of string version
– Can fit 4x more postings in memory
Index Construction: Doc Id
• Map URLs to integers as you process docs:
• 0  https://www.ics.uci.edu/~aburtsev/238P/lectures/lecture03-
calling-conventions
• 1  https://www.ics.uci.edu/~goodrich/teach/ics247/notes
• 2
https://www.ics.uci.edu/~thornton/ics184/MidtermSolutions.html
• Etc.
• Must keep that mapping stored somewhere
– You will need it for showing the search results
– Typically not part of inverted index itself, but auxiliary
bookkeeping file (e.g. docs.txt)
• More on this later
Index Construction
• Simple in-memory indexer

List<Posting>()

It.append(Posting(n))
Index Construction: Postings
• Contain context of term occurrence in a
document
– Doc id
Required for Project 3
– Frequency count or TF-IDF
– Fields
– Positions
– ...
Index Construction: Postings

class Posting:
def __init__(self, docid, tfidf, fields):
self.docid = docid
self.tfidf = tfidf # use freq counts for now
self.fields = fields

Think of what this data structure must contain.

May model it as a class or as a structured tuple.
Index Construction
• Simple in-memory indexer

List<Posting>()
What happens if you
It.append(Posting(n))
run out of memory?
Sec. 4.2

Scaling index construction

• In-memory index construction does not scale
– Can’t fit entire collection into memory
• How can we construct an index for very large
collections?
• Taking into account hardware constraints. . .
– Memory, disk, speed, etc.
Index Construction
• Simple in-memory indexer

List<Posting>()
Could this be a file,
It.append(Posting(n))
directly?
Sec. 4.1

Hardware basics
• Servers used in IR systems typically have
several GB of main memory, sometimes tens
of GB.
• Available disk space is several (2–3) orders of
magnitude larger.
• But... Fault tolerance is very expensive: It’s
much cheaper to use many regular machines
rather than one fault tolerant machine.
– Regular machines  Much smaller RAM
Sec. 4.1

Hardware basics
• Access to data in memory is much faster than
access to data on disk.
• Disk seeks: No data is transferred from disk while
the disk head is being positioned.
– Transferring one large chunk of data from disk to
memory is faster than transferring many small chunks.
• Disk I/O is block-based: Reading and writing of
entire blocks (as opposed to smaller chunks).
– Block sizes: 8KB to 256 KB.
Sec. 4.1

Jeff Dean’s (*)

“Latency Numbers Every Programmer Should Know”
• Latency Comparison Numbers (~2012)

• L1 cache reference 0.5 ns

• L2 cache reference 7 ns
• Mutex lock/unlock 25 ns
• Main memory reference 100 ns
• Compress 1K bytes with Zippy 3,000 ns 3 us
• Send 1K bytes over 1 Gbps network 10,000 ns 10 us
• Read 4K randomly from SSD* 150,000 ns 150 us
• Read 1 MB sequentially from memory 250,000 ns 250 us
• Round trip within same datacenter 500,000 ns 500 us
• Read 1 MB sequentially from SSD* 1,000,000 ns 1,000 us 1 ms
• Disk seek 10,000,000 ns 10,000 us 10 ms
• Read 1 MB sequentially from disk 20,000,000 ns 20,000 us 20 ms
• Send packet CA->Netherlands->CA 150,000,000 ns 150,000 us 150 ms

(*) https://ai.google/research/people/jeff/
Sec. 4.2

Sort using disk as “memory”?

• Can we use the same index construction
algorithm for larger collections, but by using
disk instead of memory?
• No: accessing/modifying T = 100,000,000
records on disk is too slow – too many disk
seeks.
• We need an external sorting algorithm.
Partial Indexes + Merging
• Build the inverted list structure until a certain
size
• Then write the partial index to disk, start
making a new one
• At the end of this process, the disk is filled
with many partial indexes, which are merged
• Partial lists must be designed so they can be
merged in small pieces
– e.g., storing in alphabetical order
Partial indexes
procedure BuildIndex(D)
I  HashTable
n  0
B  [] # batch of documents
while D is not empty do
B  GetBatch(D)
for all documents d in B do
n  n+1
T  Parse(D)
RemoveDuplicates(T)
for all tokens e in T do
if t not in I then
I[t] = []
I[t].append(Posting(n))
end for
end for
SortAndWriteToDisk(I, name)
I.empty()
end while
Merging
Sec. 4.2

How to merge the sorted runs?

• Can do binary merges, 2 files at a time
• During each layer, read into memory in blocks of 10M, merge,
write back.
brutus d1,d3,d6,d7

brutus d1,d3 brutus d6,d7 caesar d1,d2,d4,d8,d9

caesar d1,d2,d4 caesar d8,d9 julius d10

noble d5 julius d10 killed d8

with d1,d2,d3,d5 killed d8 noble d5

with d1,d2,d3,d5
Postings lists
to be merged Merged
postings list
Disk
Sec. 4.2

How to merge the sorted runs?

• But it is more efficient to do a multi-way merge, where you are
reading from all files simultaneously
– Open all partial index files simultaneously and
maintain a read buffer for each one and a write buffer
for the output file
– In each iteration, pick the lowest termID that hasn’t
been processed
– Merge all postings lists for that termID and write it out

• Providing you read decent-sized chunks of each block into memory

and then write out a decent-sized output chunk, then you’re not
killed by disk seeks
Project 3: indexer
• Milestone #1:
– Start with a small set of files (e.g. www-
db_ics_uci_edu). Short development cycle for simple
tasks:
• Traversing folders in search for JSON files
• Opening and reading one file at a time
• JSON & HTML parsing
• Tokenization & stemming
• Simple in-memory inverted index
• Simple index serialization to disk
– Expand gradually to as much of the dataset as
possible, until you hit memory limits
– No need to scale up yet...
Project 3: indexer
• Milestone #3:
– Scale up
• Use this lecture’s material

MD110COMMAND
100% (2)
MD110COMMAND
39 pages
4.index Construction - New
No ratings yet
4.index Construction - New
46 pages
Lecture 4 - Index Construction _ Compressing
No ratings yet
Lecture 4 - Index Construction _ Compressing
90 pages
Lecture 2 Inverted Index PDF
No ratings yet
Lecture 2 Inverted Index PDF
24 pages
04 Index Construction
No ratings yet
04 Index Construction
48 pages
Lecture 5p1 - Index Construction & Compressing
No ratings yet
Lecture 5p1 - Index Construction & Compressing
42 pages
05 Index Construction
No ratings yet
05 Index Construction
47 pages
L05
No ratings yet
L05
33 pages
indexing_1
No ratings yet
indexing_1
61 pages
Lec6 InvretedIndex pt2
No ratings yet
Lec6 InvretedIndex pt2
38 pages
FOP Efficiency Indexing 13
No ratings yet
FOP Efficiency Indexing 13
22 pages
IR Unit III - Notes
No ratings yet
IR Unit III - Notes
18 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
50 pages
IR ch4 - Inverted-Index
No ratings yet
IR ch4 - Inverted-Index
44 pages
IR_MOD4_NOTES
No ratings yet
IR_MOD4_NOTES
19 pages
lec9
No ratings yet
lec9
21 pages
AI6122 Topic 3.1 - Index
No ratings yet
AI6122 Topic 3.1 - Index
40 pages
C10 IR M2021 IndexConstruction SimpleandDistributed
No ratings yet
C10 IR M2021 IndexConstruction SimpleandDistributed
42 pages
lec9
No ratings yet
lec9
21 pages
3_Index Construction 3560e51d31af433180d259cbc5729509
No ratings yet
3_Index Construction 3560e51d31af433180d259cbc5729509
5 pages
Learning Guide Unit 2
No ratings yet
Learning Guide Unit 2
15 pages
thesis
No ratings yet
thesis
49 pages
chap5
No ratings yet
chap5
64 pages
Index Construction
No ratings yet
Index Construction
37 pages
Information Retrieval - 2
No ratings yet
Information Retrieval - 2
24 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
49 pages
Introduction To Information Storage and Retrieval: Chapter Four: Indexing Structure
No ratings yet
Introduction To Information Storage and Retrieval: Chapter Four: Indexing Structure
34 pages
Inverted File
No ratings yet
Inverted File
20 pages
IR Journal
No ratings yet
IR Journal
36 pages
1726119671-4 Index Construction
No ratings yet
1726119671-4 Index Construction
19 pages
Chapter 3 Indexing
No ratings yet
Chapter 3 Indexing
48 pages
IRS imp
No ratings yet
IRS imp
76 pages
Chapter4 Indexconstruction
No ratings yet
Chapter4 Indexconstruction
49 pages
C3 IndexConstruction
No ratings yet
C3 IndexConstruction
46 pages
inverted index-unit-3
No ratings yet
inverted index-unit-3
11 pages
CHAP 4 Inverted Index
No ratings yet
CHAP 4 Inverted Index
21 pages
4_Indexing (2)
No ratings yet
4_Indexing (2)
29 pages
Building Fast Search Engines
No ratings yet
Building Fast Search Engines
21 pages
Completed UNIT-III 20.9.17
No ratings yet
Completed UNIT-III 20.9.17
61 pages
Elementary IR: Scalable Boolean Text Search: (Compare With R & G 27.1-3)
No ratings yet
Elementary IR: Scalable Boolean Text Search: (Compare With R & G 27.1-3)
22 pages
Unit I
No ratings yet
Unit I
83 pages
Ir Chapter Three
No ratings yet
Ir Chapter Three
41 pages
Lecture4-Indexconstruction Ch2 and Ch4
No ratings yet
Lecture4-Indexconstruction Ch2 and Ch4
49 pages
Dynamic Indexing
No ratings yet
Dynamic Indexing
53 pages
Chapter 1: Introduction: Efficient Search in Large Textual Collections With Redundancy - 2009
No ratings yet
Chapter 1: Introduction: Efficient Search in Large Textual Collections With Redundancy - 2009
31 pages
3 Index Construction
No ratings yet
3 Index Construction
43 pages
3 Index Construction
No ratings yet
3 Index Construction
43 pages
IR Chap3
No ratings yet
IR Chap3
45 pages
ch3_ Indexing _2019
No ratings yet
ch3_ Indexing _2019
38 pages
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
No ratings yet
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
20 pages
Lecture 4-Indexconstruction
No ratings yet
Lecture 4-Indexconstruction
45 pages
Module 5 - Indexing and Searching
No ratings yet
Module 5 - Indexing and Searching
15 pages
IRS Module5-I
No ratings yet
IRS Module5-I
15 pages
ir
No ratings yet
ir
120 pages
Chapter 4 IR
No ratings yet
Chapter 4 IR
56 pages
Information Retrieval & Data Mining: Smart PC Explorer
No ratings yet
Information Retrieval & Data Mining: Smart PC Explorer
14 pages
IR Unit 2 Dictionaries and Query Processing
No ratings yet
IR Unit 2 Dictionaries and Query Processing
20 pages
Data Structures I Essentials
From Everand
Data Structures I Essentials
Dennis Smolarski
No ratings yet
Introduction to Microsoft SQL Server
From Everand
Introduction to Microsoft SQL Server
Eric Frick
No ratings yet
Computer Science I Essentials
From Everand
Computer Science I Essentials
Randall Raus
5/5 (7)
SQL Tutorial For Beginners
From Everand
SQL Tutorial For Beginners
HAU DANG
No ratings yet
Control and Interfacing of Motors With Ni-Labview Using Ni-Myrio
No ratings yet
Control and Interfacing of Motors With Ni-Labview Using Ni-Myrio
5 pages
Backend (Physical Design) Interview Questions and Answers GG
No ratings yet
Backend (Physical Design) Interview Questions and Answers GG
6 pages
Smart Irrigation: 2 Channel Relay Module
No ratings yet
Smart Irrigation: 2 Channel Relay Module
2 pages
pv1800_2k_hm
No ratings yet
pv1800_2k_hm
31 pages
Red Hat Enterprise Linux-9-Considerations in Adopting Rhel 9-En-Us
100% (1)
Red Hat Enterprise Linux-9-Considerations in Adopting Rhel 9-En-Us
258 pages
The ADS1232 and ADS1234: Complete Front End Solutions For Weigh Scales
No ratings yet
The ADS1232 and ADS1234: Complete Front End Solutions For Weigh Scales
5 pages
It Workshop Boss 1 3 3
No ratings yet
It Workshop Boss 1 3 3
11 pages
ch4 PN - ch4 PN - Supplemental - Transient Response
No ratings yet
ch4 PN - ch4 PN - Supplemental - Transient Response
15 pages
PDFTK
No ratings yet
PDFTK
4 pages
Informatin Technology
No ratings yet
Informatin Technology
49 pages
Programming With C
No ratings yet
Programming With C
4 pages
SRWF-1028 User Manual
No ratings yet
SRWF-1028 User Manual
12 pages
Chapter 9. Transport Layer: Introduction (9.0)
No ratings yet
Chapter 9. Transport Layer: Introduction (9.0)
15 pages
Friend Function in C Plus Plus
No ratings yet
Friend Function in C Plus Plus
2 pages
Allview VivaD8
No ratings yet
Allview VivaD8
201 pages
PTZ Sony A12
No ratings yet
PTZ Sony A12
101 pages
WebAPI Core
No ratings yet
WebAPI Core
21 pages
Emc Clariion: Simple, Reliable, Scalable Storage For Smbs Ax150 Single-Controller Storage Array
No ratings yet
Emc Clariion: Simple, Reliable, Scalable Storage For Smbs Ax150 Single-Controller Storage Array
4 pages
AxioCam MR - Reference Guide
No ratings yet
AxioCam MR - Reference Guide
89 pages
S1 ICT Notes Term for One
No ratings yet
S1 ICT Notes Term for One
182 pages
DE0-Nano-SoC My First HPS PDF
No ratings yet
DE0-Nano-SoC My First HPS PDF
13 pages
MCSP-060 Project Guidelines
100% (1)
MCSP-060 Project Guidelines
24 pages
12v 500w Smps With Ir2085
No ratings yet
12v 500w Smps With Ir2085
1 page
Load Display Fd100-Fd200 Datasheet
No ratings yet
Load Display Fd100-Fd200 Datasheet
2 pages
DataSheet Poly X30
No ratings yet
DataSheet Poly X30
2 pages
Bits, Bytes, and Nibbles
No ratings yet
Bits, Bytes, and Nibbles
4 pages
ATATool
No ratings yet
ATATool
6 pages
Java Script For Kids
100% (1)
Java Script For Kids
62 pages
Backbeat Sense: User Guide
No ratings yet
Backbeat Sense: User Guide
9 pages

chap5-index-construction

Uploaded by

chap5-index-construction

Uploaded by

Search Engines

Information Retrieval in Practice

All slides ©Addison Wesley, 2008

Think of what this data structure must contain.

Scaling index construction

Jeff Dean’s (*)

• L1 cache reference 0.5 ns

Sort using disk as “memory”?

How to merge the sorted runs?

brutus d1,d3 brutus d6,d7 caesar d1,d2,d4,d8,d9

caesar d1,d2,d4 caesar d8,d9 julius d10

noble d5 julius d10 killed d8

with d1,d2,d3,d5 killed d8 noble d5

How to merge the sorted runs?

• Providing you read decent-sized chunks of each block into memory

You might also like