Google SearchEngine
Google SearchEngine
1. Introduction
The authors: Lawrence Page, Sergey Brin
started small at Stanford Univ. during their graduate studies
1
2. System Features
PageRank1: Bringing Order to the Web
(Maps contains 518 million of hyperlinks)
Description of PageRank calculation
Intuitive Justification
Anchor Text
Other Features
Description of PageRank
A = given page
T1 … Tn = pages that point to page A (i.e. citations)
d = damping factor which can be between 0 and 1
(usually we set d = 0.85)
C(A) = number of links going out of page A
PR(A) = the PageRank of a page A
2
Intuitive Justification
Assume there is a “random surfer”
Probability that random surfer visits a page is its PR
1-d is the probability at each page that the “random
surfer” will get bored
Variations of the formula
A page has high rank:
If there are many pages that point to it
a high PageRank
Anchor Text
Associate the text of a link with the page that
the link is on and the page the link points to
Advantages:
Anchors often provide more accurate description
Anchors may exist for documents which cannot be
indexed (i.e. image, programs, and databases)
Propagating anchor text helps provide better
quality results
24 million pages over 259 million anchors
3
Other Features
Location information for all hits
Keep track of some visual presentation details
(i.e. font size of words)
Full raw HTML of pages is available in a
repository
3. Related Work
Information Retrieval
In the past, focused largely on scientific stories, articles,
etc.
Ex. Assignment 1 (Vector Space Model)
4
4. System Anatomy
Provide a high level discussion of the
architecture
Some in-depth description of important data
structure
The major components:
crawling
indexing
Searching
Implemented in C/C++ for Linux/Solaris
9
Anchors
URL Resolver
Repository
Indexer
Links
Lexicon
Barrels
Doc
index
Sorters
PageRank
Searcher
10
5
Major Data Structure
BigFiles
Virtual files are addressable by 64 bit integers
Support rudimentary compression options
Repository
Contains full HTML of every web page
Compression rate of zlib is 3 to 1 compare to bzib is 4 to 1
docID, length, URL
Document Index
Keep info of each document (docID, fixed width ISAM index)
Current doc status, a pointer into the repository, a doc
checksum, and various statistics
11
12
6
Major Data Structure (cont’d)
Lexicon
Lexicon
Repository of words
Implemented with a hash table of pointers (word ids) to
barrels (that are sorted lists)
14 million words, plus an extra file for rare words
Hit Lists
Stores occurrences of a particular word in a particular
document
Types of hits: Plain, Fancy and anchor (2 bytes per hit)
13
Inverted Index
Similar to Assignment 1 inverted index
Can be sorted by doc id or by ranking of word occurrence
14
7
Forward and Inverted Index and
the Lexicon
15
16
8
Indexing the Web
Parsing
Parsers need to handle errors very well (typos, formatting,
etc.)
Indexing
After parsing, placed into forward barrels
Words converted into word id and occurrences into hit lists
Sorting
Forward barrels are sorted by word id to produce an
inverted index
17
Searching
Goal of searching: quality search results
efficiently
Ranking system:
Every hit list includes position, font and
capitalization info
Consider each hit to be one of several different
type and each of which has its own type-weight
Many parameters: type-weight, type-prox-weight
(for phrasal queries)
User feedback mechanism
18
9
Google Query Evaluation
1. Parse the query.
2. Convert words into wordIDs.
3. Seek to the start of the doclist in the short barrel for
every word.
4. Scan through the doclists until there is a document
that matches all the search terms.
5. Compute the rank of that document for the query.
6. If we are in the short barrels and at the end of any
doclist, seek to the start of the doclist in the full barrel for
every word and go to step 4.
7. If we are not at the end of any doclist go to step 4.
Sort the documents that have matched by rank and
return the top k.
19
5. Results
and Performance
Storage Requirements
20
10
System Performance
Crawling & Indexing efficiently
21
Search Performance
22
11
6. Conclusion
Designed to be a scalable search engine
23
24
12
Scalable Architecture
Efficient in both space and time
Bottlenecks in:
CPU
Memory capacity
Disk seeks
Disk throughput
Disk capacity
Network IO
Major data structure make efficient use of available
storage space (24 million pages in < 1 week)
Build an index of 100 million pages < 1 month
25
13