0% found this document useful (0 votes)

10 views13 pages

Google SearchEngine

The document discusses the architecture and features of Google's large-scale hypertextual web search engine, developed by Larry Page and Sergey Brin. It introduces key components such as PageRank, crawling, indexing, and searching, emphasizing the importance of link structure and anchor text for improving search accuracy. The paper highlights the system's scalability and efficiency in handling vast amounts of web data while providing high-quality search results.

Uploaded by

Nivash Kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views13 pages

Google SearchEngine

Uploaded by

Nivash Kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

The Anatomy of a Large-Scale

Hypertextual Web Search Engine

Article by: Larry Page and Sergey Brin

Computer Networks 30(1-7):107-117, 1998

1. Introduction
The authors: Lawrence Page, Sergey Brin
started small at Stanford Univ. during their graduate studies

Google index grew to over 1 billion pages in June 2000

At the time, Google served an average of 18 million queries/day
Google index grew to over 8 billion pages in 2003

Traditional keyword searches not accurate enough

Google makes use of html document structure

1
2. System Features
PageRank1: Bringing Order to the Web
(Maps contains 518 million of hyperlinks)
Description of PageRank calculation
Intuitive Justification
Anchor Text
Other Features

1 Demo available at google.stanford.edu

Description of PageRank
A = given page
T1 … Tn = pages that point to page A (i.e. citations)
d = damping factor which can be between 0 and 1
(usually we set d = 0.85)
C(A) = number of links going out of page A
PR(A) = the PageRank of a page A

PR(A) = (1-d) + d(PR(T1)/C(T1) + … + PR(Tn)/C(Tn))

NOTE: the sum of all web pages’ PageRank = 1

2
Intuitive Justification
Assume there is a “random surfer”
Probability that random surfer visits a page is its PR
1-d is the probability at each page that the “random
surfer” will get bored
Variations of the formula
A page has high rank:
If there are many pages that point to it

If there are some pages that point to it and have

a high PageRank

Anchor Text
Associate the text of a link with the page that
the link is on and the page the link points to
Advantages:
Anchors often provide more accurate description
Anchors may exist for documents which cannot be
indexed (i.e. image, programs, and databases)
Propagating anchor text helps provide better
quality results
24 million pages over 259 million anchors

3
Other Features
Location information for all hits
Keep track of some visual presentation details
(i.e. font size of words)
Full raw HTML of pages is available in a
repository

3. Related Work

Information Retrieval
In the past, focused largely on scientific stories, articles,
etc.
Ex. Assignment 1 (Vector Space Model)

Differences between web and controlled collections

The web contains varying content (html, doc, PDF, etc.)
No control of web content

4
4. System Anatomy
Provide a high level discussion of the
architecture
Some in-depth description of important data
structure
The major components:
crawling
indexing
Searching
Implemented in C/C++ for Linux/Solaris
9

Google Architecture overview

crawlers Store Server
URL Server

Anchors

URL Resolver
Repository
Indexer

Links

Lexicon
Barrels
Doc
index
Sorters

PageRank
Searcher

5
Major Data Structure
BigFiles
Virtual files are addressable by 64 bit integers
Support rudimentary compression options
Repository
Contains full HTML of every web page
Compression rate of zlib is 3 to 1 compare to bzib is 4 to 1
docID, length, URL
Document Index
Keep info of each document (docID, fixed width ISAM index)
Current doc status, a pointer into the repository, a doc
checksum, and various statistics

Repository data structure

6
Major Data Structure (cont’d)
Lexicon

Lexicon
Repository of words
Implemented with a hash table of pointers (word ids) to
barrels (that are sorted lists)
14 million words, plus an extra file for rare words

Hit Lists
Stores occurrences of a particular word in a particular
document
Types of hits: Plain, Fancy and anchor (2 bytes per hit)

Major Data Structure (cont’d)

Forward Index
Barrel holds a range of word ids
Barrels
Doc id and a list of word ids

Inverted Index
Similar to Assignment 1 inverted index
Can be sorted by doc id or by ranking of word occurrence

7
Forward and Inverted Index and
the Lexicon

Crawling the Web

Crawlers need to be reliable, fast and robust
Some authors don’t want their pages to be crawled

Google (1998) had about 3-4 crawlers running

At peek speeds, the 4 crawlers processed 100 web pages/sec.

Crawlers have different states: DNS lookup, connecting to host,

send request, receiving response

Crawlers and URL server were implemented in Python

8
Indexing the Web
Parsing
Parsers need to handle errors very well (typos, formatting,
etc.)
Indexing
After parsing, placed into forward barrels
Words converted into word id and occurrences into hit lists
Sorting
Forward barrels are sorted by word id to produce an
inverted index

Searching
Goal of searching: quality search results
efficiently
Ranking system:
Every hit list includes position, font and
capitalization info
Consider each hit to be one of several different
type and each of which has its own type-weight
Many parameters: type-weight, type-prox-weight
(for phrasal queries)
User feedback mechanism

9
Google Query Evaluation
1. Parse the query.
2. Convert words into wordIDs.
3. Seek to the start of the doclist in the short barrel for
every word.
4. Scan through the doclists until there is a document
that matches all the search terms.
5. Compute the rank of that document for the query.
6. If we are in the short barrels and at the end of any
doclist, seek to the start of the doclist in the full barrel for
every word and go to step 4.
7. If we are not at the end of any doclist go to step 4.
Sort the documents that have matched by rank and
return the top k.
19

5. Results
and Performance
Storage Requirements

Google (1998) compressed its repository to about

53GB (24 million pages)

Additional storage used by Google for indexes,

temporary storage, lexicon, etc. required about
55GB.

10
System Performance
Crawling & Indexing efficiently

Crawling took a week or more

To improve efficiency, the Indexer and Crawlers

need to be capable of simultaneous execution

Sorting needs to be done in parallel on several

machines

Search Performance

Disk I/O – Requires most time (seek time, transfer

time, network delay, etc.)

Caching – Reduces the number of disk access

Performance times can improve from 2.13 sec. to
0.06 sec.

11
6. Conclusion
Designed to be a scalable search engine

Provide high quality search

Complete architecture for gathering web

pages, indexing them, and performing search
queries over them

High Quality Search

Heavy use of hypertextual info
Link structure
Link (anchor) text
Proximity and font info increase relevance
PageRank quality
Link text relevant

12
Scalable Architecture
Efficient in both space and time
Bottlenecks in:
CPU
Memory capacity
Disk seeks
Disk throughput
Disk capacity
Network IO
Major data structure make efficient use of available
storage space (24 million pages in < 1 week)
Build an index of 100 million pages < 1 month

Google Paper
100% (8)
Google Paper
20 pages
Backlinks - Pagerank
No ratings yet
Backlinks - Pagerank
12 pages
Page Rank of Google Search: The Algorithm That Organizes The Web
No ratings yet
Page Rank of Google Search: The Algorithm That Organizes The Web
8 pages
The Anatomy of A Large-Scale Hypertextual Web Search Engine: Google
No ratings yet
The Anatomy of A Large-Scale Hypertextual Web Search Engine: Google
24 pages
Cmpsci 446 Search Engines
No ratings yet
Cmpsci 446 Search Engines
32 pages
IRT
No ratings yet
IRT
100 pages
Search Engines Information Retrieval in Practice PDF
No ratings yet
Search Engines Information Retrieval in Practice PDF
542 pages
Anatomy of A Large-Scale Hypertextual Web Search Engine
No ratings yet
Anatomy of A Large-Scale Hypertextual Web Search Engine
33 pages
IRWM: Assignment 1: How Does Google Search Engine Works?
No ratings yet
IRWM: Assignment 1: How Does Google Search Engine Works?
7 pages
Web Search
No ratings yet
Web Search
49 pages
A Review of 'The Anatomy of A Large-Scale Hypertextual Web Search Engine'
No ratings yet
A Review of 'The Anatomy of A Large-Scale Hypertextual Web Search Engine'
27 pages
IR On Web Search Engines: Reference of Slides Taken From DR Haddawy's Material
No ratings yet
IR On Web Search Engines: Reference of Slides Taken From DR Haddawy's Material
21 pages
The Anatomy of A Large-Scale Hypertextual Web Search Engine
No ratings yet
The Anatomy of A Large-Scale Hypertextual Web Search Engine
20 pages
Search Engine
No ratings yet
Search Engine
35 pages
Articulo Proyecto
No ratings yet
Articulo Proyecto
37 pages
ir5
No ratings yet
ir5
18 pages
The Anatomy of A Large-Scale Hypertextual
No ratings yet
The Anatomy of A Large-Scale Hypertextual
41 pages
What Is SE IT (7yh SEM)
No ratings yet
What Is SE IT (7yh SEM)
13 pages
Lab Manual: Web Technology
No ratings yet
Lab Manual: Web Technology
39 pages
The Anatomy of A Large-Scale Hypertextual Web Search Engine '
No ratings yet
The Anatomy of A Large-Scale Hypertextual Web Search Engine '
11 pages
Ieee Format
No ratings yet
Ieee Format
13 pages
SEARCH ENGINES and PAGERANK
No ratings yet
SEARCH ENGINES and PAGERANK
29 pages
LLLLLLLLLLLLLLLLL
No ratings yet
LLLLLLLLLLLLLLLLL
30 pages
The Anatomy of A Large-Scale Hypertextual Web Search Engine: Sergey Brin and Lawrence Page
No ratings yet
The Anatomy of A Large-Scale Hypertextual Web Search Engine: Sergey Brin and Lawrence Page
19 pages
Web Search Engines: Part 1
No ratings yet
Web Search Engines: Part 1
6 pages
Selecpricelist - 2024-25 Process and Automation
No ratings yet
Selecpricelist - 2024-25 Process and Automation
37 pages
How Google Works
No ratings yet
How Google Works
61 pages
Chapter 2
No ratings yet
Chapter 2
23 pages
Page Rank Algorithm
No ratings yet
Page Rank Algorithm
26 pages
Crawling: But Do You Know How Search Engines Work? Every Search Engine Has Three Main Functions
No ratings yet
Crawling: But Do You Know How Search Engines Work? Every Search Engine Has Three Main Functions
4 pages
chapter 2
No ratings yet
chapter 2
45 pages
HP Device Manager 5.0.10 Administrator Guide en US
No ratings yet
HP Device Manager 5.0.10 Administrator Guide en US
278 pages
Google Deep Dive
No ratings yet
Google Deep Dive
9 pages
Anatomy of A Search Engine
No ratings yet
Anatomy of A Search Engine
17 pages
Major Project PROPOSAL-BACHELOR OF ENGINEERING
No ratings yet
Major Project PROPOSAL-BACHELOR OF ENGINEERING
37 pages
22761A05E9 - CaseStudy
No ratings yet
22761A05E9 - CaseStudy
9 pages
Challenges in Running A Commercial Web Search Engine: Amit Singhal
No ratings yet
Challenges in Running A Commercial Web Search Engine: Amit Singhal
50 pages
Query and Reporting Tools: Search Engine Architecture
No ratings yet
Query and Reporting Tools: Search Engine Architecture
5 pages
arasu2001
No ratings yet
arasu2001
42 pages
Summary of A Search Engine
No ratings yet
Summary of A Search Engine
4 pages
CS571-Note
No ratings yet
CS571-Note
2 pages
Unit 1
No ratings yet
Unit 1
15 pages
705-I300885e65e6rst
No ratings yet
705-I300885e65e6rst
4 pages
Python CBT Questions and Answers
No ratings yet
Python CBT Questions and Answers
34 pages
Crime Reporting System Project Report PDF Free
No ratings yet
Crime Reporting System Project Report PDF Free
106 pages
Google
No ratings yet
Google
16 pages
SearchLand: Search Quality For Beginners
No ratings yet
SearchLand: Search Quality For Beginners
29 pages
Analytics-2025-01-25-090015.ips.ca.synced
No ratings yet
Analytics-2025-01-25-090015.ips.ca.synced
79 pages
Getting Started With Sap Businessobjects Roambi
No ratings yet
Getting Started With Sap Businessobjects Roambi
33 pages
Working of Webb Search Engines
No ratings yet
Working of Webb Search Engines
29 pages
Module 2 - Assemblers & Macro Processor
No ratings yet
Module 2 - Assemblers & Macro Processor
45 pages
Blktrace Usage
100% (2)
Blktrace Usage
22 pages
OO Design Patterns
No ratings yet
OO Design Patterns
49 pages
Application Programming Interface
No ratings yet
Application Programming Interface
12 pages
InfoQ - Java8 PDF
No ratings yet
InfoQ - Java8 PDF
46 pages
SEO Book
No ratings yet
SEO Book
32 pages
Web Search Engine
No ratings yet
Web Search Engine
26 pages
PC Engines ALIX.2 / ALIX.3 / ALIX.6 Series System Boards
No ratings yet
PC Engines ALIX.2 / ALIX.3 / ALIX.6 Series System Boards
21 pages
Strings Notes
No ratings yet
Strings Notes
28 pages
Module 2
No ratings yet
Module 2
38 pages
Airline Reservation System
100% (2)
Airline Reservation System
67 pages
1) Move C1013 Out of Shield (Layout) 2) Swap D132 and D143 (Layout) 3) Add R25 For R3000 (Schemaitc and Layout) 4) Change C51 To Through Hole
No ratings yet
1) Move C1013 Out of Shield (Layout) 2) Swap D132 and D143 (Layout) 3) Add R25 For R3000 (Schemaitc and Layout) 4) Change C51 To Through Hole
17 pages
How To Install TGT SINT Linux V3 - 181124-1504-620
No ratings yet
How To Install TGT SINT Linux V3 - 181124-1504-620
27 pages
Chapter 1
No ratings yet
Chapter 1
47 pages
STL 1000
No ratings yet
STL 1000
2 pages
Labeling of Maps or Images With PowerPoint
No ratings yet
Labeling of Maps or Images With PowerPoint
6 pages
The Von Neumann Architecture
No ratings yet
The Von Neumann Architecture
20 pages
Zoom User Manual - Students
No ratings yet
Zoom User Manual - Students
11 pages
Delphi - The Road To Delphi - A Blog About Programming
No ratings yet
Delphi - The Road To Delphi - A Blog About Programming
22 pages
Xiaomi Mi Bluetooth Earphone
No ratings yet
Xiaomi Mi Bluetooth Earphone
8 pages
Fast Lane - RH-RH124
No ratings yet
Fast Lane - RH-RH124
3 pages
5100 Security Gateway Datasheet
No ratings yet
5100 Security Gateway Datasheet
5 pages
SQL Server Version List
No ratings yet
SQL Server Version List
2 pages
Iot-Based Traction Motor Drive Condition Monitoring in Electric Vehicles
No ratings yet
Iot-Based Traction Motor Drive Condition Monitoring in Electric Vehicles
3 pages
SAP HANA Innovations and Challenges: Unit 1: Introduction
No ratings yet
SAP HANA Innovations and Challenges: Unit 1: Introduction
1 page
Question Bank PUT24
No ratings yet
Question Bank PUT24
2 pages
HarperDB Architecture and Querying Solutions: The Complete Guide for Developers and Engineers
From Everand
HarperDB Architecture and Querying Solutions: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Comprehensive Guide to Hive Architecture and Query Language: Definitive Reference for Developers and Engineers
From Everand
Comprehensive Guide to Hive Architecture and Query Language: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Markdown Syntax and Practice: Definitive Reference for Developers and Engineers
From Everand
Markdown Syntax and Practice: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Rebol Programming Insights: Definitive Reference for Developers and Engineers
From Everand
Rebol Programming Insights: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Expert PHP 5 Tools
From Everand
Expert PHP 5 Tools
Dirk Merkel
4/5 (5)
Practical Web Development
From Everand
Practical Web Development
Paul Wellens
5/5 (1)
Redoc API Documentation in Practice: Definitive Reference for Developers and Engineers
From Everand
Redoc API Documentation in Practice: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Liferay Portal Systems Development
From Everand
Liferay Portal Systems Development
Jonas X. Yuan
No ratings yet
Mastering Dart
From Everand
Mastering Dart
Sergey Akopkokhyants
No ratings yet
Mastering Yii
From Everand
Mastering Yii
PortwoodII Charles R.
No ratings yet
Databases: System Concepts, Designs, Management, and Implementation
From Everand
Databases: System Concepts, Designs, Management, and Implementation
Jonathan Rigdon
No ratings yet
Big Data Analytics
From Everand
Big Data Analytics
Venkat Ankam
No ratings yet
Mastering SPARQL: Querying the Semantic Web with RDF
From Everand
Mastering SPARQL: Querying the Semantic Web with RDF
Robert Johnson
No ratings yet
Learning Boost C++ Libraries
From Everand
Learning Boost C++ Libraries
Arindam Mukherjee
No ratings yet

Google SearchEngine

Uploaded by

Google SearchEngine

Uploaded by

The Anatomy of a Large-Scale

Hypertextual Web Search Engine

Article by: Larry Page and Sergey Brin

Google index grew to over 1 billion pages in June 2000

Traditional keyword searches not accurate enough

1 Demo available at google.stanford.edu

PR(A) = (1-d) + d(PR(T1)/C(T1) + … + PR(Tn)/C(Tn))

NOTE: the sum of all web pages’ PageRank = 1

If there are some pages that point to it and have

Differences between web and controlled collections

Google Architecture overview

Repository data structure

Major Data Structure (cont’d)

Crawling the Web

Google (1998) had about 3-4 crawlers running

Crawlers have different states: DNS lookup, connecting to host,

Crawlers and URL server were implemented in Python

Google (1998) compressed its repository to about

Additional storage used by Google for indexes,

Crawling took a week or more

To improve efficiency, the Indexer and Crawlers

Sorting needs to be done in parallel on several

Disk I/O – Requires most time (seek time, transfer

Caching – Reduces the number of disk access

Provide high quality search

Complete architecture for gathering web

High Quality Search

You might also like