SlideShare a Scribd company logo
2
Most read
10
Most read
12
Most read
CommonCrawl
Building an open Web-Scale crawl using Hadoop.
Ahad Rana
Architect / Engineer at CommonCrawl
ahad@commoncrawl.org
Who is CommonCrawl ?
• A 501(c)3 non-profit “dedicated to building, maintaining and
making widely available a comprehensive crawl of the
Internet for the purpose of enabling a new wave of
innovation, education and research.”
• Funded through a grant by Gil Elbaz, former Googler and
founder of Applied Semantics, and current CEO of Factual Inc.
• Board members include Carl Malamud and Nova Spivack.
Motivations Behind CommonCrawl
• Internet is a massively disruptive force.
• Exponential advances in computing capacity, storage and
bandwidth are creating constant flux and disequilibrium in the IT
domain.
• Cloud computing makes large scale, on-demand computing
affordable for even the smallest startup.
• Hadoop provides the technology stack that enables us to crunch
massive amounts of data.
• Having the ability to “Map-Reduce the Internet” opens up lots of
new opportunities for disruptive innovation and we would like to
reduce the cost of doing this by an order of magnitude, at least.
• White list only the major search engines trend by Webmasters puts
the future of the Open Web at risk and stifles future search
innovation and evolution.
Our Strategy
• Crawl broadly and frequently across all TLDs.
• Prioritize the crawl based on simplified criteria (rank and
freshness).
• Upload the crawl corpus to S3.
• Make our S3 bucket widely accessible to as many users as
possible.
• Build support libraries to facilitate access to the S3 data via
Hadoop.
• Focus on doing a few things really well.
• Listen to customers and open up more metadata and services
as needed.
• We are not a comprehensive crawl, and may never be 
Some Numbers
• URLs in Crawl DB – 14 billion
• URLs with inverse link graph – 1.6 billion
• URLS with content in S3 – 2.5 billion
• Recent crawled documents – 500 million
• Uploaded documents after Deduping 300 million.
• Newly discovered URLs – 1.9 billion
• # of Vertices in Page Rank (recent caclulation) – 3.5 billion
• # of Edges in Page Rank Graph (recent caclulation) – 17 billion
Current System Design
• Batch oriented crawl list generation.
• High volume crawling via independent crawlers.
• Crawlers dump data into HDFS.
• Map-Reduce jobs parse, extract metadata from crawled
documents in bulk independently of crawlers.
• Periodically, we ‘checkpoint’ the crawl, which involves, among
other things:
– Post processing of crawled documents (deduping etc.)
– ARC file generation
– Link graph updates
– Crawl database updates.
– Crawl list regeneration.
Our Cluster Config
• Modest internal cluster consisting of 24 Hadoop nodes,4
crawler nodes, and 2 NameNode / Database servers.
• Each Hadoop node has 6 x 1.5 TB drives and Dual-QuadCore
Xeons with 24 or 32 GB of RAM.
• 9 Map Tasks per node, avg 4 Reducers per node, BLOCK
compression using LZO.
Crawler Design Overview
Crawler Design Details
• Java codebase.
• Asynchronous IO model using custom NIO based HTTP stack.
• Lots of worker threads that synchronize with main thread via
Asynchronous message queues.
• Can sustain a crawl rate of ~250 URLS per second.
• Up to 500 active HTTP connections at any one time.
• Currently, no document parsing in crawler process.
• We currently run 8 crawlers and crawl on average ~100 million
URLs per day, when crawling.
• During post processing phase, on average we process 800
million documents.
• After Deduping, we package and upload on average
approximately 500 million documents to S3.
Crawl Database
• Primary Keys are 128 bit URL fingerprints, consisting of 64 bit
domain fingerprint, and 64 bit URL fingerprint (Rabin-Hash).
• Keys are distributed via modulo operation of URL portion of
fingerprint only.
• Currently, we run 4 reducers per node, and there is one node
down, so we have 92 unique shards.
• Keys in each shard are sorted by Domain FP, then URL FP.
• We like the 64 bit domain id, since it is a generated key, but it
is wasteful.
• We may move to a 32 bit root domain id / 32 bit domain id +
64 URL fingerprint key scheme in the future, and then sort by
root domain, domain, and then FP per shard.
Crawl Database – Continued
• Values in the Crawl Database consist of extensible Metadata
structures.
• We currently use our own DDL and compiler for generating
structures (vs. using Thrift/ProtoBuffers/Avro).
• Avro / ProtoBufs were not available when we started, and we
added lots of Hadoop friendly stuff to our version (multipart [key]
attributes lead to auto WritableComparable derived classes, with
built-in Raw Comparator support etc.).
• Our compiler also generates RPC stubs, with Google ProtoBuf style
message passing semantics (Message w/ optional Struct In, optional
Struct Out) instead of Thrift style semantics (Method with multiple
arguments and a return type).
• We prefer the former because it is better attuned to our preference
towards the asynchronous style of RPC programming.
Map-Reduce Pipeline – Parse/Dedupe/Arc Generation
Phase 1
Phase 2
Map-Reduce Pipeline – Link Graph Construction
Link Graph Construction
Inverse Link Graph Construction
Map-Reduce Pipeline – PageRank Edge Graph Construction
Page Rank Process
Distribution Phase
Calculation Phase
Generate Page Rank Values
The Need For a Smarter Merge
• Pipelining nature of HDFS means each Reducer writes it’s
output to local disk first, then to Repl Level – 1 other nodes.
• If intermediate data record sets are already sorted, the need
to run an Identity Mapper/Shuffle/Merge Sort phase to join to
sorted record sets is very expensive.
Our Solution:

More Related Content

PDF
HUG August 2010: Best practices
PDF
Hdfs high availability
PPT
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...
PPS
Searching At Scale
PPT
Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduce
PDF
Next Generation Hadoop Operations
PDF
The Bixo Web Mining Toolkit
PPT
Nov 2010 HUG: Fuzzy Table - B.A.H
HUG August 2010: Best practices
Hdfs high availability
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...
Searching At Scale
Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduce
Next Generation Hadoop Operations
The Bixo Web Mining Toolkit
Nov 2010 HUG: Fuzzy Table - B.A.H

What's hot (20)

PPT
Hadoop at Yahoo! -- University Talks
PPTX
January 2011 HUG: Pig Presentation
PPT
Hadoop at Yahoo! -- Hadoop World NY 2009
PDF
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
PPT
Nextag talk
PPTX
Big Data Anti-Patterns: Lessons From the Front LIne
PDF
Hadoop trainting in hyderabad@kelly technologies
PPTX
January 2011 HUG: Howl Presentation
PPT
2011 06-30-hadoop-summit v5
PDF
Migrating structured data between Hadoop and RDBMS
PPT
Hadoop Hive Talk At IIT-Delhi
PPTX
Messaging architecture @FB (Fifth Elephant Conference)
PPTX
Teradata Partners Conference Oct 2014 Big Data Anti-Patterns
PDF
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
PDF
Karmasphere Studio for Hadoop
PPTX
Facebook Retrospective - Big data-world-europe-2012
PPTX
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
PPTX
Qubole Overview at the Fifth Elephant Conference
PPTX
Powering a Virtual Power Station with Big Data
PDF
Prestogres internals
Hadoop at Yahoo! -- University Talks
January 2011 HUG: Pig Presentation
Hadoop at Yahoo! -- Hadoop World NY 2009
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
Nextag talk
Big Data Anti-Patterns: Lessons From the Front LIne
Hadoop trainting in hyderabad@kelly technologies
January 2011 HUG: Howl Presentation
2011 06-30-hadoop-summit v5
Migrating structured data between Hadoop and RDBMS
Hadoop Hive Talk At IIT-Delhi
Messaging architecture @FB (Fifth Elephant Conference)
Teradata Partners Conference Oct 2014 Big Data Anti-Patterns
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Karmasphere Studio for Hadoop
Facebook Retrospective - Big data-world-europe-2012
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
Qubole Overview at the Fifth Elephant Conference
Powering a Virtual Power Station with Big Data
Prestogres internals
Ad

Similar to Common crawlpresentation (20)

PDF
Design and Implementation of a High- Performance Distributed Web Crawler
PDF
Scalable crawling with Kafka, scrapy and spark - November 2021
PDF
Web Crawling with Apache Nutch
PPT
Hadoop ecosystem framework n hadoop in live environment
PPTX
Cloudstone - Sharpening Your Weapons Through Big Data
ODP
Large scale crawling with Apache Nutch
PPTX
The Internet as a Single Database
PDF
Crawling and Processing the Italian Corporate Web
PDF
hadoop
PDF
IRJET - Review on Search Engine Optimization
PPTX
Big data at scrapinghub
PDF
Getting Started with Hadoop
PPTX
Web Archives and the dream of the Personal Search Engine
PDF
The Hadoop Ecosystem
PPTX
Scalability andefficiencypres
PDF
Frontera: open source, large scale web crawling framework
PDF
Petabyte scale on commodity infrastructure
PDF
PPTX
Big dataarchitecturesandecosystem+nosql
PPTX
Steve Watt Presentation
Design and Implementation of a High- Performance Distributed Web Crawler
Scalable crawling with Kafka, scrapy and spark - November 2021
Web Crawling with Apache Nutch
Hadoop ecosystem framework n hadoop in live environment
Cloudstone - Sharpening Your Weapons Through Big Data
Large scale crawling with Apache Nutch
The Internet as a Single Database
Crawling and Processing the Italian Corporate Web
hadoop
IRJET - Review on Search Engine Optimization
Big data at scrapinghub
Getting Started with Hadoop
Web Archives and the dream of the Personal Search Engine
The Hadoop Ecosystem
Scalability andefficiencypres
Frontera: open source, large scale web crawling framework
Petabyte scale on commodity infrastructure
Big dataarchitecturesandecosystem+nosql
Steve Watt Presentation
Ad

More from Hadoop User Group (20)

PDF
Hdfs high availability
ODP
Cascalog internal dsl_preso
PDF
Karmasphere hadoop-productivity-tools
PPTX
Building a Scalable Web Crawler with Hadoop
PPT
Pig at Linkedin
PPT
2 hadoop@e bay-hug-2010-07-21
PPT
1 content optimization-hug-2010-07-21
PDF
3 avro hug-2010-07-21
PPT
1 hadoop security_in_details_hadoop_summit2010
PPT
Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...
PDF
Yahoo! Hadoop User Group - May 2010 Meetup - Apache Hadoop Release Plans for ...
PPT
Hadoop, Hbase and Hive- Bay area Hadoop User Group
PPTX
Yahoo! Mail antispam - Bay area Hadoop user group
PPT
Hadoop Security Preview
PPT
Flightcaster Presentation Hadoop
PPTX
Map Reduce Online
PPT
Hadoop Security Preview
PPT
Hadoop Security Preview
PPT
Hadoop Release Plan Feb17
PDF
Twitter Protobufs And Hadoop Hug 021709
Hdfs high availability
Cascalog internal dsl_preso
Karmasphere hadoop-productivity-tools
Building a Scalable Web Crawler with Hadoop
Pig at Linkedin
2 hadoop@e bay-hug-2010-07-21
1 content optimization-hug-2010-07-21
3 avro hug-2010-07-21
1 hadoop security_in_details_hadoop_summit2010
Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...
Yahoo! Hadoop User Group - May 2010 Meetup - Apache Hadoop Release Plans for ...
Hadoop, Hbase and Hive- Bay area Hadoop User Group
Yahoo! Mail antispam - Bay area Hadoop user group
Hadoop Security Preview
Flightcaster Presentation Hadoop
Map Reduce Online
Hadoop Security Preview
Hadoop Security Preview
Hadoop Release Plan Feb17
Twitter Protobufs And Hadoop Hug 021709

Recently uploaded (20)

PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
DOC
Soft-furnishing-By-Architect-A.F.M.Mohiuddin-Akhand.doc
PDF
Classroom Observation Tools for Teachers
PPTX
GDM (1) (1).pptx small presentation for students
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PDF
Microbial disease of the cardiovascular and lymphatic systems
PDF
Anesthesia in Laparoscopic Surgery in India
PPTX
Cell Structure & Organelles in detailed.
PPTX
Tissue processing ( HISTOPATHOLOGICAL TECHNIQUE
PDF
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
PDF
VCE English Exam - Section C Student Revision Booklet
PPTX
master seminar digital applications in india
PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
Final Presentation General Medicine 03-08-2024.pptx
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
Soft-furnishing-By-Architect-A.F.M.Mohiuddin-Akhand.doc
Classroom Observation Tools for Teachers
GDM (1) (1).pptx small presentation for students
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
Microbial diseases, their pathogenesis and prophylaxis
Microbial disease of the cardiovascular and lymphatic systems
Anesthesia in Laparoscopic Surgery in India
Cell Structure & Organelles in detailed.
Tissue processing ( HISTOPATHOLOGICAL TECHNIQUE
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
VCE English Exam - Section C Student Revision Booklet
master seminar digital applications in india
human mycosis Human fungal infections are called human mycosis..pptx
O5-L3 Freight Transport Ops (International) V1.pdf
Final Presentation General Medicine 03-08-2024.pptx
FourierSeries-QuestionsWithAnswers(Part-A).pdf

Common crawlpresentation

  • 1. CommonCrawl Building an open Web-Scale crawl using Hadoop. Ahad Rana Architect / Engineer at CommonCrawl [email protected]
  • 2. Who is CommonCrawl ? • A 501(c)3 non-profit “dedicated to building, maintaining and making widely available a comprehensive crawl of the Internet for the purpose of enabling a new wave of innovation, education and research.” • Funded through a grant by Gil Elbaz, former Googler and founder of Applied Semantics, and current CEO of Factual Inc. • Board members include Carl Malamud and Nova Spivack.
  • 3. Motivations Behind CommonCrawl • Internet is a massively disruptive force. • Exponential advances in computing capacity, storage and bandwidth are creating constant flux and disequilibrium in the IT domain. • Cloud computing makes large scale, on-demand computing affordable for even the smallest startup. • Hadoop provides the technology stack that enables us to crunch massive amounts of data. • Having the ability to “Map-Reduce the Internet” opens up lots of new opportunities for disruptive innovation and we would like to reduce the cost of doing this by an order of magnitude, at least. • White list only the major search engines trend by Webmasters puts the future of the Open Web at risk and stifles future search innovation and evolution.
  • 4. Our Strategy • Crawl broadly and frequently across all TLDs. • Prioritize the crawl based on simplified criteria (rank and freshness). • Upload the crawl corpus to S3. • Make our S3 bucket widely accessible to as many users as possible. • Build support libraries to facilitate access to the S3 data via Hadoop. • Focus on doing a few things really well. • Listen to customers and open up more metadata and services as needed. • We are not a comprehensive crawl, and may never be 
  • 5. Some Numbers • URLs in Crawl DB – 14 billion • URLs with inverse link graph – 1.6 billion • URLS with content in S3 – 2.5 billion • Recent crawled documents – 500 million • Uploaded documents after Deduping 300 million. • Newly discovered URLs – 1.9 billion • # of Vertices in Page Rank (recent caclulation) – 3.5 billion • # of Edges in Page Rank Graph (recent caclulation) – 17 billion
  • 6. Current System Design • Batch oriented crawl list generation. • High volume crawling via independent crawlers. • Crawlers dump data into HDFS. • Map-Reduce jobs parse, extract metadata from crawled documents in bulk independently of crawlers. • Periodically, we ‘checkpoint’ the crawl, which involves, among other things: – Post processing of crawled documents (deduping etc.) – ARC file generation – Link graph updates – Crawl database updates. – Crawl list regeneration.
  • 7. Our Cluster Config • Modest internal cluster consisting of 24 Hadoop nodes,4 crawler nodes, and 2 NameNode / Database servers. • Each Hadoop node has 6 x 1.5 TB drives and Dual-QuadCore Xeons with 24 or 32 GB of RAM. • 9 Map Tasks per node, avg 4 Reducers per node, BLOCK compression using LZO.
  • 9. Crawler Design Details • Java codebase. • Asynchronous IO model using custom NIO based HTTP stack. • Lots of worker threads that synchronize with main thread via Asynchronous message queues. • Can sustain a crawl rate of ~250 URLS per second. • Up to 500 active HTTP connections at any one time. • Currently, no document parsing in crawler process. • We currently run 8 crawlers and crawl on average ~100 million URLs per day, when crawling. • During post processing phase, on average we process 800 million documents. • After Deduping, we package and upload on average approximately 500 million documents to S3.
  • 10. Crawl Database • Primary Keys are 128 bit URL fingerprints, consisting of 64 bit domain fingerprint, and 64 bit URL fingerprint (Rabin-Hash). • Keys are distributed via modulo operation of URL portion of fingerprint only. • Currently, we run 4 reducers per node, and there is one node down, so we have 92 unique shards. • Keys in each shard are sorted by Domain FP, then URL FP. • We like the 64 bit domain id, since it is a generated key, but it is wasteful. • We may move to a 32 bit root domain id / 32 bit domain id + 64 URL fingerprint key scheme in the future, and then sort by root domain, domain, and then FP per shard.
  • 11. Crawl Database – Continued • Values in the Crawl Database consist of extensible Metadata structures. • We currently use our own DDL and compiler for generating structures (vs. using Thrift/ProtoBuffers/Avro). • Avro / ProtoBufs were not available when we started, and we added lots of Hadoop friendly stuff to our version (multipart [key] attributes lead to auto WritableComparable derived classes, with built-in Raw Comparator support etc.). • Our compiler also generates RPC stubs, with Google ProtoBuf style message passing semantics (Message w/ optional Struct In, optional Struct Out) instead of Thrift style semantics (Method with multiple arguments and a return type). • We prefer the former because it is better attuned to our preference towards the asynchronous style of RPC programming.
  • 12. Map-Reduce Pipeline – Parse/Dedupe/Arc Generation Phase 1 Phase 2
  • 13. Map-Reduce Pipeline – Link Graph Construction Link Graph Construction Inverse Link Graph Construction
  • 14. Map-Reduce Pipeline – PageRank Edge Graph Construction
  • 15. Page Rank Process Distribution Phase Calculation Phase Generate Page Rank Values
  • 16. The Need For a Smarter Merge • Pipelining nature of HDFS means each Reducer writes it’s output to local disk first, then to Repl Level – 1 other nodes. • If intermediate data record sets are already sorted, the need to run an Identity Mapper/Shuffle/Merge Sort phase to join to sorted record sets is very expensive.