SlideShare a Scribd company logo
The Internet as a Single DatabaseTechnologies Used & Lessons LearnedHouston Code Camp, August 2011Shion DeysarkarCEO, Datafiniti
What does that mean?All web data in one, unified formatPlaces, people, news, URLs, products, etc., etc.Accessible as if you were querying a database
Why build such a thing?Our users needed a better way of getting web dataWeb crawling is kludgy and unintuitiveDevelopers deserve something better than current APIs
Why build such a thing?Because it would be awesome!
Not an easy task…The Challenges
The ChallengesThere’s a lot of data on the web100 million registered domainsMaybe only 100,000 have interesting stuff? (Which ones?)Some sites have millions or billions of data points
The ChallengesIt’s all structured differently!Do we have to write web crawls for each website?Writing 100,000 web crawlers seems.. not fun
The ChallengesData can conflictHow do we know which data is correct?
So let’s start at the beginning:Data Collection
Data CollectionBuilding a scalable web crawlerCloud or local data center?  Neither.Grid computing (think SETI@home)1000s of home PCs that exchange time & bandwidth for $Crawl very fast for relatively little $
Data CollectionBuilding a scalable web crawlerCoding 1000s of extraction appsBuild a framework that handles all the kludgy work:  Link following & de-duplication
  Result formatting & storage
  Throttle rates & crawling behavior
  Any other crawling activity not specific to a website’s structureAbstract away everything but pattern matching and link generation  Load lightweight, website-specific apps into above frameworkData CollectionBuilding a scalable web crawlerCoding 1000s of extraction appsAbstract away everything but pattern matching and link generation
Data CollectionBuilding a scalable web crawlerCoding 1000s of extraction appsAbstract away everything but pattern matching and link generation
Data CollectionBuilding a scalable web crawlerCurrent peak performance: 4.32 billion URLs per monthDeploying 20 new website crawls every monthEasy to scale crawling performance (just add grid nodes)Easy to scale deployment (just add contractors)
Now for step 2! (step 1 took us 3 years >_<)Data Storage
Data StorageBuilding a scalable data storeWhat we’re dealing with:TBs (eventually PBs) of dataBillions of rows, Thousands of columns (maybe more)Don’t want to deal with shardingDon’t actually care about ACIDDo care about high-throughput and fault-tolerance
Data StorageBuilding a scalable data storeNoSQL (Cassandra) >> MySQL (for us)Can increase throughput and storage linearly by adding nodesVirtually unlimited and variable # of columnsMuch faster read/writeSome challenges  Doesn’t yet support all the select features you’re used to
  Not a mature technology yet, expect frequent updatesData StorageBuilding a scalable data storeChoosing Cassandra over other NoSQL databasesMore active community, seems to be gaining traction most quicklyIntegrated with other relevant technologiesSolr for text search
Hadoop for batch-style processingImpressive production-scale examples  Though it’s true it has some high-profile scrappingsBacked by corporations (DataStax) and some really smart people
Data StorageBuilding a unified database of everythingNormalizing separate data points that represent the same thingCo-occurrence:  most popular choice wins
Data StorageBuilding a unified database of everythingNormalizing separate data points that represent the same thingTrusted sources:  put more weight on sources that tend to be right
Data StorageBuilding a unified database of everythingIdentifying interesting data on a random web page
Yay, step 3! (step 2 took us 3 months :D)Data Retrieval

More Related Content

PPTX
Augmenting Mongo DB with treasure data
PDF
Unifying Events and Logs into the Cloud
PPTX
It’s All About The Cards: Sharing on Social Media Encouraged HTML Metadata G...
PDF
Turnkey Multi-Region, Active-Active Session Stores with Steeltoe, Redis Enter...
PPTX
An Intro to Elasticsearch and Kibana
PPTX
Migrating from MySQL to MongoDB at Wordnik
PPTX
Netflix Big Data Paris 2017
PDF
Google BigQuery is the future of Analytics! (Google Developer Conference)
Augmenting Mongo DB with treasure data
Unifying Events and Logs into the Cloud
It’s All About The Cards: Sharing on Social Media Encouraged HTML Metadata G...
Turnkey Multi-Region, Active-Active Session Stores with Steeltoe, Redis Enter...
An Intro to Elasticsearch and Kibana
Migrating from MySQL to MongoDB at Wordnik
Netflix Big Data Paris 2017
Google BigQuery is the future of Analytics! (Google Developer Conference)

What's hot (19)

PDF
Intro to new Google cloud technologies: Google Storage, Prediction API, BigQuery
PDF
Big Data made easy in the era of the Cloud - Demi Ben-Ari
PPTX
Webinar: Live Data Visualisation with Tableau and MongoDB
PPTX
Benefits of Using MongoDB Over RDBMSs
PPTX
Hadoop World 2011 Keynote: Ebay - Hugh Williams
PDF
Mongo DB: Operational Big Data Database
PDF
Introduction to MongoDB Basics from SQL to NoSQL
PPTX
MongoDB in a Mainframe World
PPTX
Accelerating Delivery of Data Products - The EBSCO Way
PPTX
Transforming your application with Elasticsearch
PPTX
Introduction to Elastic with a hint of Symfony and Docker
PPTX
Introduction to MongoDB
PPTX
MongoDB.local Seattle 2019: Building Your First MongoDB App Using Atlas & Stitch
PPTX
Tableau & MongoDB: Visual Analytics at the Speed of Thought
PDF
Build robust streaming data pipelines with MongoDB and Kafka P2
PPTX
CData Data Today: A Developer's Dilemma
PDF
MongoDB vs OrientDB
PDF
Webinar: Introducing the MongoDB Connector for BI 2.0 with Tableau
ODP
Big Data Analytics with Google BigQuery. By Javier Ramirez. All your base Co...
Intro to new Google cloud technologies: Google Storage, Prediction API, BigQuery
Big Data made easy in the era of the Cloud - Demi Ben-Ari
Webinar: Live Data Visualisation with Tableau and MongoDB
Benefits of Using MongoDB Over RDBMSs
Hadoop World 2011 Keynote: Ebay - Hugh Williams
Mongo DB: Operational Big Data Database
Introduction to MongoDB Basics from SQL to NoSQL
MongoDB in a Mainframe World
Accelerating Delivery of Data Products - The EBSCO Way
Transforming your application with Elasticsearch
Introduction to Elastic with a hint of Symfony and Docker
Introduction to MongoDB
MongoDB.local Seattle 2019: Building Your First MongoDB App Using Atlas & Stitch
Tableau & MongoDB: Visual Analytics at the Speed of Thought
Build robust streaming data pipelines with MongoDB and Kafka P2
CData Data Today: A Developer's Dilemma
MongoDB vs OrientDB
Webinar: Introducing the MongoDB Connector for BI 2.0 with Tableau
Big Data Analytics with Google BigQuery. By Javier Ramirez. All your base Co...
Ad

Similar to The Internet as a Single Database (20)

PPTX
Big data at scrapinghub
PPT
SQL or NoSQL, that is the question!
PDF
A content repository for your PHP application or CMS?
ODP
Semantic Web - Introduction
PDF
The Web Scale
PDF
Web Crawling with Apache Nutch
KEY
Big data and APIs for PHP developers - SXSW 2011
PPTX
ElasticSearch as (only) datastore
PDF
Big Data! Great! Now What? #SymfonyCon 2014
PPTX
The Big Data Stack
PDF
The Big Data Developer (@pavlobaron)
PDF
Crawling and Processing the Italian Corporate Web
PPT
SQL, NoSQL, BigData in Data Architecture
PPTX
Case study of Rujhaan.com (A social news app )
PPTX
Inside Wordnik's Architecture
PPTX
Essential Data Engineering for Data Scientist
PDF
Handling the growth of data
PPTX
Building a Scalable Web Crawler with Hadoop
PPTX
Common crawlpresentation
PPT
Document Databases & RavenDB
Big data at scrapinghub
SQL or NoSQL, that is the question!
A content repository for your PHP application or CMS?
Semantic Web - Introduction
The Web Scale
Web Crawling with Apache Nutch
Big data and APIs for PHP developers - SXSW 2011
ElasticSearch as (only) datastore
Big Data! Great! Now What? #SymfonyCon 2014
The Big Data Stack
The Big Data Developer (@pavlobaron)
Crawling and Processing the Italian Corporate Web
SQL, NoSQL, BigData in Data Architecture
Case study of Rujhaan.com (A social news app )
Inside Wordnik's Architecture
Essential Data Engineering for Data Scientist
Handling the growth of data
Building a Scalable Web Crawler with Hadoop
Common crawlpresentation
Document Databases & RavenDB
Ad

Recently uploaded (20)

PPTX
Spectroscopy.pptx food analysis technology
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
Machine Learning_overview_presentation.pptx
PDF
A comparative study of natural language inference in Swahili using monolingua...
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Empathic Computing: Creating Shared Understanding
PPTX
OMC Textile Division Presentation 2021.pptx
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Heart disease approach using modified random forest and particle swarm optimi...
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PDF
August Patch Tuesday
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
Machine learning based COVID-19 study performance prediction
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
1. Introduction to Computer Programming.pptx
Spectroscopy.pptx food analysis technology
Mobile App Security Testing_ A Comprehensive Guide.pdf
Machine Learning_overview_presentation.pptx
A comparative study of natural language inference in Swahili using monolingua...
Network Security Unit 5.pdf for BCA BBA.
Empathic Computing: Creating Shared Understanding
OMC Textile Division Presentation 2021.pptx
Advanced methodologies resolving dimensionality complications for autism neur...
Per capita expenditure prediction using model stacking based on satellite ima...
Heart disease approach using modified random forest and particle swarm optimi...
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
NewMind AI Weekly Chronicles - August'25-Week II
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Digital-Transformation-Roadmap-for-Companies.pptx
SOPHOS-XG Firewall Administrator PPT.pptx
August Patch Tuesday
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Machine learning based COVID-19 study performance prediction
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
1. Introduction to Computer Programming.pptx

The Internet as a Single Database

  • 1. The Internet as a Single DatabaseTechnologies Used & Lessons LearnedHouston Code Camp, August 2011Shion DeysarkarCEO, Datafiniti
  • 2. What does that mean?All web data in one, unified formatPlaces, people, news, URLs, products, etc., etc.Accessible as if you were querying a database
  • 3. Why build such a thing?Our users needed a better way of getting web dataWeb crawling is kludgy and unintuitiveDevelopers deserve something better than current APIs
  • 4. Why build such a thing?Because it would be awesome!
  • 5. Not an easy task…The Challenges
  • 6. The ChallengesThere’s a lot of data on the web100 million registered domainsMaybe only 100,000 have interesting stuff? (Which ones?)Some sites have millions or billions of data points
  • 7. The ChallengesIt’s all structured differently!Do we have to write web crawls for each website?Writing 100,000 web crawlers seems.. not fun
  • 8. The ChallengesData can conflictHow do we know which data is correct?
  • 9. So let’s start at the beginning:Data Collection
  • 10. Data CollectionBuilding a scalable web crawlerCloud or local data center? Neither.Grid computing (think SETI@home)1000s of home PCs that exchange time & bandwidth for $Crawl very fast for relatively little $
  • 11. Data CollectionBuilding a scalable web crawlerCoding 1000s of extraction appsBuild a framework that handles all the kludgy work: Link following & de-duplication
  • 12. Result formatting & storage
  • 13. Throttle rates & crawling behavior
  • 14. Any other crawling activity not specific to a website’s structureAbstract away everything but pattern matching and link generation Load lightweight, website-specific apps into above frameworkData CollectionBuilding a scalable web crawlerCoding 1000s of extraction appsAbstract away everything but pattern matching and link generation
  • 15. Data CollectionBuilding a scalable web crawlerCoding 1000s of extraction appsAbstract away everything but pattern matching and link generation
  • 16. Data CollectionBuilding a scalable web crawlerCurrent peak performance: 4.32 billion URLs per monthDeploying 20 new website crawls every monthEasy to scale crawling performance (just add grid nodes)Easy to scale deployment (just add contractors)
  • 17. Now for step 2! (step 1 took us 3 years >_<)Data Storage
  • 18. Data StorageBuilding a scalable data storeWhat we’re dealing with:TBs (eventually PBs) of dataBillions of rows, Thousands of columns (maybe more)Don’t want to deal with shardingDon’t actually care about ACIDDo care about high-throughput and fault-tolerance
  • 19. Data StorageBuilding a scalable data storeNoSQL (Cassandra) >> MySQL (for us)Can increase throughput and storage linearly by adding nodesVirtually unlimited and variable # of columnsMuch faster read/writeSome challenges Doesn’t yet support all the select features you’re used to
  • 20. Not a mature technology yet, expect frequent updatesData StorageBuilding a scalable data storeChoosing Cassandra over other NoSQL databasesMore active community, seems to be gaining traction most quicklyIntegrated with other relevant technologiesSolr for text search
  • 21. Hadoop for batch-style processingImpressive production-scale examples Though it’s true it has some high-profile scrappingsBacked by corporations (DataStax) and some really smart people
  • 22. Data StorageBuilding a unified database of everythingNormalizing separate data points that represent the same thingCo-occurrence: most popular choice wins
  • 23. Data StorageBuilding a unified database of everythingNormalizing separate data points that represent the same thingTrusted sources: put more weight on sources that tend to be right
  • 24. Data StorageBuilding a unified database of everythingIdentifying interesting data on a random web page
  • 25. Yay, step 3! (step 2 took us 3 months :D)Data Retrieval
  • 26. Data RetrievalBuilding an easy way to get lots of data fastMaking the right choices for our APISingle channel for all data retrievalRESTful API so anyone can develop with it
  • 27. All external and internal functionality uses the same API (easier to manage)As user-friendly and intuitive as possible SQL-style querying on a NoSQL database
  • 28. JSON default output, but will also supports CSV and XML
  • 29. SSL authentication with tokenBriefly considered using a 3rd-party service like Mashery
  • 30. Put it all together… (step 3 took 3 weeks!!!)Sneak Peak
  • 31. Launching SoonSign up for the beta at http://www.datafiniti.netFollow us @Datafiniti