The Internet as a Single Database

The Internet as a Single DatabaseTechnologies Used & Lessons LearnedHouston Code Camp, August 2011Shion DeysarkarCEO, Datafiniti

What does that mean?All web data in one, unified formatPlaces, people, news, URLs, products, etc., etc.Accessible as if you were querying a database

Why build such a thing?Our users needed a better way of getting web dataWeb crawling is kludgy and unintuitiveDevelopers deserve something better than current APIs

Why build such a thing?Because it would be awesome!

Not an easy task…The Challenges

The ChallengesThere’s a lot of data on the web100 million registered domainsMaybe only 100,000 have interesting stuff? (Which ones?)Some sites have millions or billions of data points

The ChallengesIt’s all structured differently!Do we have to write web crawls for each website?Writing 100,000 web crawlers seems.. not fun

The ChallengesData can conflictHow do we know which data is correct?

So let’s start at the beginning:Data Collection

Data CollectionBuilding a scalable web crawlerCloud or local data center? Neither.Grid computing (think SETI@home)1000s of home PCs that exchange time & bandwidth for $Crawl very fast for relatively little $

Data CollectionBuilding a scalable web crawlerCoding 1000s of extraction appsBuild a framework that handles all the kludgy work: Link following & de-duplication

Throttle rates & crawling behavior

Any other crawling activity not specific to a website’s structureAbstract away everything but pattern matching and link generation Load lightweight, website-specific apps into above frameworkData CollectionBuilding a scalable web crawlerCoding 1000s of extraction appsAbstract away everything but pattern matching and link generation

Data CollectionBuilding a scalable web crawlerCoding 1000s of extraction appsAbstract away everything but pattern matching and link generation

Data CollectionBuilding a scalable web crawlerCurrent peak performance: 4.32 billion URLs per monthDeploying 20 new website crawls every monthEasy to scale crawling performance (just add grid nodes)Easy to scale deployment (just add contractors)

Now for step 2! (step 1 took us 3 years >_<)Data Storage

Data StorageBuilding a scalable data storeWhat we’re dealing with:TBs (eventually PBs) of dataBillions of rows, Thousands of columns (maybe more)Don’t want to deal with shardingDon’t actually care about ACIDDo care about high-throughput and fault-tolerance

Data StorageBuilding a scalable data storeNoSQL (Cassandra) >> MySQL (for us)Can increase throughput and storage linearly by adding nodesVirtually unlimited and variable # of columnsMuch faster read/writeSome challenges Doesn’t yet support all the select features you’re used to

Not a mature technology yet, expect frequent updatesData StorageBuilding a scalable data storeChoosing Cassandra over other NoSQL databasesMore active community, seems to be gaining traction most quicklyIntegrated with other relevant technologiesSolr for text search

Hadoop for batch-style processingImpressive production-scale examples Though it’s true it has some high-profile scrappingsBacked by corporations (DataStax) and some really smart people

Data StorageBuilding a unified database of everythingNormalizing separate data points that represent the same thingCo-occurrence: most popular choice wins

Data StorageBuilding a unified database of everythingNormalizing separate data points that represent the same thingTrusted sources: put more weight on sources that tend to be right

Data StorageBuilding a unified database of everythingIdentifying interesting data on a random web page

Yay, step 3! (step 2 took us 3 months :D)Data Retrieval

The Internet as a Single Database

More Related Content

What's hot (19)

Similar to The Internet as a Single Database (20)

Recently uploaded (20)

The Internet as a Single Database