SlideShare a Scribd company logo
Building a Large Search
Platform on a             find the talk
Shoestring Budget

Jacques Nadeau, CTO
jacques@yapmap.com
@intjesus


May 22, 2012
Agenda
What is YapMap?
• Interfacing with Data
• Using HBase as a data processing pipeline
• NoSQL Schemas: Adjusting and Migrating
• Index Construction
• HBase Operations
What is YapMap?
• A visual search technology
• Focused on threaded
  conversations
• Built to provide better
  context and ranking
• Built on Hadoop & HBase for
  massive scale
• Two self-funded guys
• Motoyap.com is largest
  implementation at 650mm       www.motoyap.com
  automotive docs
Why do this?
               • Discussion forums and
                 mailings list primary
                 home for many hobbies
               • Threaded search sucks
                 – No context in the middle
                   of the conversation
How does it work?
                    Post 1
                    Post 2
                        Post 3
                             Post 4
                    Post 5
                        Post 6
A YapMap Search Result Page
Conceptual data model
                                        Entire Thread is MainDocGroup
             Post 1
             Post 2
                  Post 3                  For long threads, a single
                       Post 4             group may have multiple
                                          MainDocs
              Post 5
                  Post 6

                                  Each individual post is a
                                  DetailDoc

 • Threads are broken up among many web pages and don’t
   necessarily arrive in order
 • Longer threads are broken up
    – For short threads, MainDocGroup == MainDoc
General architecture

           RabbitMQ            MapReduce

    Targeted     Processing     Indexing      Results
    Crawlers      Pipeline       Engine    Presentation


                HBase                      Riak
                        HDFS/MapRfs
                         Zookeeper

   MySQL                                          MySQL
We match the tool to the use case
                     MySQL             HBase                      Riak
Primary Use          Business          Storage of crawl data,     Storage of
                     management        processing pipeline        components
                     information                                  directly related to
                                                                  presentation
Key features that   Transactions, SQL, Consistency, redundancy, Predictable low
drove selection     JPA                memory to persitence       latency, full
                                       ratio                      uptime, max one
                                                                  IOP per object
Average Object Size             Small                         20k                    2k
Object Count                <1 million                500 million             1 billion
System Count                         2                         10                     8
Memory Footprint                  <1gb                     120gb                 240gb
Dataset Size                    10mb                        10tb                    2tb


    We also evaluated Voldemort and Cassandra
Agenda
• What is YapMap?
Interfacing with Data
• Using HBase as a data processing pipeline
• NoSQL Schemas: Adjusting and Migrating
• Index Construction
• HBase Operations
HBase client is a power user interface
 • HBase client interface is low-level
   – Similar to JDBC/SQL
 • Most people start by using
   Bytes.to(String|Short|Long)
   – Spaghetti data layer
 • New developers have to learn a bunch of new
   concepts
 • Mistakes are easy to make
We built a DTO layer to simplify dev
 • Data Transfer Objects (DTO) & data access layer provide single point
   for code changes and data migration
 • First-class row key objects
 • Centralized type serialization
     – Standard data types
     – Complex object serialization layer via protobuf
 • Provide optimistic locking
 • Enable asynchronous operation
 • Minimize mistakes:
     – QuerySet abstraction (columns & column families)
     – Field state management (not queried versus null)
 • Newer tools have arrived to ease this burden
     – Kundera and Gora
Examples from our DTO abstraction
             <table name="crawlJob" row-id-class=“example.CrawlJobId" >
               <column-family name="main" compression="LZO" blockCacheEnabled="false" versions="1">
Definition
 Model




                  <column name="firstCheckpoint" type=“example.proto.JobProtos$CrawlCheckpoint" />
                  <column name="firstCheckpointTime" type="Long" />
                  <column name="entryCheckpointCount" type="Long" />
                  ...
             public class CrawlJobModel extends SparseModel<CrawlJobId>{
               public CrawlJobId getId(){…}
Generated
 Model




               public boolean hasFirstCheckpoint(){…}
               public CrawlCheckpoint getFirstCheckpoint(){…}
               public void setFirstCrawlCheckpoint(CrawlCheckpoint checkpoint){…}
               …
             public interface HBaseReadWriteService{
               public void putUnsafe(T model);
               public void putVersioned(T model);
Interface
  HBase




               public T get(RowId<T> rowId, QuerySet<T> querySet);
               public void increment(RowId<T> rowId, IncrementPair<T>... pairs);
               public SutructuredScanner<T> scanByPrefix(byte[] bytePrefix, QuerySet<T> querySet);
               ….
Example Primary Keys
UrlId                           Path + Query String
  org.apache.hbase:80:x:/book/architecture.html

 Reverse domain       Client Protocol (e.g. user name + http)
                    Optional Port

MainDocId
   GroupId (row)    2 byte bucket number (part)
 xxxx x xxxxxxx xx
           Additional identifier (4, 8 or 32 bytes depending on type)
      1 byte type of identifier enum (int, long or sha2, generic 32)
   4 byte source id
Agenda
• What is YapMap?
• Interfacing with Data
Using HBase as a data processing pipeline
• NoSQL Schemas: Adjusting and Migrating
• Index Construction
• HBase Operations
Processing pipeline is built on HBase
 •    Multiple steps with checkpoints to manage failures
 •    Out of order input assumed
 •    Idempotent operations at each stage of process
 •    Utilize optimistic locking to do coordinated merges
 •    Use regular cleanup scans to pick up lost tasks
 •    Control batch size of messages to control throughput versus latency

                Message                Message             Message                 Batch
                          Build Main       Merge + Split
                                                                Pre-index Main   Indexing
     Crawlers                               Main Doc
                             Docs            Groups                  Docs           RT
                                                                                 Indexing

     Cache       DFS          t1:cf1                 t2:cf1     t2:cf2
Migrating from messaging to coprocessors

 • Big challenges
    – Mixing system code and application code
    – Memory impact: we have a GC stable state
 • Exploring HBASE-4047 to solve

             Message                Message             Message                    Batch
                       Build Main       Merge + Split
                                                                Pre-index Main   Indexing
  Crawlers                               Main Doc
                          Docs            Groups                     Docs           RT
                                                                                 Indexing
                                        CP                 CP
  Cache       DFS          t1:cf1                 t2:cf1        t2:cf1
Agenda
• What is YapMap?
• Interfacing with Data
• Using HBase as a data processing pipeline
NoSQL Schemas: Adjusting and Migrating
• Index Construction
• HBase Operations
Learn to leverage NoSQL strengths
• Original Structure was similar    • New structure utilizes a cell for
  to traditional RDBMS,               each DetailDoc
    – static column names           • Split metadata maps MainDoc >
    – fully realized MainDoc          DetailDocId
• One new DetailDoc could cause     • HBase handles cascading changes
  a cascading regeneration of all   • MainDoc realized on app read
  MainDocs
                                    • Use column prefixing
            0        1         2        metadata detailid1   detailid2

        MainDoc MainDoc MainDoc           Splits   Detail     Detail
Schema migration steps
 1. Disable application writes on OldTable
 2. Extract OldSplits from OldTable
 3. Create NewTable with appropriate column families and
    properties
 4. Split NewTable based on OldSplits
 5. Run MapReduce job that converts old objects into new
    objects
    –   Use HTableInputFormat as input on OldTable
    –   Use HFileOutputFormat as output format pointing at NewTable
 6. Bulk load output into NewTable
 7. Redeploy application to read on NewTable
 8. Enable writes in application layer
Agenda
• What is YapMap?
• Interfacing with Data
• Using HBase as a data processing pipeline
• NoSQL Schemas: Adjusting and Migrating
Index Construction
• HBase Operations
Index Shards loosely based on HBase regions
 • Indexing is split      Tokenized Main Docs
   between major
   indices (batch) and
   minor (real time)             R1             Shard 1
 • Primary key order is
   same as index order
 • Shards are based on           R2             Shard 2
   snapshots of splits
 • IndexedTableSplit
   allows cross-region           R3             Shard 3
   shard splits to be
   integrated at Index
   load time
Batch indices are memory based, stored on DFS

 • Total of all shards about 1tb
    – With ECC memory <$7/gb, systems easily achieving 128-256gb
      each=> no problem
 • Each shard ~5gb in size to improve parallelism on search
    – Variable depending on needs and use case
 • Each shard is composed of multiple map and reduce parts
   along with MapReduce statistics from HBase
    – Integration of components are done in memory
    – Partitioner utilizes observed term distributions
    – New MR committer: FileAndPutOutputCommitter
        • Allows low volume secondary outputs from Map phase to be used
          during reduce phase
Agenda
• What is YapMap?
• Interfacing with Data
• Using HBase as a data processing pipeline
• NoSQL Schemas: Adjusting and Migrating
• Index Construction
HBase Operations
HBase Operations
• Getting GC right – 6 months
   – Machines have 32gb, 12gb for HBase, more was a problem
• Pick the right region size: With HFile v2, just start bigger
• Be cautious about using multiple CFs
• Consider Asynchbase Client
   – Benoit did some nice work at SU
   – Ultimately we just leveraged EJB3.1 @Async capabilities to make our HBase
     service async
• Upgrade: typically on the first or second point release
   – Testing/research cluster first
• Hardware: 8 core low power chips, low power ddr3, 6x WD
  Black 2TB drives per machine, Infiniband
• MapR’s M3 distribution of Hadoop
Questions
• Why not Lucene/Solr/ElasticSearch/etc?
    – Data locality between main and detail documents to do document-at-once scoring
    – Not built to work well with Hadoop and HBase (Blur.io is first to tackle this head on)
• Why not store indices directly in HBase?
    – Single cell storage would be the only way to do it efficiently
    – No such thing as a single cell no-read append (HBASE-5993)
    – No single cell partial read
• Why use Riak for presentation side?
    – Hadoop SPOF
    – Even with newer Hadoop versions, HBase does not do sub-second row-level HA on node
      failure (HBASE-2357)
    – Riak has more predictable latency
• Why did you switch to MapR?
    – Index load performance was substantially faster
    – Less impact on HBase performance
    – Snapshots in trial copy were nice for those 30 days

More Related Content

PPTX
HBaseCon 2012 | HBase, the Use Case in eBay Cassini
Cloudera, Inc.
 
PPTX
Harmonizing Multi-tenant HBase Clusters for Managing Workload Diversity
HBaseCon
 
PPTX
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
Cloudera, Inc.
 
PDF
HBaseCon 2015- HBase @ Flipboard
Matthew Blair
 
PDF
Large-scale Web Apps @ Pinterest
HBaseCon
 
PPTX
HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experienc...
Cloudera, Inc.
 
PPTX
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
Cloudera, Inc.
 
PPTX
HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...
Michael Stack
 
HBaseCon 2012 | HBase, the Use Case in eBay Cassini
Cloudera, Inc.
 
Harmonizing Multi-tenant HBase Clusters for Managing Workload Diversity
HBaseCon
 
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
Cloudera, Inc.
 
HBaseCon 2015- HBase @ Flipboard
Matthew Blair
 
Large-scale Web Apps @ Pinterest
HBaseCon
 
HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experienc...
Cloudera, Inc.
 
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
Cloudera, Inc.
 
HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...
Michael Stack
 

What's hot (20)

PPTX
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon
 
PPTX
HBaseCon 2012 | Solbase - Kyungseog Oh, Photobucket
Cloudera, Inc.
 
PPTX
HBase at Bloomberg: High Availability Needs for the Financial Industry
HBaseCon
 
PPTX
Keynote: The Future of Apache HBase
HBaseCon
 
PPTX
A Survey of HBase Application Archetypes
HBaseCon
 
PPTX
Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsight
HBaseCon
 
PDF
HBase Read High Availability Using Timeline-Consistent Region Replicas
HBaseCon
 
PPT
HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big Data
Cloudera, Inc.
 
PPTX
HBase in Practice
DataWorks Summit/Hadoop Summit
 
PPTX
HBaseCon 2012 | HBase for the Worlds Libraries - OCLC
Cloudera, Inc.
 
PDF
HBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWS
HBaseCon
 
PDF
HBaseCon 2013: Apache HBase Operations at Pinterest
Cloudera, Inc.
 
PPTX
HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...
Cloudera, Inc.
 
PPTX
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
Yahoo Developer Network
 
PPTX
HBaseCon 2015 General Session: Zen - A Graph Data Model on HBase
HBaseCon
 
PDF
HBaseCon 2015: HBase at Scale in an Online and High-Demand Environment
HBaseCon
 
PPTX
HBaseConAsia2018 Track2-1: Kerberos-based Big Data Security Solution and Prac...
Michael Stack
 
PPTX
HBaseCon 2013: Compaction Improvements in Apache HBase
Cloudera, Inc.
 
PDF
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
Cloudera, Inc.
 
PPTX
HBaseCon 2015: HBase and Spark
HBaseCon
 
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon
 
HBaseCon 2012 | Solbase - Kyungseog Oh, Photobucket
Cloudera, Inc.
 
HBase at Bloomberg: High Availability Needs for the Financial Industry
HBaseCon
 
Keynote: The Future of Apache HBase
HBaseCon
 
A Survey of HBase Application Archetypes
HBaseCon
 
Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsight
HBaseCon
 
HBase Read High Availability Using Timeline-Consistent Region Replicas
HBaseCon
 
HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big Data
Cloudera, Inc.
 
HBase in Practice
DataWorks Summit/Hadoop Summit
 
HBaseCon 2012 | HBase for the Worlds Libraries - OCLC
Cloudera, Inc.
 
HBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWS
HBaseCon
 
HBaseCon 2013: Apache HBase Operations at Pinterest
Cloudera, Inc.
 
HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...
Cloudera, Inc.
 
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
Yahoo Developer Network
 
HBaseCon 2015 General Session: Zen - A Graph Data Model on HBase
HBaseCon
 
HBaseCon 2015: HBase at Scale in an Online and High-Demand Environment
HBaseCon
 
HBaseConAsia2018 Track2-1: Kerberos-based Big Data Security Solution and Prac...
Michael Stack
 
HBaseCon 2013: Compaction Improvements in Apache HBase
Cloudera, Inc.
 
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
Cloudera, Inc.
 
HBaseCon 2015: HBase and Spark
HBaseCon
 
Ad

Viewers also liked (20)

PDF
HBaseCon 2012 | Getting Real about Interactive Big Data Management with Lily ...
Cloudera, Inc.
 
PDF
HBaseCon 2013:High-Throughput, Transactional Stream Processing on Apache HBase
Cloudera, Inc.
 
PPTX
HBaseCon 2012 | Developing Real Time Analytics Applications Using HBase in th...
Cloudera, Inc.
 
PDF
HBaseCon 2013: Real-Time Model Scoring in Recommender Systems
Cloudera, Inc.
 
PDF
HBaseCon 2012 | Real-time Analytics with HBase - Sematext
Cloudera, Inc.
 
PDF
HBaseCon 2013: Scalable Network Designs for Apache HBase
Cloudera, Inc.
 
PPTX
HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live W...
Cloudera, Inc.
 
PPTX
HBaseCon 2013: Full-Text Indexing for Apache HBase
Cloudera, Inc.
 
PPTX
HBaseCon 2013: Realtime User Segmentation using Apache HBase -- Architectural...
Cloudera, Inc.
 
PPTX
HBaseCon 2013: Near Real Time Indexing for eBay Search
Cloudera, Inc.
 
PDF
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
Cloudera, Inc.
 
PDF
Docker Monitoring Webinar
Sematext Group, Inc.
 
PDF
Solr Anti Patterns
Sematext Group, Inc.
 
PDF
Tuning Solr for Logs
Sematext Group, Inc.
 
PDF
From Zero to Production Hero: Log Analysis with Elasticsearch (from Velocity ...
Sematext Group, Inc.
 
PDF
Large Scale Log Analytics with Solr (from Lucene Revolution 2015)
Sematext Group, Inc.
 
PDF
Metrics, Logs, Transaction Traces, Anomaly Detection at Scale
Sematext Group, Inc.
 
PPTX
Tuning Elasticsearch Indexing Pipeline for Logs
Sematext Group, Inc.
 
PDF
Side by Side with Elasticsearch & Solr, Part 2
Sematext Group, Inc.
 
PDF
Hue: Big Data Web applications for Interactive Hadoop at Big Data Spain 2014
gethue
 
HBaseCon 2012 | Getting Real about Interactive Big Data Management with Lily ...
Cloudera, Inc.
 
HBaseCon 2013:High-Throughput, Transactional Stream Processing on Apache HBase
Cloudera, Inc.
 
HBaseCon 2012 | Developing Real Time Analytics Applications Using HBase in th...
Cloudera, Inc.
 
HBaseCon 2013: Real-Time Model Scoring in Recommender Systems
Cloudera, Inc.
 
HBaseCon 2012 | Real-time Analytics with HBase - Sematext
Cloudera, Inc.
 
HBaseCon 2013: Scalable Network Designs for Apache HBase
Cloudera, Inc.
 
HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live W...
Cloudera, Inc.
 
HBaseCon 2013: Full-Text Indexing for Apache HBase
Cloudera, Inc.
 
HBaseCon 2013: Realtime User Segmentation using Apache HBase -- Architectural...
Cloudera, Inc.
 
HBaseCon 2013: Near Real Time Indexing for eBay Search
Cloudera, Inc.
 
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
Cloudera, Inc.
 
Docker Monitoring Webinar
Sematext Group, Inc.
 
Solr Anti Patterns
Sematext Group, Inc.
 
Tuning Solr for Logs
Sematext Group, Inc.
 
From Zero to Production Hero: Log Analysis with Elasticsearch (from Velocity ...
Sematext Group, Inc.
 
Large Scale Log Analytics with Solr (from Lucene Revolution 2015)
Sematext Group, Inc.
 
Metrics, Logs, Transaction Traces, Anomaly Detection at Scale
Sematext Group, Inc.
 
Tuning Elasticsearch Indexing Pipeline for Logs
Sematext Group, Inc.
 
Side by Side with Elasticsearch & Solr, Part 2
Sematext Group, Inc.
 
Hue: Big Data Web applications for Interactive Hadoop at Big Data Spain 2014
gethue
 
Ad

Similar to HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget (20)

PDF
Searching conversations with hadoop
DataWorks Summit
 
PDF
Introduction to Hadoop
Ovidiu Dimulescu
 
KEY
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
Bradford Stephens
 
PPTX
Big data ppt
Thirunavukkarasu Ps
 
PDF
Transition from relational to NoSQL Philly DAMA Day
Dipti Borkar
 
PDF
Navigating the Transition from relational to NoSQL - CloudCon Expo 2012
Dipti Borkar
 
PPTX
Big data hadoop ecosystem and nosql
Khanderao Kand
 
PDF
Facebook keynote-nicolas-qcon
Yiwei Ma
 
PDF
支撑Facebook消息处理的h base存储系统
yongboy
 
PDF
Facebook Messages & HBase
强 王
 
PPTX
Real time hadoop + mapreduce intro
Geoff Hendrey
 
PPTX
Drill njhug -19 feb2013
MapR Technologies
 
PPTX
Understanding the Value and Architecture of Apache Drill
DataWorks Summit
 
PPTX
Hadoop Summit - Hausenblas 20 March
MapR Technologies
 
PPTX
Seattle Scalability Meetup - Ted Dunning - MapR
clive boulton
 
PDF
Architecting the Future of Big Data & Search - Eric Baldeschwieler
lucenerevolution
 
PPTX
Introduction to Apache Drill
Swiss Big Data User Group
 
PDF
Hadoop on Azure, Blue elephants
Ovidiu Dimulescu
 
KEY
TriHUG - Beyond Batch
boorad
 
PDF
Omaha Java Users Group - Introduction to HBase and Hadoop
Shawn Hermans
 
Searching conversations with hadoop
DataWorks Summit
 
Introduction to Hadoop
Ovidiu Dimulescu
 
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
Bradford Stephens
 
Big data ppt
Thirunavukkarasu Ps
 
Transition from relational to NoSQL Philly DAMA Day
Dipti Borkar
 
Navigating the Transition from relational to NoSQL - CloudCon Expo 2012
Dipti Borkar
 
Big data hadoop ecosystem and nosql
Khanderao Kand
 
Facebook keynote-nicolas-qcon
Yiwei Ma
 
支撑Facebook消息处理的h base存储系统
yongboy
 
Facebook Messages & HBase
强 王
 
Real time hadoop + mapreduce intro
Geoff Hendrey
 
Drill njhug -19 feb2013
MapR Technologies
 
Understanding the Value and Architecture of Apache Drill
DataWorks Summit
 
Hadoop Summit - Hausenblas 20 March
MapR Technologies
 
Seattle Scalability Meetup - Ted Dunning - MapR
clive boulton
 
Architecting the Future of Big Data & Search - Eric Baldeschwieler
lucenerevolution
 
Introduction to Apache Drill
Swiss Big Data User Group
 
Hadoop on Azure, Blue elephants
Ovidiu Dimulescu
 
TriHUG - Beyond Batch
boorad
 
Omaha Java Users Group - Introduction to HBase and Hadoop
Shawn Hermans
 

More from Cloudera, Inc. (20)

PPTX
Partner Briefing_January 25 (FINAL).pptx
Cloudera, Inc.
 
PPTX
Cloudera Data Impact Awards 2021 - Finalists
Cloudera, Inc.
 
PPTX
2020 Cloudera Data Impact Awards Finalists
Cloudera, Inc.
 
PPTX
Edc event vienna presentation 1 oct 2019
Cloudera, Inc.
 
PPTX
Machine Learning with Limited Labeled Data 4/3/19
Cloudera, Inc.
 
PPTX
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Cloudera, Inc.
 
PPTX
Introducing Cloudera DataFlow (CDF) 2.13.19
Cloudera, Inc.
 
PPTX
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Cloudera, Inc.
 
PPTX
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Cloudera, Inc.
 
PPTX
Leveraging the cloud for analytics and machine learning 1.29.19
Cloudera, Inc.
 
PPTX
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Cloudera, Inc.
 
PPTX
Leveraging the Cloud for Big Data Analytics 12.11.18
Cloudera, Inc.
 
PPTX
Modern Data Warehouse Fundamentals Part 3
Cloudera, Inc.
 
PPTX
Modern Data Warehouse Fundamentals Part 2
Cloudera, Inc.
 
PPTX
Modern Data Warehouse Fundamentals Part 1
Cloudera, Inc.
 
PPTX
Extending Cloudera SDX beyond the Platform
Cloudera, Inc.
 
PPTX
Federated Learning: ML with Privacy on the Edge 11.15.18
Cloudera, Inc.
 
PPTX
Analyst Webinar: Doing a 180 on Customer 360
Cloudera, Inc.
 
PPTX
Build a modern platform for anti-money laundering 9.19.18
Cloudera, Inc.
 
PPTX
Introducing the data science sandbox as a service 8.30.18
Cloudera, Inc.
 
Partner Briefing_January 25 (FINAL).pptx
Cloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
Cloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Cloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Cloudera, Inc.
 

Recently uploaded (20)

PDF
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
PDF
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
PDF
How-Cloud-Computing-Impacts-Businesses-in-2025-and-Beyond.pdf
Artjoker Software Development Company
 
PDF
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
PDF
Cloud-Migration-Best-Practices-A-Practical-Guide-to-AWS-Azure-and-Google-Clou...
Artjoker Software Development Company
 
PPTX
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
PPTX
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
PDF
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
PDF
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
PDF
Brief History of Internet - Early Days of Internet
sutharharshit158
 
PPTX
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
PPTX
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
PDF
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
PDF
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
PDF
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
PPTX
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
PDF
Doc9.....................................
SofiaCollazos
 
PDF
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
PDF
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
PDF
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
How-Cloud-Computing-Impacts-Businesses-in-2025-and-Beyond.pdf
Artjoker Software Development Company
 
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
Cloud-Migration-Best-Practices-A-Practical-Guide-to-AWS-Azure-and-Google-Clou...
Artjoker Software Development Company
 
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
Brief History of Internet - Early Days of Internet
sutharharshit158
 
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
Doc9.....................................
SofiaCollazos
 
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 

HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget

  • 1. Building a Large Search Platform on a find the talk Shoestring Budget Jacques Nadeau, CTO [email protected] @intjesus May 22, 2012
  • 2. Agenda What is YapMap? • Interfacing with Data • Using HBase as a data processing pipeline • NoSQL Schemas: Adjusting and Migrating • Index Construction • HBase Operations
  • 3. What is YapMap? • A visual search technology • Focused on threaded conversations • Built to provide better context and ranking • Built on Hadoop & HBase for massive scale • Two self-funded guys • Motoyap.com is largest implementation at 650mm www.motoyap.com automotive docs
  • 4. Why do this? • Discussion forums and mailings list primary home for many hobbies • Threaded search sucks – No context in the middle of the conversation
  • 5. How does it work? Post 1 Post 2 Post 3 Post 4 Post 5 Post 6
  • 6. A YapMap Search Result Page
  • 7. Conceptual data model Entire Thread is MainDocGroup Post 1 Post 2 Post 3 For long threads, a single Post 4 group may have multiple MainDocs Post 5 Post 6 Each individual post is a DetailDoc • Threads are broken up among many web pages and don’t necessarily arrive in order • Longer threads are broken up – For short threads, MainDocGroup == MainDoc
  • 8. General architecture RabbitMQ MapReduce Targeted Processing Indexing Results Crawlers Pipeline Engine Presentation HBase Riak HDFS/MapRfs Zookeeper MySQL MySQL
  • 9. We match the tool to the use case MySQL HBase Riak Primary Use Business Storage of crawl data, Storage of management processing pipeline components information directly related to presentation Key features that Transactions, SQL, Consistency, redundancy, Predictable low drove selection JPA memory to persitence latency, full ratio uptime, max one IOP per object Average Object Size Small 20k 2k Object Count <1 million 500 million 1 billion System Count 2 10 8 Memory Footprint <1gb 120gb 240gb Dataset Size 10mb 10tb 2tb We also evaluated Voldemort and Cassandra
  • 10. Agenda • What is YapMap? Interfacing with Data • Using HBase as a data processing pipeline • NoSQL Schemas: Adjusting and Migrating • Index Construction • HBase Operations
  • 11. HBase client is a power user interface • HBase client interface is low-level – Similar to JDBC/SQL • Most people start by using Bytes.to(String|Short|Long) – Spaghetti data layer • New developers have to learn a bunch of new concepts • Mistakes are easy to make
  • 12. We built a DTO layer to simplify dev • Data Transfer Objects (DTO) & data access layer provide single point for code changes and data migration • First-class row key objects • Centralized type serialization – Standard data types – Complex object serialization layer via protobuf • Provide optimistic locking • Enable asynchronous operation • Minimize mistakes: – QuerySet abstraction (columns & column families) – Field state management (not queried versus null) • Newer tools have arrived to ease this burden – Kundera and Gora
  • 13. Examples from our DTO abstraction <table name="crawlJob" row-id-class=“example.CrawlJobId" > <column-family name="main" compression="LZO" blockCacheEnabled="false" versions="1"> Definition Model <column name="firstCheckpoint" type=“example.proto.JobProtos$CrawlCheckpoint" /> <column name="firstCheckpointTime" type="Long" /> <column name="entryCheckpointCount" type="Long" /> ... public class CrawlJobModel extends SparseModel<CrawlJobId>{ public CrawlJobId getId(){…} Generated Model public boolean hasFirstCheckpoint(){…} public CrawlCheckpoint getFirstCheckpoint(){…} public void setFirstCrawlCheckpoint(CrawlCheckpoint checkpoint){…} … public interface HBaseReadWriteService{ public void putUnsafe(T model); public void putVersioned(T model); Interface HBase public T get(RowId<T> rowId, QuerySet<T> querySet); public void increment(RowId<T> rowId, IncrementPair<T>... pairs); public SutructuredScanner<T> scanByPrefix(byte[] bytePrefix, QuerySet<T> querySet); ….
  • 14. Example Primary Keys UrlId Path + Query String org.apache.hbase:80:x:/book/architecture.html Reverse domain Client Protocol (e.g. user name + http) Optional Port MainDocId GroupId (row) 2 byte bucket number (part) xxxx x xxxxxxx xx Additional identifier (4, 8 or 32 bytes depending on type) 1 byte type of identifier enum (int, long or sha2, generic 32) 4 byte source id
  • 15. Agenda • What is YapMap? • Interfacing with Data Using HBase as a data processing pipeline • NoSQL Schemas: Adjusting and Migrating • Index Construction • HBase Operations
  • 16. Processing pipeline is built on HBase • Multiple steps with checkpoints to manage failures • Out of order input assumed • Idempotent operations at each stage of process • Utilize optimistic locking to do coordinated merges • Use regular cleanup scans to pick up lost tasks • Control batch size of messages to control throughput versus latency Message Message Message Batch Build Main Merge + Split Pre-index Main Indexing Crawlers Main Doc Docs Groups Docs RT Indexing Cache DFS t1:cf1 t2:cf1 t2:cf2
  • 17. Migrating from messaging to coprocessors • Big challenges – Mixing system code and application code – Memory impact: we have a GC stable state • Exploring HBASE-4047 to solve Message Message Message Batch Build Main Merge + Split Pre-index Main Indexing Crawlers Main Doc Docs Groups Docs RT Indexing CP CP Cache DFS t1:cf1 t2:cf1 t2:cf1
  • 18. Agenda • What is YapMap? • Interfacing with Data • Using HBase as a data processing pipeline NoSQL Schemas: Adjusting and Migrating • Index Construction • HBase Operations
  • 19. Learn to leverage NoSQL strengths • Original Structure was similar • New structure utilizes a cell for to traditional RDBMS, each DetailDoc – static column names • Split metadata maps MainDoc > – fully realized MainDoc DetailDocId • One new DetailDoc could cause • HBase handles cascading changes a cascading regeneration of all • MainDoc realized on app read MainDocs • Use column prefixing 0 1 2 metadata detailid1 detailid2 MainDoc MainDoc MainDoc Splits Detail Detail
  • 20. Schema migration steps 1. Disable application writes on OldTable 2. Extract OldSplits from OldTable 3. Create NewTable with appropriate column families and properties 4. Split NewTable based on OldSplits 5. Run MapReduce job that converts old objects into new objects – Use HTableInputFormat as input on OldTable – Use HFileOutputFormat as output format pointing at NewTable 6. Bulk load output into NewTable 7. Redeploy application to read on NewTable 8. Enable writes in application layer
  • 21. Agenda • What is YapMap? • Interfacing with Data • Using HBase as a data processing pipeline • NoSQL Schemas: Adjusting and Migrating Index Construction • HBase Operations
  • 22. Index Shards loosely based on HBase regions • Indexing is split Tokenized Main Docs between major indices (batch) and minor (real time) R1 Shard 1 • Primary key order is same as index order • Shards are based on R2 Shard 2 snapshots of splits • IndexedTableSplit allows cross-region R3 Shard 3 shard splits to be integrated at Index load time
  • 23. Batch indices are memory based, stored on DFS • Total of all shards about 1tb – With ECC memory <$7/gb, systems easily achieving 128-256gb each=> no problem • Each shard ~5gb in size to improve parallelism on search – Variable depending on needs and use case • Each shard is composed of multiple map and reduce parts along with MapReduce statistics from HBase – Integration of components are done in memory – Partitioner utilizes observed term distributions – New MR committer: FileAndPutOutputCommitter • Allows low volume secondary outputs from Map phase to be used during reduce phase
  • 24. Agenda • What is YapMap? • Interfacing with Data • Using HBase as a data processing pipeline • NoSQL Schemas: Adjusting and Migrating • Index Construction HBase Operations
  • 25. HBase Operations • Getting GC right – 6 months – Machines have 32gb, 12gb for HBase, more was a problem • Pick the right region size: With HFile v2, just start bigger • Be cautious about using multiple CFs • Consider Asynchbase Client – Benoit did some nice work at SU – Ultimately we just leveraged EJB3.1 @Async capabilities to make our HBase service async • Upgrade: typically on the first or second point release – Testing/research cluster first • Hardware: 8 core low power chips, low power ddr3, 6x WD Black 2TB drives per machine, Infiniband • MapR’s M3 distribution of Hadoop
  • 26. Questions • Why not Lucene/Solr/ElasticSearch/etc? – Data locality between main and detail documents to do document-at-once scoring – Not built to work well with Hadoop and HBase (Blur.io is first to tackle this head on) • Why not store indices directly in HBase? – Single cell storage would be the only way to do it efficiently – No such thing as a single cell no-read append (HBASE-5993) – No single cell partial read • Why use Riak for presentation side? – Hadoop SPOF – Even with newer Hadoop versions, HBase does not do sub-second row-level HA on node failure (HBASE-2357) – Riak has more predictable latency • Why did you switch to MapR? – Index load performance was substantially faster – Less impact on HBase performance – Snapshots in trial copy were nice for those 30 days