Hadoop and Vertica
The Data Analytics Platform at Twitter
               Bill Graham - @billgraham
     Data Systems Engineer, Analytics Infrastructure
              Hadoop Summit, June 2012
About that pony giveaway...




                              2
Outline
  • Architecture
  • Data flow
  • Job coordination
  • Resource management
  • Vertica integration
  • Gotchas
  • Future work




                          3
We count things

  • 140 characters
  • 140M active users
  • 400M tweets per day
  • 80-100 TB ingested daily (uncompressed)
  • 10s of Ks daily Hadoop jobs




                                              4
Heterogeneous stack
  • Many job execution applications
    • Crane - Java ETL
    • Oink - Pig scheduler
    • Rasvelg - SQL aggregations
    • Scalding - Cascading via Scala
    • PyCascading - Cascading via Python
    • Indexing jobs
  • Our users
    • Analytics, Revenue, Growth, Search, Recommendations, etc.
    • PMs, Sales!


                                                                  5
Data flow: Analytics

                                       Production Hosts
                  Log                                      Application
                events                                     Data
                         Scribe
                         Aggregators

  Third Party
                                                                     Social graph
   Imports                        HDFS                    MySQL/     Tweets
                                                          Gizzard    User profiles
                     Staging Hadoop Cluster




                Main Hadoop DW           HBase                                      Analytics
                                                           Vertica
                                                                                    Web Tools




                                                            MySQL
                                                                                                6
Data flow: Analytics

                                       Production Hosts
                  Log                                            Application
                events                                           Data
                         Scribe
                         Aggregators

  Third Party
                                                                           Social graph
   Imports                        HDFS                          MySQL/     Tweets
                                                                Gizzard    User profiles
                                                  Distributed
                     Staging Hadoop Cluster       Crawler



                            Log
                            Mover



                Main Hadoop DW           HBase                                            Analytics
                                                                 Vertica
                                                                                          Web Tools




                                                                  MySQL
                                                                                                      6
Data flow: Analytics

                                       Production Hosts
                  Log                                                 Application
                events                                                Data
                         Scribe
                         Aggregators

  Third Party
                                                                                Social graph
   Imports                        HDFS                              MySQL/      Tweets
                                                                    Gizzard     User profiles
                                                  Distributed
                     Staging Hadoop Cluster       Crawler
       Crane                                                Crane
                                                                       Crane
                            Log
                            Mover



                Main Hadoop DW           HBase                                                 Analytics
                                                                      Vertica
                                                                                               Web Tools
                                                          Crane

                                                                    Crane
                                                          Crane


                                                                       MySQL
                                                                                                           6
Data flow: Analytics

                                       Production Hosts
                  Log                                                 Application
                events                                                Data
                         Scribe
                         Aggregators

  Third Party
                                                                                Social graph
   Imports                        HDFS                              MySQL/      Tweets
                                                                    Gizzard     User profiles
                                                  Distributed
                     Staging Hadoop Cluster       Crawler
       Crane                                                Crane
                                                                       Crane
                            Log
                            Mover


                                                           Oink
    Oink        Main Hadoop DW           HBase                                                 Analytics
                                                                      Vertica
                                                                                               Web Tools
                                                          Crane

                                                                    Crane
                                                          Crane

                                                 Oink
                                                                       MySQL
                                                                                                           6
Data flow: Analytics

                                       Production Hosts
                  Log                                                 Application
                events                                                Data
                         Scribe
                         Aggregators

  Third Party
                                                                                Social graph
   Imports                        HDFS                              MySQL/      Tweets
                                                                    Gizzard     User profiles
                                                  Distributed
                     Staging Hadoop Cluster       Crawler
       Crane                                                Crane
                                                                       Crane
                            Log                                                     Rasvelg
                            Mover


                                                           Oink
    Oink        Main Hadoop DW           HBase                                                 Analytics
                                                                      Vertica
                                                                                               Web Tools
                                                          Crane

                                                                    Crane
                                                          Crane

                                                 Oink
                                                                       MySQL
                                                                                                           6
Data flow: Analytics

                                       Production Hosts
                  Log                                                 Application
                events                                                Data
                         Scribe
                         Aggregators

  Third Party
                                                                                Social graph
   Imports                        HDFS                              MySQL/      Tweets
                                                                    Gizzard     User profiles
                                                  Distributed                                              Analysts
                     Staging Hadoop Cluster       Crawler                                                  Engineers
       Crane                                                                                               PMs
                                                            Crane                                          Sales
                                                                       Crane
                            Log                                                     Rasvelg
                            Mover


                                                           Oink
    Oink        Main Hadoop DW           HBase                                                 Analytics
                                                                      Vertica
                                                                                               Web Tools
                                                          Crane

                                                                    Crane
                                                          Crane

                                                 Oink
                                                                       MySQL
                                                                                                                  6
Data flow: Analytics

                                             Production Hosts
                        Log                                                 Application
                      events                                                Data
                               Scribe
                               Aggregators

       Third Party
                                                                                      Social graph
        Imports                         HDFS                              MySQL/      Tweets
                                                                          Gizzard     User profiles
                                                        Distributed                                              Analysts
                           Staging Hadoop Cluster       Crawler                                                  Engineers
              Crane                                                                                              PMs
                                                                  Crane                                          Sales
                                                                             Crane
                                  Log                                                     Rasvelg
HCatalog                          Mover


                                                                 Oink
           Oink       Main Hadoop DW           HBase                                                 Analytics
                                                                            Vertica
                                                                                                     Web Tools
                                                                Crane

                                                                          Crane
                                                                Crane

                                                       Oink
                                                                             MySQL
                                                                                                                        6
Chaotic? Actually, no.




                         7
System concepts


  • Loose coupling
  • Job coordination as a service
  • Resource management as a service
  • Idempotence




                                       8
Loose coupling


  • Multiple job frameworks
  • Right tool for the job
  • Common dependency management




                                   9
Job coordination

  • Shared batch table for job state
  • Access via client libraries
  • Jobs & data are time-based
  • 3 types of preconditions
    1. other job success (i.e., predecessor job complete)
    2. existence of data (i.e., HDFS input exists)
    3. user-defined (i.e., MySQL slave lag)
  • Failed jobs get retried (usually)

                                                            10
Job coordination

  • Shared batch table for job state             batch table:
                                                 (id, description, state,
  • Access via client libraries                   start_time, end_time,
                                                  job_start_time, job_end_time)

  • Jobs & data are time-based
  • 3 types of preconditions
    1. other job success (i.e., predecessor job complete)
    2. existence of data (i.e., HDFS input exists)
    3. user-defined (i.e., MySQL slave lag)
  • Failed jobs get retried (usually)

                                                                          10
Job coordination

  • Shared batch table for job state             batch table:
                                                 (id, description, state,
  • Access via client libraries                   start_time, end_time,
                                                  job_start_time, job_end_time)

  • Jobs & data are time-based
  • 3 types of preconditions
                                                                   Job
    1. other job success (i.e., predecessor job complete)
    2. existence of data (i.e., HDFS input exists)
    3. user-defined (i.e., MySQL slave lag)
  • Failed jobs get retried (usually)

                                                                          10
Job coordination

  • Shared batch table for job state             batch table:
                                                 (id, description, state,
  • Access via client libraries                   start_time, end_time,
                                                  job_start_time, job_end_time)

  • Jobs & data are time-based
  • 3 types of preconditions
                                                                   Job
    1. other job success (i.e., predecessor job complete)
    2. existence of data (i.e., HDFS input exists)         Data

    3. user-defined (i.e., MySQL slave lag)
  • Failed jobs get retried (usually)

                                                                          10
Job coordination

  • Shared batch table for job state             batch table:
                                                 (id, description, state,
  • Access via client libraries                   start_time, end_time,
                                                  job_start_time, job_end_time)

  • Jobs & data are time-based
  • 3 types of preconditions
                                                                      Job
    1. other job success (i.e., predecessor job complete)
    2. existence of data (i.e., HDFS input exists)         Data

    3. user-defined (i.e., MySQL slave lag)
  • Failed jobs get retried (usually)                             ?


                                                                            10
Resource management

  • Analytics Resource Manager - ARM!
  • Library above Zookeeper
  • Throttles jobs and workers
    • Only 1 job of this name may run at once
    • Only N jobs may be run by this app at once
    • Only M mappers may write to Vertica at once




                                                    11
Resource management

  • Analytics Resource Manager - ARM!
  • Library above Zookeeper
  • Throttles jobs and workers
    • Only 1 job of this name may run at once
    • Only N jobs may be run by this app at once
    • Only M mappers may write to Vertica at once




                                                    11
Resource management

  • Analytics Resource Manager - ARM!
  • Library above Zookeeper
  • Throttles jobs and workers
    • Only 1 job of this name may run at once
    • Only N jobs may be run by this app at once
    • Only M mappers may write to Vertica at once




                                                    11
Job DAG & state transition

            “Local View”
            • Is it time for me to run yet?
            • Are my dependancies satisfied?
            • Any resource constraints?




                                              12
Job DAG & state transition

            “Local View”
            • Is it time for me to run yet?
            • Are my dependancies satisfied?
            • Any resource constraints?
                                          granted


                                 denied                                  Insert entry into
                                                                            batch table
                                                      no
                          Idle                             yes   Completion
                                     Execution
                                     Complete?

                                          Execution




                                                                                    12
Job DAG & state transition

                 “Local View”
                 • Is it time for me to run yet?
                 • Are my dependancies satisfied?
                 • Any resource constraints?
                                               granted


                                      denied                                  Insert entry into
                                                                                 batch table
                                                           no
                               Idle                             yes   Completion
                                          Execution
                                          Complete?

                                               Execution


     batch table:
     (id, description, state,
      start_time, end_time,
      job_start_time, job_end_time)
                                                                                         12
Example: active users

  Production Hosts




                     Main Hadoop DW




       MySQL/                                  Analytics
       Gizzard                        MySQL   Dashboards
                          Vertica




                                                           13
Example: active users
                                                                       Job DAG




                                                                 Log mover
  Production Hosts
                                   Log mover
                              (via staging cluster)

                         ib   e   web_events
                     Scr
                                                      Main Hadoop DW
                 Scr
                        ibe       sms_events




       MySQL/                                                                             Analytics
       Gizzard                                                                   MySQL   Dashboards
                                                           Vertica




                                                                                                      13
Example: active users
                                                                       Job DAG




                                                                             Oink
                                                                 Log mover
  Production Hosts
                                   Log mover
                              (via staging cluster)
                                                                                 Oink/Pig
                         ibe      web_events
                     Scr                                                         Cleanse
                                                      Main Hadoop DW             Filter
                                                                                 Transform
                 Scr                                                             Geo lookup
                        ibe       sms_events                                     Union
                                                                                 Distinct




       MySQL/                                                                                  Analytics
       Gizzard                                                                      MySQL     Dashboards
                                                           Vertica




                                                                                                           13
Example: active users
                                                                       Job DAG




                                                                             Oink     Oink
                                                                 Log mover
  Production Hosts
                                   Log mover
                              (via staging cluster)
                                                                                 Oink/Pig
                         ibe      web_events
                     Scr                                                         Cleanse
                                                      Main Hadoop DW             Filter
                                                                                 Transform
                 Scr                                                             Geo lookup
                        ibe       sms_events                                     Union
                                                                                 Distinct

                                                             Oink
                                                             user_sessions



       MySQL/                                                                                  Analytics
       Gizzard                                                                      MySQL     Dashboards
                                                           Vertica




                                                                                                           13
Example: active users
                                                                       Job DAG




                                                                             Oink     Oink
                                                                 Log mover
  Production Hosts
                                                                                     Crane
                                   Log mover
                              (via staging cluster)
                                                                                 Oink/Pig
                         ibe      web_events
                     Scr                                                         Cleanse
                                                      Main Hadoop DW             Filter
                                                                                 Transform
                 Scr                                                             Geo lookup
                        ibe       sms_events                                     Union
                                                                                 Distinct

                                                             Oink
                                                             user_sessions



       MySQL/                         Crane                                                    Analytics
       Gizzard                                                                      MySQL     Dashboards
                                   user_profiles            Vertica




                                                                                                           13
Example: active users
                                                                            Job DAG




                                                                                   Oink     Oink
                                                                      Log mover
  Production Hosts
                                                                                           Crane
                                   Log mover                                                       Rasvelg
                              (via staging cluster)
                                                                                      Oink/Pig
                         ibe      web_events
                     Scr                                                              Cleanse
                                                       Main Hadoop DW                 Filter
                                                                                      Transform
                 Scr                                                                  Geo lookup
                        ibe       sms_events                                          Union
                                                                                      Distinct

                                                                   Oink
                                                                   user_sessions



       MySQL/                         Crane                                                                   Analytics
       Gizzard                                                                            MySQL              Dashboards
                                   user_profiles                 Vertica



                                                      Rasvelg
                                                      Join,
                                                      Join Group, Count
                                                      Aggregations:
                                                      - active_by_geo
                                                      - active_by_device
                                                      - active_by_client
                                                      ...                                                                 13
Example: active users
                                                                             Job DAG




                                                                                    Oink     Oink
                                                                                                           ...
                                                                      Log mover
  Production Hosts
                                                                                            Crane
                                   Log mover                                                          Rasvelg Crane
                              (via staging cluster)
                                                                                         Oink/Pig
                         ibe      web_events
                     Scr                                                                 Cleanse
                                                       Main Hadoop DW                    Filter
                                                                                         Transform
                 Scr                                                                     Geo lookup
                        ibe       sms_events                                             Union
                                                                                         Distinct

                                                                   Oink
                                                                   user_sessions



       MySQL/                         Crane                                 Crane                             Analytics
       Gizzard                                                                             MySQL             Dashboards
                                   user_profiles                 Vertica    active_by_*



                                                      Rasvelg
                                                      Join,
                                                      Join Group, Count
                                                      Aggregations:
                                                      - active_by_geo
                                                      - active_by_device
                                                      - active_by_client
                                                      ...                                                                 13
Vertica or Hadoop?
  • Vertica
    • Loads 100s of Ks rows/second
    • Aggregate 100s of Ms rows in seconds
    • Used for low latency queries and aggregations
    • Keep a sliding window of data
  • Hadoop
    • Excels when data size is massive
    • Flexible and powerful
    • Great with nested data structures and unstructured data
    • Used for complex functions and ML



                                                                14
Vertica import options
  • Direct import via Crane
    • Load into dest table, single thread
  • Atomic import via Crane/Rasvelg
    • Crane loads to temp table, single thread
    • Rasvelg moves to dest table
  • Parallel import via Oink/Pig
    • Pig job via VerticaStorer
                                                                MySQL/
                                                                Gizzard



    • ARM throttles active DB connections                         Crane

                                                                           Rasvelg


                                                        Oink
                                       Main Hadoop DW
                                                                 Vertica
                                                        Crane




                                                                                15
Vertica imports - pros/cons
  • Crane & Rasvelg
    • Good for smaller datasets, DB to DB transfers
    • Single threaded
    • Easy on Vertica
    • Hadoop not required
  • Pig
    • Great for larger datasets                                  MySQL/
                                                                 Gizzard


    • More complex, not atomic
                                                                   Crane

    • DDOS potential                                                        Rasvelg


                                                         Oink
                                        Main Hadoop DW
                                                                  Vertica
                                                         Crane




                                                                                16
VerticaStorer
  • PigStorage implementation
  • From Vertica’s Hadoop connector suite
  • Out of the box
    • Easy to get Hello World working
    • Well documented
    • Pig/Vertica data bindings work well
    • Fast!
    • Transaction-aware tasks
    • No bugs found
    • Open source?



                                            17
Pig VerticaStorage
  • Our enhancements
    • Connection credential management
    • Truncate before load option
    • Throttle concurrent writers via ZK
  • Future features
    • Counters for rows inserted/rejected
    • Name-based tuple-column bindings
    • Atomic load via temp table




                                            18
Pig VerticaStorage
  • Our enhancements
    • Connection credential management
    • Truncate before load option
    • Throttle concurrent writers via ZK
  • Future features
    • Counters for rows inserted/rejected
    • Name-based tuple-column bindings
    • Atomic load via temp table
         SET mapred.map.tasks.speculative.execution false

         user_sessions = LOAD ‘/processed/user_sessions/2012/06/14’;

         STORE user_sessions INTO '{db_schema.user_sessions}' USING
               com.twitter.twadoop.pig.store.VerticaStorage(
               'config/db.yml', 'db_name', 'arm_resource_name');
                                                                       18
Gotcha #1


  • MR data load is not atomic
    • Avoid partial reads
    • Option 1: load to temp table, then insert direct
    • Option 2: add job dependency concept




                                                         19
Gotcha #2



  • Speculative execution is not always your friend
    • Launch more tasks than needed, just in case
    • For non-idempotent jobs, extra tasks == BAD




                                                      20
Gotcha #3


  • isIdempotant() must be a first-class concept
    • Loader jobs will fail
    • Failure after first task success == not good
    • Can’t automate retry without cleanup




                                                    21
Gotcha #4

  • Vendor code only gets you so far
    • Nice to haves == have to write
    • Favor the decorator pattern
    • Pig’s StoreFuncWrapper can help
    • Vendor open sourcing is ideal




                                        22
Future work
  • More VerticaStorer features
  • Multiple Vertica clusters
  • Atomic DB loads with Pig/Oink
  • Better DAG visibility
  • Better job history visibility
  • MR job optimizations via historic stats
  • HCatalog data registry
  • Job push events


                                              23
Acknowledgements




                   24
Questions?

 Bill Graham - @billgraham




                             25

Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at Twitter

  • 1.
    Hadoop and Vertica TheData Analytics Platform at Twitter Bill Graham - @billgraham Data Systems Engineer, Analytics Infrastructure Hadoop Summit, June 2012
  • 2.
    About that ponygiveaway... 2
  • 3.
    Outline •Architecture • Data flow • Job coordination • Resource management • Vertica integration • Gotchas • Future work 3
  • 4.
    We count things • 140 characters • 140M active users • 400M tweets per day • 80-100 TB ingested daily (uncompressed) • 10s of Ks daily Hadoop jobs 4
  • 5.
    Heterogeneous stack • Many job execution applications • Crane - Java ETL • Oink - Pig scheduler • Rasvelg - SQL aggregations • Scalding - Cascading via Scala • PyCascading - Cascading via Python • Indexing jobs • Our users • Analytics, Revenue, Growth, Search, Recommendations, etc. • PMs, Sales! 5
  • 6.
    Data flow: Analytics Production Hosts Log Application events Data Scribe Aggregators Third Party Social graph Imports HDFS MySQL/ Tweets Gizzard User profiles Staging Hadoop Cluster Main Hadoop DW HBase Analytics Vertica Web Tools MySQL 6
  • 7.
    Data flow: Analytics Production Hosts Log Application events Data Scribe Aggregators Third Party Social graph Imports HDFS MySQL/ Tweets Gizzard User profiles Distributed Staging Hadoop Cluster Crawler Log Mover Main Hadoop DW HBase Analytics Vertica Web Tools MySQL 6
  • 8.
    Data flow: Analytics Production Hosts Log Application events Data Scribe Aggregators Third Party Social graph Imports HDFS MySQL/ Tweets Gizzard User profiles Distributed Staging Hadoop Cluster Crawler Crane Crane Crane Log Mover Main Hadoop DW HBase Analytics Vertica Web Tools Crane Crane Crane MySQL 6
  • 9.
    Data flow: Analytics Production Hosts Log Application events Data Scribe Aggregators Third Party Social graph Imports HDFS MySQL/ Tweets Gizzard User profiles Distributed Staging Hadoop Cluster Crawler Crane Crane Crane Log Mover Oink Oink Main Hadoop DW HBase Analytics Vertica Web Tools Crane Crane Crane Oink MySQL 6
  • 10.
    Data flow: Analytics Production Hosts Log Application events Data Scribe Aggregators Third Party Social graph Imports HDFS MySQL/ Tweets Gizzard User profiles Distributed Staging Hadoop Cluster Crawler Crane Crane Crane Log Rasvelg Mover Oink Oink Main Hadoop DW HBase Analytics Vertica Web Tools Crane Crane Crane Oink MySQL 6
  • 11.
    Data flow: Analytics Production Hosts Log Application events Data Scribe Aggregators Third Party Social graph Imports HDFS MySQL/ Tweets Gizzard User profiles Distributed Analysts Staging Hadoop Cluster Crawler Engineers Crane PMs Crane Sales Crane Log Rasvelg Mover Oink Oink Main Hadoop DW HBase Analytics Vertica Web Tools Crane Crane Crane Oink MySQL 6
  • 12.
    Data flow: Analytics Production Hosts Log Application events Data Scribe Aggregators Third Party Social graph Imports HDFS MySQL/ Tweets Gizzard User profiles Distributed Analysts Staging Hadoop Cluster Crawler Engineers Crane PMs Crane Sales Crane Log Rasvelg HCatalog Mover Oink Oink Main Hadoop DW HBase Analytics Vertica Web Tools Crane Crane Crane Oink MySQL 6
  • 13.
  • 14.
    System concepts • Loose coupling • Job coordination as a service • Resource management as a service • Idempotence 8
  • 15.
    Loose coupling • Multiple job frameworks • Right tool for the job • Common dependency management 9
  • 16.
    Job coordination • Shared batch table for job state • Access via client libraries • Jobs & data are time-based • 3 types of preconditions 1. other job success (i.e., predecessor job complete) 2. existence of data (i.e., HDFS input exists) 3. user-defined (i.e., MySQL slave lag) • Failed jobs get retried (usually) 10
  • 17.
    Job coordination • Shared batch table for job state batch table: (id, description, state, • Access via client libraries start_time, end_time, job_start_time, job_end_time) • Jobs & data are time-based • 3 types of preconditions 1. other job success (i.e., predecessor job complete) 2. existence of data (i.e., HDFS input exists) 3. user-defined (i.e., MySQL slave lag) • Failed jobs get retried (usually) 10
  • 18.
    Job coordination • Shared batch table for job state batch table: (id, description, state, • Access via client libraries start_time, end_time, job_start_time, job_end_time) • Jobs & data are time-based • 3 types of preconditions Job 1. other job success (i.e., predecessor job complete) 2. existence of data (i.e., HDFS input exists) 3. user-defined (i.e., MySQL slave lag) • Failed jobs get retried (usually) 10
  • 19.
    Job coordination • Shared batch table for job state batch table: (id, description, state, • Access via client libraries start_time, end_time, job_start_time, job_end_time) • Jobs & data are time-based • 3 types of preconditions Job 1. other job success (i.e., predecessor job complete) 2. existence of data (i.e., HDFS input exists) Data 3. user-defined (i.e., MySQL slave lag) • Failed jobs get retried (usually) 10
  • 20.
    Job coordination • Shared batch table for job state batch table: (id, description, state, • Access via client libraries start_time, end_time, job_start_time, job_end_time) • Jobs & data are time-based • 3 types of preconditions Job 1. other job success (i.e., predecessor job complete) 2. existence of data (i.e., HDFS input exists) Data 3. user-defined (i.e., MySQL slave lag) • Failed jobs get retried (usually) ? 10
  • 21.
    Resource management • Analytics Resource Manager - ARM! • Library above Zookeeper • Throttles jobs and workers • Only 1 job of this name may run at once • Only N jobs may be run by this app at once • Only M mappers may write to Vertica at once 11
  • 22.
    Resource management • Analytics Resource Manager - ARM! • Library above Zookeeper • Throttles jobs and workers • Only 1 job of this name may run at once • Only N jobs may be run by this app at once • Only M mappers may write to Vertica at once 11
  • 23.
    Resource management • Analytics Resource Manager - ARM! • Library above Zookeeper • Throttles jobs and workers • Only 1 job of this name may run at once • Only N jobs may be run by this app at once • Only M mappers may write to Vertica at once 11
  • 24.
    Job DAG &state transition “Local View” • Is it time for me to run yet? • Are my dependancies satisfied? • Any resource constraints? 12
  • 25.
    Job DAG &state transition “Local View” • Is it time for me to run yet? • Are my dependancies satisfied? • Any resource constraints? granted denied Insert entry into batch table no Idle yes Completion Execution Complete? Execution 12
  • 26.
    Job DAG &state transition “Local View” • Is it time for me to run yet? • Are my dependancies satisfied? • Any resource constraints? granted denied Insert entry into batch table no Idle yes Completion Execution Complete? Execution batch table: (id, description, state, start_time, end_time, job_start_time, job_end_time) 12
  • 27.
    Example: active users Production Hosts Main Hadoop DW MySQL/ Analytics Gizzard MySQL Dashboards Vertica 13
  • 28.
    Example: active users Job DAG Log mover Production Hosts Log mover (via staging cluster) ib e web_events Scr Main Hadoop DW Scr ibe sms_events MySQL/ Analytics Gizzard MySQL Dashboards Vertica 13
  • 29.
    Example: active users Job DAG Oink Log mover Production Hosts Log mover (via staging cluster) Oink/Pig ibe web_events Scr Cleanse Main Hadoop DW Filter Transform Scr Geo lookup ibe sms_events Union Distinct MySQL/ Analytics Gizzard MySQL Dashboards Vertica 13
  • 30.
    Example: active users Job DAG Oink Oink Log mover Production Hosts Log mover (via staging cluster) Oink/Pig ibe web_events Scr Cleanse Main Hadoop DW Filter Transform Scr Geo lookup ibe sms_events Union Distinct Oink user_sessions MySQL/ Analytics Gizzard MySQL Dashboards Vertica 13
  • 31.
    Example: active users Job DAG Oink Oink Log mover Production Hosts Crane Log mover (via staging cluster) Oink/Pig ibe web_events Scr Cleanse Main Hadoop DW Filter Transform Scr Geo lookup ibe sms_events Union Distinct Oink user_sessions MySQL/ Crane Analytics Gizzard MySQL Dashboards user_profiles Vertica 13
  • 32.
    Example: active users Job DAG Oink Oink Log mover Production Hosts Crane Log mover Rasvelg (via staging cluster) Oink/Pig ibe web_events Scr Cleanse Main Hadoop DW Filter Transform Scr Geo lookup ibe sms_events Union Distinct Oink user_sessions MySQL/ Crane Analytics Gizzard MySQL Dashboards user_profiles Vertica Rasvelg Join, Join Group, Count Aggregations: - active_by_geo - active_by_device - active_by_client ... 13
  • 33.
    Example: active users Job DAG Oink Oink ... Log mover Production Hosts Crane Log mover Rasvelg Crane (via staging cluster) Oink/Pig ibe web_events Scr Cleanse Main Hadoop DW Filter Transform Scr Geo lookup ibe sms_events Union Distinct Oink user_sessions MySQL/ Crane Crane Analytics Gizzard MySQL Dashboards user_profiles Vertica active_by_* Rasvelg Join, Join Group, Count Aggregations: - active_by_geo - active_by_device - active_by_client ... 13
  • 34.
    Vertica or Hadoop? • Vertica • Loads 100s of Ks rows/second • Aggregate 100s of Ms rows in seconds • Used for low latency queries and aggregations • Keep a sliding window of data • Hadoop • Excels when data size is massive • Flexible and powerful • Great with nested data structures and unstructured data • Used for complex functions and ML 14
  • 35.
    Vertica import options • Direct import via Crane • Load into dest table, single thread • Atomic import via Crane/Rasvelg • Crane loads to temp table, single thread • Rasvelg moves to dest table • Parallel import via Oink/Pig • Pig job via VerticaStorer MySQL/ Gizzard • ARM throttles active DB connections Crane Rasvelg Oink Main Hadoop DW Vertica Crane 15
  • 36.
    Vertica imports -pros/cons • Crane & Rasvelg • Good for smaller datasets, DB to DB transfers • Single threaded • Easy on Vertica • Hadoop not required • Pig • Great for larger datasets MySQL/ Gizzard • More complex, not atomic Crane • DDOS potential Rasvelg Oink Main Hadoop DW Vertica Crane 16
  • 37.
    VerticaStorer •PigStorage implementation • From Vertica’s Hadoop connector suite • Out of the box • Easy to get Hello World working • Well documented • Pig/Vertica data bindings work well • Fast! • Transaction-aware tasks • No bugs found • Open source? 17
  • 38.
    Pig VerticaStorage • Our enhancements • Connection credential management • Truncate before load option • Throttle concurrent writers via ZK • Future features • Counters for rows inserted/rejected • Name-based tuple-column bindings • Atomic load via temp table 18
  • 39.
    Pig VerticaStorage • Our enhancements • Connection credential management • Truncate before load option • Throttle concurrent writers via ZK • Future features • Counters for rows inserted/rejected • Name-based tuple-column bindings • Atomic load via temp table SET mapred.map.tasks.speculative.execution false user_sessions = LOAD ‘/processed/user_sessions/2012/06/14’; STORE user_sessions INTO '{db_schema.user_sessions}' USING com.twitter.twadoop.pig.store.VerticaStorage( 'config/db.yml', 'db_name', 'arm_resource_name'); 18
  • 40.
    Gotcha #1 • MR data load is not atomic • Avoid partial reads • Option 1: load to temp table, then insert direct • Option 2: add job dependency concept 19
  • 41.
    Gotcha #2 • Speculative execution is not always your friend • Launch more tasks than needed, just in case • For non-idempotent jobs, extra tasks == BAD 20
  • 42.
    Gotcha #3 • isIdempotant() must be a first-class concept • Loader jobs will fail • Failure after first task success == not good • Can’t automate retry without cleanup 21
  • 43.
    Gotcha #4 • Vendor code only gets you so far • Nice to haves == have to write • Favor the decorator pattern • Pig’s StoreFuncWrapper can help • Vendor open sourcing is ideal 22
  • 44.
    Future work • More VerticaStorer features • Multiple Vertica clusters • Atomic DB loads with Pig/Oink • Better DAG visibility • Better job history visibility • MR job optimizations via historic stats • HCatalog data registry • Job push events 23
  • 45.
  • 46.
    Questions? Bill Graham- @billgraham 25

Editor's Notes

  • #2 \n
  • #3 \n
  • #4 \n
  • #5 \n
  • #6 Point out differences more. which ones move from where\n
  • #7 describe colo\n
  • #8 describe colo\n
  • #9 describe colo\n
  • #10 describe colo\n
  • #11 describe colo\n
  • #12 describe colo\n
  • #13 \n
  • #14 point out develop your own tools pattern more\nopt-in too common services like screech-owl\n
  • #15 \n
  • #16 expand on the time-based aspect more (jobs and data)\n
  • #17 expand on the time-based aspect more (jobs and data)\n
  • #18 expand on the time-based aspect more (jobs and data)\n
  • #19 expand on the time-based aspect more (jobs and data)\n
  • #20 \n
  • #21 \n
  • #22 Point out that batch table is updated for all state changes\n
  • #23 Point out that batch table is updated for all state changes\n
  • #24 talk about when we use vertica and when we use Hadoop\n
  • #25 talk about when we use vertica and when we use Hadoop\n
  • #26 talk about when we use vertica and when we use Hadoop\n
  • #27 talk about when we use vertica and when we use Hadoop\n
  • #28 talk about when we use vertica and when we use Hadoop\n
  • #29 talk about when we use vertica and when we use Hadoop\n
  • #30 Writes are fast because they bypass the Vertica write buffer (copy direct)\n
  • #31 \n
  • #32 \n
  • #33 \n
  • #34 \n
  • #35 \n
  • #36 \n
  • #37 \n
  • #38 \n
  • #39 2 vertica clusters: one for just queries\n
  • #40 \n
  • #41 \n