Accelerated Analytics for the Big Data Fabric
       Bay Area Hadoop User Group




       © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
AGENDA



 The Big Data Fabric
 Big Data Preparation – An Everyday Challenge
 Use-Case Scenario – Call Volume Analysis
    Solution Requirements
    Solution Workflow
    Phase I - Data Preparation & Visualization
    Phase II - Pentaho MapReduce & Orchestration
 Summary




                                                                                                      2
                     © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
The Big Data Fabric




                                                                                Data Integration Big Analytics
   Pentaho Business Analytics                 3rd Party Tools
                                                             R
       Visualization      Dashboards              3rd   Party BI Tools
   Interactive Analysis    Reports                      Applications



Data Integration                                                 Scheduling
Job Orchestration                                            High Performance
    Workflow                                                      Visual IDE



   Hadoop                                                  Analytic Databases
                                NoSQL Databases




                                                                                Big Data Mgmt
                                                                                                                 3
Preparing Big Data for Analysis
          is an Everyday Challenge


                                             •        Very technical skills required
                                             •        Divide between M-R developers & analysts
                                             •        Beyond the reach of many organizations




                                                                                             4
  © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
Pentaho Visual MapReduce




                                           Accessible by any ETL
                                           developer, business analyst or data
                                           scientist

                                           Executes inside Hadoop as a native
                                           Java MapReduce task
   © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
                                                                                    5
Pentaho Reporting & Analytics




          Batch Reporting
         and Ad Hoc Query
                                                                                      Data Visualization, Discovery
                                                                                              and Analysis




Hadoop                                    NoSQL                                                           Hybrid
                                                                                                                      6
                   © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
Use Case Scenario – Call Volume Analysis

• VOIP service provider has excess capacity and is
  considering expansion to consumer markets
• Business Analyst: what are the top 10 states for
  inbound calls on Fridays, Saturdays and Sundays?
• Research data available:
   – Call records – date/timestamp & destination phone #
                                                                                                        ?
   – NANP (North American Numbering Plan) data – area
     code by country, state & time zone




                                                                                                            7
                       © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
Solution Requirements

• Data Preparation
   – Access the call records in HDFS
   – Extract the destination area code for each call
   – Read the area code reference data
   – Lookup country, state and time zone by area code, append to each
     record
   – Filter out records (non-U.S. calls, calls made on M-Tu-W-Th)
   – Load to a relational database
   – Generate metadata
• Analysis
   – Explore data multi-dimensionally
   – Find the top-10 states by inbound call volume
   – Navigate via a geospatial interface
• Deployment
   – Deploy in MapReduce to handle larger data volumes

                                                                                                      8
                     © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
Solution Workflow


• Phase I - Business Analysts
   – Use a data extract to prepare and validate their analyses
   – Iterate over requirements with executives and stake-holders


• Phase II - MapReduce Developers/Analysts
   – Create production Pentaho MapReduce transformations
   – Manage the deployment and orchestration between the
     Hadoop cluster and the production database




                                                                                                      9
                     © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
Data Preparation (Phase I)




• The data pipeline implements the data preparation logic
• Each component has a “personality”– access, calculate, join, filter …
• Free-form design
    – As many or as few inputs, transformations and outputs as needed
• Schema contract exists only between connected components
• Pipelined, multi-threaded for performance
• 100% Java-based for deployment flexibility



                                                                                                       10
                      © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
Data Pipeline – Input from HDFS




                                                                                      11
     © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
Data Pipeline - Calculator




                                                                                   12
  © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
Data Pipeline – Stream Lookup




                                                                                     13
    © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
Data Pipeline – Row Filter




                                                                                   14
  © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
Data Pipeline – Table Output




                                                                                    15
   © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
Visualization – Multi-Dimensional UX




                                                                                        16
       © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
Visualization – Geographic




                                                                                   17
  © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
Visualization - Heatmap




                                                                                  18
 © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
Deployment to Hadoop (Phase II)


• To process a larger set of data we can deploy the data pipeline via
  MapReduce
    – Input and output streams are encoded in key-value pairs
    – Two specialized components provide an interface:




    – A special job component deploys the data pipeline to the Hadoop
      cluster:




                                                                                                       19
                      © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
Pentaho MapReduce – Inputs/Outputs



      The core logic of the data pipeline is
       identical … only the ends change




                                ........




                                                                                         20
        © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
Pentaho MapReduce – Orchestration




                                                                                        21
       © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
Instant Analytics (Roadmap)




Choose a Big Data Source,
Answer a Few Questions,
   Publish to Pentaho


                                                Report, Explore and
                                                     Analyze




                                                                                                             Customize Model
                                                                                                                (Optional)
                                                                                                                               22
                            © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
SUMMARY



1. The Big Data Fabric encompasses a large collection of Hadoop
   distributions, NoSQL and analytical databases
2. A component-based approach to data access and integration can:
   – Allow business analysts and data scientists to perform their own data
     preparation
   – Result in more rapid validation of business requirements & metrics
   – Be used to create data pipelines that can be deployed directly to a
     cluster, enabling analytics against much larger data sets
   – Support orchestration across environments




                                                                                                      23
                     © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
Summary




                                                                                 24
© 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
Thank You
Join the conversation. You can find us on:

     http://blog.pentaho.com

     @Pentaho

     Facebook.com/Pentaho

     Pentaho Business Analytics



  © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555

More Related Content

ODP
Pentaho Data Integration Introduction
PPT
Pentaho - Jake Cornelius - Hadoop World 2010
PPTX
Pentaho big data camp - 5 min
PPTX
Slides pentaho-hadoop-weka
PPTX
Webinar | Using Hadoop Analytics to Gain a Big Data Advantage
PPTX
Hadoop uk user group meeting final
PPTX
Big Data Integration Webinar: Getting Started With Hadoop Big Data
PDF
Exclusive Verizon Employee Webinar: Getting More From Your CDR Data
Pentaho Data Integration Introduction
Pentaho - Jake Cornelius - Hadoop World 2010
Pentaho big data camp - 5 min
Slides pentaho-hadoop-weka
Webinar | Using Hadoop Analytics to Gain a Big Data Advantage
Hadoop uk user group meeting final
Big Data Integration Webinar: Getting Started With Hadoop Big Data
Exclusive Verizon Employee Webinar: Getting More From Your CDR Data

What's hot (20)

PDF
Why Your Product Needs an Analytic Strategy
PDF
Advanced Reporting and ETL for MongoDB: Easily Build a 360-Degree View of You...
PPTX
Big Data for BI - Beyond the Hype - Pentaho
PDF
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...
PPTX
Pentaho Analytics at Tampa Analytics September Meetup
PDF
Oracle Enterprise Metadata Management
PDF
30 for 30: Quick Start Your Pentaho Evaluation
PDF
Moving Health Care Analytics to Hadoop to Build a Better Predictive Model
PPTX
Pentaho roadmap 061314
PDF
Big Data for Product Managers
PPTX
Expand a Data warehouse with Hadoop and Big Data
PDF
Big Data Discovery
PPTX
Pentaho Analytics for MongoDB - presentation from MongoDB World 2014
PPTX
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight
PPTX
Near Real-time Outlier Detection and Interpretation - Part 1 by Robert Thorma...
PPTX
Priyank Patel, Teradata, Hadoop & SQL
PPTX
Oracle's BigData solutions
PPTX
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight
PPTX
Data Mashups for Analytics
Why Your Product Needs an Analytic Strategy
Advanced Reporting and ETL for MongoDB: Easily Build a 360-Degree View of You...
Big Data for BI - Beyond the Hype - Pentaho
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...
Pentaho Analytics at Tampa Analytics September Meetup
Oracle Enterprise Metadata Management
30 for 30: Quick Start Your Pentaho Evaluation
Moving Health Care Analytics to Hadoop to Build a Better Predictive Model
Pentaho roadmap 061314
Big Data for Product Managers
Expand a Data warehouse with Hadoop and Big Data
Big Data Discovery
Pentaho Analytics for MongoDB - presentation from MongoDB World 2014
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight
Near Real-time Outlier Detection and Interpretation - Part 1 by Robert Thorma...
Priyank Patel, Teradata, Hadoop & SQL
Oracle's BigData solutions
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight
Data Mashups for Analytics
Ad

Similar to Bay Area Hadoop User Group (20)

PDF
Putting Business Intelligence to Work on Hadoop Data Stores
PDF
Plug 20110217
PDF
Pentaho Roadmap 2011
PDF
Whats new in Pentaho 3.8
PPTX
Pentaho Analytics on MongoDB
PDF
Open Analytics 2014 - Pedro Alves - Innovation though Open Source
PPTX
Pentaho Big Data Analytics with Vertica and Hadoop
PPTX
How advanced analytics is impacting the banking sector
PDF
Nov 2010 HUG: Business Intelligence for Big Data
PPT
MongoDB IoT City Tour EINDHOVEN: Analysing the Internet of Things: Davy Nys, ...
PDF
Pentaho Open Source BI
KEY
Processing Big Data
PDF
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
PPT
MongoDB IoT City Tour LONDON: Analysing the Internet of Things: Davy Nys, Pen...
PPTX
Hadoop Summit 2012 | Integrating Hadoop Into the Enterprise
PPTX
Integrating Hadoop Into the Enterprise
PPT
MongoDB IoT City Tour STUTTGART: Analysing the Internet of Things. By, Pentaho
PDF
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
PPTX
Pass bac jd_sm
PPTX
Big Data Expo 2015 - Pentaho The Future of Analytics
Putting Business Intelligence to Work on Hadoop Data Stores
Plug 20110217
Pentaho Roadmap 2011
Whats new in Pentaho 3.8
Pentaho Analytics on MongoDB
Open Analytics 2014 - Pedro Alves - Innovation though Open Source
Pentaho Big Data Analytics with Vertica and Hadoop
How advanced analytics is impacting the banking sector
Nov 2010 HUG: Business Intelligence for Big Data
MongoDB IoT City Tour EINDHOVEN: Analysing the Internet of Things: Davy Nys, ...
Pentaho Open Source BI
Processing Big Data
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
MongoDB IoT City Tour LONDON: Analysing the Internet of Things: Davy Nys, Pen...
Hadoop Summit 2012 | Integrating Hadoop Into the Enterprise
Integrating Hadoop Into the Enterprise
MongoDB IoT City Tour STUTTGART: Analysing the Internet of Things. By, Pentaho
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Pass bac jd_sm
Big Data Expo 2015 - Pentaho The Future of Analytics
Ad

More from Pentaho (14)

PPTX
Data Mashups for Analytics
PPTX
Filling the Data Lake - Strata + HadoopWorld San Jose 2016 Preview Presentation
PDF
The Next Big Thing in Big Data
PDF
Big Data Predictions for 2015
PDF
Competitive edgewithmongod bandpentaho_2014sep_v3[1]
PDF
Data Is Your Next Product Opportunity
PDF
Improving the Business of Healthcare through Better Analytics
PDF
Up Your Analytics Game with Pentaho and Vertica
PDF
Embedded Analytics in CRM and Marketing
PDF
Embedded Analytics in Customer Success
PDF
Embedded Analytics in Human Capital Management
PDF
Predictive Analytics with Pentaho Data Mining - Análisis Predictivo con Penta...
PDF
Pentaho Healthcare Solutions
PDF
Pentaho Business Analytics for ISVs and SaaS providers in healthcare
Data Mashups for Analytics
Filling the Data Lake - Strata + HadoopWorld San Jose 2016 Preview Presentation
The Next Big Thing in Big Data
Big Data Predictions for 2015
Competitive edgewithmongod bandpentaho_2014sep_v3[1]
Data Is Your Next Product Opportunity
Improving the Business of Healthcare through Better Analytics
Up Your Analytics Game with Pentaho and Vertica
Embedded Analytics in CRM and Marketing
Embedded Analytics in Customer Success
Embedded Analytics in Human Capital Management
Predictive Analytics with Pentaho Data Mining - Análisis Predictivo con Penta...
Pentaho Healthcare Solutions
Pentaho Business Analytics for ISVs and SaaS providers in healthcare

Bay Area Hadoop User Group

  • 1. Accelerated Analytics for the Big Data Fabric Bay Area Hadoop User Group © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
  • 2. AGENDA  The Big Data Fabric  Big Data Preparation – An Everyday Challenge  Use-Case Scenario – Call Volume Analysis  Solution Requirements  Solution Workflow  Phase I - Data Preparation & Visualization  Phase II - Pentaho MapReduce & Orchestration  Summary 2 © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
  • 3. The Big Data Fabric Data Integration Big Analytics Pentaho Business Analytics 3rd Party Tools R Visualization Dashboards 3rd Party BI Tools Interactive Analysis Reports Applications Data Integration Scheduling Job Orchestration High Performance Workflow Visual IDE Hadoop Analytic Databases NoSQL Databases Big Data Mgmt 3
  • 4. Preparing Big Data for Analysis is an Everyday Challenge • Very technical skills required • Divide between M-R developers & analysts • Beyond the reach of many organizations 4 © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
  • 5. Pentaho Visual MapReduce Accessible by any ETL developer, business analyst or data scientist Executes inside Hadoop as a native Java MapReduce task © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 5
  • 6. Pentaho Reporting & Analytics Batch Reporting and Ad Hoc Query Data Visualization, Discovery and Analysis Hadoop NoSQL Hybrid 6 © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
  • 7. Use Case Scenario – Call Volume Analysis • VOIP service provider has excess capacity and is considering expansion to consumer markets • Business Analyst: what are the top 10 states for inbound calls on Fridays, Saturdays and Sundays? • Research data available: – Call records – date/timestamp & destination phone # ? – NANP (North American Numbering Plan) data – area code by country, state & time zone 7 © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
  • 8. Solution Requirements • Data Preparation – Access the call records in HDFS – Extract the destination area code for each call – Read the area code reference data – Lookup country, state and time zone by area code, append to each record – Filter out records (non-U.S. calls, calls made on M-Tu-W-Th) – Load to a relational database – Generate metadata • Analysis – Explore data multi-dimensionally – Find the top-10 states by inbound call volume – Navigate via a geospatial interface • Deployment – Deploy in MapReduce to handle larger data volumes 8 © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
  • 9. Solution Workflow • Phase I - Business Analysts – Use a data extract to prepare and validate their analyses – Iterate over requirements with executives and stake-holders • Phase II - MapReduce Developers/Analysts – Create production Pentaho MapReduce transformations – Manage the deployment and orchestration between the Hadoop cluster and the production database 9 © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
  • 10. Data Preparation (Phase I) • The data pipeline implements the data preparation logic • Each component has a “personality”– access, calculate, join, filter … • Free-form design – As many or as few inputs, transformations and outputs as needed • Schema contract exists only between connected components • Pipelined, multi-threaded for performance • 100% Java-based for deployment flexibility 10 © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
  • 11. Data Pipeline – Input from HDFS 11 © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
  • 12. Data Pipeline - Calculator 12 © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
  • 13. Data Pipeline – Stream Lookup 13 © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
  • 14. Data Pipeline – Row Filter 14 © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
  • 15. Data Pipeline – Table Output 15 © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
  • 16. Visualization – Multi-Dimensional UX 16 © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
  • 17. Visualization – Geographic 17 © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
  • 18. Visualization - Heatmap 18 © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
  • 19. Deployment to Hadoop (Phase II) • To process a larger set of data we can deploy the data pipeline via MapReduce – Input and output streams are encoded in key-value pairs – Two specialized components provide an interface: – A special job component deploys the data pipeline to the Hadoop cluster: 19 © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
  • 20. Pentaho MapReduce – Inputs/Outputs The core logic of the data pipeline is identical … only the ends change ........ 20 © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
  • 21. Pentaho MapReduce – Orchestration 21 © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
  • 22. Instant Analytics (Roadmap) Choose a Big Data Source, Answer a Few Questions, Publish to Pentaho Report, Explore and Analyze Customize Model (Optional) 22 © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
  • 23. SUMMARY 1. The Big Data Fabric encompasses a large collection of Hadoop distributions, NoSQL and analytical databases 2. A component-based approach to data access and integration can: – Allow business analysts and data scientists to perform their own data preparation – Result in more rapid validation of business requirements & metrics – Be used to create data pipelines that can be deployed directly to a cluster, enabling analytics against much larger data sets – Support orchestration across environments 23 © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
  • 24. Summary 24 © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
  • 25. Thank You Join the conversation. You can find us on: http://blog.pentaho.com @Pentaho Facebook.com/Pentaho Pentaho Business Analytics © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555

Editor's Notes

  • #7: Leveraging PDI to incorporate Big Data into your data fabric provides immediate access to analytics, examples: Batch and Ad Hoc reporting directly against Big Data Data sources using familiar BI tools with no coding – Report Designer, Interactive Reporting Agile framework to quickly generate/house/manage data marts for interactive analysis, data discovery, etc.