SlideShare a Scribd company logo
Trends and usage of
Apache Hadoop
Eric Baldeschwieler
CEO Hortonworks
Twitter: @jeric14, @hortonworks



January 2012




© Hortonworks Inc. 2011           Page 1
Agenda
• Define terms
  – What is Hadoop? Why does Hadoop matter?


• What drives Hadoop adoption?

• Observed Trends




     Architecting the Future of Big Data
                                              Page 2
     © Hortonworks Inc. 2011
Hortonworks Vision


 We believe that by 2015, more than
    half the world's data will be
   processed by Apache Hadoop


                         How to achieve that vision???
                                 Enable ecosystem around
                                 enterprise-viable platform.




                                                               Page 3
   © Hortonworks Inc. 2011
What is Apache Hadoop?
•  Solution for big data
    –  Deals with complexities of high
       volume, velocity & variety of data

•  Set of open source projects

•  Transforms commodity hardware
   into a service that:
    –  Stores petabytes of data reliably
    –  Allows huge distributed computations

•  Key attributes:
    –  Redundant and reliable (no data loss)
                                                One of the best examples of
    –  Extremely powerful                      open source driving innovation
    –  Batch processing centric                   and creating a market
    –  Easy to program distributed apps
    –  Runs on commodity hardware



                                                                          Page 4
         © Hortonworks Inc. 2011
Hortonworks Data Platform (HDP)
Key Components of “Standard Hadoop” Open Source Stack


     Core Apache Hadoop                                                      Related Hadoop Projects             Open APIs for:
                                                                                                                  •  Data Integration
                                                                                                                  •  Data Movement
                                                                                                                  •  App Job Management
                                                                                                                  •  System Management
                                                                            Pig                      Hive
                                                                         (Data Flow)                     (SQL)
                                             (Columnar NoSQL Store)
                                     HBase



                                                                                  MapReduce
        Zookeeper
                    (Coordination)




                                                                          (Distributed Programing Framework)



                                                                                       HCatalog
                                                                             (Table & Schema Management)



                                                                                 HDFS
                                                                      (Hadoop Distributed File System)




                                                                                                                                 Page 5
      © Hortonworks Inc. 2011
Big Data Trailblazers and Use Cases


                                                                data
                                analyzing web logs            analytics
                   advertising optimization        machine learning
                                                             mail anti-spam
                  text mining web search
                                                        content optimization
                   customer trend analysis
                                                 ad selection
             video & audio processing
                                                         data mining
                             user interest prediction
                                        social media




                                                                               Page 6
   © Hortonworks Inc. 2011
Yahoo!, Apache Hadoop & Hortonworks
http://www.wired.com/wiredenterprise/2011/10/how-yahoo-spawned-hadoop

      Yahoo! embraced Apache Hadoop, an open source platform, to
   crunch epic amounts of data using an army of dirt-cheap servers

                                         2006




                                  Hadoop at Yahoo!
                                    40K+ Servers
                                    170PB Storage
                                  5M+ Monthly Jobs
                                  1000+ Active Users



                                         2011




  Yahoo! spun off 22+ engineers into Hortonworks, a company focused on
    advancing open source Apache Hadoop for the broader market

                                                                         Page 7
        © Hortonworks Inc. 2011
What drives Hadoop adoption?




  Architecting the Future of Big Data
                                        Page 8
  © Hortonworks Inc. 2011
Market Drivers for Apache Hadoop
• Business drivers
  – High-value projects that require use of more data        Gartner predicts
                                                            800% data growth
  – Belief that there is great ROI in mastering big data    over next 5 years



• Financial drivers
  – Growing cost of data systems as percentage of IT spend
  – Cost advantage of commodity hardware + open source
  – Enables departmental-level big data strategies        80-90% of data
                                                            produced today
                                                            is unstructured

• Technical drivers
  – Existing solutions failing under growing requirements
       – 3Vs - Volume, velocity, variety
  – Proliferation of unstructured data

      © Hortonworks Inc. 2011                                           9
      © Hortonworks Inc. 2011
Every Market has Big Data
       Digital data is personal, everywhere, increasingly
      accessible, and will continue to grow exponentially




Source: McKinsey & Company report. Big data: The next frontier for innovation, competition, and productivity. May 2011.


                                                                                                                          Page 10
           © Hortonworks Inc. 2011
Broader Use Case Opportunities
Financial Services                            Healthcare
•  Detect/prevent fraud                       •  Patient monitoring
•  Model and manage risk                      •  Predictive modeling
•  Personalize banking/insurance products     •  Compliance, Archival, text search
•  Compliance, Archival, …                    •  Data driven research
Retail                                        Web / Social / Mobile
•  Behavior analysis                          •  Sentiment analysis
•  Cross selling, recommendation engines      •  Web log, image, and video analysis
•  Optimize pricing, placement, design        •  Personalization
•  Optimize inventory and distribution        •  Billing, Reporting, Network Analysis

Manufacturing                                 Government
•  Simulation, Analysis, Design               •  Detect/prevent fraud
•  Improve service via product sensor data    •  Security & Intelligence
•  “Digital factory” for lean manufacturing   •  Support open data initiatives



                                                                                     Page 11
           © Hortonworks Inc. 2011
Observed Trends




  Architecting the Future of Big Data
                                        Page 12
  © Hortonworks Inc. 2011
Trend: Agile Data
• The old way
  – Operational systems keep only current records, short history
  – Analytics systems keep only conformed / cleaned / digested data
  – Unstructured data locked away in operational silos
  – Archives offline
       – Inflexible, new questions require system redesigns

• The new trend
  – Keep raw data in Hadoop for a long time
  – Able to produce a new analytics view on-demand
  – Keep a new copy of data that was previously on in silos
  – Can directly do new reports, experiments at low incremental cost
  – New products / services can be added very quickly
  – Agile outcome justifies new infrastructure

      Architecting the Future of Big Data
                                                                  Page 13
      © Hortonworks Inc. 2011
Traditional Enterprise Data Architecture
  Data Silos
                                                                 Traditional Data Warehouses,
  Serving Applications                                                   BI & Analytics

Web       NoSQL                              Traditional ETL &
                                                                             Data      BI /
Serving   RDMS
                                …            Message buses             EDW
                                                                             Marts   Analytics




                          Serving   Social     Sensor          Text
                           Logs     Media       Data         Systems    …


                                    Unstructured Systems
                                                                                                 Page 14
          © Hortonworks Inc. 2011
Agile Data Architecture w/Hadoop
  Connecting All of Your Big Data
                                                                 Traditional Data Warehouses,
  Serving Applications                                                   BI & Analytics

Web       NoSQL                              Traditional ETL &
                                                                             Data      BI /
Serving   RDMS
                                …            Message buses             EDW
                                                                             Marts   Analytics




                                         EsTsL (s = Store)
                                         Custom Analytics




                          Serving   Social     Sensor          Text
                           Logs     Media       Data         Systems    …


                                    Unstructured Systems
                                                                                                 Page 15
          © Hortonworks Inc. 2011
Trend: Data driven development
• Limited runtime logic driven by huge lookup tables

• Data computed offline on Hadoop
  – Machine learning, other expensive computation offline
  – Personalization, classification, fraud, value analysis…


• Application development requires data science
  – Huge amounts of actually observed data key to modern services
  – Hadoop used as the science platform




      Architecting the Future of Big Data
                                                               Page 16
      © Hortonworks Inc. 2011
CASE STUDY
     YAHOO! HOMEPAGE

  •  Serving Maps	
                                        SCIENCE      »	
  Machine learning to build ever
            •  Users	
  -­‐	
  Interests	
                  HADOOP         better categorization models
  	
                                                        CLUSTER
  •  Five	
  Minute	
                        USER	
                         CATEGORIZATION	
  
       Produc7on	
                       BEHAVIOR	
                         MODELS	
  (weekly)	
  
  	
  
  •  Weekly	
                                              PRODUCTION
       Categoriza7on	
                                        HADOOP
                                                                        »	
  Identify user interests using
       models	
                          SERVING
                                                              CLUSTER
                                                                           Categorization models
                                            MAPS
                                 (every 5 minutes)
                                                              USER
                                                            BEHAVIOR



                                      SERVING	
  SYSTEMS                   ENGAGED	
  USERS


    Build	
  customized	
  home	
  pages	
  with	
  latest	
  data	
  (thousands	
  /	
  second)	
  
Copyright	
  Yahoo	
  2011	
                                                                                 17	
  
CASE STUDY
     YAHOO! HOMEPAGE


      Personalized
      for each visitor


      Result:
      twice the engagement

                                                       Recommended	
  links	
       News	
  Interests	
       Top	
  Searches	
  

                                                      +79% clicks                 +160% clicks +43% clicks
                                                      vs. randomly selected       vs. one size fits all     vs. editor selected




Copyright	
  Yahoo	
  2011	
  Hortonworks Inc. 2011
                         ©
                                                                                                                                    18	
  
Trend: Specialization of Data Systems
• Hadoop does not replace existing systems
  – It adds new capabilities to the enterprise
  – It can offload things that are not done efficiently in current systems
       – Especially in scale out situations


• Specialization of traditional data components
  – Use OLTP systems just for transactions
  – Use OLAP systems for interactive analysis


• Hadoop has LOTS of bandwidth to storage and CPU
  – Pull reporting out OLTP systems
  – Pull ELT out of OLAP systems


      Architecting the Future of Big Data
                                                                      Page 19
      © Hortonworks Inc. 2011
Hadoop and OLTP Systems
      MPP Processing of Online Transactions              Hadoop used to Process Reports
•    Mission critical                              •     Free up 50+% processing power for
•    Manages transactions & serves reports               transaction processing system
                                                   •     Significant cost savings due to commodity
                                                         nature of Hadoop


      Web
      Site
                               Transaction     Reports
                               Processing
      Web                       Systems
      Site
                                       $$$    Transaction
                                                 Logs
      Web
      Site




                                                                                             Page 20
             © Hortonworks Inc. 2011
Hadoop and OLAP Systems
 Fast loading, raw data staging, ELT &
           long-term archival                  Allow analysts to use tools they know
         (The Agile Data Zone)                (Take advantage of huge ecosystem of
                                                     BI and Analytics tooling)


Web


                       Hadoop                                       EDW
Mobile



Social
                                         Online
                                         Archival
Other
logs


                                                                               Page 21
         © Hortonworks Inc. 2011
TRENDS: Instrument Clouds of Things
 Clouds of things logging to Hadoop         HDFS + Map-Reduce
              Websites                          Or HBase
 Mobile phones, Enterprise devices…                 +
                                                 Analysis



                                Things
                                   Things




                                Things
                                   Things




                                Things
                                   Things




                                                                Page 22
      © Hortonworks Inc. 2011
Trend: Many POCs, Few Production Systems

• The problem
  – Hadoop is still a young technology
  – Hard to find knowledgeable staff
  – Integration with existing systems


• Hadoop market is maturing at speed
  – Emerging ecosystem of Hadoop platform solutions providers
  – Apache Hadoop continues to get better
  – Hadoop training and support available form several vendors




      Architecting the Future of Big Data
                                                                 Page 23
      © Hortonworks Inc. 2011
Growth in Hadoop Ecosystem
• Hardware vendors, Public Cloud (IAAS, PAAS)
  – Storage, Appliances, Preloaded commodity boxes, cloud

• Data Systems
  – All the major vendors announced Hadoop plans / products in 2011

• BI, Analytics and ETL
  – Hadoop integrations emerging

• Dedicated Hadoop Applications
  – Datamere, Karmashere, Platfora, …

• Systems Integrators
  – Regional and Global providers available

     Architecting the Future of Big Data
                                                                Page 24
     © Hortonworks Inc. 2011
Hadoop Continues to Improve
Apache community, including Hortonworks investing to improve Hadoop:
•  Make Hadoop an Open, Extensible, and Enterprise Viable Platform
•  Enable More Applications to Run on Apache Hadoop
                                                         “Hadoop.Beyond”
                                                      Platform actively evolving

                                       “Hadoop.Next”
                                        (Hadoop 0.23)
                                     HA, Next-gen HDFS & MapReduce
   “Hadoop.Now”                      Extension & Integration APIs
    (Hadoop 1.0)
Most stable version ever
HBase, security, WebHDFS




                                                                            Page 25
           © Hortonworks Inc. 2011
Hortonworks – Approachable Hadoop
•  Apache Hadoop Leadership
   –  Delivered every major release since 0.1
   –  Driving innovation across entire stack
   –  Experience managing world’s largest
      deployment
   –  Access to Yahoo’s 1,000+ Hadoop users
      and 40k+ nodes for testing, QA, etc.


•  Business Focus
   –  Provide 100% open source product
        –  Hortonworks Data Platform                Expert Role-based Training

   –  Help customers and partners overcome
      Hadoop knowledge gaps

                                                Full Lifecycle Support and Services
   –  Help organizations successfully develop
      and deploy solutions based on Hadoop
                                                 Evaluate       Pilot      Production


          Architecting the Future of Big Data
                                                                                 Page 26
          © Hortonworks Inc. 2011
Trend: Finding More Value Over Time
• Hadoop is usually brought in to solve a specific
  problem
  – Build seach indexes for Yahoo
  – Manage web site logs for Facebook
  – Users using EC2 to do data processing at Amazon
  – Simple reporting when existing tools don’t scale


• Once your data is in Hadoop more users find value

• Once you have Hadoop, folks add more data




     Architecting the Future of Big Data
                                                       Page 27
     © Hortonworks Inc. 2011
Thank You! Questions?
Eric Baldeschwieler
@jeric14 @hortonworks




                               Page 28
     © Hortonworks Inc. 2011

More Related Content

PDF
Federated Cloud Computing - The OpenNebula Experience v1.0s
PPTX
SOA Service Oriented Architecture
PPTX
Intrusion Detection System(IDS)
PDF
Cloud Migration Strategy and Best Practices
PPTX
Multi Cloud Architecture Approach
PPTX
Service Oriented Architecture (SOA)
PPTX
CS8791 Unit 2 Cloud Enabling Technologies
PPTX
Cloud security Presentation
Federated Cloud Computing - The OpenNebula Experience v1.0s
SOA Service Oriented Architecture
Intrusion Detection System(IDS)
Cloud Migration Strategy and Best Practices
Multi Cloud Architecture Approach
Service Oriented Architecture (SOA)
CS8791 Unit 2 Cloud Enabling Technologies
Cloud security Presentation

What's hot (20)

PDF
Microsoft Azure Cloud Services
PDF
Service-Oriented Architecture (SOA)
PPTX
Digital Integration Hub - Maximise Your APIs
PPTX
Apache Hadoop
PPTX
Microsoft Cloud Computing
PPTX
Cloud computing risks
PPTX
Using Camunda on Kubernetes through Operators
PPTX
Fundamental Cloud Architectures
PPTX
Splunk for IT Operations
PPTX
Cloud Resource Management
PPTX
BPEL, BPEL vs ESB (Integration)
PDF
Introduction to Firebase from Google
PPTX
Data platform modernization with Databricks.pptx
PPTX
Intrusion detection and prevention system
PDF
Cloud Migration
PPTX
Intrusion detection system
PDF
Cloud Computing Architecture
PPTX
Cloud Computing: Virtualization
Microsoft Azure Cloud Services
Service-Oriented Architecture (SOA)
Digital Integration Hub - Maximise Your APIs
Apache Hadoop
Microsoft Cloud Computing
Cloud computing risks
Using Camunda on Kubernetes through Operators
Fundamental Cloud Architectures
Splunk for IT Operations
Cloud Resource Management
BPEL, BPEL vs ESB (Integration)
Introduction to Firebase from Google
Data platform modernization with Databricks.pptx
Intrusion detection and prevention system
Cloud Migration
Intrusion detection system
Cloud Computing Architecture
Cloud Computing: Virtualization
Ad

Similar to Hadoop Trends (20)

PDF
Apache hadoop bigdata-in-banking
PDF
Keynote from ApacheCon NA 2011
PDF
Hadoop - Now, Next and Beyond
PPTX
Hadoop as Data Refinery - Steve Loughran
PPTX
Hadoop as data refinery
PDF
Introduction to Hadoop
PPTX
Introduction to Hortonworks Data Platform for Windows
PDF
Supporting Financial Services with a More Flexible Approach to Big Data
PPTX
Create a Smarter Data Lake with HP Haven and Apache Hadoop
PPTX
Why hadoop for data science?
PDF
Enterprise Apache Hadoop: State of the Union
PDF
Hadoop for shanghai dev meetup
PDF
Hortonworks Data Platform for Systems Integrators Webinar 9-5-2012.pptx
PDF
Building a Modern Data Architecture with Enterprise Hadoop
PDF
Rescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
PPTX
201305 hadoop jpl-v3
PDF
Hortonworks and Platfora in Financial Services - Webinar
PDF
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
PDF
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
PDF
Hortonworks and Red Hat Webinar_Sept.3rd_Part 1
Apache hadoop bigdata-in-banking
Keynote from ApacheCon NA 2011
Hadoop - Now, Next and Beyond
Hadoop as Data Refinery - Steve Loughran
Hadoop as data refinery
Introduction to Hadoop
Introduction to Hortonworks Data Platform for Windows
Supporting Financial Services with a More Flexible Approach to Big Data
Create a Smarter Data Lake with HP Haven and Apache Hadoop
Why hadoop for data science?
Enterprise Apache Hadoop: State of the Union
Hadoop for shanghai dev meetup
Hortonworks Data Platform for Systems Integrators Webinar 9-5-2012.pptx
Building a Modern Data Architecture with Enterprise Hadoop
Rescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
201305 hadoop jpl-v3
Hortonworks and Platfora in Financial Services - Webinar
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Hortonworks and Red Hat Webinar_Sept.3rd_Part 1
Ad

More from Hortonworks (20)

PDF
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
PDF
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
PDF
Getting the Most Out of Your Data in the Cloud with Cloudbreak
PDF
Johns Hopkins - Using Hadoop to Secure Access Log Events
PDF
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
PDF
HDF 3.2 - What's New
PPTX
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
PDF
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
PDF
IBM+Hortonworks = Transformation of the Big Data Landscape
PDF
Premier Inside-Out: Apache Druid
PDF
Accelerating Data Science and Real Time Analytics at Scale
PDF
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
PDF
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
PDF
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
PDF
Making Enterprise Big Data Small with Ease
PDF
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
PDF
Driving Digital Transformation Through Global Data Management
PPTX
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
PDF
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
PDF
Unlock Value from Big Data with Apache NiFi and Streaming CDC
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
Getting the Most Out of Your Data in the Cloud with Cloudbreak
Johns Hopkins - Using Hadoop to Secure Access Log Events
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
HDF 3.2 - What's New
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
IBM+Hortonworks = Transformation of the Big Data Landscape
Premier Inside-Out: Apache Druid
Accelerating Data Science and Real Time Analytics at Scale
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Making Enterprise Big Data Small with Ease
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
Driving Digital Transformation Through Global Data Management
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Unlock Value from Big Data with Apache NiFi and Streaming CDC

Recently uploaded (20)

PDF
Hindi spoken digit analysis for native and non-native speakers
PDF
Zenith AI: Advanced Artificial Intelligence
PPTX
cloud_computing_Infrastucture_as_cloud_p
PDF
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
PDF
Web App vs Mobile App What Should You Build First.pdf
PDF
August Patch Tuesday
PPTX
A Presentation on Touch Screen Technology
PDF
Mushroom cultivation and it's methods.pdf
PPTX
OMC Textile Division Presentation 2021.pptx
PPTX
TLE Review Electricity (Electricity).pptx
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
A comparative study of natural language inference in Swahili using monolingua...
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PDF
WOOl fibre morphology and structure.pdf for textiles
PDF
Approach and Philosophy of On baking technology
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Hindi spoken digit analysis for native and non-native speakers
Zenith AI: Advanced Artificial Intelligence
cloud_computing_Infrastucture_as_cloud_p
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
Web App vs Mobile App What Should You Build First.pdf
August Patch Tuesday
A Presentation on Touch Screen Technology
Mushroom cultivation and it's methods.pdf
OMC Textile Division Presentation 2021.pptx
TLE Review Electricity (Electricity).pptx
Univ-Connecticut-ChatGPT-Presentaion.pdf
A comparative analysis of optical character recognition models for extracting...
A comparative study of natural language inference in Swahili using monolingua...
gpt5_lecture_notes_comprehensive_20250812015547.pdf
SOPHOS-XG Firewall Administrator PPT.pptx
WOOl fibre morphology and structure.pdf for textiles
Approach and Philosophy of On baking technology
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...

Hadoop Trends

  • 1. Trends and usage of Apache Hadoop Eric Baldeschwieler CEO Hortonworks Twitter: @jeric14, @hortonworks January 2012 © Hortonworks Inc. 2011 Page 1
  • 2. Agenda • Define terms – What is Hadoop? Why does Hadoop matter? • What drives Hadoop adoption? • Observed Trends Architecting the Future of Big Data Page 2 © Hortonworks Inc. 2011
  • 3. Hortonworks Vision We believe that by 2015, more than half the world's data will be processed by Apache Hadoop How to achieve that vision??? Enable ecosystem around enterprise-viable platform. Page 3 © Hortonworks Inc. 2011
  • 4. What is Apache Hadoop? •  Solution for big data –  Deals with complexities of high volume, velocity & variety of data •  Set of open source projects •  Transforms commodity hardware into a service that: –  Stores petabytes of data reliably –  Allows huge distributed computations •  Key attributes: –  Redundant and reliable (no data loss) One of the best examples of –  Extremely powerful open source driving innovation –  Batch processing centric and creating a market –  Easy to program distributed apps –  Runs on commodity hardware Page 4 © Hortonworks Inc. 2011
  • 5. Hortonworks Data Platform (HDP) Key Components of “Standard Hadoop” Open Source Stack Core Apache Hadoop Related Hadoop Projects Open APIs for: •  Data Integration •  Data Movement •  App Job Management •  System Management Pig Hive (Data Flow) (SQL) (Columnar NoSQL Store) HBase MapReduce Zookeeper (Coordination) (Distributed Programing Framework) HCatalog (Table & Schema Management) HDFS (Hadoop Distributed File System) Page 5 © Hortonworks Inc. 2011
  • 6. Big Data Trailblazers and Use Cases data analyzing web logs analytics advertising optimization machine learning mail anti-spam text mining web search content optimization customer trend analysis ad selection video & audio processing data mining user interest prediction social media Page 6 © Hortonworks Inc. 2011
  • 7. Yahoo!, Apache Hadoop & Hortonworks http://www.wired.com/wiredenterprise/2011/10/how-yahoo-spawned-hadoop Yahoo! embraced Apache Hadoop, an open source platform, to crunch epic amounts of data using an army of dirt-cheap servers 2006 Hadoop at Yahoo! 40K+ Servers 170PB Storage 5M+ Monthly Jobs 1000+ Active Users 2011 Yahoo! spun off 22+ engineers into Hortonworks, a company focused on advancing open source Apache Hadoop for the broader market Page 7 © Hortonworks Inc. 2011
  • 8. What drives Hadoop adoption? Architecting the Future of Big Data Page 8 © Hortonworks Inc. 2011
  • 9. Market Drivers for Apache Hadoop • Business drivers – High-value projects that require use of more data Gartner predicts 800% data growth – Belief that there is great ROI in mastering big data over next 5 years • Financial drivers – Growing cost of data systems as percentage of IT spend – Cost advantage of commodity hardware + open source – Enables departmental-level big data strategies 80-90% of data produced today is unstructured • Technical drivers – Existing solutions failing under growing requirements – 3Vs - Volume, velocity, variety – Proliferation of unstructured data © Hortonworks Inc. 2011 9 © Hortonworks Inc. 2011
  • 10. Every Market has Big Data Digital data is personal, everywhere, increasingly accessible, and will continue to grow exponentially Source: McKinsey & Company report. Big data: The next frontier for innovation, competition, and productivity. May 2011. Page 10 © Hortonworks Inc. 2011
  • 11. Broader Use Case Opportunities Financial Services Healthcare •  Detect/prevent fraud •  Patient monitoring •  Model and manage risk •  Predictive modeling •  Personalize banking/insurance products •  Compliance, Archival, text search •  Compliance, Archival, … •  Data driven research Retail Web / Social / Mobile •  Behavior analysis •  Sentiment analysis •  Cross selling, recommendation engines •  Web log, image, and video analysis •  Optimize pricing, placement, design •  Personalization •  Optimize inventory and distribution •  Billing, Reporting, Network Analysis Manufacturing Government •  Simulation, Analysis, Design •  Detect/prevent fraud •  Improve service via product sensor data •  Security & Intelligence •  “Digital factory” for lean manufacturing •  Support open data initiatives Page 11 © Hortonworks Inc. 2011
  • 12. Observed Trends Architecting the Future of Big Data Page 12 © Hortonworks Inc. 2011
  • 13. Trend: Agile Data • The old way – Operational systems keep only current records, short history – Analytics systems keep only conformed / cleaned / digested data – Unstructured data locked away in operational silos – Archives offline – Inflexible, new questions require system redesigns • The new trend – Keep raw data in Hadoop for a long time – Able to produce a new analytics view on-demand – Keep a new copy of data that was previously on in silos – Can directly do new reports, experiments at low incremental cost – New products / services can be added very quickly – Agile outcome justifies new infrastructure Architecting the Future of Big Data Page 13 © Hortonworks Inc. 2011
  • 14. Traditional Enterprise Data Architecture Data Silos Traditional Data Warehouses, Serving Applications BI & Analytics Web NoSQL Traditional ETL & Data BI / Serving RDMS … Message buses EDW Marts Analytics Serving Social Sensor Text Logs Media Data Systems … Unstructured Systems Page 14 © Hortonworks Inc. 2011
  • 15. Agile Data Architecture w/Hadoop Connecting All of Your Big Data Traditional Data Warehouses, Serving Applications BI & Analytics Web NoSQL Traditional ETL & Data BI / Serving RDMS … Message buses EDW Marts Analytics EsTsL (s = Store) Custom Analytics Serving Social Sensor Text Logs Media Data Systems … Unstructured Systems Page 15 © Hortonworks Inc. 2011
  • 16. Trend: Data driven development • Limited runtime logic driven by huge lookup tables • Data computed offline on Hadoop – Machine learning, other expensive computation offline – Personalization, classification, fraud, value analysis… • Application development requires data science – Huge amounts of actually observed data key to modern services – Hadoop used as the science platform Architecting the Future of Big Data Page 16 © Hortonworks Inc. 2011
  • 17. CASE STUDY YAHOO! HOMEPAGE •  Serving Maps   SCIENCE »  Machine learning to build ever •  Users  -­‐  Interests   HADOOP better categorization models   CLUSTER •  Five  Minute   USER   CATEGORIZATION   Produc7on   BEHAVIOR   MODELS  (weekly)     •  Weekly   PRODUCTION Categoriza7on   HADOOP »  Identify user interests using models   SERVING CLUSTER Categorization models MAPS (every 5 minutes) USER BEHAVIOR SERVING  SYSTEMS ENGAGED  USERS Build  customized  home  pages  with  latest  data  (thousands  /  second)   Copyright  Yahoo  2011   17  
  • 18. CASE STUDY YAHOO! HOMEPAGE Personalized for each visitor Result: twice the engagement Recommended  links   News  Interests   Top  Searches   +79% clicks +160% clicks +43% clicks vs. randomly selected vs. one size fits all vs. editor selected Copyright  Yahoo  2011  Hortonworks Inc. 2011 © 18  
  • 19. Trend: Specialization of Data Systems • Hadoop does not replace existing systems – It adds new capabilities to the enterprise – It can offload things that are not done efficiently in current systems – Especially in scale out situations • Specialization of traditional data components – Use OLTP systems just for transactions – Use OLAP systems for interactive analysis • Hadoop has LOTS of bandwidth to storage and CPU – Pull reporting out OLTP systems – Pull ELT out of OLAP systems Architecting the Future of Big Data Page 19 © Hortonworks Inc. 2011
  • 20. Hadoop and OLTP Systems MPP Processing of Online Transactions Hadoop used to Process Reports •  Mission critical •  Free up 50+% processing power for •  Manages transactions & serves reports transaction processing system •  Significant cost savings due to commodity nature of Hadoop Web Site Transaction Reports Processing Web Systems Site $$$ Transaction Logs Web Site Page 20 © Hortonworks Inc. 2011
  • 21. Hadoop and OLAP Systems Fast loading, raw data staging, ELT & long-term archival Allow analysts to use tools they know (The Agile Data Zone) (Take advantage of huge ecosystem of BI and Analytics tooling) Web Hadoop EDW Mobile Social Online Archival Other logs Page 21 © Hortonworks Inc. 2011
  • 22. TRENDS: Instrument Clouds of Things Clouds of things logging to Hadoop HDFS + Map-Reduce Websites Or HBase Mobile phones, Enterprise devices… + Analysis Things Things Things Things Things Things Page 22 © Hortonworks Inc. 2011
  • 23. Trend: Many POCs, Few Production Systems • The problem – Hadoop is still a young technology – Hard to find knowledgeable staff – Integration with existing systems • Hadoop market is maturing at speed – Emerging ecosystem of Hadoop platform solutions providers – Apache Hadoop continues to get better – Hadoop training and support available form several vendors Architecting the Future of Big Data Page 23 © Hortonworks Inc. 2011
  • 24. Growth in Hadoop Ecosystem • Hardware vendors, Public Cloud (IAAS, PAAS) – Storage, Appliances, Preloaded commodity boxes, cloud • Data Systems – All the major vendors announced Hadoop plans / products in 2011 • BI, Analytics and ETL – Hadoop integrations emerging • Dedicated Hadoop Applications – Datamere, Karmashere, Platfora, … • Systems Integrators – Regional and Global providers available Architecting the Future of Big Data Page 24 © Hortonworks Inc. 2011
  • 25. Hadoop Continues to Improve Apache community, including Hortonworks investing to improve Hadoop: •  Make Hadoop an Open, Extensible, and Enterprise Viable Platform •  Enable More Applications to Run on Apache Hadoop “Hadoop.Beyond” Platform actively evolving “Hadoop.Next” (Hadoop 0.23) HA, Next-gen HDFS & MapReduce “Hadoop.Now” Extension & Integration APIs (Hadoop 1.0) Most stable version ever HBase, security, WebHDFS Page 25 © Hortonworks Inc. 2011
  • 26. Hortonworks – Approachable Hadoop •  Apache Hadoop Leadership –  Delivered every major release since 0.1 –  Driving innovation across entire stack –  Experience managing world’s largest deployment –  Access to Yahoo’s 1,000+ Hadoop users and 40k+ nodes for testing, QA, etc. •  Business Focus –  Provide 100% open source product –  Hortonworks Data Platform Expert Role-based Training –  Help customers and partners overcome Hadoop knowledge gaps Full Lifecycle Support and Services –  Help organizations successfully develop and deploy solutions based on Hadoop Evaluate Pilot Production Architecting the Future of Big Data Page 26 © Hortonworks Inc. 2011
  • 27. Trend: Finding More Value Over Time • Hadoop is usually brought in to solve a specific problem – Build seach indexes for Yahoo – Manage web site logs for Facebook – Users using EC2 to do data processing at Amazon – Simple reporting when existing tools don’t scale • Once your data is in Hadoop more users find value • Once you have Hadoop, folks add more data Architecting the Future of Big Data Page 27 © Hortonworks Inc. 2011
  • 28. Thank You! Questions? Eric Baldeschwieler @jeric14 @hortonworks Page 28 © Hortonworks Inc. 2011