ANALYTICS ON
                                            HADOOP


                                               Donald Miner
                                               Solutions Architect
                                               Advanced Technologies Group




© Copyright 2012 EMC Corporation. All rights reserved.                       1
Large Retailer and Pregnancy

                                                         “   As Pole’s computers crawled
                                                             through the data, he was able
                                                             to identify about 25 products
                                                             that, when analyzed together,
                                                             allowed him to assign each
                                                             shopper a “pregnancy
                                                             prediction” score. More
                                                             important, he could also
                                         ?                   estimate her due date to
                                                             within a small window, so they
                                                             could send coupons timed to
                                                             very specific stages of her

                                                                                         ”
                                                             pregnancy.



© Copyright 2012 EMC Corporation. All rights reserved.                                        2
Hadoop Origins
 Open source system based off of papers
  written by Google
 MapReduce used by Google to parse and
  index web pages and calculate “page rank”
 Came from the need of a system that is:
        –    Linearly and horizontally scalable
        –    Able to store massive amounts of data
        –    Fault tolerant
        –    Ready to analyze HTML files
        –    Cheap to build and maintain


© Copyright 2012 EMC Corporation. All rights reserved.   3
What is Hadoop?
                                              Two Core Components

                               HDFS                            MapReduce

                  Scalable storage in                        Compute via the
                  Hadoop Distribued                        MapReduce distributed
                     File System                            Processing platform

 Open source system developed by the Apache
  Foundation
 Storage and compute in one framework
 Massively scalable



© Copyright 2012 EMC Corporation. All rights reserved.                             4
Why is Hadoop Important?
 Business analytics require new approaches
        – Data size
        – Data growth
 The new nature of data
        – Unstructured
        – Numerous sources
 Hadoop makes analytics on large data sets
  more cost effective




© Copyright 2012 EMC Corporation. All rights reserved.   5
Structured and Unstructured Data
 Greenplum DB
              Partitioning
  SQL
               Indexing
       RDBMS                            BI Tools
                                                          GP MapReduce
Tables and Schemas

 STRUCTURED                                              UNSTRUCTURED




© Copyright 2012 EMC Corporation. All rights reserved.                   6
Structured and Unstructured Data
                                                                                          Hadoop
                                                                                    Schema on load
                                                         SequenceFile
                                                                                         MapReduce
                            Hive                                             Directories  Java
                                                                 XML, JSON, …             Flat files
                                           Pig                                  No ETL

 STRUCTURED                                                                        UNSTRUCTURED




© Copyright 2012 EMC Corporation. All rights reserved.                                                 7
Leverage Both in a Unified Platform
 Greenplum DB                                                                             Hadoop
              Partitioning
  SQL                                                                               Schema on load
         Indexing                                        SequenceFile
                                                                                         MapReduce
       RDBMS Hive                       BI Tools                             Directories  Java
                                                                 XML, JSON, …      GP MapReduce
Tables and Schemas Pig                                                          No ETL    Flat files
 STRUCTURED                                                                        UNSTRUCTURED




© Copyright 2012 EMC Corporation. All rights reserved.                                                 8
Hadoop Use Case
                                Launching our new product:
                                  The Marshmallow House




© Copyright 2012 EMC Corporation. All rights reserved.       9
Marshmallow House Release Analysis
            Greenplum Party




© Copyright 2012 EMC Corporation. All rights reserved.   10
Website Logs
 15 web servers, 5 application servers
 Problem: cross-correlation
 Problem: 500TB of data with 1TB/day
 Problem: extracting insights from text




© Copyright 2012 EMC Corporation. All rights reserved.   11
Current System
 SQL database
        – ETL process to collect and parse logs
        – Analyze transactions on the website
        – Can’t work with the text comfortably
 Perl scripts parsing the logs
        – Doesn’t scale
        – Hard to correlate across systems
        – Hard to deploy




© Copyright 2012 EMC Corporation. All rights reserved.   12
Augmenting Capabilities with Hadoop
 Hadoop helps us extract value in more ways
 Particular analytics we have in mind:
        –    Interest in product by location
        –    Sessionizing our disparate data
        –    Building behavior models of our customers
        –    Analyzing customers’ sentiment of our products
 Why? Target Marshmallow House purchasers




© Copyright 2012 EMC Corporation. All rights reserved.        13
Geographical Distribution
 Problem: We don’t know what the amount of
  interest is, by location
 Value: This will allow us to justify and scope
  additional marketing efforts
 Why Hadoop: Search through text, parsing
  log, custom data structures




© Copyright 2012 EMC Corporation. All rights reserved.   14
Geographical Distribution
 Solution: Find IP addresses interested in our
  product, then count them over their locations
 MapReduce job:
        – map: extract ip addresses from all data, enrich with
          ipgeo information
        – reduce: group by geographical location, count the
          number of records
        – output: location, count
 Result: Lots of interest in Virginia




© Copyright 2012 EMC Corporation. All rights reserved.           15
Sample MapReduce Java Code
 A MapReduce job consists of a Mapper,
  Reducer, and a Driver
 The Mapper parses, filters, transforms,
  enriches, and extracts
 The Reducer aggregates, counts, and outputs
 The Driver sets up and submits the job for
  execution




© Copyright 2012 EMC Corporation. All rights reserved.   16
Mapper Code




© Copyright 2012 EMC Corporation. All rights reserved.   17
Reducer Code




© Copyright 2012 EMC Corporation. All rights reserved.   18
Driver Code




© Copyright 2012 EMC Corporation. All rights reserved.   19
Sessionizing
 Problem: Data is scattered
 Value: Analyze a user’s experience at a session-
  level, which shows a bigger picture
 Why Hadoop: Hadoop can deal with heterogeneous
  and hierarchical data well




© Copyright 2012 EMC Corporation. All rights reserved.   20
Sessionizing
 Solution: Load the data sets and group by IP and
  temporal locality, then output as a hierarchical data
  structure
 MapReduce job:
        – map: extract IP and date/time, keep the record
        – reduce: group by IP, then group into sessions; format
          into JSON documents and output
 Result: 1 million sessions a day




© Copyright 2012 EMC Corporation. All rights reserved.            21
Unstructured and Semi-Structured Data
 Unnatural to store in an RDBMS
 Unstructured: text, documents, media,
  raw sensor data
 Semi-structured: mixed structured/unstructured;
  hierarchical
 Hadoop’s ability to leverage Java to gives flexibility
 “Schema on load”
 Data stored as “rich documents”




© Copyright 2012 EMC Corporation. All rights reserved.     22
Behavioral Model
 Problem: We don’t understand how our visitors
  behave stereotypically
 Value: Optimize our interface for usability;
  understand our customers
 Why Hadoop: Advanced analytics and machine
  learning is possible because of the flexibility of the
  framework




© Copyright 2012 EMC Corporation. All rights reserved.     23
Behavioral Model
 Solution: Run over the sessions and build a generic
  model from those
 MapReduce job: Use clustering to bring users into
  stereotypes, then use frequent item set analysis to build
  correlations between our users’ actions
 Results: We have three major types of buyers; casual
  buyers usually visit the marshmallow house from the
  main page




© Copyright 2012 EMC Corporation. All rights reserved.        24
Apache Mahout
 Machine learning library built on Hadoop
 Scalable machine learning
 Open source project
 Data mining, advanced analytics, predictive
  modeling
 Main use cases: recommendation engines,
  clustering, classification, frequent itemset
  mining



© Copyright 2012 EMC Corporation. All rights reserved.   25
Hadoop Makes These Possible
 Unstructured analysis is possible in Java and
  Hadoop
 Advanced data mining and machine learning
  techniques are natural
 Data analysis can be done on the data in its
  original form
 Analyze large amounts of heterogeneous
  data



© Copyright 2012 EMC Corporation. All rights reserved.   26
Provide Feedback & Win!


                                                          125 attendees will receive
                                                           $100 iTunes gift cards. To
                                                           enter the raffle, simply
                                                           complete:
                                                            – 5 sessions surveys
                                                            – The conference survey
                                                          Download the EMC World
                                                           Conference App to learn
                                                           more: emcworld.com/app



© Copyright 2012 EMC Corporation. All rights reserved.                                  27
© Copyright 2012 EMC Corporation. All rights reserved.   28
Thank You




© Copyright 2012 EMC Corporation. All rights reserved.        29
Analytics on Hadoop

Analytics on Hadoop

  • 1.
    ANALYTICS ON HADOOP Donald Miner Solutions Architect Advanced Technologies Group © Copyright 2012 EMC Corporation. All rights reserved. 1
  • 2.
    Large Retailer andPregnancy “ As Pole’s computers crawled through the data, he was able to identify about 25 products that, when analyzed together, allowed him to assign each shopper a “pregnancy prediction” score. More important, he could also ? estimate her due date to within a small window, so they could send coupons timed to very specific stages of her ” pregnancy. © Copyright 2012 EMC Corporation. All rights reserved. 2
  • 3.
    Hadoop Origins  Opensource system based off of papers written by Google  MapReduce used by Google to parse and index web pages and calculate “page rank”  Came from the need of a system that is: – Linearly and horizontally scalable – Able to store massive amounts of data – Fault tolerant – Ready to analyze HTML files – Cheap to build and maintain © Copyright 2012 EMC Corporation. All rights reserved. 3
  • 4.
    What is Hadoop? Two Core Components HDFS MapReduce Scalable storage in Compute via the Hadoop Distribued MapReduce distributed File System Processing platform  Open source system developed by the Apache Foundation  Storage and compute in one framework  Massively scalable © Copyright 2012 EMC Corporation. All rights reserved. 4
  • 5.
    Why is HadoopImportant?  Business analytics require new approaches – Data size – Data growth  The new nature of data – Unstructured – Numerous sources  Hadoop makes analytics on large data sets more cost effective © Copyright 2012 EMC Corporation. All rights reserved. 5
  • 6.
    Structured and UnstructuredData Greenplum DB Partitioning SQL Indexing RDBMS BI Tools GP MapReduce Tables and Schemas STRUCTURED UNSTRUCTURED © Copyright 2012 EMC Corporation. All rights reserved. 6
  • 7.
    Structured and UnstructuredData Hadoop Schema on load SequenceFile MapReduce Hive Directories Java XML, JSON, … Flat files Pig No ETL STRUCTURED UNSTRUCTURED © Copyright 2012 EMC Corporation. All rights reserved. 7
  • 8.
    Leverage Both ina Unified Platform Greenplum DB Hadoop Partitioning SQL Schema on load Indexing SequenceFile MapReduce RDBMS Hive BI Tools Directories Java XML, JSON, … GP MapReduce Tables and Schemas Pig No ETL Flat files STRUCTURED UNSTRUCTURED © Copyright 2012 EMC Corporation. All rights reserved. 8
  • 9.
    Hadoop Use Case Launching our new product: The Marshmallow House © Copyright 2012 EMC Corporation. All rights reserved. 9
  • 10.
    Marshmallow House ReleaseAnalysis Greenplum Party © Copyright 2012 EMC Corporation. All rights reserved. 10
  • 11.
    Website Logs  15web servers, 5 application servers  Problem: cross-correlation  Problem: 500TB of data with 1TB/day  Problem: extracting insights from text © Copyright 2012 EMC Corporation. All rights reserved. 11
  • 12.
    Current System  SQLdatabase – ETL process to collect and parse logs – Analyze transactions on the website – Can’t work with the text comfortably  Perl scripts parsing the logs – Doesn’t scale – Hard to correlate across systems – Hard to deploy © Copyright 2012 EMC Corporation. All rights reserved. 12
  • 13.
    Augmenting Capabilities withHadoop  Hadoop helps us extract value in more ways  Particular analytics we have in mind: – Interest in product by location – Sessionizing our disparate data – Building behavior models of our customers – Analyzing customers’ sentiment of our products  Why? Target Marshmallow House purchasers © Copyright 2012 EMC Corporation. All rights reserved. 13
  • 14.
    Geographical Distribution  Problem:We don’t know what the amount of interest is, by location  Value: This will allow us to justify and scope additional marketing efforts  Why Hadoop: Search through text, parsing log, custom data structures © Copyright 2012 EMC Corporation. All rights reserved. 14
  • 15.
    Geographical Distribution  Solution:Find IP addresses interested in our product, then count them over their locations  MapReduce job: – map: extract ip addresses from all data, enrich with ipgeo information – reduce: group by geographical location, count the number of records – output: location, count  Result: Lots of interest in Virginia © Copyright 2012 EMC Corporation. All rights reserved. 15
  • 16.
    Sample MapReduce JavaCode  A MapReduce job consists of a Mapper, Reducer, and a Driver  The Mapper parses, filters, transforms, enriches, and extracts  The Reducer aggregates, counts, and outputs  The Driver sets up and submits the job for execution © Copyright 2012 EMC Corporation. All rights reserved. 16
  • 17.
    Mapper Code © Copyright2012 EMC Corporation. All rights reserved. 17
  • 18.
    Reducer Code © Copyright2012 EMC Corporation. All rights reserved. 18
  • 19.
    Driver Code © Copyright2012 EMC Corporation. All rights reserved. 19
  • 20.
    Sessionizing  Problem: Datais scattered  Value: Analyze a user’s experience at a session- level, which shows a bigger picture  Why Hadoop: Hadoop can deal with heterogeneous and hierarchical data well © Copyright 2012 EMC Corporation. All rights reserved. 20
  • 21.
    Sessionizing  Solution: Loadthe data sets and group by IP and temporal locality, then output as a hierarchical data structure  MapReduce job: – map: extract IP and date/time, keep the record – reduce: group by IP, then group into sessions; format into JSON documents and output  Result: 1 million sessions a day © Copyright 2012 EMC Corporation. All rights reserved. 21
  • 22.
    Unstructured and Semi-StructuredData  Unnatural to store in an RDBMS  Unstructured: text, documents, media, raw sensor data  Semi-structured: mixed structured/unstructured; hierarchical  Hadoop’s ability to leverage Java to gives flexibility  “Schema on load”  Data stored as “rich documents” © Copyright 2012 EMC Corporation. All rights reserved. 22
  • 23.
    Behavioral Model  Problem:We don’t understand how our visitors behave stereotypically  Value: Optimize our interface for usability; understand our customers  Why Hadoop: Advanced analytics and machine learning is possible because of the flexibility of the framework © Copyright 2012 EMC Corporation. All rights reserved. 23
  • 24.
    Behavioral Model  Solution:Run over the sessions and build a generic model from those  MapReduce job: Use clustering to bring users into stereotypes, then use frequent item set analysis to build correlations between our users’ actions  Results: We have three major types of buyers; casual buyers usually visit the marshmallow house from the main page © Copyright 2012 EMC Corporation. All rights reserved. 24
  • 25.
    Apache Mahout  Machinelearning library built on Hadoop  Scalable machine learning  Open source project  Data mining, advanced analytics, predictive modeling  Main use cases: recommendation engines, clustering, classification, frequent itemset mining © Copyright 2012 EMC Corporation. All rights reserved. 25
  • 26.
    Hadoop Makes ThesePossible  Unstructured analysis is possible in Java and Hadoop  Advanced data mining and machine learning techniques are natural  Data analysis can be done on the data in its original form  Analyze large amounts of heterogeneous data © Copyright 2012 EMC Corporation. All rights reserved. 26
  • 27.
    Provide Feedback &Win!  125 attendees will receive $100 iTunes gift cards. To enter the raffle, simply complete: – 5 sessions surveys – The conference survey  Download the EMC World Conference App to learn more: emcworld.com/app © Copyright 2012 EMC Corporation. All rights reserved. 27
  • 28.
    © Copyright 2012EMC Corporation. All rights reserved. 28
  • 29.
    Thank You © Copyright2012 EMC Corporation. All rights reserved. 29