The Infrastructure of Tomorrow, Today –
  Integrating Supermicro, Greenplum and SAS
           to enable Big Data Analytics




                            Jeff Tsai 蔡穎碩
                           Solution Manager


                                              © Supermicro 2012
Agenda


Big Data Analytics Platform & Infrastructure
EMC+Supermicro
   1,000 Nodes Hadoop Cluster
!!!
“Big Data Is Less                                        !!!
 About Size, And
 More About
 Freedom”
          ―Techcrunch
                                                                             !!!
                        THE ERA OF
                                                                       !!!

            BIG DATA
                            “Findings: „Big Data‟
                         !!! Is More Extreme
                             Than Volume”
                                                         “Big Data! It‟s Real,
                          IS HERE…           ― Gartner    It‟s Real-time, and
                                                          It‟s Already
               “Total data: „bigger‟                      Changing Your
                than big data”                            World”
        !!!              ― 451 Group
                                       !!!
                                                         !!!                 ―IDB
Data Sources Are Expanding




                                           THE DIGITAL UNIVERSE WILL



                                            GROW 44X
                                           IN THE NEXT 10 YEARS




Source : 2011 IDC Digital Universe Study
BIG Data is Just a Bunch of Data to Store…?                                               OR
                    90
                    80
                    70
                    60
                    50

       Big          40
      Data          30
     Sources        20
                    10
                      0
                            2009 2010 2011 2012 2013 2014
                        File Based: 60.7% CAGR       Block Based: 21.8% CAGR

                  By 2012, 80% of all storage capacity sold will be for file-based data

                     Source: IDC
To Create Significant value to your business…




                                      HOW?...
Make BIG Data
Accessible
   Identify the data source
   Store the data
   Connect applications and users
   Utilize the data in different views
EMC UAP Solutions – Analytics Platform



 This is what my
     analytics
environment looks
      like…
Building The Big Data Analytics
               “Stack”
                                    Analytic Toolsets
                              (Business Analytics, BI, Statistics, etc.)



                                   Greenplum Chorus
                            Enterprise Collaboration Platform for Data



                  Greenplum Data Computing Appliances
                               Purpose-built for Big Data Analytics



    Greenplum Database                                               Greenplum HD
    Enterprise & Community Editions                       Hadoop Enterprise & Community Editions

World’s Most Scalable MPP Database Platform              Enterprise Analytics Platform for Unstructured Data
Greenplum Becomes the Foundation
 of EMC’s Data Computing Division
    E M C A C Q U I R E S G R E E N P L U M O N J U LY 2 0 1 0




  “For three years, Gartner has identified Greenplum as
       the most advanced vendor in the visionary
quadrant of its data warehouse DBMS Magic Quadrant….”
                         – Gartner
SAS at a Glance
Company Highlight:
•   Founded 1976: 11,000+ employees in 400+
    offices
•   2010 worldwide revenue $2.43 B
•   IDC: SAS is leader in Analytics with a 34.5%
    market share : Analytics and Reporting
•   4.5 million users worldwide
•   50,000+sites in 114 countries
•   From Tools to Vertical Solutions
                                                                        Services
                                                               Retail
                                                                         11%
                                                         Other 4%                            Financial Services
                                                          2%                                       42%
                                                   Manufacturing
                                                       6%
                                                     Healthcare
                                                                                                 Communications
                                                   & Life Sciences
                                                                                                     8%
                                                          8%
                                                          Government                         Education
                                                             14%        Energy & Utilities     3%
                                                                              2%
Overview

                                                             SMC Inc., HQ       SMC BV,
                                                             San Jose, CA       The Netherlands




                                                                                 SMC TW,
                                                                                 Taiwan



   Founded in 1993, HQ– San Jose, CA, 2007 NASDAQ: SMCI

Revenues:                      FY09    $500M, FY10        $721M , FY11   ~$1B
Global Footprint:   >100 Countries
Production:                    US, EU and Asia Production facilities
Engineering:        70% of workforce in engineering (30% growth through recession)
Market Share:       #1 Server Channel (SMCI enables ~10% of global server market)
Brand Equity:       Growing public profile since 2007 IPO

Corporate Focus:    Energy Efficiency, Earth-friendly,   Green Technology Innovation
Product Family
Resource Optimized (WIO/UIO)           Twin Architecture   GPU SuperComputing




   Data Center Optimized                                      Embedded




Application Optimized: Multi I/O                              SuperBlade




                                       Workstation
Mainstream Business Solutions                               Storage Server
In-House Design and Server Building Block Solutions®

                Technology Partners Server Building Block Solutions®      Customer Requirements
                                         Application Optimized
                                                                            OEM
                                                                            Specs
                                                                                          Tri-Lab
                                                                            Optimized
                                                                           Data Center

                                              In-House Design

                                    Server Building Block Solutions®
                                              > 350                                       Operating
               >550            >1300                      > 140 Power      Open
                                             Cooling                                      Systems /
           Motherboards       Chassis                        Supplies   CPU/ Memory
                                             Modules                                     Applications




(1) As of Q2, 2009
Big Data Analytics on Hadoop
Internet companies are not built on SQL but are building Analytics on Hadoop/NoSQL


                             Existing Hadoop Users (Internet)

      This is what I think                                                      BI &
                                                               ETL Tools                    Web Apps
         my analytics                                                         Reporting
      environment looks
             like…




                                   Management & Coordination
                                                                 Pig            Hive        HBase



                                        Hadoop System                  MapReduce Layer



                                                                           Hadoop Storage

   Web Portal,
 Social Networks
Hadoop Components (hadoop.apache.org)
    HDFS      • Hadoop Distributed File System


 MapReduce    • Framework for writing scalable data applications


     Pig      • Procedural language that abstracts lower level MapReduce


  Zookeeper   • Highly reliable distributed coordination


    Hive      • Data warehouse infrastructure built on top of Hadoop


   HBase      • Database for random, real time read/write access


    Oozie     • workflow/coordination to manage jobs


   Mahout     • Scalable machine learning libraries
What can Hadoop do for you?

 Financial Services                        Web & e-Tailing
    Better knowing customers                    Web usage, click stream behavior
    Risk analysis and management.               Market & customer segmentation
    Fraud detection and security                Ad customer targeting
     analytics.                                  On-line fraud detection


 Telecommunications                        Government
      Customer churn prevention.              Fraud detection
      Price optimization and marketing        Compliance and regulatory analytics
      Network analysis and optimization
      Customer experience management       Retail
                                               Market and consumer segmentation
 Healthcare                                   Merchandizing and cross-selling
    Patient care quality                      Promotion and campaign analysis
    Drug development




                                                                   Data Source: Cloudera
Hadoop Use Cases


 Linkedin – “People You May Know” and other facts

 Yahoo! – Hadoop to support AdSystems and web search

 Visa – Credit card fraud detection and analysis

 T-Mobile – Churn analysis, user experience

 Amazon, Baidu, AOL, eBay, Facebook, Twitter, …




                                                    Data Source: Cloudera
Hadoop Cluster HW selection
 What’s the HW configuration for Hadoop clusters?...
  It depends, workloads matter.

            CPU Intensive                  I/O Intensive

         Machine learning                Data importing and exporting
         Natural language processing     Indexing
         Complex data mining             Searching
         Feature extraction              Grouping
                                         Decoding/decompressing



            Data Storage
          Capacity
                                       General Configuration
          # of data mirroring
                                         2 Quad Core CPUs
                                         16-96GB Memory
              TCO                        2 x GE
          Rack space                     1TB-2TB Disk x n
          Power consumption              1U/2U Rack mount
          Different workloads
Proven at Scale with Worldwide Support
Production-scale testing of Apache Trunk & hosted environment for customer POC‟s


                                               Industry’s largest Hadoop
                                                support team
                                                    Industry‟s most accomplished
                                                     Hadoop talents (from Yahoo!,
                                                     LinkedIn, Talend, etc.)
                                               Tested at scale on the
                                                Greenplum Analytics
                                                Workbench
                                                    1,000-node, 24-petabyte cluster
                                                    Multi-million dollar investment
                                                     by EMC and partners
                                                    Reduced risk for EMC
         Bringing Rapid Innovation                   customers
                to Hadoop
                                                    Certification of partner products
Supermicro Server Functions in the Cluster
Supermicro
Data Nodes




2U Storage Server




Supermicro Infrastructure
Nodes
                                        • 1,000+ Physical Supermicro Server Nodes
                                         (10k virtual nodes)
                                        • 12,000 Processor Cores
                                        • 24 Petabytes of Storage Capacity (6Gbps SATA)
                                        • 48 Terabytes RAM
     2U Twin2 Server                    • 56 Gbps Infiniband Connectivity
Supermicro Multi-Node Server Solutions




                Switch Data Center - Las Vegas NV
Minutes                Initial Benchmark Data




…Results before fine-tuning.
     World record performance results expected to be announced before 2013.
Other testing programs – Supermicro & Intel
              CPU Benchmark
Supermicro Advantages
 Why Supermicro…


   Building Blocks for different                     High Efficiency, High Quality
    Workloads & Requirement
                                                  -Green IT
  -Meet any Hadoop workloads by models            -High Efficiency Power
        -I/O, CPU, Disks, Density                 -High Quality for highest system availability and
  - Customize by specific workload requirement    best utilization




             Proven solutions                                      TCO

  -EMC Greenplum proven solutions                 Solutions to Cost-Effective Hadoop Clusters
  -100% Apache Hadoop Compatible                  Best choice of Hadoop Hardware platforms
  -Benchmark and testing programs with partners
Turnkey Hadoop:
          Supermicro Complete Rack Solutions

   One Stop Shop for Hardware, End to End Total
   Solutions


        Speedup Deployment With Ready to Run Rack
        Systems


          Single Source, Consistent Build Quality and
          Delivery Time


        Multi-Vendor Compatibility Test, Zero
        Compatibility Issue



   Premium Service With Competitive Pricing



Shipped Directly From US, NL, TW
Broad Product Portfolios and Building Blocks




    Best platform to your Hadoop cluster
SMC Inc., HQ   SMC BV,
            San Jose, CA   The Netherlands




                           SMC TW,
                           Taiwan




  Q&A
Thank You

101 ab 1415-1445

  • 1.
    The Infrastructure ofTomorrow, Today – Integrating Supermicro, Greenplum and SAS to enable Big Data Analytics Jeff Tsai 蔡穎碩 Solution Manager © Supermicro 2012
  • 2.
    Agenda Big Data AnalyticsPlatform & Infrastructure EMC+Supermicro  1,000 Nodes Hadoop Cluster
  • 3.
    !!! “Big Data IsLess !!! About Size, And More About Freedom” ―Techcrunch !!! THE ERA OF !!! BIG DATA “Findings: „Big Data‟ !!! Is More Extreme Than Volume” “Big Data! It‟s Real, IS HERE… ― Gartner It‟s Real-time, and It‟s Already “Total data: „bigger‟ Changing Your than big data” World” !!! ― 451 Group !!! !!! ―IDB
  • 4.
    Data Sources AreExpanding THE DIGITAL UNIVERSE WILL GROW 44X IN THE NEXT 10 YEARS Source : 2011 IDC Digital Universe Study
  • 5.
    BIG Data isJust a Bunch of Data to Store…? OR 90 80 70 60 50 Big 40 Data 30 Sources 20 10 0 2009 2010 2011 2012 2013 2014 File Based: 60.7% CAGR Block Based: 21.8% CAGR By 2012, 80% of all storage capacity sold will be for file-based data Source: IDC
  • 6.
    To Create Significantvalue to your business… HOW?...
  • 7.
    Make BIG Data Accessible  Identify the data source  Store the data  Connect applications and users  Utilize the data in different views
  • 8.
    EMC UAP Solutions– Analytics Platform This is what my analytics environment looks like…
  • 9.
    Building The BigData Analytics “Stack” Analytic Toolsets (Business Analytics, BI, Statistics, etc.) Greenplum Chorus Enterprise Collaboration Platform for Data Greenplum Data Computing Appliances Purpose-built for Big Data Analytics Greenplum Database Greenplum HD Enterprise & Community Editions Hadoop Enterprise & Community Editions World’s Most Scalable MPP Database Platform Enterprise Analytics Platform for Unstructured Data
  • 10.
    Greenplum Becomes theFoundation of EMC’s Data Computing Division E M C A C Q U I R E S G R E E N P L U M O N J U LY 2 0 1 0 “For three years, Gartner has identified Greenplum as the most advanced vendor in the visionary quadrant of its data warehouse DBMS Magic Quadrant….” – Gartner
  • 12.
    SAS at aGlance Company Highlight: • Founded 1976: 11,000+ employees in 400+ offices • 2010 worldwide revenue $2.43 B • IDC: SAS is leader in Analytics with a 34.5% market share : Analytics and Reporting • 4.5 million users worldwide • 50,000+sites in 114 countries • From Tools to Vertical Solutions Services Retail 11% Other 4% Financial Services 2% 42% Manufacturing 6% Healthcare Communications & Life Sciences 8% 8% Government Education 14% Energy & Utilities 3% 2%
  • 13.
    Overview SMC Inc., HQ SMC BV, San Jose, CA The Netherlands SMC TW, Taiwan Founded in 1993, HQ– San Jose, CA, 2007 NASDAQ: SMCI Revenues: FY09 $500M, FY10 $721M , FY11 ~$1B Global Footprint: >100 Countries Production: US, EU and Asia Production facilities Engineering: 70% of workforce in engineering (30% growth through recession) Market Share: #1 Server Channel (SMCI enables ~10% of global server market) Brand Equity: Growing public profile since 2007 IPO Corporate Focus: Energy Efficiency, Earth-friendly, Green Technology Innovation
  • 14.
    Product Family Resource Optimized(WIO/UIO) Twin Architecture GPU SuperComputing Data Center Optimized Embedded Application Optimized: Multi I/O SuperBlade Workstation Mainstream Business Solutions Storage Server
  • 15.
    In-House Design andServer Building Block Solutions® Technology Partners Server Building Block Solutions® Customer Requirements Application Optimized OEM Specs Tri-Lab Optimized Data Center In-House Design Server Building Block Solutions® > 350 Operating >550 >1300 > 140 Power Open Cooling Systems / Motherboards Chassis Supplies CPU/ Memory Modules Applications (1) As of Q2, 2009
  • 16.
    Big Data Analyticson Hadoop Internet companies are not built on SQL but are building Analytics on Hadoop/NoSQL Existing Hadoop Users (Internet) This is what I think BI & ETL Tools Web Apps my analytics Reporting environment looks like… Management & Coordination Pig Hive HBase Hadoop System MapReduce Layer Hadoop Storage Web Portal, Social Networks
  • 17.
    Hadoop Components (hadoop.apache.org) HDFS • Hadoop Distributed File System MapReduce • Framework for writing scalable data applications Pig • Procedural language that abstracts lower level MapReduce Zookeeper • Highly reliable distributed coordination Hive • Data warehouse infrastructure built on top of Hadoop HBase • Database for random, real time read/write access Oozie • workflow/coordination to manage jobs Mahout • Scalable machine learning libraries
  • 18.
    What can Hadoopdo for you?  Financial Services  Web & e-Tailing  Better knowing customers  Web usage, click stream behavior  Risk analysis and management.  Market & customer segmentation  Fraud detection and security  Ad customer targeting analytics.  On-line fraud detection  Telecommunications  Government  Customer churn prevention.  Fraud detection  Price optimization and marketing  Compliance and regulatory analytics  Network analysis and optimization  Customer experience management  Retail  Market and consumer segmentation  Healthcare  Merchandizing and cross-selling  Patient care quality  Promotion and campaign analysis  Drug development Data Source: Cloudera
  • 19.
    Hadoop Use Cases Linkedin – “People You May Know” and other facts  Yahoo! – Hadoop to support AdSystems and web search  Visa – Credit card fraud detection and analysis  T-Mobile – Churn analysis, user experience  Amazon, Baidu, AOL, eBay, Facebook, Twitter, … Data Source: Cloudera
  • 20.
    Hadoop Cluster HWselection  What’s the HW configuration for Hadoop clusters?... It depends, workloads matter. CPU Intensive I/O Intensive Machine learning Data importing and exporting Natural language processing Indexing Complex data mining Searching Feature extraction Grouping Decoding/decompressing Data Storage Capacity General Configuration # of data mirroring 2 Quad Core CPUs 16-96GB Memory TCO 2 x GE Rack space 1TB-2TB Disk x n Power consumption 1U/2U Rack mount Different workloads
  • 21.
    Proven at Scalewith Worldwide Support Production-scale testing of Apache Trunk & hosted environment for customer POC‟s  Industry’s largest Hadoop support team  Industry‟s most accomplished Hadoop talents (from Yahoo!, LinkedIn, Talend, etc.)  Tested at scale on the Greenplum Analytics Workbench  1,000-node, 24-petabyte cluster  Multi-million dollar investment by EMC and partners  Reduced risk for EMC Bringing Rapid Innovation customers to Hadoop  Certification of partner products
  • 22.
    Supermicro Server Functionsin the Cluster Supermicro Data Nodes 2U Storage Server Supermicro Infrastructure Nodes • 1,000+ Physical Supermicro Server Nodes (10k virtual nodes) • 12,000 Processor Cores • 24 Petabytes of Storage Capacity (6Gbps SATA) • 48 Terabytes RAM 2U Twin2 Server • 56 Gbps Infiniband Connectivity
  • 23.
    Supermicro Multi-Node ServerSolutions Switch Data Center - Las Vegas NV
  • 24.
    Minutes Initial Benchmark Data …Results before fine-tuning.  World record performance results expected to be announced before 2013.
  • 25.
    Other testing programs– Supermicro & Intel CPU Benchmark
  • 26.
    Supermicro Advantages  WhySupermicro… Building Blocks for different High Efficiency, High Quality Workloads & Requirement -Green IT -Meet any Hadoop workloads by models -High Efficiency Power -I/O, CPU, Disks, Density -High Quality for highest system availability and - Customize by specific workload requirement best utilization Proven solutions TCO -EMC Greenplum proven solutions Solutions to Cost-Effective Hadoop Clusters -100% Apache Hadoop Compatible Best choice of Hadoop Hardware platforms -Benchmark and testing programs with partners
  • 27.
    Turnkey Hadoop: Supermicro Complete Rack Solutions One Stop Shop for Hardware, End to End Total Solutions Speedup Deployment With Ready to Run Rack Systems Single Source, Consistent Build Quality and Delivery Time Multi-Vendor Compatibility Test, Zero Compatibility Issue Premium Service With Competitive Pricing Shipped Directly From US, NL, TW
  • 28.
    Broad Product Portfoliosand Building Blocks Best platform to your Hadoop cluster
  • 29.
    SMC Inc., HQ SMC BV, San Jose, CA The Netherlands SMC TW, Taiwan Q&A Thank You