The Next Generation of
 Hadoop Map-Reduce
        Sharad Agarwal
     sharadag@yahoo-inc.com
        sharad@apache.org
About Me

   Hadoop Committer and PMC member
   Architect at Yahoo!
Hadoop Map-Reduce Today
   JobTracker
    - Manages cluster resources
      and job scheduling
   TaskTracker
    - Per-node agent
    - Manage tasks
Current Limitations
   Scalability
    - Maximum Cluster size – 4,000 nodes
    - Maximum concurrent tasks – 40,000
    - Coarse synchronization in JobTracker
   Single point of failure
    - Failure kills all queued and running jobs
    - Jobs need to be re-submitted by users
   Restart is very tricky due to complex state
   Hard partition of resources into map and reduce
    slots
Current Limitations

   Lacks support for alternate paradigms
    - Iterative applications implemented using Map-Reduce
      are 10x slower.
    - Example: K-Means, PageRank
   Lack of wire-compatible protocols
    - Client and cluster must be of same version
    - Applications and workflows cannot migrate to
      different clusters
Next Generation Map-Reduce Requirements
   Reliability
   Availability
   Scalability - Clusters of 6,000 machines
    - Each machine with 16 cores, 48G RAM, 24TB disks
    - 100,000 concurrent tasks
    - 10,000 concurrent jobs
   Wire Compatibility
   Agility & Evolution – Ability for customers to
    control upgrades to the grid software stack.
Next Generation Map-Reduce – Design
Centre

   Split up the two major functions of JobTracker
    - Cluster resource management
    - Application life-cycle management
   Map-Reduce becomes user-land library
Architecture
Architecture
   Resource Manager
    - Global resource scheduler
    - Hierarchical queues
   Node Manager
    - Per-machine agent
    - Manages the life-cycle of container
    - Container resource monitoring
   Application Master
    - Per-application
    - Manages application scheduling and task execution
    - E.g. Map-Reduce Application Master
Improvements vis-à-vis current Map-Reduce
     Scalability
      - Application life-cycle management is very
        expensive
      - Partition resource management and application
        life-cycle management
      - Application management is distributed
      - Hardware trends - Currently run clusters of 4,000
        machines
          • 6,000 2012 machines > 12,000 2009 machines
          • <8 cores, 16G, 4TB> v/s <16+ cores, 48/96G, 24TB>
Improvements vis-à-vis current Map-Reduce
     Availability
      - Application Master
          • Optional failover via application-specific checkpoint
          • Map-Reduce applications pick up where they left off
      - Resource Manager
          • No single point of failure - failover via ZooKeeper
          • Application Masters are restarted automatically
Improvements vis-à-vis current Map-Reduce
     Wire Compatibility
      - Protocols are wire-compatible
      - Old clients can talk to new servers
      - Rolling upgrades
Improvements vis-à-vis current Map-Reduce
     Agility / Evolution
      - Map-Reduce now becomes a user-land library
      - Multiple versions of Map-Reduce can run in the
        same cluster (ala Apache Pig)
          • Faster deployment cycles for improvements
      - Customers upgrade Map-Reduce versions on their
        schedule
Improvements vis-à-vis current Map-Reduce
     Utilization
      - Generic resource model
          •   Memory
          •   CPU
          •   Disk b/w
          •   Network b/w
      - Remove fixed partition of map and reduce slots
Improvements vis-à-vis current Map-Reduce
     Support for programming paradigms other
      than Map-Reduce
      - MPI
      - Master-Worker
      - Machine Learning
      - Iterative processing
      - Enabled by allowing use of paradigm-specific
        Application Master
      - Run all on the same Hadoop cluster
Summary
   The next generation of Map-Reduce takes
    Hadoop to the next level
    -   Scale-out even further
    -   High availability
    -   Cluster Utilization
    -   Support for paradigms other than Map-Reduce
Questions?

YARN Hadoop Summit Bangalore 2011

  • 1.
    The Next Generationof Hadoop Map-Reduce Sharad Agarwal [email protected] [email protected]
  • 2.
    About Me  Hadoop Committer and PMC member  Architect at Yahoo!
  • 3.
    Hadoop Map-Reduce Today  JobTracker - Manages cluster resources and job scheduling  TaskTracker - Per-node agent - Manage tasks
  • 4.
    Current Limitations  Scalability - Maximum Cluster size – 4,000 nodes - Maximum concurrent tasks – 40,000 - Coarse synchronization in JobTracker  Single point of failure - Failure kills all queued and running jobs - Jobs need to be re-submitted by users  Restart is very tricky due to complex state  Hard partition of resources into map and reduce slots
  • 5.
    Current Limitations  Lacks support for alternate paradigms - Iterative applications implemented using Map-Reduce are 10x slower. - Example: K-Means, PageRank  Lack of wire-compatible protocols - Client and cluster must be of same version - Applications and workflows cannot migrate to different clusters
  • 6.
    Next Generation Map-ReduceRequirements  Reliability  Availability  Scalability - Clusters of 6,000 machines - Each machine with 16 cores, 48G RAM, 24TB disks - 100,000 concurrent tasks - 10,000 concurrent jobs  Wire Compatibility  Agility & Evolution – Ability for customers to control upgrades to the grid software stack.
  • 7.
    Next Generation Map-Reduce– Design Centre  Split up the two major functions of JobTracker - Cluster resource management - Application life-cycle management  Map-Reduce becomes user-land library
  • 8.
  • 9.
    Architecture  Resource Manager - Global resource scheduler - Hierarchical queues  Node Manager - Per-machine agent - Manages the life-cycle of container - Container resource monitoring  Application Master - Per-application - Manages application scheduling and task execution - E.g. Map-Reduce Application Master
  • 10.
    Improvements vis-à-vis currentMap-Reduce  Scalability - Application life-cycle management is very expensive - Partition resource management and application life-cycle management - Application management is distributed - Hardware trends - Currently run clusters of 4,000 machines • 6,000 2012 machines > 12,000 2009 machines • <8 cores, 16G, 4TB> v/s <16+ cores, 48/96G, 24TB>
  • 11.
    Improvements vis-à-vis currentMap-Reduce  Availability - Application Master • Optional failover via application-specific checkpoint • Map-Reduce applications pick up where they left off - Resource Manager • No single point of failure - failover via ZooKeeper • Application Masters are restarted automatically
  • 12.
    Improvements vis-à-vis currentMap-Reduce  Wire Compatibility - Protocols are wire-compatible - Old clients can talk to new servers - Rolling upgrades
  • 13.
    Improvements vis-à-vis currentMap-Reduce  Agility / Evolution - Map-Reduce now becomes a user-land library - Multiple versions of Map-Reduce can run in the same cluster (ala Apache Pig) • Faster deployment cycles for improvements - Customers upgrade Map-Reduce versions on their schedule
  • 14.
    Improvements vis-à-vis currentMap-Reduce  Utilization - Generic resource model • Memory • CPU • Disk b/w • Network b/w - Remove fixed partition of map and reduce slots
  • 15.
    Improvements vis-à-vis currentMap-Reduce  Support for programming paradigms other than Map-Reduce - MPI - Master-Worker - Machine Learning - Iterative processing - Enabled by allowing use of paradigm-specific Application Master - Run all on the same Hadoop cluster
  • 16.
    Summary  The next generation of Map-Reduce takes Hadoop to the next level - Scale-out even further - High availability - Cluster Utilization - Support for paradigms other than Map-Reduce
  • 17.