[db tech showcase Tokyo 2014] C32: Hadoop最前線 - 開発の現場から by NTT 小沢健史

Copyright©2014 NTT corp. All Rights Reserved.
Apache Hadoop-What’s next?- @db tech showcase 2014
Tsuyoshi Ozawa
ozawa.tsuyoshi@lab.ntt.co.jp

2
•Tsuyoshi Ozawa
•Researcher & Engineer @ NTTTwitter: @oza_x86_64
•A Hadoop developer
•Merged patches –53 patches!
•Author of “Hadoop 徹底入門2nd Edition” Chapter 22(YARN)
About me

3
Quiz!!

4
Does Hadoophave SPoF?
Quiz

5
Quiz
All master nodes in Hadoopcan run as highly available mode

6
Is Hadooponly for MapReduce?
Quiz

7
Quiz
Hadoop isnot only for MapReducebut also Spark/Tez/Storm and so on…

8
•Current Status of Hadoop-New features since Hadoop 2 -
•HDFS
•No SPoFwith NamenodeHA + JournalNode
•Scaling out Namenodewith NamenodeFederation
•YARN
•Resource Management with YARN
•No SPoFwith ResourceManagerHA
•MapReduce
•No SPoFwith ApplicationMasterrestart
•What’s next? -Coming features in 2.6 release -
•HDFS
•Heterogeneous Storage
•Memory as Storage Tier
•YARN
•Label-based scheduling
•RM HA Phase 2
Agenda

9
HDFS IN HADOOP 2

10
•Once on a time, NameNodewas SPoF
•In Hadoop 2, NameNodehasQuorum JournalManager
•Replication is done by Pasxos-based protocol
See also:
http://blog.cloudera.com/blog/2012/10/quorum-based-journaling-in-cdh4-1/
NameNode with JournalNode
NameNode
QuorumJournalManager
JournalNode
JournalNode
JournalNode
Local disk
Local disk
Local disk

11
•Once on a time, scalability of NameNodewas limited to memory
•In Hadoop 2, NameNodehasFederation feature
•Distributing metadata per namespace
NameNode Federation
Figures from:
https://hadoop.apache.org/docs/r2.3.0/hadoop-project-dist/hadoop- hdfs/Federation.html

12
RESOURCE MANAGEMENT IN HADOOP 2

13
YARN
•Generic resource management framework
•YARN = Yet Another Resource Negotiator
•Proposed by ArunC Murthy in 2011
•Container-level resource management
•Container is more generic unit of resource than slots
•Separate JobTracker’srole
•Job Scheduling/Resource Management/Isolation
•Task Scheduling
What’s YARN?
JobTracker
MRv1 architecture
MRv2 and YARN Architecture
YARN ResourceManager
Impala Master
Spark Master
MRv2 Master
TaskTracker
YARN NodeManager
map slot
reduce slot
container
container
container

14
•Running various processing frameworkson same cluster
•Batch processing with MapReduce
•Interactive query with Impala
•Interactive deep analytics(e.g. Machine Learning) with Spark
Why YARN?(Use case)
MRv2/Tez
YARN
HDFS
Impala
Spark
Periodic long batch
query
Interactive
Aggregation
query
Interactive
Machine Learning
query

15
•More effective resource management for multiple processing frameworks
•difficult to use entire resources without thrashing
•Cannot move *Real* big data from HDFS/S3
Why YARN?(Technical reason)
Master for MapReduce
Master for Impala
Slave
Impala slave
map slot
reduce slot
MapReduce slave
Slave
Slave
Slave
HDFS slave
Each frameworks has own scheduler
Job2
Job1
Job1
thrashing

16
•Resource is managed by JobTracker
•Job-level Scheduling
•Resource Management
MRv1 Architecture
Master for MapReduce
Slave
map slot
reduce slot
MapReduce slave
Slave
map slot
reduce slot
MapReduce slave
Slave
map slot
reduce slot
MapReduce slave
Master for Impala
Schedulers only now own resource usages

17
•Idea
•One global resource manager(ResourceManager)
•Common resource pool for all frameworks(NodeManagerand Container)
•Schedulers for each frameworks(AppMaster)
YARN Architecture
ResourceManager
Slave
NodeManager
Container
Container
Container
Slave
NodeManager
Container
Container
Container
Slave
NodeManager
Container
Container
Container
Master
Slave
Slave
Master
Slave
Slave
Master
Slave
Slave
Client
1. Submit jobs
2. Launch Master
3. Launch Slaves

18
YARN and Mesos
YARN
•AppMasteris launched for each jobs
•More scalability
•Higher latency
•One container per req
•One Master per Job
Mesos
•AppMasteris launched for each app(framework)
•Less scalability
•Lower latency
•Bundle of containers per req
•One Master per Framework
ResourceManager
NM
NM
NM
ResourceMaster
Slave
Slave
Slave
Master1
Master2
Master1
Master2
Policy/Philosophy is different

19
•MapReduce
•Of course, it works
•DAG-style processing framework
•Spark on YARN
•Hive on Tezon YARN
•Interactive Query
•Impala on YARN(via llama)
•Users
•Yahoo!
•Twitter
•LinkdedIn
•Hadoop 2 @ Twitter http://www.slideshare.net/Hadoop_Summit/t- 235p210-cvijayarenuv2
YARN Eco-system

20
YARN COMPONENTS

21
•Master Node of YARN
•Role
•Accepting requests from
1.Application Masters for allocating containers
2.Clients for submitting jobs
•Managing Cluster Resources
•Job-level Scheduling
•Container Management
•Launching Application-level Master(e.g. for MapReduce)
ResourceManager(RM)
ResourceManager
Client
Slave
NodeManager
Container
Container
Master
4.Container allocationrequests to NodeManager
1. Submitting Jobs
2. Launching Master of jobs
3.Container allocation requests

22
•Slave Node of YARN
•Role
•Accepting requests from RM
•Monitoring local machine and report it to RM
•Health Check
•Managing local resources
NodeManager(NM)
NodeManager
ResourceManager
2. Allocating containers
Clients
Master
or
3. Launching containers
containers
4. Containers information(host, port, etc.)
1. Request containers
Periodic health check via heartbeat

23
•Master of Applications(e.g. Master of MapReduce, Tez, Spark etc.)
•Run on Containers
•Roles
•Getting containers from ResourceManager
•Application-level Scheduling
•How much and where Map tasks run?
•When reduce tasks will be launched?
ApplicationMaster(AM)
NodeManager
Container
Master of MapReduce
ResourceManager
1. Request containers
2. List of Allocated containers

24
RESOURCE MANAGER HA

25
•What’s happen when ResourceManagerfails?
•cannot submit new jobs
•NOTE:
•Launched Apps continues to run
•AppMasterrecover is done in each frameworks
•MRv2
ResourceManager High Availability
ResourceManager
Slave
NodeManager
Container
Container
Container
Slave
NodeManager
Container
Container
Container
Slave
NodeManager
Container
Container
Container
Master
Slave
Slave
Master
Slave
Slave
Master
Slave
Slave
Client
Submit jobs
Continue to run each jobs

26
•Approach
•Storing RM information to ZooKeeper
•Automatic Failover by Embedded Elector
•Manual Failover by RMHAUtils
•NodeManagersuses local RMProxyto access them
ResourceManager High Availability
ResourceManager
Active
ResourceManager
Standby
ZooKeeper
ZooKeeper
ZooKeeper
2. failure
3. Embedded
Detects
failure
EmbeddedElector
EmbeddedElector
4. Failover
RMState
RMState
RMState
1.Active Node storesall state into RMStateStore
3. Standby
Node become
active
5. Load states fromRMStateStore

27
CAPACITY PLANNINGON YARN

28
•Define resources with XML(etc/hadoop/yarn-site.xml)
Resource definition on NodeManager
NodeManager
CPU
CPU
CPU
CPU
CPU
Memory
Memory
Memory
Memory
Memory
<property>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>8</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>8192</value>
</property>
8 CPU cores
8 GB memory

29
Container allocation on ResourceManager
•RM accepts container request and send it to NM, but the request can be rewritten
•Small requests will be rounded up to minimum-allocation-mb
•Large requests will be rounded down tomaximum-allocation-mb
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>1024</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>8192</value>
</property>
ResourceManager
Client
Request 512MB
NodeManager
NodeManager
NodeManager
Request 1024MB
Master

30
•Define how much MapTasksor ReduceTasksuse resource
•MapReduce: etc/hadoop/mapred-site.xml
Container allocation at framework side
NodeManager
CPU
CPU
CPU
CPU
CPU
Memory
Memory
Memory
Memory
Memory
8 CPU cores
8 GB memory
<property>
<name>mapreduce.map.memory.mb</name>
<value>1024</value>
</property>
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>4096</value>
</property>
Slave
NodeManager
Container
Container
Master
Giving us containers
For map task-1024 MB memory,
1 CPU core
Container
1024MB memory1 core

31
WHAT’S NEXT? –HDFS -

32
•HDFS-2832, HDFS-5682
•Handling various storage types in HDFS
•SSD, memory, disk, and so on.
•Setting quota per storage types
•Setting SSD quota on /home/user1 to 10 TB.
•Setting SSD quota on /home/user2 to 10 TB.
•(c) Not configuring any SSD quota on the remaining user directories (i.e. leaving it to defaults).
Heterogeneous Storages for HDFS Phase 2
<configuration>
...
<property>
<name>dfs.datanode.data.dir</name>
<value>[DISK]/mnt/sdc2/,[DISK]/mnt/sdd2,[SSD]/mnt/sde2</value>
</property>
...
</configuration>

33
•HDFS-5851
•Introducing obvious “Cache”layer in HDFS
•DiscardableDistributed Memory
•Applications can accelerate their speedsby using memory
•DiscardableMemory and Materialized Queries is one of examples
•Difference between RDD and DDM
•Multi-tenancy aware
•Handling data in processing layer or in Storage layer
Support memory as a storage medium

34
•Archival storage
•HDFS-6584
•Transparent encryption
•HDFS-6134
And, more!

35
WHAT’S NEXT? –YARN -

36
•Non-stop YARN updating(YARN-666)
•NodeManger, ResourceManager, Applications
•Before 2.6.0
•Restarting RM -> RM restarts all AMs -> restart all jobs
•Restarting NMs -> NMs are removed from cluster-> Containers are restarted!
•After 2.6.0
•Restarting RM -> AMs continue run
•Restarting NM -> NMs restore the state from local data
Support for rolling upgrades in YARN
ResourceManager
Slave
NodeManager
Container
Container
Container
Slave
NodeManager
Container
Container
Container
Slave
NodeManager
Container
Container
Container
Master
Slave
Slave
Master
Slave
Slave
Master
Slave
Slave

37
•Now we can run various subsystems on YARN
•Interactive query engines : Spark, Impala, …
•Batch processing engines : MapReduce, Tez, …
•Problem
•Interactive query engines allocates resources at the same time –it can delay daily batch.
•Time-based reservation scheduling
•8:00am –6:00pm, allocating resources for Impala
•6:00pm –0:00am, allocating resources for MapReduce
YARN reservation-subsystem
Allocation for Interactive query engine
Batch processing for
The next day!
8:00am
6:00pm
0:00am

38
•YARN-796
•Handling heterogeneous machinesin one YARN cluster
•GPU cluster
•High memory cluster
•40Gbps Network cluster
•Labeling them and scheduling based on labels
•Admin can add/remove labels via yarn rmadmincommands
Support for admin-specified labels in YARN
NodeManager
NodeManager
NodeManager
NodeManager
GPU
NodeManager
NodeManager
NodeManager
NodeManager
40Gnetwork
ResourceManager
Client
Submit jobs
On GPU!

39
•Timeline service security
•YARN-1935
•Minimal support for running long-running services on YARN
•YARN-896
•Support for automatic, shared cache for YARN application artifacts
•YARN-1492
•And, and more!
•Please check Wiki http://wiki.apache.org/hadoop/Roadmap
And, more!

40
•Hadoop 2 is evolving rapidly
•I appreciate if you can catch up via this presentaion!
•New components from V2
•HDFS
•Quorum Journal Manager
•NamenodeFederation
•ResourceManager
•NodeManager
•Application Master
•New features in 2.6:
•Discardablememory store on HDFS, and so on.
•Rolling update, labels for heterogeneous cluster on YARN, Reservation system, and so on…
•Questions or Feedbacks -> user@hadoop.apache.org
•Issue -> https://issues.apache.org/jira/browse/{HDFS,YARN,HADOOP, MAPREDUCE}
Summary

41
•YARN-666
•https://www.youtube.com/watch?v=O4Q73e2ua9Y&feature=youtu.be
•http://www.slideshare.net/Hadoop_Summit/ t-145p230avavilapalli-mac

[db tech showcase Tokyo 2014] C32: Hadoop最前線 - 開発の現場から by NTT 小沢健史

More Related Content

What's hot

Similar to [db tech showcase Tokyo 2014] C32: Hadoop最前線 - 開発の現場から by NTT 小沢健史

More from Insight Technology, Inc.

Recently uploaded

[db tech showcase Tokyo 2014] C32: Hadoop最前線 - 開発の現場から by NTT 小沢健史