Optimizing MapReduce Job performance

June 14, 2012

Optimizing MapReduce Job
Performance
Todd Lipcon [@tlipcon]

Introductions

•  Software Engineer at Cloudera since 2009
•  Committer and PMC member on HDFS,
MapReduce, and HBase
•  Spend lots of time looking at full stack
performance

•  This talk is to help you develop faster jobs
–  If you want to hear about how we made Hadoop
faster, see my Hadoop World 2011 talk on
cloudera.com

2
©2011 Cloudera, Inc. All Rights Reserved.

Aspects of Performance

•  Algorithmic performance
–  big-O, join strategies, data structures,
asymptotes
•  Physical performance
–  Hardware (disks, CPUs, etc)
•  Implementation performance
–  Efficiency of code, avoiding extra work
–  Make good use of available physical perf

3

Performance fundamentals

•  You can’t tune what you don’t
understand
–  MR’s strength as a framework is its black-box
nature
–  To get optimal performance, you have to
understand the internals

•  This presentation: understanding the
black box

4

Performance fundamentals (2)

•  You can’t improve what you can’t
measure
–  Ganglia/Cacti/Cloudera Manager/etc a must
–  Top 4 metrics: CPU, Memory, Disk, Network
–  MR job metrics: slot-seconds, CPU-seconds,
task wall-clocks, and I/O

•  Before you start: run jobs, gather data

5

Graphing bottlenecks
This job might
be CPU-bound
in map phase
Most jobs not
CPU-bound
Plenty of free
RAM, perhaps
can make better
use of it?

Fairly flat-topped
network –
bottleneck?

6

Performance tuning cycle

Identify Address
Run job
bottleneck bottleneck
- Graphs -  Tune configs
- Job counters -  Improve code
- Job logs -  Rethink algos
- Profiler results

In order to understand these metrics and make
changes, you need to understand MR internals.

7

MR from 10,000 feet
InputFormat Map Sort/ Fetch Merge Reduce OutputFormat
Task Spill Task

8

MR from 10,000 feet
Task Spill Task

9

Map-side sort/spill overview
•  Goal: when complete, map task outputs one sorted file
•  What happens when you call OutputCollector.collect
()?
Map
Task 2. Output Buffer fills up.
Contents sorted, partitioned
.collect(K,V) and spilled to disk

MapOutputBuffer IFile
1. In-memory buffer
holds serialized, Map-side
IFile IFile
unsorted key-values Merge
3. Map task finishes. All
IFile IFiles merged to single
IFile per task

10

Zooming further: MapOutputBuffer
(Hadoop 1.0)

12 bytes/rec
kvoffsets
(Partition, KOff, VOff)
per record
io.sort.record.percent
* io.sort.mb

kvindices

4 bytes/rec
1 indirect-sort index
io.sort.mb per record

R bytes/rec
kvbuffer
Raw, serialized (1-io.sort.record.percent)
(Key, Val) pairs * io.sort.mb

11

MapOutputBuffer spill behavior

•  Memory is limited: must spill
–  If either of the kvbuffer or the metadata
buffers fill up, “spill” to disk
–  In fact, we spill before it’s full (in another
thread): configure io.sort.spill.percent
•  Performance impact
–  If we spill more than one time, we must re-
read and re-write all data: 3x the IO!
–  #1 goal for map task optimization: spill once!

12

Spill counters on map tasks

•  ratio of Spilled Records vs Map Output
Records
–  if unequal, then you are doing more than one
spill
•  FILE: Number of bytes read/written
–  get a sense of I/O amplification due to spilling

13

Spill logs on map tasks
indicates that the metadata buffers
2012-06-04 11:52:21,445 INFO before the data buffer
filled up MapTask: Spilling map output:
record full = true
2012-06-04 11:52:21,445 INFO MapTask: bufstart = 0; bufend
= 60030900; bufvoid = 228117712
2012-06-04 11:52:21,445 INFO MapTask: kvstart = 0; kvend =
600309; length = 750387
2012-06-04 11:52:24,320 INFO MapTask: Finished spill 0
2012-06-04 11:52:26,117 INFO MapTask: Spilling map output:
record full = true
2012-06-04 11:52:26,118 INFO MapTask: bufstart = 60030900;
bufend = 120061700; bufvoid = 228117712
2012-06-04 11:52:26,118 INFO MapTask: kvstart = 600309;
kvend = 450230; length = 750387
2012-06-04 11:52:26,666 INFO MapTask: Starting flush of
map output
2012-06-04 11:52:28,272 INFO MapTask: Finished spill 1
2012-06-04 spills total! maybeINFO MapTask: Finished spill 2
3 11:52:29,105 we can do
better?

14

Tuning to reduce spills

•  Parameters:
–  io.sort.mb: total buffer space
–  io.sort.record.percent: proportion between
metadata buffers and key/value data
–  io.sort.spill.percent: threshold at which
spill is triggered
–  Total map output generated: can you use
more compact serialization?
•  Optimal settings depend on your data and
available RAM!

15

Setting io.sort.record.percent

•  Common mistake: metadata buffers fill up
way before kvdata buffer
•  Optimal setting:
–  io.sort.record.percent = 16/(16 + R)
–  R = average record size: divide “Map Output
Bytes” counter by “Map Output Records” counter
•  Default (0.05) is usually too low (optimal for
~300byte records)
•  Hadoop 2.0: this is no longer necessary!
–  see MAPREDUCE-64 for gory details

16

Tuning Example (terasort)

•  Map input size = output size
–  128MB block = 1,342,177 records, each 100
bytes
–  metadata: 16 * 1342177 = 20.9MB
•  io.sort.mb
–  128MB data + 20.9MB meta = 148.9MB
•  io.sort.record.percent
–  16/(16+100)=0.138
•  io.sort.spill.percent = 1.0

17

More tips on spill tuning
•  Biggest win is going from 2 spills to 1 spill
–  3 spills is approximately the same speed as 2 spills
(same IO amplificatoin)
•  Calculate if it’s even possible, given your heap
size
–  io.sort.mb has to fit within your Java heap (plus
whatever RAM your Mapper needs, plus ~30% for
overhead)
•  Only bother if this is the bottleneck!
–  Look at map task logs: if the merge step at the end is
taking a fraction of a second, not worth it!
–  Typically most impact on jobs with big shuffle (sort/
dedup)

18

MR from 10,000 feet
Task Spill Task

19

Reducer fetch tuning

•  Reducers fetch map output via HTTP
•  Tuning parameters:
–  Server side: tasktracker.http.threads
–  Client side:
mapred.reduce.parallel.copies
•  Turns out this is not so interesting
–  follow the best practices from Hadoop:
Definitive Guide

20

Improving fetch bottlenecks

•  Reduce intermediate data
–  Implement a Combiner: less data transfers faster
–  Enable intermediate compression: Snappy is
easy to enable; trades off some CPU for less IO/
network
•  Double-check for network issues
–  Frame errors, NICs auto-negotiated to 100mbit,
etc: one or two slow hosts can bottleneck a job
–  Tell-tale sign: all maps are done, and reducers sit
in fetch stage for many minutes (look at logs)

21

MR from 10,000 feet
Task Spill Task

22

Reducer merge (Hadoop 1.0)

Yes:
RAMManager
fetch to RAM-to-disk
RAM merges
Remote Map Fits in
Outputs RAM?
(via HTTP) 1. Data accumulated
in RAM is merged to
No: fetch disk files
to disk
Local Disk

IFile
2. If too many disk
disk-to-disk Merged
files accumulate, IFile
merges iterator
they are re-merged

IFile Reduce
Task
3. Segments from
RAM and disk are
23 merged into the
reducer code

Reducer merge triggers
•  RAMManager
–  Total buffer size:
mapred.job.shuffle.input.buffer.percent
(default 0.70, percentage of reducer heapsize)
•  Mem-to-disk merge triggers:
–  RAMManager is
mapred.job.shuffle.merge.percent % full
(default 0.66)
–  Or mapred.inmem.merge.threshold segments
accumulated (default 1000)
•  Disk-to-disk merge
–  io.sort.factor on-disk segments pile up (fairly rare)

24

Final merge phase

•  MR assumes that reducer code needs the
full heap worth of RAM
–  Spills all in-RAM segments before running
user code to free memory
•  This isn’t true if your reducer is simple
–  eg sort, simple aggregation, etc with no state
•  Configure
mapred.job.reduce.input.buffer.percent to
0.70 to keep reducer input data in RAM

25

Reducer merge counters

•  FILE: number of bytes read/written
–  Ideally close to 0 if you can fit in RAM
•  Spilled records:
–  Ideally close to 0. If significantly more than
reduce input records, job is hitting a multi-
pass merge which is quite expensive

26

Tuning reducer merge

•  Configure
mapred.job.reduce.input.buffer.percent
to 0.70 to keep data in RAM if you don’t
have any state in reducer
•  Experiment with setting
mapred.inmem.merge.threshold to 0 to
avoid spills
•  Hadoop 2.0: experiment with
mapreduce.reduce.merge.memtomem.enabled

27

Rules of thumb for # maps/reduces

•  Aim for map tasks running 1-3 minutes each
–  Too small: wasted startup overhead, less efficient
shuffle
–  Too big: not enough parallelism, harder to share
cluster
•  Reduce task count:
–  Large reduce phase: base on cluster slot count (a
few GB per reducer)
–  Small reduce phase: fewer reducers will result in
more efficient shuffle phase

28

MR from 10,000 feet
Task Spill Task

29

Tuning Java code for MR
•  Follow general Java best practices
–  String parsing and formatting is slow
–  Guard debug statements with isDebugEnabled()
–  StringBuffer.append vs repeated string concatenation
•  For CPU-intensive jobs, make a test harness/
benchmark outside MR
–  Then use your favorite profiler
•  Check for GC overhead: -XX:+PrintGCDetails –
verbose:gc
•  Easiest profiler: add –Xprof to
mapred.child.java.opts – then look at
stdout task log

30

Other tips for fast MR code

•  Use the most compact and efficient data
formats
–  LongWritable is way faster than parsing text
–  BytesWritable instead of Text for SHA1
hashes/dedup
–  Avro/Thrift/Protobuf for complex data, not JSON!
•  Write a Combiner and RawComparator
•  Enable intermediate compression (Snappy/
LZO)

31

Summary

•  Understanding MR internals helps understand
configurations and tuning
•  Focus your tuning effort on things that are
bottlenecks, following a scientific approach
•  Don’t forget that you can always just add nodes!
–  Spending 1 month of engineer time to make your job
20% faster is not worth it if you have a 10 node
cluster!
•  We’re working on simplifying this where we can,
but deep understanding will always allow more
efficient jobs

32

Questions?

@tlipcon
todd@cloudera.com

Optimizing MapReduce Job performance

In this document

More Related Content

What's hot

Viewers also liked

Similar to Optimizing MapReduce Job performance

More from DataWorks Summit

Recently uploaded

Optimizing MapReduce Job performance