June 14, 2012

Optimizing MapReduce Job
Performance
Todd Lipcon [@tlipcon]
Introductions

    •  Software Engineer at Cloudera since 2009
    •  Committer and PMC member on HDFS,
       MapReduce, and HBase
    •  Spend lots of time looking at full stack
       performance

    •  This talk is to help you develop faster jobs
      –  If you want to hear about how we made Hadoop
         faster, see my Hadoop World 2011 talk on
         cloudera.com

2
                       ©2011 Cloudera, Inc. All Rights Reserved.
Aspects of Performance

    •  Algorithmic performance
      –  big-O, join strategies, data structures,
         asymptotes
    •  Physical performance
      –  Hardware (disks, CPUs, etc)
    •  Implementation performance
      –  Efficiency of code, avoiding extra work
      –  Make good use of available physical perf


3
                        ©2011 Cloudera, Inc. All Rights Reserved.
Performance fundamentals

    •  You can’t tune what you don’t
       understand
      –  MR’s strength as a framework is its black-box
         nature
      –  To get optimal performance, you have to
         understand the internals

    •  This presentation: understanding the
       black box

4
                      ©2011 Cloudera, Inc. All Rights Reserved.
Performance fundamentals (2)

    •  You can’t improve what you can’t
       measure
      –  Ganglia/Cacti/Cloudera Manager/etc a must
      –  Top 4 metrics: CPU, Memory, Disk, Network
      –  MR job metrics: slot-seconds, CPU-seconds,
         task wall-clocks, and I/O


    •  Before you start: run jobs, gather data


5
                      ©2011 Cloudera, Inc. All Rights Reserved.
Graphing bottlenecks
                                                                             This job might
                                                                             be CPU-bound
                                                                             in map phase
                                                        Most jobs not
                                                        CPU-bound
     Plenty of free
     RAM, perhaps
     can make better
     use of it?




                                                                        Fairly flat-topped
                                                                        network –
                                                                        bottleneck?




6
                       ©2011 Cloudera, Inc. All Rights Reserved.
Performance tuning cycle


                   Identify                                      Address
       Run job
                   bottleneck                                    bottleneck
                  - Graphs                                       -  Tune configs
                  - Job counters                                 -  Improve code
                  - Job logs                                     -  Rethink algos
                  - Profiler results


                 In order to understand these metrics and make
                 changes, you need to understand MR internals.




7
                     ©2011 Cloudera, Inc. All Rights Reserved.
MR from 10,000 feet
     InputFormat   Map      Sort/            Fetch             Merge   Reduce   OutputFormat
                   Task     Spill                                       Task




8
                          ©2011 Cloudera, Inc. All Rights Reserved.
MR from 10,000 feet
     InputFormat   Map      Sort/            Fetch             Merge   Reduce   OutputFormat
                   Task     Spill                                       Task




9
                          ©2011 Cloudera, Inc. All Rights Reserved.
Map-side sort/spill overview
  •  Goal: when complete, map task outputs one sorted file
  •  What happens when you call OutputCollector.collect
     ()?
       Map
       Task                       2. Output Buffer fills up.
                                  Contents sorted, partitioned
           .collect(K,V)          and spilled to disk

MapOutputBuffer                                  IFile
1. In-memory buffer
holds serialized,                                                      Map-side
                                                 IFile                                       IFile
unsorted key-values                                                     Merge
                                                                       3. Map task finishes. All
                                                 IFile                 IFiles merged to single
                                                                       IFile per task


10
                           ©2011 Cloudera, Inc. All Rights Reserved.
Zooming further: MapOutputBuffer
   (Hadoop 1.0)




                                                  12 bytes/rec
                  kvoffsets
              (Partition, KOff, VOff)
                    per record
                                                                          io.sort.record.percent
                                                                          * io.sort.mb

                      kvindices




                                                  4 bytes/rec
                  1 indirect-sort index
io.sort.mb             per record


                                                  R bytes/rec
                      kvbuffer
                    Raw, serialized                                     (1-io.sort.record.percent)
                    (Key, Val) pairs                                    * io.sort.mb




  11
                                  ©2011 Cloudera, Inc. All Rights Reserved.
MapOutputBuffer spill behavior

 •  Memory is limited: must spill
     –  If either of the kvbuffer or the metadata
        buffers fill up, “spill” to disk
     –  In fact, we spill before it’s full (in another
        thread): configure io.sort.spill.percent
 •  Performance impact
     –  If we spill more than one time, we must re-
        read and re-write all data: 3x the IO!
     –  #1 goal for map task optimization: spill once!

12
                       ©2011 Cloudera, Inc. All Rights Reserved.
Spill counters on map tasks

 •  ratio of Spilled Records vs Map Output
    Records
     –  if unequal, then you are doing more than one
        spill
 •  FILE: Number of bytes read/written
     –  get a sense of I/O amplification due to spilling




13
                      ©2011 Cloudera, Inc. All Rights Reserved.
Spill logs on map tasks
                             indicates that the metadata buffers
 2012-06-04 11:52:21,445 INFO before the data buffer
                             filled up MapTask: Spilling map output:
   record full = true
 2012-06-04 11:52:21,445 INFO MapTask: bufstart = 0; bufend
   = 60030900; bufvoid = 228117712
 2012-06-04 11:52:21,445 INFO MapTask: kvstart = 0; kvend =
   600309; length = 750387
 2012-06-04 11:52:24,320 INFO MapTask: Finished spill 0
 2012-06-04 11:52:26,117 INFO MapTask: Spilling map output:
   record full = true
 2012-06-04 11:52:26,118 INFO MapTask: bufstart = 60030900;
   bufend = 120061700; bufvoid = 228117712
  2012-06-04 11:52:26,118 INFO MapTask: kvstart = 600309;
   kvend = 450230; length = 750387
 2012-06-04 11:52:26,666 INFO MapTask: Starting flush of
   map output
 2012-06-04 11:52:28,272 INFO MapTask: Finished spill 1
 2012-06-04 spills total! maybeINFO MapTask: Finished spill 2
          3 11:52:29,105 we can do
          better?


14
                          ©2011 Cloudera, Inc. All Rights Reserved.
Tuning to reduce spills

 •  Parameters:
     –  io.sort.mb: total buffer space
     –  io.sort.record.percent: proportion between
        metadata buffers and key/value data
     –  io.sort.spill.percent: threshold at which
        spill is triggered
     –  Total map output generated: can you use
        more compact serialization?
 •  Optimal settings depend on your data and
    available RAM!

15
                      ©2011 Cloudera, Inc. All Rights Reserved.
Setting io.sort.record.percent

 •  Common mistake: metadata buffers fill up
    way before kvdata buffer
 •  Optimal setting:
     –  io.sort.record.percent = 16/(16 + R)
     –  R = average record size: divide “Map Output
        Bytes” counter by “Map Output Records” counter
 •  Default (0.05) is usually too low (optimal for
    ~300byte records)
 •  Hadoop 2.0: this is no longer necessary!
     –  see MAPREDUCE-64 for gory details

16
                      ©2011 Cloudera, Inc. All Rights Reserved.
Tuning Example (terasort)

 •  Map input size = output size
     –  128MB block = 1,342,177 records, each 100
        bytes
     –  metadata: 16 * 1342177 = 20.9MB
 •  io.sort.mb
     –  128MB data + 20.9MB meta = 148.9MB
 •  io.sort.record.percent
     –  16/(16+100)=0.138
 •  io.sort.spill.percent = 1.0

17
                    ©2011 Cloudera, Inc. All Rights Reserved.
More tips on spill tuning
 •  Biggest win is going from 2 spills to 1 spill
     –  3 spills is approximately the same speed as 2 spills
        (same IO amplificatoin)
 •  Calculate if it’s even possible, given your heap
    size
     –  io.sort.mb has to fit within your Java heap (plus
        whatever RAM your Mapper needs, plus ~30% for
        overhead)
 •  Only bother if this is the bottleneck!
     –  Look at map task logs: if the merge step at the end is
        taking a fraction of a second, not worth it!
     –  Typically most impact on jobs with big shuffle (sort/
        dedup)


18
                         ©2011 Cloudera, Inc. All Rights Reserved.
MR from 10,000 feet
     InputFormat   Map      Sort/            Fetch             Merge   Reduce   OutputFormat
                   Task     Spill                                       Task




19
                          ©2011 Cloudera, Inc. All Rights Reserved.
Reducer fetch tuning

 •  Reducers fetch map output via HTTP
 •  Tuning parameters:
     –  Server side: tasktracker.http.threads
     –  Client side:
      mapred.reduce.parallel.copies
 •  Turns out this is not so interesting
     –  follow the best practices from Hadoop:
        Definitive Guide


20
                     ©2011 Cloudera, Inc. All Rights Reserved.
Improving fetch bottlenecks

 •  Reduce intermediate data
     –  Implement a Combiner: less data transfers faster
     –  Enable intermediate compression: Snappy is
        easy to enable; trades off some CPU for less IO/
        network
 •  Double-check for network issues
     –  Frame errors, NICs auto-negotiated to 100mbit,
        etc: one or two slow hosts can bottleneck a job
     –  Tell-tale sign: all maps are done, and reducers sit
        in fetch stage for many minutes (look at logs)


21
                       ©2011 Cloudera, Inc. All Rights Reserved.
MR from 10,000 feet
     InputFormat   Map      Sort/            Fetch             Merge   Reduce   OutputFormat
                   Task     Spill                                       Task




22
                          ©2011 Cloudera, Inc. All Rights Reserved.
Reducer merge (Hadoop 1.0)

                                  Yes:
                                                    RAMManager
                                  fetch to                                       RAM-to-disk
                                  RAM                                            merges
 Remote Map           Fits in
    Outputs           RAM?
  (via HTTP)                                                                      1. Data accumulated
                                                                                  in RAM is merged to
                                No: fetch                                         disk files
                                to disk
                                                      Local Disk

                                                         IFile
2. If too many disk
                         disk-to-disk                                                    Merged
files accumulate,                                        IFile
                         merges                                                          iterator
they are re-merged

                                                         IFile                           Reduce
                                                                                          Task
                                                                   3. Segments from
                                                                   RAM and disk are
23                                                                 merged into the
                                                                   reducer code
                                     ©2011 Cloudera, Inc. All Rights Reserved.
Reducer merge triggers
 •  RAMManager
     –  Total buffer size:
       mapred.job.shuffle.input.buffer.percent
       (default 0.70, percentage of reducer heapsize)
 •  Mem-to-disk merge triggers:
     –  RAMManager is
        mapred.job.shuffle.merge.percent % full
        (default 0.66)
     –  Or mapred.inmem.merge.threshold segments
        accumulated (default 1000)
 •  Disk-to-disk merge
     –  io.sort.factor on-disk segments pile up (fairly rare)



24
                             ©2011 Cloudera, Inc. All Rights Reserved.
Final merge phase

 •  MR assumes that reducer code needs the
    full heap worth of RAM
     –  Spills all in-RAM segments before running
        user code to free memory
 •  This isn’t true if your reducer is simple
     –  eg sort, simple aggregation, etc with no state
 •  Configure
     mapred.job.reduce.input.buffer.percent to
     0.70 to keep reducer input data in RAM


25
                      ©2011 Cloudera, Inc. All Rights Reserved.
Reducer merge counters

 •  FILE: number of bytes read/written
     –  Ideally close to 0 if you can fit in RAM
 •  Spilled records:
     –  Ideally close to 0. If significantly more than
        reduce input records, job is hitting a multi-
        pass merge which is quite expensive




26
                       ©2011 Cloudera, Inc. All Rights Reserved.
Tuning reducer merge

 •  Configure
     mapred.job.reduce.input.buffer.percent
    to 0.70 to keep data in RAM if you don’t
    have any state in reducer
 •  Experiment with setting
    mapred.inmem.merge.threshold to 0 to
    avoid spills
 •  Hadoop 2.0: experiment with
     mapreduce.reduce.merge.memtomem.enabled


27
                   ©2011 Cloudera, Inc. All Rights Reserved.
Rules of thumb for # maps/reduces

 •  Aim for map tasks running 1-3 minutes each
     –  Too small: wasted startup overhead, less efficient
        shuffle
     –  Too big: not enough parallelism, harder to share
        cluster
 •  Reduce task count:
     –  Large reduce phase: base on cluster slot count (a
        few GB per reducer)
     –  Small reduce phase: fewer reducers will result in
        more efficient shuffle phase


28
                       ©2011 Cloudera, Inc. All Rights Reserved.
MR from 10,000 feet
     InputFormat   Map      Sort/            Fetch             Merge   Reduce   OutputFormat
                   Task     Spill                                       Task




29
                          ©2011 Cloudera, Inc. All Rights Reserved.
Tuning Java code for MR
 •  Follow general Java best practices
     –  String parsing and formatting is slow
     –  Guard debug statements with isDebugEnabled()
     –  StringBuffer.append vs repeated string concatenation
 •  For CPU-intensive jobs, make a test harness/
    benchmark outside MR
     –  Then use your favorite profiler
 •  Check for GC overhead: -XX:+PrintGCDetails –
    verbose:gc
 •  Easiest profiler: add –Xprof to
    mapred.child.java.opts – then look at
    stdout task log

30
                         ©2011 Cloudera, Inc. All Rights Reserved.
Other tips for fast MR code

 •  Use the most compact and efficient data
    formats
     –  LongWritable is way faster than parsing text
     –  BytesWritable instead of Text for SHA1
        hashes/dedup
     –  Avro/Thrift/Protobuf for complex data, not JSON!
 •  Write a Combiner and RawComparator
 •  Enable intermediate compression (Snappy/
    LZO)

31
                       ©2011 Cloudera, Inc. All Rights Reserved.
Summary

 •  Understanding MR internals helps understand
    configurations and tuning
 •  Focus your tuning effort on things that are
    bottlenecks, following a scientific approach
 •  Don’t forget that you can always just add nodes!
     –  Spending 1 month of engineer time to make your job
        20% faster is not worth it if you have a 10 node
        cluster!
 •  We’re working on simplifying this where we can,
    but deep understanding will always allow more
    efficient jobs


32
                       ©2011 Cloudera, Inc. All Rights Reserved.
Questions?

    @tlipcon
todd@cloudera.com

Optimizing MapReduce Job performance

  • 1.
    June 14, 2012 OptimizingMapReduce Job Performance Todd Lipcon [@tlipcon]
  • 2.
    Introductions •  Software Engineer at Cloudera since 2009 •  Committer and PMC member on HDFS, MapReduce, and HBase •  Spend lots of time looking at full stack performance •  This talk is to help you develop faster jobs –  If you want to hear about how we made Hadoop faster, see my Hadoop World 2011 talk on cloudera.com 2 ©2011 Cloudera, Inc. All Rights Reserved.
  • 3.
    Aspects of Performance •  Algorithmic performance –  big-O, join strategies, data structures, asymptotes •  Physical performance –  Hardware (disks, CPUs, etc) •  Implementation performance –  Efficiency of code, avoiding extra work –  Make good use of available physical perf 3 ©2011 Cloudera, Inc. All Rights Reserved.
  • 4.
    Performance fundamentals •  You can’t tune what you don’t understand –  MR’s strength as a framework is its black-box nature –  To get optimal performance, you have to understand the internals •  This presentation: understanding the black box 4 ©2011 Cloudera, Inc. All Rights Reserved.
  • 5.
    Performance fundamentals (2) •  You can’t improve what you can’t measure –  Ganglia/Cacti/Cloudera Manager/etc a must –  Top 4 metrics: CPU, Memory, Disk, Network –  MR job metrics: slot-seconds, CPU-seconds, task wall-clocks, and I/O •  Before you start: run jobs, gather data 5 ©2011 Cloudera, Inc. All Rights Reserved.
  • 6.
    Graphing bottlenecks This job might be CPU-bound in map phase Most jobs not CPU-bound Plenty of free RAM, perhaps can make better use of it? Fairly flat-topped network – bottleneck? 6 ©2011 Cloudera, Inc. All Rights Reserved.
  • 7.
    Performance tuning cycle Identify Address Run job bottleneck bottleneck - Graphs -  Tune configs - Job counters -  Improve code - Job logs -  Rethink algos - Profiler results In order to understand these metrics and make changes, you need to understand MR internals. 7 ©2011 Cloudera, Inc. All Rights Reserved.
  • 8.
    MR from 10,000feet InputFormat Map Sort/ Fetch Merge Reduce OutputFormat Task Spill Task 8 ©2011 Cloudera, Inc. All Rights Reserved.
  • 9.
    MR from 10,000feet InputFormat Map Sort/ Fetch Merge Reduce OutputFormat Task Spill Task 9 ©2011 Cloudera, Inc. All Rights Reserved.
  • 10.
    Map-side sort/spill overview •  Goal: when complete, map task outputs one sorted file •  What happens when you call OutputCollector.collect ()? Map Task 2. Output Buffer fills up. Contents sorted, partitioned .collect(K,V) and spilled to disk MapOutputBuffer IFile 1. In-memory buffer holds serialized, Map-side IFile IFile unsorted key-values Merge 3. Map task finishes. All IFile IFiles merged to single IFile per task 10 ©2011 Cloudera, Inc. All Rights Reserved.
  • 11.
    Zooming further: MapOutputBuffer (Hadoop 1.0) 12 bytes/rec kvoffsets (Partition, KOff, VOff) per record io.sort.record.percent * io.sort.mb kvindices 4 bytes/rec 1 indirect-sort index io.sort.mb per record R bytes/rec kvbuffer Raw, serialized (1-io.sort.record.percent) (Key, Val) pairs * io.sort.mb 11 ©2011 Cloudera, Inc. All Rights Reserved.
  • 12.
    MapOutputBuffer spill behavior •  Memory is limited: must spill –  If either of the kvbuffer or the metadata buffers fill up, “spill” to disk –  In fact, we spill before it’s full (in another thread): configure io.sort.spill.percent •  Performance impact –  If we spill more than one time, we must re- read and re-write all data: 3x the IO! –  #1 goal for map task optimization: spill once! 12 ©2011 Cloudera, Inc. All Rights Reserved.
  • 13.
    Spill counters onmap tasks •  ratio of Spilled Records vs Map Output Records –  if unequal, then you are doing more than one spill •  FILE: Number of bytes read/written –  get a sense of I/O amplification due to spilling 13 ©2011 Cloudera, Inc. All Rights Reserved.
  • 14.
    Spill logs onmap tasks indicates that the metadata buffers 2012-06-04 11:52:21,445 INFO before the data buffer filled up MapTask: Spilling map output: record full = true 2012-06-04 11:52:21,445 INFO MapTask: bufstart = 0; bufend = 60030900; bufvoid = 228117712 2012-06-04 11:52:21,445 INFO MapTask: kvstart = 0; kvend = 600309; length = 750387 2012-06-04 11:52:24,320 INFO MapTask: Finished spill 0 2012-06-04 11:52:26,117 INFO MapTask: Spilling map output: record full = true 2012-06-04 11:52:26,118 INFO MapTask: bufstart = 60030900; bufend = 120061700; bufvoid = 228117712 2012-06-04 11:52:26,118 INFO MapTask: kvstart = 600309; kvend = 450230; length = 750387 2012-06-04 11:52:26,666 INFO MapTask: Starting flush of map output 2012-06-04 11:52:28,272 INFO MapTask: Finished spill 1 2012-06-04 spills total! maybeINFO MapTask: Finished spill 2 3 11:52:29,105 we can do better? 14 ©2011 Cloudera, Inc. All Rights Reserved.
  • 15.
    Tuning to reducespills •  Parameters: –  io.sort.mb: total buffer space –  io.sort.record.percent: proportion between metadata buffers and key/value data –  io.sort.spill.percent: threshold at which spill is triggered –  Total map output generated: can you use more compact serialization? •  Optimal settings depend on your data and available RAM! 15 ©2011 Cloudera, Inc. All Rights Reserved.
  • 16.
    Setting io.sort.record.percent • Common mistake: metadata buffers fill up way before kvdata buffer •  Optimal setting: –  io.sort.record.percent = 16/(16 + R) –  R = average record size: divide “Map Output Bytes” counter by “Map Output Records” counter •  Default (0.05) is usually too low (optimal for ~300byte records) •  Hadoop 2.0: this is no longer necessary! –  see MAPREDUCE-64 for gory details 16 ©2011 Cloudera, Inc. All Rights Reserved.
  • 17.
    Tuning Example (terasort) •  Map input size = output size –  128MB block = 1,342,177 records, each 100 bytes –  metadata: 16 * 1342177 = 20.9MB •  io.sort.mb –  128MB data + 20.9MB meta = 148.9MB •  io.sort.record.percent –  16/(16+100)=0.138 •  io.sort.spill.percent = 1.0 17 ©2011 Cloudera, Inc. All Rights Reserved.
  • 18.
    More tips onspill tuning •  Biggest win is going from 2 spills to 1 spill –  3 spills is approximately the same speed as 2 spills (same IO amplificatoin) •  Calculate if it’s even possible, given your heap size –  io.sort.mb has to fit within your Java heap (plus whatever RAM your Mapper needs, plus ~30% for overhead) •  Only bother if this is the bottleneck! –  Look at map task logs: if the merge step at the end is taking a fraction of a second, not worth it! –  Typically most impact on jobs with big shuffle (sort/ dedup) 18 ©2011 Cloudera, Inc. All Rights Reserved.
  • 19.
    MR from 10,000feet InputFormat Map Sort/ Fetch Merge Reduce OutputFormat Task Spill Task 19 ©2011 Cloudera, Inc. All Rights Reserved.
  • 20.
    Reducer fetch tuning •  Reducers fetch map output via HTTP •  Tuning parameters: –  Server side: tasktracker.http.threads –  Client side: mapred.reduce.parallel.copies •  Turns out this is not so interesting –  follow the best practices from Hadoop: Definitive Guide 20 ©2011 Cloudera, Inc. All Rights Reserved.
  • 21.
    Improving fetch bottlenecks •  Reduce intermediate data –  Implement a Combiner: less data transfers faster –  Enable intermediate compression: Snappy is easy to enable; trades off some CPU for less IO/ network •  Double-check for network issues –  Frame errors, NICs auto-negotiated to 100mbit, etc: one or two slow hosts can bottleneck a job –  Tell-tale sign: all maps are done, and reducers sit in fetch stage for many minutes (look at logs) 21 ©2011 Cloudera, Inc. All Rights Reserved.
  • 22.
    MR from 10,000feet InputFormat Map Sort/ Fetch Merge Reduce OutputFormat Task Spill Task 22 ©2011 Cloudera, Inc. All Rights Reserved.
  • 23.
    Reducer merge (Hadoop1.0) Yes: RAMManager fetch to RAM-to-disk RAM merges Remote Map Fits in Outputs RAM? (via HTTP) 1. Data accumulated in RAM is merged to No: fetch disk files to disk Local Disk IFile 2. If too many disk disk-to-disk Merged files accumulate, IFile merges iterator they are re-merged IFile Reduce Task 3. Segments from RAM and disk are 23 merged into the reducer code ©2011 Cloudera, Inc. All Rights Reserved.
  • 24.
    Reducer merge triggers •  RAMManager –  Total buffer size: mapred.job.shuffle.input.buffer.percent (default 0.70, percentage of reducer heapsize) •  Mem-to-disk merge triggers: –  RAMManager is mapred.job.shuffle.merge.percent % full (default 0.66) –  Or mapred.inmem.merge.threshold segments accumulated (default 1000) •  Disk-to-disk merge –  io.sort.factor on-disk segments pile up (fairly rare) 24 ©2011 Cloudera, Inc. All Rights Reserved.
  • 25.
    Final merge phase •  MR assumes that reducer code needs the full heap worth of RAM –  Spills all in-RAM segments before running user code to free memory •  This isn’t true if your reducer is simple –  eg sort, simple aggregation, etc with no state •  Configure mapred.job.reduce.input.buffer.percent to 0.70 to keep reducer input data in RAM 25 ©2011 Cloudera, Inc. All Rights Reserved.
  • 26.
    Reducer merge counters •  FILE: number of bytes read/written –  Ideally close to 0 if you can fit in RAM •  Spilled records: –  Ideally close to 0. If significantly more than reduce input records, job is hitting a multi- pass merge which is quite expensive 26 ©2011 Cloudera, Inc. All Rights Reserved.
  • 27.
    Tuning reducer merge •  Configure mapred.job.reduce.input.buffer.percent to 0.70 to keep data in RAM if you don’t have any state in reducer •  Experiment with setting mapred.inmem.merge.threshold to 0 to avoid spills •  Hadoop 2.0: experiment with mapreduce.reduce.merge.memtomem.enabled 27 ©2011 Cloudera, Inc. All Rights Reserved.
  • 28.
    Rules of thumbfor # maps/reduces •  Aim for map tasks running 1-3 minutes each –  Too small: wasted startup overhead, less efficient shuffle –  Too big: not enough parallelism, harder to share cluster •  Reduce task count: –  Large reduce phase: base on cluster slot count (a few GB per reducer) –  Small reduce phase: fewer reducers will result in more efficient shuffle phase 28 ©2011 Cloudera, Inc. All Rights Reserved.
  • 29.
    MR from 10,000feet InputFormat Map Sort/ Fetch Merge Reduce OutputFormat Task Spill Task 29 ©2011 Cloudera, Inc. All Rights Reserved.
  • 30.
    Tuning Java codefor MR •  Follow general Java best practices –  String parsing and formatting is slow –  Guard debug statements with isDebugEnabled() –  StringBuffer.append vs repeated string concatenation •  For CPU-intensive jobs, make a test harness/ benchmark outside MR –  Then use your favorite profiler •  Check for GC overhead: -XX:+PrintGCDetails – verbose:gc •  Easiest profiler: add –Xprof to mapred.child.java.opts – then look at stdout task log 30 ©2011 Cloudera, Inc. All Rights Reserved.
  • 31.
    Other tips forfast MR code •  Use the most compact and efficient data formats –  LongWritable is way faster than parsing text –  BytesWritable instead of Text for SHA1 hashes/dedup –  Avro/Thrift/Protobuf for complex data, not JSON! •  Write a Combiner and RawComparator •  Enable intermediate compression (Snappy/ LZO) 31 ©2011 Cloudera, Inc. All Rights Reserved.
  • 32.
    Summary •  UnderstandingMR internals helps understand configurations and tuning •  Focus your tuning effort on things that are bottlenecks, following a scientific approach •  Don’t forget that you can always just add nodes! –  Spending 1 month of engineer time to make your job 20% faster is not worth it if you have a 10 node cluster! •  We’re working on simplifying this where we can, but deep understanding will always allow more efficient jobs 32 ©2011 Cloudera, Inc. All Rights Reserved.
  • 33.