Introduction to Map Reduce
1
Map Reduce: Motivation
We realized that most of our computations involved applying a map
operation to each logical record in our input in order to compute a set
of intermediate key/value pairs, and then applying a reduce operation
to all the values that shared the same key in order to combine the
derived data appropriately.
The issues of how to parallelize the computation, distribute the data,
and handle failures conspire to obscure the original simple
computation with large amounts of complex code to deal with these
issues.
Dean, Ghemawat. MapReduce: simplified data processing on large
clusters. CACM. ACM 51, 1 January 2008
2
Problem Scope
Need to scale to 100s or 1000s of computers, each with several
processor cores
How large is the amount of work?
Web-Scale data on the order of 100s of GBs to TBs or PBs
It is likely that the input data set will not fit on a single computers hard drive
Hence, a distributed file system (e.g., Google File System- GFS) is typically
required
3
Problem Scope
Scalability to large data volumes:
Scan 1000 TB on 1 node @ 100 MB/s = 24 days
Scan on 1000-node cluster = 35 minutes
Required functions
Automatic parallelization & distribution
Fault-tolerance
Status and monitoring tools
A clean abstraction for programmers
Functional programming meets
distributed computing
A batch data processing system
4
Commodity Clusters
Need to efficiently process large volumes of data by connecting many
commodity computers together to work in parallel
A theoretical 1000-CPU machine would cost a very large amount of
money, far more than 1000 single-CPU or 250 quad-core machines
5
Mapreduce & Hadoop - History
2003: Google publishes about its cluster architecture & distributed file
system (GFS)
2004: Google publishes about its MapReduce model used on top of GFS
Both GFS and MapReduce are written in C++ and are closed-source, with Python
and Java APIs available to Google programmers only
2006: Apache & Yahoo! -> Hadoop & HDFS
open-source, Java implementations of Google MapReduce and GFS with a diverse
set of API available to the public
Evolved from Apache Lucene/Nutch open-source web search engine
2008: Hadoop becomes an independent Apache project
Yahoo! Uses Hadoop in production
Today: Hadoop is used as a general-purpose storage and analysis platform
for big data
Other Hadoop distributions from several vendors including EMC, IBM, Microsoft,
Oracle, Cloudera, etc.
Many users (http://wiki.apache.org/hadoop/PoweredBy)
Research and development actively continues...
6
Google Cluster Architecture: Key Ideas
Single-thread performance doesnt matter
For large problems, total throughput/$ is more important than peak
Stuff breaks
If you have 1server, it may stay up three years (1,000days).
If you have 10,000 servers, expect to lose 10 per day.
Ultra-reliable hardware doesnt really help
At large scales, the most reliable hardware still fails, albeit less often
Software still needs to be fault-tolerant
Commodity machines without fancy hardware give better performance/$
Have a reliable computing infrastructure from clusters of unreliable
commodity PCs.
Replicate services across many machines to increase request
throughput and availability.
Favor price/performance over peak performance.
7
What Makes MapReduce Unique?
Its simplified programming model which allows the user to quickly write
and test distributed systems
Its efficient and automatic distribution of data and workload across
machines
Its flat scalability curve. Specifically, after a Mapreduce program is
written and functioning on 10 nodes, very little-if any- work is required
for making that same program run on 1000 nodes.
MapReduce ties smaller and more reasonably priced machines together
into a single cost-effective commodity cluster
8
Isolated Tasks
MapReduce divides the workload into multiple independent tasks and
schedules them across cluster nodes
A work performed by each task is done in isolation from one another
The amount of communication which can be performed by tasks is
mainly limited for scalability reasons
The communication overhead required to keep the data on the nodes
synchronized at all times would prevent the model from performing
reliably and efficiently at large scale
9
MapReduce in a Nutshell
Given:
a very large dataset
a well-defined computation task to be performed on elements of this dataset
(preferably, in a parallel fashion on a large cluster)
Map Reduce framework:
Just express what you want to compute (map() & reduce()).
Dont worry about parallelization, fault tolerance, data distribution, load
balancing (MapReduce takes care of these).
What changes from one application to another is the actual computation; the
programming structure stays similar.
In simple terms
Read lots of data.
Map: extract something that you care about from each record.
Shuffle and sort.
Reduce: aggregate, summarize, filter, or transform.
Write the results.
One can use as many Maps and Reduces as needed to model a given
problem.
10
Note: There is no precise 1-1
correspondence. Please take
Functional programming this just as an analogy.
foundations
map in MapReduce map in FP
map::(ab)[a][b]
Example: Double all numbers in a list.
> map ((*) 2) [1, 2, 3]
> [2, 4, 6]
In a purely functional setting, an element of a list being computed by
map cannot see the effects of the computations on other elements.
If the order of application of a function f to elements in a list is
commutative, then we can reorder or parallelize execution.
11
Note: There is no precise 1-1
correspondence. Please take
Functional programming this just as an analogy.
foundations
Move over the list, apply f to each element and an accumulator. f
returns the next accumulator value, which is combined with the next
element.
reduce in MapReduce fold in FP
foldl :: (b a b) b [a] b
Example: Sum of all numbers in a list.
> foldl (+) 0 [1, 2, 3] foldl (+) 0 [1, 2, 3]
>6
12
MapReduce Basic Programming Model
Transform a set of input key-value pairs to a set of output values:
Map: (k1, v1) list(k2, v2)
MapReduce library groups all
intermediate pairs with same key together.
Reduce: (k2, list(v2)) list(v2)
13
Word Count
map(k1, v1) list(k2, v2) reduce(k2, list(v2)) list(v2)
map(String key, String value): reduce(String key, Iterator values):
// key: document name // key: a word
// value: document contents // values: a list of counts
for each word w in value: int result = 0;
EmitIntermediate(w, 1); for each v in values:
result += ParseInt(v);
Emit(AsString(result));
14
Parallel processing model
15
Execution overview Read as part of this lecture!
Jeffrey Dean and Sanjay
Ghemawat. 2008. MapReduce:
simplified data processing on
Master Workers large clusters. Commun. ACM
51, 1 (January 2008), 107-113.
Master coordinates
Local Write / remote reads
16
MapReduce Scheduling
One master, many workers
Input data split into M map tasks (typically 64 MB (~ chunk size in GFS))
Reduce phase partitioned into R reduce tasks (hash(k) mod R)
Tasks are assigned to workers dynamically
Master assigns each map task to a free worker
Considers locality of data to worker when assigning a task
Worker reads task input (often from local disk)
Worker produces R local files containing intermediate k/v pairs
Master assigns each reduce task to a free worker
Worker reads intermediate k/v pairs from map workers
Worker sorts & applies users reduce operation to produce the output
17
Data Distribution
In a MapReduce cluster, data is distributed to all the nodes of the
cluster as it is being loaded in
An underlying distributed file systems (e.g., GFS) splits large data files
into chunks which are managed by different nodes in the cluster
Input data: A large file
Node 1 Node 2 Node 3
Chunk of input data Chunk of input data Chunk of input data
Even though the file chunks are distributed across several machines,
they form a single namespace
18
Partitions
In MapReduce, intermediate output values are not usually reduced together
All values with the same key are presented to a single Reducer together
More specifically, a different subset of intermediate key space is assigned to
each Reducer
These subsets are known as partitions
Different colors represent
different keys (potentially)
from different Mappers
Partitions are the input to Reducers
Word count again
20
Choosing M and R
M = number of map tasks, R = number of reduce tasks
Larger M, R: creates smaller tasks, enabling easier load balancing and
faster recovery (many small tasks from failed machine)
Limitation: O(M+R) scheduling decisions and O(M*R) in-memory
state at master
Very small tasks not worth the startup cost
Recommendation:
Choose M so that split size is approximately 64 MB
Choose R a small multiple of the number of workers; alternatively choose R a
little smaller than #workers to finish reduce phase in one wave
21
MapReduce Fault Tolerance
On worker failure:
Master detects failure via periodic heartbeats.
Both completed and in-progress map tasks on that worker should be re-
executed ( output stored on local disk).
Only in-progress reduce tasks on that worker should be re- executed (
output stored in global file system).
All reduce workers will be notified about any map re-executions.
On master failure:
State is check-pointed to GFS: new master recovers & continues.
Robustness:
Example: Lost 1600 of 1800 machines once, but finished fine.
22
MapReduce Data Locality
Goal: To conserve network bandwidth.
In GFS, data files are divided into 64MB blocks and 3 copies of each
are stored on different machines.
Master program schedules map() tasks based on the location of these
replicas:
Put map() tasks physically on the same machine as one of the input replicas
(or, at least on the same rack / network switch).
This way, thousands of machines can read input at local disk speed.
Otherwise, rack switches would limit read rate.
23
Stragglers & Backup Tasks
Problem: Stragglers (i.e., slow workers) significantly lengthen the
completion time.
Solution: Close to completion, spawn backup copies of the remaining
in-progress tasks.
Whichever one finishes first, wins.
Additional cost: a few percent more resource usage.
Example: A sort program without backup = 44% longer.
24
Other Practical Extensions
User-specified combiner functions for partial combination within a
map task can save network bandwidth (~ mini-reduce)
Example: WordCount
User-specified partitioning functions for mapping intermediate key
values to reduce workers (by default: hash(key) mod R)
Example: hash(Hostname(urlkey)) mod R
Ordering guarantees: Processing intermediate k/v pairs in increasing
order
Example: reduce of WordCount outputs ordered results.
Custom input and output format handlers
Single-machine execution option for testing & debugging
25
Basic MapReduce Program Design
Tasks that can be performed independently on a data object, large
number of them: Map
Tasks that require combining of multiple data objects: Reduce
Sometimes it is easier to start program design with Map, sometimes
with Reduce
Select keys and values such that the right objects end up together in
the same Reduce invocation
Might have to partition a complex task into multiple MapReduce sub-
tasks
26
MapReduce vs. Traditional RDBMS
MapReduce Traditional RDBMS
Data size Petabytes Gigabytes
Access Batch Interactive and batch
Write once, read many Read and write many
Updates
times times
Structure Dynamic schema Static schema
Integrity Low High (normalized data)
Non-linear (general
Scaling Linear
SQL)
27
More Hadoop details
28
Hadoop
Since its debut on the computing stage, MapReduce has
frequently been associated with Hadoop
Hadoop is an open source implementation of MapReduce and is
currently enjoying wide popularity
Hadoop presents MapReduce as an analytics engine and under
the hood uses a distributed storage layer referred to as Hadoop
Distributed File System (HDFS)
HDFS mimics Google File System (GFS)
29
Hadoop MapReduce: A Closer Look
Node 1 Node 2
Files loaded from local HDFS store Files loaded from local HDFS store
InputFormat InputFormat
file file
Split Split Split Split Split Split
file file
RecordReaders RR RR RR RR RR RR RecordReaders
Input (K, V) pairs Input (K, V) pairs
Map Map Map Map Map Map
Intermediate (K, V) pairs Intermediate (K, V) pairs
Shuffling
Partitioner Process Partitioner
Sort Intermediate Sort
(K,V) pairs
exchanged by
Reduce all nodes Reduce
Final (K, V) pairs Final (K, V) pairs
OutputFormat OutputFormat
Writeback to local Writeback to local
HDFS store HDFS store
Input Files
Input files are where the data for a MapReduce task is initially stored
The input files typically reside in a distributed file system (e.g. HDFS)
The format of input files is arbitrary
Line-based log files
Binary files
Multi-line input records
Or something else entirely
file
file
31
InputFormat
How the input files are split up and read is defined by the InputFormat
InputFormat is a class that does the following:
Selects the files that should be used for input
Defines the InputSplits that break a file
Provides a factory for RecordReader objects that Files loaded from local HDFS store
read the file
InputFormat
file
file
32
InputFormat Types
Several InputFormats are provided with Hadoop:
InputFormat Description Key Value
TextInputFormat Default format; The byte The line contents
reads lines of text offset of the
files line
KeyValueInputFormat Parses lines into Everything up The remainder of
(K, V) pairs to the first tab the line
character
SequenceFileInputFormat A Hadoop-specific user-defined user-defined
high-performance
binary format
33
Input Splits
An input split describes a unit of work that comprises a single map task in a
MapReduce program
By default, the InputFormat breaks a file up into 64MB splits
Files loaded from local HDFS store
By dividing the file into splits, we allow
several map tasks to operate on a single
file in parallel InputFormat
file
If the file is very large, this can improve Split Split Split
performance significantly through parallelism file
Each map task corresponds to a single input split
RecordReader
The input split defines a slice of work but does not describe how
to access it
The RecordReader class actually loads data from its source and
converts it into (K, V) pairs suitable for reading by Mappers
Files loaded from local HDFS store
The RecordReader is invoked repeatedly InputFormat
on the input until the entire split is consumed
file
Split Split Split
Each invocation of the RecordReader leads file
to another call of the map function defined RR RR RR
by the programmer
Mapper and Reducer
The Mapper performs the user-defined work Files loaded from local HDFS store
of the first phase of the MapReduce program
InputFormat
A new instance of Mapper is created
for each split file
Split Split Split
file
The Reducer performs the user-defined work RR RR RR
of the second phase of the MapReduce program
Map Map Map
A new instance of Reducer is created for each partition
Partitioner
For each key in the partition assigned to a Reducer, the
Reducer is called once Sort
Reduce
Partitioner
Each mapper may emit (K, V) pairs Files loaded from local HDFS store
to any partition
InputFormat
Therefore, the map nodes must all agree on
where to send different pieces of file
Split Split Split
intermediate data file
RR RR RR
The partitioner class determines which
partition a given (K,V) pair will go to
Map Map Map
The default partitioner computes a hash value for a Partitioner
given key and assigns it to a partition based on
this result
Sort
Reduce
Sort
Files loaded from local HDFS store
Each Reducer is responsible for reducing
the values associated with (several)
InputFormat
intermediate keys
file
Split Split Split
The set of intermediate keys on a single file
node is automatically sorted by RR RR RR
MapReduce before they are presented
to the Reducer
Map Map Map
Partitioner
Sort
Reduce
OutputFormat
Files loaded from local HDFS store
The OutputFormat class defines the InputFormat
way (K,V) pairs produced by Reducers
file
are written to output files Split Split Split
The instances of OutputFormat provided by file
Hadoop write to files on the local disk or in HDFS RR RR RR
Several OutputFormats are provided by Hadoop:
Map Map Map
OutputFormat Description
TextOutputFormat Default; writes lines in "key \t Partitioner
value" format
SequenceFileOutputFormat Writes binary files suitable for Sort
reading into subsequent
MapReduce jobs
Reduce
NullOutputFormat Generates no output files
OutputFormat
Questions?
40
Exercise
41
Exercise
Read the original Map Reduce paper
Answer some questions
Implement friends count
Fill word length (why fill, anyway?)
Understand and run inverted indexes
Code available as a Maven or
Eclipse project: Just run locally
42
MapReduce Use Case: Word Length
Big = Yellow = 10+ letters
Medium = Red = 5..9 letters
Small = Blue = 2..4 letters
Tiny = Pink = 1 letter
Big 37
Medium 148
Small 200
Tiny 9
43
MapReduce Use Case: Word Length
Split the document into
chunks and process
each chunk
on a different computer
44
MapReduce Use Case: Word Length
Big 1
Big 1
Big 1
Big 1,1,1,1,
Medium 1,1,1,.. Medium 1
Small 1,1,1,1,.. Medium 1
Tiny 1,1,1,1, Big 37
Medium 148
Small 1 Small 200
Small 1 Tiny 9
Big 1,1,1,1, Small 1
Medium 1,1,1,..
Small 1,1,1,1,..
Tiny 1,1,1,1,
Tiny 1
Tiny 1
Tiny 1
45
MapReduce Use Case: Inverted Indexing
Construction of inverted lists for document search
Input: documents: (docid, [term, term..]), (docid, [term, ..]), ..
Output: (term, [docid, docid, ])
E.g., (apple, [Foo.txt, Bar.txt, Boo.txt, ])
2010, Jamie Callan 46
Inverted Index: Data flow
Foo
Foo map output
contains: Foo
much: Foo Reduced output
This page contains page : Foo
so much text so : Foo
text: Foo contains: Foo, Bar
This : Foo much: Foo
My: Bar
page : Foo, Bar
so : Foo
Bar text: Foo, Bar
Bar map output
This : Foo
contains: Bar
too: Bar
My: Bar
My page contains page : Bar
text too text: Bar
too: Bar
MapReduce Use Case: Inverted Indexing
A simple approach to creating inverted lists
Each Map task is a document parser
Input: A stream of documents
Output: A stream of (term, docid) tuples
(long, Foo.txt) (ago, Foo.txt) (and, Foo.txt) (once, Bar.txt) (upon, Bar.txt)
We may create internal IDs for words.
Shuffle sorts tuples by key and routes tuples to Reducers
Reducers convert streams of keys into streams of inverted lists
Input: (long, Foo.txt) (long, Bar.txt) (long, Boo.txt) (long, )
The reducer sorts the values for a key and builds an inverted list
Output: (long, [Foo.txt, Bar.txt, ])
2010, Jamie Callan 48
Questions?
49
Sources & References
Excellent intro to MapReduce:
https://websci.informatik.uni-
freiburg.de/teaching/ws201213/infosys/slides/m3_l1_mapreduce.pdf
http://www.systems.ethz.ch/sites/default/files/file/BigData_Fall2012/BigData-
2012-M3.pdf
MapReduce & Functional Programming:
https://courses.cs.washington.edu/courses/cse490h/08au/lectures/mapred.ppt
For the introductory part:
http://www.cs.ucsb.edu/~tyang/class/140s14/slides/CS140TopicMapReduce.pdf
A lot of details about the Hadoop case:
www.qatar.cmu.edu/~msakr/15440-
f11/.../Lecture18_15440_MHH_9Nov_2011.ppt
50