Unit 5 - Mapreduce
Unit 5 - Mapreduce
MapReduce anatomy
MapReduce is a programming model for data processing. Hadoop can run MapReduce
programs written in Java, Ruby and Python.
MapReduce programs are inherently parallel, thus very large scale data analysis can be done
fastly.
In MapReduce programming, Jobs(applications) are split into a set of map tasks and reduce
tasks.
Map task takes care of loading, parsing, transforming and filtering. The responsibility
of reduce task is grouping and aggregating data that is produced by map tasks to generate
final output. Each map task is broken down into the following phases:
The output produced by the map task is known as intermediate <keys, value> pairs. These
intermediate <keys, value> pairs are sent to reducer.
The reduce tasks are broken down into the following phases:
1. Shuffle 2. Sort
3. Reducer 4. Output format.
Hadoop assigns map tasks to the DataNode where the actual data to be processed resides.
This way, Hadoop ensures data locality. Data locality means that data is not moved over
network; only computational code moved to process data which saves network bandwidth.
Mapper Phases:
Mapper maps the input <keys, value> pairs into a set of intermediate <keys, value> pairs.
Each map task is broken into following phases:
1. RecordReader: converts byte oriented view of input in to Record oriented view and
presents it to the Mapper tasks. It presents the tasks with keys and values.
i. InputFormat: It reads the given input file and splits using the method
getsplits().
ii. Then it defines RecordReader using createRecordReader() which is
responsible for generating <keys, value> pairs.
2. Mapper: Map function works on the <keys, value> pairs produced by RecordReader and
generates intermediate (key, value) pairs.
Methods:
- protected void cleanup(Context context): called once at tend of task.
- protected void map(KEYIN key, VALUEIN value, Context context): called
once for each key-value pair in input split.
- void run(Context context): user can override this method for complete control
over execution of Mapper.
- protected void setup(Context context): called once at beginning of task to perform
required activities to initiate map() method.
3. Combiner: It takes intermediate <keys, value> pairs provided by mapper and applies user
specific aggregate function to only one mapper. It is also known as local Reducer. We can
optionally specify a combiner using Job.setCombinerClass(ReducerClass) to perform local
aggregation on intermediate outputs.
4. Partitioner: Take intermediate <keys, value> pairs produced by the mapper, splits
them into partitions the data using a user-defined condition.
The default behavior is to hash the key to determine the reducer.User can control by using
the method:
int getPartition(KEY key, VALUE value, int numPartitions )
Reducer Phases:
1. Shuffle & Sort:
Downloads the grouped key-value pairs onto the local machine, where the
Reducer is running.
The individual <keys, value> pairs are sorted by key into a larger data list.
The data list groups the equivalent keys together so that their values can be
iterated easily in the Reducer task.
2. Reducer:
The Reducer takes the grouped key-value paired data as input and runs a
Reducer function on each one of them.
Here, the data can be aggregated, filtered, and combined in a number of ways,
and it requires a wide range of processing.
Once the execution is over, it gives zero or more key-value pairs to the final
step.
Methods:
- protected void cleanup(Context context): called once at tend of task.
- protected void reduce(KEYIN key, VALUEIN value, Context context): called
once for each key-value pair.
- void run(Context context): user can override this method for complete control
over execution of Reducer.
- protected void setup(Context context): called once at beginning of task to perform
required activities to initiate reduce() method.
3. Output format:
In the output phase, we have an output formatter that translates the final key-value
pairs from the Reducer function and writes them onto a file using a record writer.
conf.setBoolean(“mapred.output.compress”,true);
conf.setClass(“mapred.output.compression.codec”,GzipCodec.class,Compres
sionCodec.class);
Here, codec is the implementation of a compression and decompression algorithm,
GzipCodec is the compression algorithm for gzip.
There will be one job Tracker process running on a single Hadoop cluster. Job Tracker
processes run on their own Java Virtual machine process.
Task Tracker continuously sends heartbeat message to job tracker. When a job tracker fails
to receive a heartbeat message from a TaskTracker, the JobTracker assumes that the
TaskTracker has failed and resubmits the task to another available node in the cluster.
Map Reduce Framework
Phases: Daemons:
Map: Converts input into keyvalue pairs. JobTracker: Master, Schedules
Reduce: Combines output of Task
mappers and produces a reduced result TaskTracker: Slave, Execute task
set.
MapReduce working:
MapReduce divides a data analysis task into two parts – Map and Reduce. In the example
given below: there two mappers and one reduce.
Each mapper works on the partial data set that is stored on that node and the reducer
combines the output from the mappers to produce the reduced result set.
Steps:
1. First, the input dataset is split into multiple pieces of data.
2. Next, the framework creates a master and several slave processes and executes the worker
processes remotely.
3. Several map tasks work simultaneously and read pieces of data that were assigned to each
map task.
4. Map worker uses partitioner function to divide the data into regions.
5. When the map slaves complete their work, the master instructs the reduce slaves to begin
their work.
6. When all the reduce slaves complete their work, the master transfers the control to the
user program.
Fig. MapReduce Programming Architecture
{
final static IntWritable one = new IntWritable(1);
Text word = new Text(); public void map(Object key, Text value, Context
context) throws
IOException, InterruptedException {
StringTokenizer itr = new tringTokenizer(value.toString()); while
(itr.hasMoreTokens()) { word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class WCReducer extends Reducer<Text, IntWritable, Text,
IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values, Context
context ) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) { sum +=
val.get();
}
result.set(sum); context.write(key,
result);
}
}