0% found this document useful (0 votes)
14 views8 pages

Unit 5 - Mapreduce

MapReduce is a programming model for parallel data processing that splits jobs into map and reduce tasks, enabling efficient large-scale data analysis. The map phase involves loading, parsing, and transforming data into intermediate key-value pairs, while the reduce phase aggregates these pairs into final outputs. The document also discusses the roles of JobTracker and TaskTracker in managing tasks, as well as provides a Java example for a WordCount MapReduce program.

Uploaded by

yashi.bajpai18
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views8 pages

Unit 5 - Mapreduce

MapReduce is a programming model for parallel data processing that splits jobs into map and reduce tasks, enabling efficient large-scale data analysis. The map phase involves loading, parsing, and transforming data into intermediate key-value pairs, while the reduce phase aggregates these pairs into final outputs. The document also discusses the roles of JobTracker and TaskTracker in managing tasks, as well as provides a Java example for a WordCount MapReduce program.

Uploaded by

yashi.bajpai18
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

Unit 5: Processing Your Data With Mapreduce

MapReduce anatomy
MapReduce is a programming model for data processing. Hadoop can run MapReduce
programs written in Java, Ruby and Python.
MapReduce programs are inherently parallel, thus very large scale data analysis can be done
fastly.
In MapReduce programming, Jobs(applications) are split into a set of map tasks and reduce
tasks.
Map task takes care of loading, parsing, transforming and filtering. The responsibility
of reduce task is grouping and aggregating data that is produced by map tasks to generate
final output. Each map task is broken down into the following phases:

1. Record Reader 2. Mapper 3. Combiner 4.Partitioner.

The output produced by the map task is known as intermediate <keys, value> pairs. These
intermediate <keys, value> pairs are sent to reducer.
The reduce tasks are broken down into the following phases:
1. Shuffle 2. Sort
3. Reducer 4. Output format.
Hadoop assigns map tasks to the DataNode where the actual data to be processed resides.
This way, Hadoop ensures data locality. Data locality means that data is not moved over
network; only computational code moved to process data which saves network bandwidth.

Mapper Phases:
Mapper maps the input <keys, value> pairs into a set of intermediate <keys, value> pairs.
Each map task is broken into following phases:

1. RecordReader: converts byte oriented view of input in to Record oriented view and
presents it to the Mapper tasks. It presents the tasks with keys and values.

i. InputFormat: It reads the given input file and splits using the method
getsplits().
ii. Then it defines RecordReader using createRecordReader() which is
responsible for generating <keys, value> pairs.

2. Mapper: Map function works on the <keys, value> pairs produced by RecordReader and
generates intermediate (key, value) pairs.
Methods:
- protected void cleanup(Context context): called once at tend of task.
- protected void map(KEYIN key, VALUEIN value, Context context): called
once for each key-value pair in input split.
- void run(Context context): user can override this method for complete control
over execution of Mapper.
- protected void setup(Context context): called once at beginning of task to perform
required activities to initiate map() method.
3. Combiner: It takes intermediate <keys, value> pairs provided by mapper and applies user
specific aggregate function to only one mapper. It is also known as local Reducer. We can
optionally specify a combiner using Job.setCombinerClass(ReducerClass) to perform local
aggregation on intermediate outputs.

Fig. MapReduce without Combiner class

Fig. MapReduce with Combiner class

4. Partitioner: Take intermediate <keys, value> pairs produced by the mapper, splits
them into partitions the data using a user-defined condition.

The default behavior is to hash the key to determine the reducer.User can control by using
the method:
int getPartition(KEY key, VALUE value, int numPartitions )
Reducer Phases:
1. Shuffle & Sort:
 Downloads the grouped key-value pairs onto the local machine, where the
Reducer is running.
 The individual <keys, value> pairs are sorted by key into a larger data list.
 The data list groups the equivalent keys together so that their values can be
iterated easily in the Reducer task.
2. Reducer:
 The Reducer takes the grouped key-value paired data as input and runs a
Reducer function on each one of them.
 Here, the data can be aggregated, filtered, and combined in a number of ways,
and it requires a wide range of processing.
 Once the execution is over, it gives zero or more key-value pairs to the final
step.
Methods:
- protected void cleanup(Context context): called once at tend of task.
- protected void reduce(KEYIN key, VALUEIN value, Context context): called
once for each key-value pair.
- void run(Context context): user can override this method for complete control
over execution of Reducer.
- protected void setup(Context context): called once at beginning of task to perform
required activities to initiate reduce() method.

3. Output format:
In the output phase, we have an output formatter that translates the final key-value
pairs from the Reducer function and writes them onto a file using a record writer.

Compression: In MapReduce programming we can compress the output file. Compression


provides two benefits as follows:
 Reduces the space to store files.
 Speeds up data transfer across the network.
We can specify compression format in the Driver program as below:

conf.setBoolean(“mapred.output.compress”,true);
conf.setClass(“mapred.output.compression.codec”,GzipCodec.class,Compres
sionCodec.class);
Here, codec is the implementation of a compression and decompression algorithm,
GzipCodec is the compression algorithm for gzip.

The importance of MapReduce in Hadoop environment for processing data.


 MapReduce programming helps to process massive amounts of data in parallel.
 Input data set splits into independent chunks. Map tasks process these independent
chunks completely in a parallel manner.
 Reduce task-provides reduced output by combining the output of various mapers.
There are two daemons associated with MapReduce Programming: JobTracker and
TaskTracer.
JobTracker:
JobTracker is a master daemon responsible for executing over MapReduce job. It provides
connectivity between Hadoop and application.
Whenever code submitted to a cluster, JobTracker creates the execution plan by deciding
which task to assign to which node.
It also monitors all the running tasks. When task fails it automatically re-schedules the task to
a different node after a predefined number of retires.

There will be one job Tracker process running on a single Hadoop cluster. Job Tracker
processes run on their own Java Virtual machine process.

Fig. Job Tracker and Task Tracker interaction


TaskTracker:
This daemon is responsible for executing individual tasks that is assigned by the Job Tracker.

Task Tracker continuously sends heartbeat message to job tracker. When a job tracker fails
to receive a heartbeat message from a TaskTracker, the JobTracker assumes that the
TaskTracker has failed and resubmits the task to another available node in the cluster.
Map Reduce Framework
Phases: Daemons:
Map: Converts input into keyvalue pairs. JobTracker: Master, Schedules
Reduce: Combines output of Task
mappers and produces a reduced result TaskTracker: Slave, Execute task
set.

MapReduce working:
MapReduce divides a data analysis task into two parts – Map and Reduce. In the example
given below: there two mappers and one reduce.
Each mapper works on the partial data set that is stored on that node and the reducer
combines the output from the mappers to produce the reduced result set.
Steps:
1. First, the input dataset is split into multiple pieces of data.
2. Next, the framework creates a master and several slave processes and executes the worker
processes remotely.
3. Several map tasks work simultaneously and read pieces of data that were assigned to each
map task.
4. Map worker uses partitioner function to divide the data into regions.
5. When the map slaves complete their work, the master instructs the reduce slaves to begin
their work.
6. When all the reduce slaves complete their work, the master transfers the control to the
user program.
Fig. MapReduce Programming Architecture

A MapReduce programming using Java requires three classes:


1. Driver Class: This class specifies Job configuration details.
2. MapperClass: this class overrides the MapFunction based on the problem statement.
3. Reducer Class: This class overrides the Reduce function based on the problem
statement.

Write a MapReuduce program for WordCount problem.

import java.io.IOException; import


java.util.StringTokenizer; import
org.apache.hadoop.conf.Configuration; import
org.apache.hadoop.fs.Path; import
org.apache.hadoop.io.IntWritable; import
org.apache.hadoop.io.Text; import
org.apache.hadoop.mapreduce.Job; import
org.apache.hadoop.mapreduce.Mapper; import
org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount


{
public static class WCMapper extends Mapper <Object, Text, Text, IntWritable>

{
final static IntWritable one = new IntWritable(1);
Text word = new Text(); public void map(Object key, Text value, Context
context) throws
IOException, InterruptedException {
StringTokenizer itr = new tringTokenizer(value.toString()); while
(itr.hasMoreTokens()) { word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class WCReducer extends Reducer<Text, IntWritable, Text,
IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values, Context
context ) throws IOException, InterruptedException {

int sum = 0;
for (IntWritable val : values) { sum +=
val.get();
}
result.set(sum); context.write(key,
result);
}
}

public static void main(String[] args) throws Exception {


Configuration conf = new Configuration(); Job job =
Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(WCMapper.class);
job.setReducerClass(WCReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Fig. MapReduce paradigm for WordCount

You might also like