0% found this document useful (0 votes)

14 views8 pages

Unit 5 - Mapreduce

MapReduce is a programming model for parallel data processing that splits jobs into map and reduce tasks, enabling efficient large-scale data analysis. The map phase involves loading, parsing, and transforming data into intermediate key-value pairs, while the reduce phase aggregates these pairs into final outputs. The document also discusses the roles of JobTracker and TaskTracker in managing tasks, as well as provides a Java example for a WordCount MapReduce program.

Uploaded by

yashi.bajpai18

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views8 pages

Unit 5 - Mapreduce

Uploaded by

yashi.bajpai18

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 8

Unit 5: Processing Your Data With Mapreduce

MapReduce anatomy
MapReduce is a programming model for data processing. Hadoop can run MapReduce
programs written in Java, Ruby and Python.
MapReduce programs are inherently parallel, thus very large scale data analysis can be done
fastly.
In MapReduce programming, Jobs(applications) are split into a set of map tasks and reduce
tasks.
Map task takes care of loading, parsing, transforming and filtering. The responsibility
of reduce task is grouping and aggregating data that is produced by map tasks to generate
final output. Each map task is broken down into the following phases:

1. Record Reader 2. Mapper 3. Combiner 4.Partitioner.

The output produced by the map task is known as intermediate <keys, value> pairs. These
intermediate <keys, value> pairs are sent to reducer.
The reduce tasks are broken down into the following phases:
1. Shuffle 2. Sort
3. Reducer 4. Output format.
Hadoop assigns map tasks to the DataNode where the actual data to be processed resides.
This way, Hadoop ensures data locality. Data locality means that data is not moved over
network; only computational code moved to process data which saves network bandwidth.

Mapper Phases:
Mapper maps the input <keys, value> pairs into a set of intermediate <keys, value> pairs.
Each map task is broken into following phases:

1. RecordReader: converts byte oriented view of input in to Record oriented view and
presents it to the Mapper tasks. It presents the tasks with keys and values.

i. InputFormat: It reads the given input file and splits using the method
getsplits().
ii. Then it defines RecordReader using createRecordReader() which is
responsible for generating <keys, value> pairs.

2. Mapper: Map function works on the <keys, value> pairs produced by RecordReader and
generates intermediate (key, value) pairs.
Methods:
- protected void cleanup(Context context): called once at tend of task.
- protected void map(KEYIN key, VALUEIN value, Context context): called
once for each key-value pair in input split.
- void run(Context context): user can override this method for complete control
over execution of Mapper.
- protected void setup(Context context): called once at beginning of task to perform
required activities to initiate map() method.
3. Combiner: It takes intermediate <keys, value> pairs provided by mapper and applies user
specific aggregate function to only one mapper. It is also known as local Reducer. We can
optionally specify a combiner using Job.setCombinerClass(ReducerClass) to perform local
aggregation on intermediate outputs.

Fig. MapReduce without Combiner class

Fig. MapReduce with Combiner class

4. Partitioner: Take intermediate <keys, value> pairs produced by the mapper, splits
them into partitions the data using a user-defined condition.

The default behavior is to hash the key to determine the reducer.User can control by using
the method:
int getPartition(KEY key, VALUE value, int numPartitions )
Reducer Phases:
1. Shuffle & Sort:
 Downloads the grouped key-value pairs onto the local machine, where the
Reducer is running.
 The individual <keys, value> pairs are sorted by key into a larger data list.
 The data list groups the equivalent keys together so that their values can be
iterated easily in the Reducer task.
2. Reducer:
 The Reducer takes the grouped key-value paired data as input and runs a
Reducer function on each one of them.
 Here, the data can be aggregated, filtered, and combined in a number of ways,
and it requires a wide range of processing.
 Once the execution is over, it gives zero or more key-value pairs to the final
step.
Methods:
- protected void cleanup(Context context): called once at tend of task.
- protected void reduce(KEYIN key, VALUEIN value, Context context): called
once for each key-value pair.
- void run(Context context): user can override this method for complete control
over execution of Reducer.
- protected void setup(Context context): called once at beginning of task to perform
required activities to initiate reduce() method.

3. Output format:
In the output phase, we have an output formatter that translates the final key-value
pairs from the Reducer function and writes them onto a file using a record writer.

Compression: In MapReduce programming we can compress the output file. Compression

provides two benefits as follows:
 Reduces the space to store files.
 Speeds up data transfer across the network.
We can specify compression format in the Driver program as below:

conf.setBoolean(“mapred.output.compress”,true);
conf.setClass(“mapred.output.compression.codec”,GzipCodec.class,Compres
sionCodec.class);
Here, codec is the implementation of a compression and decompression algorithm,
GzipCodec is the compression algorithm for gzip.

The importance of MapReduce in Hadoop environment for processing data.

 MapReduce programming helps to process massive amounts of data in parallel.
 Input data set splits into independent chunks. Map tasks process these independent
chunks completely in a parallel manner.
 Reduce task-provides reduced output by combining the output of various mapers.
There are two daemons associated with MapReduce Programming: JobTracker and
TaskTracer.
JobTracker:
JobTracker is a master daemon responsible for executing over MapReduce job. It provides
connectivity between Hadoop and application.
Whenever code submitted to a cluster, JobTracker creates the execution plan by deciding
which task to assign to which node.
It also monitors all the running tasks. When task fails it automatically re-schedules the task to
a different node after a predefined number of retires.

There will be one job Tracker process running on a single Hadoop cluster. Job Tracker
processes run on their own Java Virtual machine process.

Fig. Job Tracker and Task Tracker interaction

TaskTracker:
This daemon is responsible for executing individual tasks that is assigned by the Job Tracker.

Task Tracker continuously sends heartbeat message to job tracker. When a job tracker fails
to receive a heartbeat message from a TaskTracker, the JobTracker assumes that the
TaskTracker has failed and resubmits the task to another available node in the cluster.
Map Reduce Framework
Phases: Daemons:
Map: Converts input into keyvalue pairs. JobTracker: Master, Schedules
Reduce: Combines output of Task
mappers and produces a reduced result TaskTracker: Slave, Execute task
set.

MapReduce working:
MapReduce divides a data analysis task into two parts – Map and Reduce. In the example
given below: there two mappers and one reduce.
Each mapper works on the partial data set that is stored on that node and the reducer
combines the output from the mappers to produce the reduced result set.
Steps:
1. First, the input dataset is split into multiple pieces of data.
2. Next, the framework creates a master and several slave processes and executes the worker
processes remotely.
3. Several map tasks work simultaneously and read pieces of data that were assigned to each
map task.
4. Map worker uses partitioner function to divide the data into regions.
5. When the map slaves complete their work, the master instructs the reduce slaves to begin
their work.
6. When all the reduce slaves complete their work, the master transfers the control to the
user program.
Fig. MapReduce Programming Architecture

A MapReduce programming using Java requires three classes:

1. Driver Class: This class specifies Job configuration details.
2. MapperClass: this class overrides the MapFunction based on the problem statement.
3. Reducer Class: This class overrides the Reduce function based on the problem
statement.

Write a MapReuduce program for WordCount problem.

import java.io.IOException; import

java.util.StringTokenizer; import
org.apache.hadoop.conf.Configuration; import
org.apache.hadoop.fs.Path; import
org.apache.hadoop.io.IntWritable; import
org.apache.hadoop.io.Text; import
org.apache.hadoop.mapreduce.Job; import
org.apache.hadoop.mapreduce.Mapper; import
org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount

{
public static class WCMapper extends Mapper <Object, Text, Text, IntWritable>

{
final static IntWritable one = new IntWritable(1);
Text word = new Text(); public void map(Object key, Text value, Context
context) throws
IOException, InterruptedException {
StringTokenizer itr = new tringTokenizer(value.toString()); while
(itr.hasMoreTokens()) { word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class WCReducer extends Reducer<Text, IntWritable, Text,
IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values, Context
context ) throws IOException, InterruptedException {

int sum = 0;
for (IntWritable val : values) { sum +=
val.get();
}
result.set(sum); context.write(key,
result);
}
}

public static void main(String[] args) throws Exception {

Configuration conf = new Configuration(); Job job =
Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(WCMapper.class);
job.setReducerClass(WCReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Fig. MapReduce paradigm for WordCount

IDMS Batch COBOL Programming
100% (8)
IDMS Batch COBOL Programming
52 pages
Palo Alto CLI Commands
100% (3)
Palo Alto CLI Commands
2 pages
BDA Unit 2 Notes
No ratings yet
BDA Unit 2 Notes
32 pages
unit3
No ratings yet
unit3
33 pages
Big Data BCA Unit4
No ratings yet
Big Data BCA Unit4
9 pages
BDA U2 - copy
No ratings yet
BDA U2 - copy
79 pages
BDA Unit-2
No ratings yet
BDA Unit-2
11 pages
Adobe Scan 03 Jun 2024
No ratings yet
Adobe Scan 03 Jun 2024
3 pages
Notes Bug Data and of Apache
No ratings yet
Notes Bug Data and of Apache
4 pages
BIG DATA UNIT -3
No ratings yet
BIG DATA UNIT -3
7 pages
BDA Unit 3 Notes
No ratings yet
BDA Unit 3 Notes
11 pages
Unit - III
No ratings yet
Unit - III
37 pages
Unit 3 - Big Data Technologies
No ratings yet
Unit 3 - Big Data Technologies
42 pages
Hadoop Karunesh
No ratings yet
Hadoop Karunesh
14 pages
Unit 3
No ratings yet
Unit 3
13 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
Bda Mod2
No ratings yet
Bda Mod2
8 pages
P.Prabu (23x61c) CCS334-BDA - Unit-3
No ratings yet
P.Prabu (23x61c) CCS334-BDA - Unit-3
23 pages
Hadoop Map Reduce
No ratings yet
Hadoop Map Reduce
53 pages
3.1.How Map Reduce Works & 3.2 Anatomy
No ratings yet
3.1.How Map Reduce Works & 3.2 Anatomy
11 pages
Hadoop (Mapreduce)
No ratings yet
Hadoop (Mapreduce)
43 pages
Map Reduce 2
No ratings yet
Map Reduce 2
14 pages
Data Science Presentation
No ratings yet
Data Science Presentation
20 pages
Unit-2 (MapReduce-II)
No ratings yet
Unit-2 (MapReduce-II)
11 pages
unit 2
No ratings yet
unit 2
12 pages
Cloud Computing Prof
No ratings yet
Cloud Computing Prof
11 pages
Unit-2 MapReduce2024
No ratings yet
Unit-2 MapReduce2024
41 pages
Map Reduce
No ratings yet
Map Reduce
74 pages
MapReduce Arch
No ratings yet
MapReduce Arch
29 pages
Bda Unit III r20csm
No ratings yet
Bda Unit III r20csm
54 pages
Unit-4-1
No ratings yet
Unit-4-1
12 pages
DSBDA Manual Assignment 11
No ratings yet
DSBDA Manual Assignment 11
6 pages
Notes - Unit 3 - Map Reduce Applications
No ratings yet
Notes - Unit 3 - Map Reduce Applications
11 pages
BDA_UNIT_2
No ratings yet
BDA_UNIT_2
48 pages
Big Data Unit-2 PPT part2
No ratings yet
Big Data Unit-2 PPT part2
78 pages
BDA-UNIT-3
No ratings yet
BDA-UNIT-3
29 pages
UNIT 3bda
No ratings yet
UNIT 3bda
16 pages
Map Reduce
No ratings yet
Map Reduce
45 pages
Big Data Analytics-4
No ratings yet
Big Data Analytics-4
26 pages
What Is MapReduce in Hadoop - Architecture - Example
No ratings yet
What Is MapReduce in Hadoop - Architecture - Example
7 pages
BDA Unit 3 1
No ratings yet
BDA Unit 3 1
37 pages
Unit-2 (MapReduce-I)
No ratings yet
Unit-2 (MapReduce-I)
28 pages
Big Data Management Continued
No ratings yet
Big Data Management Continued
48 pages
Big Data Analytics UNIT 3 Notets
No ratings yet
Big Data Analytics UNIT 3 Notets
12 pages
M4_06_MapReduce
No ratings yet
M4_06_MapReduce
28 pages
1 UNIT-1
No ratings yet
1 UNIT-1
59 pages
Unit 5
No ratings yet
Unit 5
7 pages
HadoopMapreduce Summerization
No ratings yet
HadoopMapreduce Summerization
24 pages
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
No ratings yet
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
54 pages
Unit 2 Topic 4 Map Reduce
No ratings yet
Unit 2 Topic 4 Map Reduce
43 pages
MapReduce Architecture
No ratings yet
MapReduce Architecture
5 pages
Unit-4
No ratings yet
Unit-4
19 pages
BDA-MapReduce (1) 5rfgy656yhgvcft6
No ratings yet
BDA-MapReduce (1) 5rfgy656yhgvcft6
60 pages
BIG DATA
No ratings yet
BIG DATA
120 pages
UNIT – III
No ratings yet
UNIT – III
38 pages
Top Answers To Map Reduce Interview Questions
No ratings yet
Top Answers To Map Reduce Interview Questions
6 pages
Understanding MapReduce
No ratings yet
Understanding MapReduce
4 pages
Hadoop: Er. Gursewak Singh Dsce
No ratings yet
Hadoop: Er. Gursewak Singh Dsce
15 pages
Unit 2 Topic 4 Map Reduce
No ratings yet
Unit 2 Topic 4 Map Reduce
27 pages
3-bda-unit-3-notes
No ratings yet
3-bda-unit-3-notes
12 pages
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
From Everand
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
Marcus Richards
No ratings yet
Conceptual Programming: Conceptual Programming: Learn Programming the old way!
From Everand
Conceptual Programming: Conceptual Programming: Learn Programming the old way!
Avishek Sharma
No ratings yet
Use Case Study of HDD-SSD Hybrid Storage, Distributed Storage and HDD Storage On Openstack
No ratings yet
Use Case Study of HDD-SSD Hybrid Storage, Distributed Storage and HDD Storage On Openstack
2 pages
Unit - 1 Daa
No ratings yet
Unit - 1 Daa
38 pages
Em720 Protocolo Iec-62056-21
No ratings yet
Em720 Protocolo Iec-62056-21
15 pages
Bpia 2371525652
No ratings yet
Bpia 2371525652
172 pages
Introduction To User Defined Function in MATLAB.
No ratings yet
Introduction To User Defined Function in MATLAB.
10 pages
Bladecenter Hx5 Blade Server Installation and User'S Guide: Machine Types: 7873, 7872, 1910, 1909
No ratings yet
Bladecenter Hx5 Blade Server Installation and User'S Guide: Machine Types: 7873, 7872, 1910, 1909
141 pages
Micro Controller Quiz
0% (1)
Micro Controller Quiz
2 pages
Qualnet Developers Guide-3.6 PDF
No ratings yet
Qualnet Developers Guide-3.6 PDF
274 pages
FSP W6SatSlot2 RAM90F Outage Tasklist
No ratings yet
FSP W6SatSlot2 RAM90F Outage Tasklist
51 pages
Reference Guide: 9-Pin Dot Matrix Printer
No ratings yet
Reference Guide: 9-Pin Dot Matrix Printer
133 pages
Lab Manual CS 112 Computing & Programming
No ratings yet
Lab Manual CS 112 Computing & Programming
110 pages
Cbse Computer Science Operators and Expressions Solutions PDF
82% (11)
Cbse Computer Science Operators and Expressions Solutions PDF
12 pages
Building For TV Devices React Native
No ratings yet
Building For TV Devices React Native
3 pages
Development of Verification IP of Physical Layer of PCIe
No ratings yet
Development of Verification IP of Physical Layer of PCIe
5 pages
Solution Architect Resume
No ratings yet
Solution Architect Resume
5 pages
Remedy API Quick Reference
No ratings yet
Remedy API Quick Reference
4 pages
RUX Uptake Tool Instructions Revised
No ratings yet
RUX Uptake Tool Instructions Revised
10 pages
6.1 Overflow1
No ratings yet
6.1 Overflow1
27 pages
كتاب تعلم لينكس للمبتدئين PDF
No ratings yet
كتاب تعلم لينكس للمبتدئين PDF
96 pages
Wifi Lora 32 (V2) Pinout Diagram: GND Power
No ratings yet
Wifi Lora 32 (V2) Pinout Diagram: GND Power
1 page
Python Exam
100% (2)
Python Exam
5 pages
Gas Turbine 11
100% (1)
Gas Turbine 11
59 pages
Em330 Em340 Et330 Et340 CP
No ratings yet
Em330 Em340 Et330 Et340 CP
19 pages
Object Oriented Programming (LAB) Comp (ONPO121A)
No ratings yet
Object Oriented Programming (LAB) Comp (ONPO121A)
28 pages
Lecture 4 C# Basic: Introduction To Classes and Objects in C#
No ratings yet
Lecture 4 C# Basic: Introduction To Classes and Objects in C#
12 pages
1830 PSS GMRE Control Plane
No ratings yet
1830 PSS GMRE Control Plane
17 pages
AQM Error
No ratings yet
AQM Error
1 page

Unit 5 - Mapreduce

Uploaded by

Unit 5 - Mapreduce

Uploaded by

Unit 5: Processing Your Data With Mapreduce

1. Record Reader 2. Mapper 3. Combiner 4.Partitioner.

Fig. MapReduce without Combiner class

Fig. MapReduce with Combiner class

Compression: In MapReduce programming we can compress the output file. Compression

The importance of MapReduce in Hadoop environment for processing data.

Fig. Job Tracker and Task Tracker interaction

A MapReduce programming using Java requires three classes:

Write a MapReuduce program for WordCount problem.

import java.io.IOException; import

public class WordCount

public static void main(String[] args) throws Exception {

You might also like