Unit 3
MapReduce API framework :
The MapReduce API framework, primarily associated with Apache Hadoop, provides a
programming model for processing large datasets in a distributed and parallel manner across a
cluster of machines. It simplifies the complexities of distributed programming by abstracting
away details like data distribution, fault tolerance, and inter-process communication.
The core components of the MapReduce API framework include:
• JobContext Interface
• Job Class
• Mapper Class
• Reducer Class
• InputFormat
• OutputFormat:
• Partitioner:
• Combiner (Optional):
JobContext Interface
The JobContext interface is the super interface for all the classes, which defines different jobs in
MapReduce. It gives you a read-only view of the job that is provided to the tasks while they are
running.
The following are the sub-interfaces of JobContext interface.
S.No. Subinterface Description
1. MapContext<KEYIN, VALUEIN, KEYOUT, VALUEOUT>
Defines the context that is given to the Mapper.
2. ReduceContext<KEYIN, VALUEIN, KEYOUT, VALUEOUT>
Defines the context that is passed to the Reducer.
Job class is the main class that implements the JobContext interface
Job Class
The Job class is the most important class in the MapReduce API. It allows the user to configure
the job, submit it, control its execution, and query the state. The set methods only work until the
job is submitted, afterwards they will throw an IllegalStateException.
Following are the constructor summary of Job class.
S.No Constructor Summary
1 Job()
2 Job(Configuration conf)
3 Job(Configuration conf, String jobName)
Methods
Some of the important methods of Job class are as follows −
S.No Method Description
1 getJobName() -User-specified job name.
2 getJobState() -Returns the current state of the Job.
3 isComplete() -Checks if the job is finished or not.
4 setInputFormatClass() -Sets the InputFormat for the job.
5 setJobName(String name) -Sets the user-specified job name.
6 setOutputFormatClass() -Sets the Output Format for the job.
7 setMapperClass(Class) -Sets the Mapper for the job.
8 setReducerClass(Class) -Sets the Reducer for the job.
9 setPartitionerClass(Class) -Sets the Partitioner for the job.
10 setCombinerClass(Class) -Sets the Combiner for the job.
Mapper Class:
Users implement the map method within this class. The map method processes a single input
key-value pair and generates zero or more intermediate key-value pairs. This phase focuses on
data transformation and filtering.
Reducer Class:
Users implement the reduce method within this class. The reduce method receives a key and an
iterable list of values associated with that key (which have been grouped and sorted by the
framework). It then aggregates or combines these values to produce the final output key-value
pairs.
InputFormat:
This defines how the input data is read and split into records that are fed to the mappers. It
determines the input key-value pairs for the map phase.
OutputFormat:
This defines how the output of the reduce phase is written to the desired location, typically
HDFS.
Partitioner:
This optional component determines which reducer receives which intermediate key-value pair
from the mappers. By default, it uses a hash-based partitioning to distribute data evenly across
reducers, but custom partitioners can be implemented for specific needs.
Combiner (Optional):
This is an optional "mini-reducer" that runs on the mapper side before the data is shuffled to the
reducers. It performs local aggregation to reduce the amount of data transferred over the
network, improving efficiency.
The framework handles the entire workflow, including splitting input data, distributing tasks to
nodes, managing communication and data transfers between map and reduce phases (shuffle and
sort), and ensuring fault tolerance by re-executing failed tasks. The key and value classes used
throughout the process must implement the Writable interface for serialization, and key classes
also need to implement WritableComparable for sorting.
features of mapreduce :
MapReduce is a programming model and software framework used for processing large datasets
in a distributed and parallel manner across a cluster of computers. Its core features contribute to
its effectiveness in big data processing:
Key Features of MapReduce:
Scalability:
MapReduce can handle massive datasets by distributing computations across a large number of
commodity machines. As data volume grows, more nodes can be added to the cluster to
maintain performance.
Fault Tolerance:
It is inherently designed to handle failures. If a node or task fails, the framework automatically
detects the failure and re-executes the task on another available node, ensuring job completion
without manual intervention.
Data Locality:
MapReduce aims to process data where it resides (on the same node or rack), minimizing
network traffic and improving efficiency. This is achieved by scheduling tasks on nodes that
store the relevant data blocks.
Parallel Processing:
The framework enables parallel execution of tasks. The "Map" phase processes data
independently across multiple nodes, and the "Reduce" phase aggregates the results in parallel.
Simplicity (for Developers):
Developers primarily need to implement two functions: map() and reduce(). The complex
details of distributed processing, fault tolerance, and scheduling are handled by the MapReduce
framework.
Cost-Effectiveness:
It leverages commodity hardware, making it a cost-efficient solution for large-scale data
processing compared to traditional high-performance computing systems.
Mapper Class
The Mapper class defines the Map job. Maps input key-value pairs to a set of intermediate key-
value pairs. Maps are the individual tasks that transform the input records into intermediate
records. The transformed intermediate records need not be of the same type as the input records.
A given input pair may map to zero or many output pairs.
Method: map is the most prominent method of the Mapper class.
The syntax is defined below −
map(KEYIN key, VALUEIN value, org.apache.hadoop.mapreduce.Mapper.Context context)
This method is called once for each key-value pair in the input split.
Reducer Class
The Reducer class defines the Reduce job in MapReduce. It reduces a set of intermediate values
that share a key to a smaller set of values. Reducer implementations can access the Configuration
for a job via the JobContext.getConfiguration() method.
A Reducer has three primary phases − Shuffle, Sort, and Reduce.
Shuffle − The Reducer copies the sorted output from each Mapper using HTTP across the
network.
Sort − The framework merge-sorts the Reducer inputs by keys (since different Mappers may
have output the same key). The shuffle and sort phases occur simultaneously, i.e., while outputs
are being fetched, they are merged.
Reduce − In this phase the reduce (Object, Iterable, Context) method is called for each <key,
(collection of values)> in the sorted inputs.
Method: Reduce is the most prominent method of the Reducer class.
The syntax is defined below −
reduce(KEYIN key, Iterable<VALUEIN> values,
org.apache.hadoop.mapreduce.Reducer.Context context)
This method is called once for each key on the collection of key-value pairs