0% found this document useful (0 votes)
5 views11 pages

Cloud Computing Prof

Uploaded by

niveshgarg2025
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views11 pages

Cloud Computing Prof

Uploaded by

niveshgarg2025
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Madhav Institute of Technology and Science

Deemed to be University

CLOUD COMPUTING AND VIRTUALIZATION

Name: Nivesh Garg


Submitted to: Dr. Smita Parte
Class: 6th Semester (CSE)
Enrollment number: 0901CS211078
MAP
REDUCE
Proficiency
presentation
What is MapReduce?
MapReduce is a Java-based, distributed
execution framework within the Apache
Hadoop Ecosystem. It takes away the
complexity of distributed programming by
exposing two processing steps that
developers implement: 1) Map and 2)
Reduce. In the Mapping step, data is split
between parallel processing tasks.
Transformation logic can be applied to
each chunk of data. Once completed, the
Reduce phase takes over to handle
aggregating data from the Map set.. In
general, MapReduce uses Hadoop
Distributed File System (HDFS) for both
input and output.
How does
MapReduce work?
A MapReduce system is usually
composed of three steps (even
though it's generalized as the
combination of Map and Reduce
operations/functions). The
MapReduce Operations are:
MapReduce
Map Shuffle, combine
and partition
Reduce
The input data is first split into
smaller blocks. The Hadoop Worker nodes redistribute data
A reducer cannot start while a
framework then decides how based on the output keys
mapper is still in progress.
many mappers to use, based (produced by the map function),
Worker nodes process each
such that all data belonging to
on the size of the data to be group of <key,value> pairs
one key is located on the same
processed and the memory output data, in parallel to
worker node. As an optional
block available on each produce <key,value> pairs as
process the combiner (a reducer)
mapper server. Each block is output. All the map output p
can run individually on each
then assigned to a mapper for that have the same key are
mapper server to reduce the data
processing. Each ‘worker’ node assigned to a single reducer,
on each mapper even further
applies the map function to which then aggregates the
making reducing the data
the local data, and writes the values for that key. Unlike the
footprint and shuffling and sorting
output to temporary storage. map function which is
easier. Partition (not optional) is
The primary (master) node mandatory to filter and sort the
the process that decides how the
ensures that only a single copy initial data, the reduce function
data has to be presented to the
of the redundant input data is is optional.
reducer and also assigns it to a
processed. particular reducer.
Components of
MapReduce
Architecture:
Client: The MapReduce client is the one
who brings the Job to the MapReduce
for processing. There can be multiple
clients available that continuously send
jobs for processing to the Hadoop
MapReduce Manager.

Job: The MapReduce Job is the actual


work that the client wanted to do
which is comprised of so many smaller
tasks that the client wants to process or
execute.
Hadoop MapReduce Master: It divides the
particular job into subsequent job-parts.

Job-Parts: The task or sub-jobs that are


obtained after dividing the main job. The
result of all the job-parts combined to
produce the final output.

Input Data: The data set that is fed to the


MapReduce for processing.

Output Data: The final result is obtained after


the processing.
Advantages of
MapReduce
Scalability
Flexibility
Security and
authentication
Faster processing of data
Very simple programming
model
Availability and resilient
nature
MapReduce, as a programming model and framework, offers several key
features that make it useful for processing large-scale data:

Scalability: MapReduce is designed to handle massive datasets by distributing the processing across
multiple nodes in a cluster. This distributed nature allows it to scale horizontally, accommodating
increasing data volumes without requiring significant changes to the underlying infrastructure.

Fault Tolerance: MapReduce provides built-in fault tolerance mechanisms to ensure that
computations continue in the event of node failures. It achieves this through data replication and
task re-execution, allowing jobs to recover from failures without data loss or interruption.

Parallel Processing: MapReduce divides data processing tasks into smaller units, which can be
executed independently and in parallel across multiple nodes. This parallel processing capability
enables efficient utilization of cluster resources and accelerates data processing tasks.
Ease of Programming: MapReduce abstracts away the complexities of distributed computing,
allowing developers to focus on writing simple map and reduce functions to process data. The
framework handles data distribution, task scheduling, and fault tolerance transparently, making it
easier to develop and debug distributed applications.

Data Locality: MapReduce leverages data locality to minimize data movement across the cluster. By
processing data where it resides, MapReduce reduces network overhead and improves overall
performance. This locality-aware processing is crucial for optimizing performance in distributed
environments.

Flexibility: While MapReduce's primary programming model involves the map and reduce phases, it
can be extended and customized to support various data processing tasks. Developers can define
custom input/output formats, partitioning strategies, and combiner functions to tailor MapReduce
jobs to specific requirements.
THANK YOU FOR
LISTENING!

You might also like