0% found this document useful (0 votes)

12 views37 pages

L04-MapReduce

The document provides an overview of MapReduce, a big data processing technique that utilizes parallel processing to handle large datasets. It outlines the MapReduce algorithm, detailing its phases: split, map, partition, shuffle and sort, and reduce, with examples illustrating how to count occurrences, calculate averages, and filter data. The lecture emphasizes the divide-and-conquer principle and the sequential execution of map and reduce tasks in a cluster environment.

Uploaded by

abdullah.oudaa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views37 pages

L04-MapReduce

Uploaded by

abdullah.oudaa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

Lecture 4

Big Data Processing Techniques

(MapReduce)

Dr. Lydia Wahid

1
Agenda

MapReduce Definition

MapReduce Algorithm

MapReduce Examples

2
MapReduce Definition

3
MapReduce Definition
➢ MapReduce is a widely used Big data processing technique.

➢ Itprocesses large datasets using parallel processing deployed over clusters of

hardware.

➢ Itis based on the principle of divide-and-conquer. It divides a big problem into a

collection of smaller problems that can each be solved quickly.

➢A dataset is broken down into multiple smaller parts, and operations are performed
on each part independently and in parallel.

➢ The results from all operations are then combined to arrive at the result of the
whole dataset.
4
MapReduce Definition
➢ EachMapReduce job is composed of a map phase and a reduce phase and
each phase consists of multiple stages.

➢ The Map and Reduce phases run sequentially in a cluster.

➢ The Map phase is executed first then the Reduce phase.

➢ The output of the Map phase becomes the input of the Reduce phase.

➢ MapReduce does not require that the input data conform to any particular
data model.
5
MapReduce Definition
➢The following figure shows the data flow in MapReduce:

6
MapReduce Definition
➢In MapReduce, all map and reduce tasks run in parallel.

➢First of all, all map tasks are independently run.

➢Meanwhile, reduce tasks wait until their respective maps are finished.

➢Then, reduce tasks process their data concurrently and independently.

7
MapReduce Algorithm

8
MapReduce Algorithm
Split Map

INPUT OUTPUT

Map Phase Reduce Phase

9
MapReduce Algorithm
➢We will now apply and explain each stage on the following example:
• Problem Statement:
Count the number of occurrences of each word available in a DataSet.

Input Dataset Required Output

10
MapReduce Algorithm
➢Split stage:
• Takes input DataSet and divides it into smaller Sub-DataSets called splits.
• Each split is parsed into its constituent records as a key-value pair. The key is
usually the ordinal position of the record, and the value is the actual record.
• A common example will read a directory full of text files and return each line as a
record.
• The key-value pairs for each split are then sent to a map function (or mapper).
➢By applying this stage on our example, we get the following:

11
MapReduce Algorithm

Split stage

Input Dataset

12
MapReduce Algorithm
➢Map stage:
• This is the map function or mapper that executes user-defined logic.
• The mapper processes each key-value pair as per the user-defined logic and
further generates a key-value pair as its output.
• The output key can either be the same as the input key or a substring value from
the input value, or another user-defined object.
• Similarly, the output value can either be the same as the input value or a substring
value from the input value, or another user-defined object.
• When all records of the split have been processed, the output is a list of key-
value pairs where multiple key-value pairs can exist for the same key.
➢By applying this stage on our example, we get the following:
13
MapReduce Algorithm

Map stage
(mapper)

14
MapReduce Algorithm
➢Partition stage:
• During the partition stage, if more than one reducer is involved, a partitioner
divides the output from the mapper into partitions between reducer instances.
• All records for a particular key are assigned to the same reducer.
• The MapReduce algorithm guarantees a random and fair distribution between
reducers while making sure that all of the same keys across multiple mappers end
up with the same reducer.
➢Assume here in our example, that we have only one reducer.

15
MapReduce Algorithm
➢Shuffle and Sort stage:
• During the first stage of the reduce task, output from all partitioners is copied
across the network to the nodes running the reduce task. This is known as
shuffling.
• The output list of key-value pairs from each partitioner can contain the same key
multiple times, so sorting and merging of the key-value pairs is done according
to the keys so that the output contains a sorted list of all input keys and their
values with the same keys appearing together.
• This merge creates a single key-value pair per group, where key is the group key
and the value is the list of all group values.
• The way in which keys are sorted and merged can be customized.
➢By applying this stage on our example, we get the following:

16
MapReduce Algorithm

Shuffling and
Sorting stage

17
MapReduce Algorithm
➢Reduce stage:
• Reduce is the final stage of the reduce phase.
• Depending on the user-defined logic specified in the reduce function or
reducer, the reducer will either further summarize its input or will emit the
output without making any changes.
• The output key can either be the same as the input key or a substring value from
the input value, or another user-defined object.
• The output value can either be the same as the input value or a substring value
from the input value, or another user-defined object.
➢By applying this stage on our example, we get the following:

18
MapReduce Algorithm

Reduce stage
(reducer)

Final Output

19
MapReduce Algorithm
➢Consider another example as follows:
• We have products information as input and we need as output the quantity of
each product.

Map Phase Reduce Phase

20
MapReduce Algorithm
1. The input (sales.txt) is divided into two
splits.
2. Two map tasks running on two different
nodes, Node A and Node B, extract
product and quantity from the respective
split’s records in parallel. The output from
each map function is a key-value pair
where product is the key while quantity is
the value.
3. The combiner then performs local
summation of product quantities. (A
combiner is essentially a reducer function
that locally groups a mapper’s output on
the same node as the mapper.)
4. As there is only one reduce task, no Map Phase
partitioning is performed.
21
MapReduce Algorithm
5. The output from the two map
tasks is then copied to a third
node, Node C, that runs the
shuffle stage as part of the
reduce task.
6. The sort stage then groups all
quantities of the same product
together as a list.
7. The reduce function then sums
up the quantities of each unique
product in order to create the Reduce Phase
output.
22
MapReduce Examples

23
MapReduce Examples
➢For the examples in this section, we will use data similar to the data
collected by a web analytics service that shows various statistics for page
visits for a website.
➢Each page has some tracking code which sends the visitor’s IP address
along with a timestamp to the web analytics service. The web analytics
service keeps a record of all page visits and the visitor IP addresses and
uses MapReduce programs for computing various statistics.
➢Each visit to a page is logged as one row in the log. The log file contains
the following columns:
Date (YYYY-MM-DD), Time (HH:MM:SS), URL, IP, Visit-Length.

24
MapReduce Examples
1. Count: Compute the number of visits to each
page of the given website:
Part of Input to show its format

25
MapReduce Examples

26
MapReduce Examples
1. Count computation Explanation:
• To compute count, the mapper function emits key-value pairs where the key is
the field to group-by.
• The mapper function in this example parses each line of the input and emits key-
value pairs where the key is the URL and value is ‘1’.
• The reducer function receives the key-value pairs grouped by the same key and
adds up the values for each group to compute count.

27
MapReduce Examples
2. Average: Find the average
time spent on each page in
the given website:
Part of Input to show its format

28
29
MapReduce Examples
2. Average computation Explanation:
• To compute the average, the mapper function emits key-value pairs where
the key is the field to group-by and value contains related items required to
compute the average.
• The mapper function in this example parses each line of the input and
emits key-value pairs where the key is the URL and value is the visit length.
• The reducer receives the list of values grouped by the key (which is the
URL) and finds the average of these values.

30
MapReduce Examples
3. Top-N: Find the top 3 most visited pages in
the given website:
Part of Input to show its format

31
a. What will be the
output of the
shuffle and sort
stage?

b. In Reduce-2,
how many
reducers do we
have?

32
MapReduce Examples
3. Top-N computation Explanation:
• The mapper function in this example parses each line of the input and emits
key-value pairs where the key is the URL and value is ‘1’.
• The reducer receives the list of values grouped by the key and sums up the
values to count the visits for each page.
• The first reducer emits None as the key and a tuple comprising of page visit
count and page URL and the value.
• The second reducer receives a list of (visit count, URL) pairs all grouped
together (as the key is None). The reducer sorts the visit counts and emits top
3 visit counts along with the page URLs.
• In this example, a two-step job was required because we need to compute the
page visit counts first before finding the top 3 visited pages.

33
MapReduce Examples
4. Filtering:
• Filter out a subset of the records based on a filtering criteria.
• For example: filtering all page visits for the page ‘contact.html’ in the
month of Dec 2014.

34
MapReduce Examples

Part of Input to show its format

35
MapReduce Examples
4. Filtering computation Explanation:
• Filtering is useful when you want to get a subset of the data for further
processing.
• Filtering requires only a Map task.
• Each mapper filters out its local records based on the filtering criteria in the
map function.
• The mapper function in this example parses each line of the input, extracts the
month, year and page URL and emits key-value pairs if the month and year are
Dec 2014 and the page URL is ’http://example.com/contact.html’.
• The key is the URL, and the value is a tuple containing the rest of the parsed
fields.
36
Thank You

Heavy Equipment Daily Inspection Checklist
100% (1)
Heavy Equipment Daily Inspection Checklist
1 page
Women in Engaged Buddhism
50% (2)
Women in Engaged Buddhism
10 pages
Indian Space Programme PPT Brief
100% (1)
Indian Space Programme PPT Brief
24 pages
Chapter 4 - Understanding Map Reduce Fundamentals
No ratings yet
Chapter 4 - Understanding Map Reduce Fundamentals
45 pages
Map Reduce Tutorial-1
No ratings yet
Map Reduce Tutorial-1
7 pages
Map Reduce
No ratings yet
Map Reduce
7 pages
Unit4 Fos
No ratings yet
Unit4 Fos
7 pages
(BIG DATA) (MapReduce - Quick Guide, Tutorialspoint - Com)
No ratings yet
(BIG DATA) (MapReduce - Quick Guide, Tutorialspoint - Com)
36 pages
BDA Unit 3 1
No ratings yet
BDA Unit 3 1
37 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
Unit-2 MapReduce2024
No ratings yet
Unit-2 MapReduce2024
41 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
9 pages
Map Reduce 2
No ratings yet
Map Reduce 2
14 pages
6.UNIT 3 BDA
No ratings yet
6.UNIT 3 BDA
18 pages
Module2 C MapReduceParadigm
No ratings yet
Module2 C MapReduceParadigm
74 pages
Why MapReduce
No ratings yet
Why MapReduce
8 pages
Data Science
No ratings yet
Data Science
7 pages
L-4
No ratings yet
L-4
3 pages
Distributed and Cloud Computing
No ratings yet
Distributed and Cloud Computing
58 pages
DrKP Module 3
No ratings yet
DrKP Module 3
44 pages
Map reduce
No ratings yet
Map reduce
35 pages
Unit 2 Topic 4 Map Reduce
No ratings yet
Unit 2 Topic 4 Map Reduce
43 pages
Unit-2 Map Reduce Notes
No ratings yet
Unit-2 Map Reduce Notes
28 pages
UNIT-3 (1)
No ratings yet
UNIT-3 (1)
27 pages
Unit Ii Iintroduction To Map Reduce
No ratings yet
Unit Ii Iintroduction To Map Reduce
4 pages
Lecture 5 - MapReduce
No ratings yet
Lecture 5 - MapReduce
43 pages
Unit 3
No ratings yet
Unit 3
22 pages
Chapter4 - MapReduce
No ratings yet
Chapter4 - MapReduce
29 pages
Lecture 2.1
No ratings yet
Lecture 2.1
13 pages
Ir MR 1
No ratings yet
Ir MR 1
34 pages
Lecture 3 - MapReduce
No ratings yet
Lecture 3 - MapReduce
9 pages
BIG DATA
No ratings yet
BIG DATA
120 pages
Ecs765p W2
No ratings yet
Ecs765p W2
55 pages
Chapter 9 - Processing Big Data With Mapreduce
No ratings yet
Chapter 9 - Processing Big Data With Mapreduce
157 pages
Paper Map Reduce
No ratings yet
Paper Map Reduce
16 pages
Lecture 2_Map Reduce
No ratings yet
Lecture 2_Map Reduce
20 pages
Map Reduce
No ratings yet
Map Reduce
74 pages
Mapreduce Model Principles
No ratings yet
Mapreduce Model Principles
65 pages
Chapter 4
No ratings yet
Chapter 4
53 pages
Map-Reduce For Parallel Computing: Amit Jain
No ratings yet
Map-Reduce For Parallel Computing: Amit Jain
72 pages
UNIT - 5
No ratings yet
UNIT - 5
57 pages
Mapreduce Programming Model and Design Patterns: Andrea Lottarini January 17, 2012
No ratings yet
Mapreduce Programming Model and Design Patterns: Andrea Lottarini January 17, 2012
23 pages
3.Map-Reduce Framework - 1
No ratings yet
3.Map-Reduce Framework - 1
47 pages
Map Reduce
No ratings yet
Map Reduce
7 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
43 pages
Unit III
No ratings yet
Unit III
8 pages
Lecture 4: Mapreduce and Hadoop: Indranil Gupta (Indy)
No ratings yet
Lecture 4: Mapreduce and Hadoop: Indranil Gupta (Indy)
37 pages
Lecture 10 MapReduce Hadoop
No ratings yet
Lecture 10 MapReduce Hadoop
37 pages
Map Reduce
No ratings yet
Map Reduce
11 pages
Hadoop - Mapreduce (1)
No ratings yet
Hadoop - Mapreduce (1)
5 pages
Introduction To Map Reduce
No ratings yet
Introduction To Map Reduce
50 pages
ECS765P_W2_The MapReduce Programming Model
No ratings yet
ECS765P_W2_The MapReduce Programming Model
53 pages
Hadoop (Mapreduce)
No ratings yet
Hadoop (Mapreduce)
43 pages
Map Reduce Workflow Colloquim
No ratings yet
Map Reduce Workflow Colloquim
30 pages
Hadoop MapReduce Tutorial
No ratings yet
Hadoop MapReduce Tutorial
25 pages
Map Reduce
No ratings yet
Map Reduce
44 pages
UNIT 3bda
No ratings yet
UNIT 3bda
16 pages
Lecture - 3
No ratings yet
Lecture - 3
25 pages
MapReduce: Simplified Data Processing On Large Clusters
100% (1)
MapReduce: Simplified Data Processing On Large Clusters
13 pages
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
From Everand
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
Fouad Sabry
No ratings yet
Introduction to Algorithms
From Everand
Introduction to Algorithms
S VASIST
No ratings yet
Scanline Rendering: Exploring Visual Realism Through Scanline Rendering Techniques
From Everand
Scanline Rendering: Exploring Visual Realism Through Scanline Rendering Techniques
Fouad Sabry
No ratings yet
Line Drawing Algorithm: Mastering Techniques for Precision Image Rendering
From Everand
Line Drawing Algorithm: Mastering Techniques for Precision Image Rendering
Fouad Sabry
No ratings yet
Pembagian Halaqoh Ma 2023-2024.
No ratings yet
Pembagian Halaqoh Ma 2023-2024.
3 pages
Resident Citizen NRC, Ra, Nra-Etb Nra-Netb Regular Income Passive Income (Within The PH) Capital Gains Subject To CGT
No ratings yet
Resident Citizen NRC, Ra, Nra-Etb Nra-Netb Regular Income Passive Income (Within The PH) Capital Gains Subject To CGT
19 pages
Muet Cefr 2021 Drilling Model Worksheets 4 Sets Part1 Reading Paper
No ratings yet
Muet Cefr 2021 Drilling Model Worksheets 4 Sets Part1 Reading Paper
6 pages
Comparison of Reclaimer Types - Rev. 0
No ratings yet
Comparison of Reclaimer Types - Rev. 0
5 pages
Information Technology Basics
100% (3)
Information Technology Basics
12 pages
Cape Chemistry 2015 U2 p2 Ms
100% (1)
Cape Chemistry 2015 U2 p2 Ms
15 pages
Daftar Pustaka Spa
No ratings yet
Daftar Pustaka Spa
10 pages
1-APAR-ReadingMaterialApril29KSN
No ratings yet
1-APAR-ReadingMaterialApril29KSN
7 pages
Get Interpersonal Communication: Everyday Encounters 9th Edition J.T. Wood - Ebook PDF Free All Chapters
100% (8)
Get Interpersonal Communication: Everyday Encounters 9th Edition J.T. Wood - Ebook PDF Free All Chapters
26 pages
t Tp 1678727157 Super Mario Bros Learning Activity Booklet Ages 5 7 Ver 1
No ratings yet
t Tp 1678727157 Super Mario Bros Learning Activity Booklet Ages 5 7 Ver 1
13 pages
Food and Drinks Vocabulary Esl Multiple Choice Tests For Kids
No ratings yet
Food and Drinks Vocabulary Esl Multiple Choice Tests For Kids
4 pages
PDF Android Recipes A Problem-Solution Approach Dave Smith Download
100% (6)
PDF Android Recipes A Problem-Solution Approach Dave Smith Download
62 pages
OM Dated 19th June 2023 Regarding Revis - 240510 - 114659
No ratings yet
OM Dated 19th June 2023 Regarding Revis - 240510 - 114659
3 pages
Lawrence Gowing - The Encyclopedia of Visual Art (10 Volume Set) Vol. 08 (1985, Grolier Educational Corporation)
100% (1)
Lawrence Gowing - The Encyclopedia of Visual Art (10 Volume Set) Vol. 08 (1985, Grolier Educational Corporation)
200 pages
2017 Catalog
100% (1)
2017 Catalog
113 pages
Company Volunteering Project
No ratings yet
Company Volunteering Project
6 pages
Epri It Eam 2007
No ratings yet
Epri It Eam 2007
134 pages
Asqarov Diplom Ishi
No ratings yet
Asqarov Diplom Ishi
71 pages
Resumo Dto Civil IV p1
No ratings yet
Resumo Dto Civil IV p1
5 pages
Inviligator
No ratings yet
Inviligator
2 pages
S70 6986662 enUS sm.pdf 5
No ratings yet
S70 6986662 enUS sm.pdf 5
6 pages
As A Consumer, Try To Make An Assessment Tool or Requirement Checklist On A Certain Product. Take A Look at The Example Below
No ratings yet
As A Consumer, Try To Make An Assessment Tool or Requirement Checklist On A Certain Product. Take A Look at The Example Below
1 page
The Secret of Writing Multiple-Choice Test Items
100% (1)
The Secret of Writing Multiple-Choice Test Items
11 pages
DS Lab Manual CD 303
No ratings yet
DS Lab Manual CD 303
38 pages
Festivals in The Philippines
No ratings yet
Festivals in The Philippines
17 pages
Hockey 46546546
No ratings yet
Hockey 46546546
6 pages
BTech. 4th Year - Computer Science and Engineering-Artificial Intelligence - 2023-24 - v2
No ratings yet
BTech. 4th Year - Computer Science and Engineering-Artificial Intelligence - 2023-24 - v2
20 pages

L04-MapReduce

Uploaded by

L04-MapReduce

Uploaded by

Lecture 4

Big Data Processing Techniques

Dr. Lydia Wahid

➢ Itprocesses large datasets using parallel processing deployed over clusters of

➢ Itis based on the principle of divide-and-conquer. It divides a big problem into a

➢ The Map and Reduce phases run sequentially in a cluster.

➢ The Map phase is executed first then the Reduce phase.

➢First of all, all map tasks are independently run.

➢Then, reduce tasks process their data concurrently and independently.

Map Phase Reduce Phase

Input Dataset Required Output

Map Phase Reduce Phase

Part of Input to show its format

You might also like