0% found this document useful (0 votes)
12 views37 pages

L04-MapReduce

The document provides an overview of MapReduce, a big data processing technique that utilizes parallel processing to handle large datasets. It outlines the MapReduce algorithm, detailing its phases: split, map, partition, shuffle and sort, and reduce, with examples illustrating how to count occurrences, calculate averages, and filter data. The lecture emphasizes the divide-and-conquer principle and the sequential execution of map and reduce tasks in a cluster environment.

Uploaded by

abdullah.oudaa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views37 pages

L04-MapReduce

The document provides an overview of MapReduce, a big data processing technique that utilizes parallel processing to handle large datasets. It outlines the MapReduce algorithm, detailing its phases: split, map, partition, shuffle and sort, and reduce, with examples illustrating how to count occurrences, calculate averages, and filter data. The lecture emphasizes the divide-and-conquer principle and the sequential execution of map and reduce tasks in a cluster environment.

Uploaded by

abdullah.oudaa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Lecture 4

Big Data Processing Techniques


(MapReduce)

Dr. Lydia Wahid

1
Agenda

MapReduce Definition

MapReduce Algorithm

MapReduce Examples

2
MapReduce Definition

3
MapReduce Definition
➢ MapReduce is a widely used Big data processing technique.

➢ Itprocesses large datasets using parallel processing deployed over clusters of


hardware.

➢ Itis based on the principle of divide-and-conquer. It divides a big problem into a


collection of smaller problems that can each be solved quickly.

➢A dataset is broken down into multiple smaller parts, and operations are performed
on each part independently and in parallel.

➢ The results from all operations are then combined to arrive at the result of the
whole dataset.
4
MapReduce Definition
➢ EachMapReduce job is composed of a map phase and a reduce phase and
each phase consists of multiple stages.

➢ The Map and Reduce phases run sequentially in a cluster.

➢ The Map phase is executed first then the Reduce phase.

➢ The output of the Map phase becomes the input of the Reduce phase.

➢ MapReduce does not require that the input data conform to any particular
data model.
5
MapReduce Definition
➢The following figure shows the data flow in MapReduce:

6
MapReduce Definition
➢In MapReduce, all map and reduce tasks run in parallel.

➢First of all, all map tasks are independently run.

➢Meanwhile, reduce tasks wait until their respective maps are finished.

➢Then, reduce tasks process their data concurrently and independently.

7
MapReduce Algorithm

8
MapReduce Algorithm
Split Map

INPUT OUTPUT

Map Phase Reduce Phase

9
MapReduce Algorithm
➢We will now apply and explain each stage on the following example:
• Problem Statement:
Count the number of occurrences of each word available in a DataSet.

Input Dataset Required Output

10
MapReduce Algorithm
➢Split stage:
• Takes input DataSet and divides it into smaller Sub-DataSets called splits.
• Each split is parsed into its constituent records as a key-value pair. The key is
usually the ordinal position of the record, and the value is the actual record.
• A common example will read a directory full of text files and return each line as a
record.
• The key-value pairs for each split are then sent to a map function (or mapper).
➢By applying this stage on our example, we get the following:

11
MapReduce Algorithm

Split stage

Input Dataset

12
MapReduce Algorithm
➢Map stage:
• This is the map function or mapper that executes user-defined logic.
• The mapper processes each key-value pair as per the user-defined logic and
further generates a key-value pair as its output.
• The output key can either be the same as the input key or a substring value from
the input value, or another user-defined object.
• Similarly, the output value can either be the same as the input value or a substring
value from the input value, or another user-defined object.
• When all records of the split have been processed, the output is a list of key-
value pairs where multiple key-value pairs can exist for the same key.
➢By applying this stage on our example, we get the following:
13
MapReduce Algorithm

Map stage
(mapper)

14
MapReduce Algorithm
➢Partition stage:
• During the partition stage, if more than one reducer is involved, a partitioner
divides the output from the mapper into partitions between reducer instances.
• All records for a particular key are assigned to the same reducer.
• The MapReduce algorithm guarantees a random and fair distribution between
reducers while making sure that all of the same keys across multiple mappers end
up with the same reducer.
➢Assume here in our example, that we have only one reducer.

15
MapReduce Algorithm
➢Shuffle and Sort stage:
• During the first stage of the reduce task, output from all partitioners is copied
across the network to the nodes running the reduce task. This is known as
shuffling.
• The output list of key-value pairs from each partitioner can contain the same key
multiple times, so sorting and merging of the key-value pairs is done according
to the keys so that the output contains a sorted list of all input keys and their
values with the same keys appearing together.
• This merge creates a single key-value pair per group, where key is the group key
and the value is the list of all group values.
• The way in which keys are sorted and merged can be customized.
➢By applying this stage on our example, we get the following:

16
MapReduce Algorithm

Shuffling and
Sorting stage

17
MapReduce Algorithm
➢Reduce stage:
• Reduce is the final stage of the reduce phase.
• Depending on the user-defined logic specified in the reduce function or
reducer, the reducer will either further summarize its input or will emit the
output without making any changes.
• The output key can either be the same as the input key or a substring value from
the input value, or another user-defined object.
• The output value can either be the same as the input value or a substring value
from the input value, or another user-defined object.
➢By applying this stage on our example, we get the following:

18
MapReduce Algorithm

Reduce stage
(reducer)

Final Output

19
MapReduce Algorithm
➢Consider another example as follows:
• We have products information as input and we need as output the quantity of
each product.

Map Phase Reduce Phase

20
MapReduce Algorithm
1. The input (sales.txt) is divided into two
splits.
2. Two map tasks running on two different
nodes, Node A and Node B, extract
product and quantity from the respective
split’s records in parallel. The output from
each map function is a key-value pair
where product is the key while quantity is
the value.
3. The combiner then performs local
summation of product quantities. (A
combiner is essentially a reducer function
that locally groups a mapper’s output on
the same node as the mapper.)
4. As there is only one reduce task, no Map Phase
partitioning is performed.
21
MapReduce Algorithm
5. The output from the two map
tasks is then copied to a third
node, Node C, that runs the
shuffle stage as part of the
reduce task.
6. The sort stage then groups all
quantities of the same product
together as a list.
7. The reduce function then sums
up the quantities of each unique
product in order to create the Reduce Phase
output.
22
MapReduce Examples

23
MapReduce Examples
➢For the examples in this section, we will use data similar to the data
collected by a web analytics service that shows various statistics for page
visits for a website.
➢Each page has some tracking code which sends the visitor’s IP address
along with a timestamp to the web analytics service. The web analytics
service keeps a record of all page visits and the visitor IP addresses and
uses MapReduce programs for computing various statistics.
➢Each visit to a page is logged as one row in the log. The log file contains
the following columns:
Date (YYYY-MM-DD), Time (HH:MM:SS), URL, IP, Visit-Length.

24
MapReduce Examples
1. Count: Compute the number of visits to each
page of the given website:
Part of Input to show its format

25
MapReduce Examples

26
MapReduce Examples
1. Count computation Explanation:
• To compute count, the mapper function emits key-value pairs where the key is
the field to group-by.
• The mapper function in this example parses each line of the input and emits key-
value pairs where the key is the URL and value is ‘1’.
• The reducer function receives the key-value pairs grouped by the same key and
adds up the values for each group to compute count.

27
MapReduce Examples
2. Average: Find the average
time spent on each page in
the given website:
Part of Input to show its format

28
29
MapReduce Examples
2. Average computation Explanation:
• To compute the average, the mapper function emits key-value pairs where
the key is the field to group-by and value contains related items required to
compute the average.
• The mapper function in this example parses each line of the input and
emits key-value pairs where the key is the URL and value is the visit length.
• The reducer receives the list of values grouped by the key (which is the
URL) and finds the average of these values.

30
MapReduce Examples
3. Top-N: Find the top 3 most visited pages in
the given website:
Part of Input to show its format

31
a. What will be the
output of the
shuffle and sort
stage?

b. In Reduce-2,
how many
reducers do we
have?

32
MapReduce Examples
3. Top-N computation Explanation:
• The mapper function in this example parses each line of the input and emits
key-value pairs where the key is the URL and value is ‘1’.
• The reducer receives the list of values grouped by the key and sums up the
values to count the visits for each page.
• The first reducer emits None as the key and a tuple comprising of page visit
count and page URL and the value.
• The second reducer receives a list of (visit count, URL) pairs all grouped
together (as the key is None). The reducer sorts the visit counts and emits top
3 visit counts along with the page URLs.
• In this example, a two-step job was required because we need to compute the
page visit counts first before finding the top 3 visited pages.

33
MapReduce Examples
4. Filtering:
• Filter out a subset of the records based on a filtering criteria.
• For example: filtering all page visits for the page ‘contact.html’ in the
month of Dec 2014.

34
MapReduce Examples

Part of Input to show its format

35
MapReduce Examples
4. Filtering computation Explanation:
• Filtering is useful when you want to get a subset of the data for further
processing.
• Filtering requires only a Map task.
• Each mapper filters out its local records based on the filtering criteria in the
map function.
• The mapper function in this example parses each line of the input, extracts the
month, year and page URL and emits key-value pairs if the month and year are
Dec 2014 and the page URL is ’http://example.com/contact.html’.
• The key is the URL, and the value is a tuple containing the rest of the parsed
fields.
36
Thank You

37

You might also like