0% found this document useful (0 votes)

134 views50 pages

Introduction To Map Reduce

CDBT

Uploaded by

KhAn Zainab

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

134 views50 pages

Introduction To Map Reduce

CDBT

Uploaded by

KhAn Zainab

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Introduction to Map Reduce

1
Map Reduce: Motivation

We realized that most of our computations involved applying a map

operation to each logical record in our input in order to compute a set
of intermediate key/value pairs, and then applying a reduce operation
to all the values that shared the same key in order to combine the
derived data appropriately.

The issues of how to parallelize the computation, distribute the data,

and handle failures conspire to obscure the original simple
computation with large amounts of complex code to deal with these
issues.

Dean, Ghemawat. MapReduce: simplified data processing on large

clusters. CACM. ACM 51, 1 January 2008

2
Problem Scope

Need to scale to 100s or 1000s of computers, each with several

processor cores

How large is the amount of work?

Web-Scale data on the order of 100s of GBs to TBs or PBs
It is likely that the input data set will not fit on a single computers hard drive
Hence, a distributed file system (e.g., Google File System- GFS) is typically
required

3
Problem Scope

Scalability to large data volumes:

Scan 1000 TB on 1 node @ 100 MB/s = 24 days
Scan on 1000-node cluster = 35 minutes

Required functions
Automatic parallelization & distribution
Fault-tolerance
Status and monitoring tools
A clean abstraction for programmers
Functional programming meets
distributed computing
A batch data processing system

4
Commodity Clusters

Need to efficiently process large volumes of data by connecting many

commodity computers together to work in parallel

A theoretical 1000-CPU machine would cost a very large amount of

money, far more than 1000 single-CPU or 250 quad-core machines

5
Mapreduce & Hadoop - History

2003: Google publishes about its cluster architecture & distributed file
system (GFS)
2004: Google publishes about its MapReduce model used on top of GFS
Both GFS and MapReduce are written in C++ and are closed-source, with Python
and Java APIs available to Google programmers only
2006: Apache & Yahoo! -> Hadoop & HDFS
open-source, Java implementations of Google MapReduce and GFS with a diverse
set of API available to the public
Evolved from Apache Lucene/Nutch open-source web search engine
2008: Hadoop becomes an independent Apache project
Yahoo! Uses Hadoop in production
Today: Hadoop is used as a general-purpose storage and analysis platform
for big data
Other Hadoop distributions from several vendors including EMC, IBM, Microsoft,
Oracle, Cloudera, etc.
Many users (http://wiki.apache.org/hadoop/PoweredBy)
Research and development actively continues...

6
Google Cluster Architecture: Key Ideas

Single-thread performance doesnt matter

For large problems, total throughput/$ is more important than peak
Stuff breaks
If you have 1server, it may stay up three years (1,000days).
If you have 10,000 servers, expect to lose 10 per day.
Ultra-reliable hardware doesnt really help
At large scales, the most reliable hardware still fails, albeit less often
Software still needs to be fault-tolerant
Commodity machines without fancy hardware give better performance/$
Have a reliable computing infrastructure from clusters of unreliable
commodity PCs.
Replicate services across many machines to increase request
throughput and availability.
Favor price/performance over peak performance.

7
What Makes MapReduce Unique?
Its simplified programming model which allows the user to quickly write
and test distributed systems
Its efficient and automatic distribution of data and workload across
machines
Its flat scalability curve. Specifically, after a Mapreduce program is
written and functioning on 10 nodes, very little-if any- work is required
for making that same program run on 1000 nodes.
MapReduce ties smaller and more reasonably priced machines together
into a single cost-effective commodity cluster

8
Isolated Tasks

MapReduce divides the workload into multiple independent tasks and

schedules them across cluster nodes

A work performed by each task is done in isolation from one another

The amount of communication which can be performed by tasks is

mainly limited for scalability reasons

The communication overhead required to keep the data on the nodes

synchronized at all times would prevent the model from performing
reliably and efficiently at large scale

9
MapReduce in a Nutshell

Given:
a very large dataset
a well-defined computation task to be performed on elements of this dataset
(preferably, in a parallel fashion on a large cluster)
Map Reduce framework:
Just express what you want to compute (map() & reduce()).
Dont worry about parallelization, fault tolerance, data distribution, load
balancing (MapReduce takes care of these).
What changes from one application to another is the actual computation; the
programming structure stays similar.
In simple terms
Read lots of data.
Map: extract something that you care about from each record.
Shuffle and sort.
Reduce: aggregate, summarize, filter, or transform.
Write the results.
One can use as many Maps and Reduces as needed to model a given
problem.

10
Note: There is no precise 1-1
correspondence. Please take
Functional programming this just as an analogy.
foundations

map in MapReduce map in FP

map::(ab)[a][b]
Example: Double all numbers in a list.
> map ((*) 2) [1, 2, 3]
> [2, 4, 6]

In a purely functional setting, an element of a list being computed by

map cannot see the effects of the computations on other elements.
If the order of application of a function f to elements in a list is
commutative, then we can reorder or parallelize execution.

11
Note: There is no precise 1-1
correspondence. Please take
Functional programming this just as an analogy.
foundations

Move over the list, apply f to each element and an accumulator. f

returns the next accumulator value, which is combined with the next
element.

reduce in MapReduce fold in FP

foldl :: (b a b) b [a] b
Example: Sum of all numbers in a list.
> foldl (+) 0 [1, 2, 3] foldl (+) 0 [1, 2, 3]
>6

12
MapReduce Basic Programming Model

Transform a set of input key-value pairs to a set of output values:

Map: (k1, v1) list(k2, v2)
MapReduce library groups all
intermediate pairs with same key together.
Reduce: (k2, list(v2)) list(v2)

13
Word Count
map(k1, v1) list(k2, v2) reduce(k2, list(v2)) list(v2)

map(String key, String value): reduce(String key, Iterator values):

// key: document name // key: a word
// value: document contents // values: a list of counts
for each word w in value: int result = 0;
EmitIntermediate(w, 1); for each v in values:
result += ParseInt(v);
Emit(AsString(result));

14
Parallel processing model

15
Execution overview Read as part of this lecture!
Jeffrey Dean and Sanjay
Ghemawat. 2008. MapReduce:
simplified data processing on
Master Workers large clusters. Commun. ACM
51, 1 (January 2008), 107-113.
Master coordinates
Local Write / remote reads

16
MapReduce Scheduling

One master, many workers

Input data split into M map tasks (typically 64 MB (~ chunk size in GFS))
Reduce phase partitioned into R reduce tasks (hash(k) mod R)
Tasks are assigned to workers dynamically
Master assigns each map task to a free worker
Considers locality of data to worker when assigning a task
Worker reads task input (often from local disk)
Worker produces R local files containing intermediate k/v pairs
Master assigns each reduce task to a free worker
Worker reads intermediate k/v pairs from map workers
Worker sorts & applies users reduce operation to produce the output

17
Data Distribution

In a MapReduce cluster, data is distributed to all the nodes of the

cluster as it is being loaded in

An underlying distributed file systems (e.g., GFS) splits large data files
into chunks which are managed by different nodes in the cluster
Input data: A large file

Node 1 Node 2 Node 3

Chunk of input data Chunk of input data Chunk of input data

Even though the file chunks are distributed across several machines,
they form a single namespace
18
Partitions

In MapReduce, intermediate output values are not usually reduced together

All values with the same key are presented to a single Reducer together
More specifically, a different subset of intermediate key space is assigned to
each Reducer
These subsets are known as partitions

Different colors represent

different keys (potentially)
from different Mappers

Partitions are the input to Reducers

Word count again

20
Choosing M and R

M = number of map tasks, R = number of reduce tasks

Larger M, R: creates smaller tasks, enabling easier load balancing and
faster recovery (many small tasks from failed machine)
Limitation: O(M+R) scheduling decisions and O(M*R) in-memory
state at master
Very small tasks not worth the startup cost
Recommendation:
Choose M so that split size is approximately 64 MB
Choose R a small multiple of the number of workers; alternatively choose R a
little smaller than #workers to finish reduce phase in one wave

21
MapReduce Fault Tolerance

On worker failure:
Master detects failure via periodic heartbeats.
Both completed and in-progress map tasks on that worker should be re-
executed ( output stored on local disk).
Only in-progress reduce tasks on that worker should be re- executed (
output stored in global file system).
All reduce workers will be notified about any map re-executions.

On master failure:
State is check-pointed to GFS: new master recovers & continues.
Robustness:
Example: Lost 1600 of 1800 machines once, but finished fine.

22
MapReduce Data Locality

Goal: To conserve network bandwidth.

In GFS, data files are divided into 64MB blocks and 3 copies of each
are stored on different machines.
Master program schedules map() tasks based on the location of these
replicas:
Put map() tasks physically on the same machine as one of the input replicas
(or, at least on the same rack / network switch).
This way, thousands of machines can read input at local disk speed.
Otherwise, rack switches would limit read rate.

23
Stragglers & Backup Tasks

Problem: Stragglers (i.e., slow workers) significantly lengthen the

completion time.
Solution: Close to completion, spawn backup copies of the remaining
in-progress tasks.
Whichever one finishes first, wins.
Additional cost: a few percent more resource usage.
Example: A sort program without backup = 44% longer.

24
Other Practical Extensions

User-specified combiner functions for partial combination within a

map task can save network bandwidth (~ mini-reduce)
Example: WordCount
User-specified partitioning functions for mapping intermediate key
values to reduce workers (by default: hash(key) mod R)
Example: hash(Hostname(urlkey)) mod R
Ordering guarantees: Processing intermediate k/v pairs in increasing
order
Example: reduce of WordCount outputs ordered results.
Custom input and output format handlers
Single-machine execution option for testing & debugging

25
Basic MapReduce Program Design

Tasks that can be performed independently on a data object, large

number of them: Map
Tasks that require combining of multiple data objects: Reduce
Sometimes it is easier to start program design with Map, sometimes
with Reduce
Select keys and values such that the right objects end up together in
the same Reduce invocation
Might have to partition a complex task into multiple MapReduce sub-
tasks

26
MapReduce vs. Traditional RDBMS

MapReduce Traditional RDBMS

Data size Petabytes Gigabytes
Access Batch Interactive and batch
Write once, read many Read and write many
Updates
times times
Structure Dynamic schema Static schema
Integrity Low High (normalized data)
Non-linear (general
Scaling Linear
SQL)

27
More Hadoop details

28
Hadoop

Since its debut on the computing stage, MapReduce has

frequently been associated with Hadoop

Hadoop is an open source implementation of MapReduce and is

currently enjoying wide popularity

Hadoop presents MapReduce as an analytics engine and under

the hood uses a distributed storage layer referred to as Hadoop
Distributed File System (HDFS)

HDFS mimics Google File System (GFS)

29
Hadoop MapReduce: A Closer Look
Node 1 Node 2
Files loaded from local HDFS store Files loaded from local HDFS store

InputFormat InputFormat

file file
Split Split Split Split Split Split
file file

RecordReaders RR RR RR RR RR RR RecordReaders

Input (K, V) pairs Input (K, V) pairs

Map Map Map Map Map Map

Intermediate (K, V) pairs Intermediate (K, V) pairs

Shuffling
Partitioner Process Partitioner

Sort Intermediate Sort

(K,V) pairs
exchanged by
Reduce all nodes Reduce

Final (K, V) pairs Final (K, V) pairs

OutputFormat OutputFormat
Writeback to local Writeback to local
HDFS store HDFS store
Input Files

Input files are where the data for a MapReduce task is initially stored
The input files typically reside in a distributed file system (e.g. HDFS)
The format of input files is arbitrary
Line-based log files
Binary files
Multi-line input records
Or something else entirely

file

31
InputFormat

How the input files are split up and read is defined by the InputFormat
InputFormat is a class that does the following:
Selects the files that should be used for input
Defines the InputSplits that break a file
Provides a factory for RecordReader objects that Files loaded from local HDFS store
read the file
InputFormat

file

32
InputFormat Types

Several InputFormats are provided with Hadoop:

InputFormat Description Key Value

TextInputFormat Default format; The byte The line contents
reads lines of text offset of the
files line
KeyValueInputFormat Parses lines into Everything up The remainder of
(K, V) pairs to the first tab the line
character
SequenceFileInputFormat A Hadoop-specific user-defined user-defined
high-performance
binary format

33
Input Splits

An input split describes a unit of work that comprises a single map task in a
MapReduce program

By default, the InputFormat breaks a file up into 64MB splits

Files loaded from local HDFS store

By dividing the file into splits, we allow
several map tasks to operate on a single
file in parallel InputFormat

file
If the file is very large, this can improve Split Split Split
performance significantly through parallelism file

Each map task corresponds to a single input split

RecordReader

The input split defines a slice of work but does not describe how
to access it

The RecordReader class actually loads data from its source and
converts it into (K, V) pairs suitable for reading by Mappers
Files loaded from local HDFS store

The RecordReader is invoked repeatedly InputFormat

on the input until the entire split is consumed
file
Split Split Split

Each invocation of the RecordReader leads file

to another call of the map function defined RR RR RR

by the programmer
Mapper and Reducer

The Mapper performs the user-defined work Files loaded from local HDFS store
of the first phase of the MapReduce program
InputFormat
A new instance of Mapper is created
for each split file
Split Split Split
file

The Reducer performs the user-defined work RR RR RR

of the second phase of the MapReduce program
Map Map Map
A new instance of Reducer is created for each partition
Partitioner

For each key in the partition assigned to a Reducer, the

Reducer is called once Sort

Reduce
Partitioner

Each mapper may emit (K, V) pairs Files loaded from local HDFS store
to any partition
InputFormat
Therefore, the map nodes must all agree on
where to send different pieces of file
Split Split Split
intermediate data file

RR RR RR
The partitioner class determines which
partition a given (K,V) pair will go to
Map Map Map

The default partitioner computes a hash value for a Partitioner

given key and assigns it to a partition based on
this result
Sort

Reduce
Sort
Files loaded from local HDFS store
Each Reducer is responsible for reducing
the values associated with (several)
InputFormat
intermediate keys
file
Split Split Split
The set of intermediate keys on a single file

node is automatically sorted by RR RR RR

MapReduce before they are presented
to the Reducer
Map Map Map

Partitioner

Sort

Reduce
OutputFormat
Files loaded from local HDFS store

The OutputFormat class defines the InputFormat

way (K,V) pairs produced by Reducers
file
are written to output files Split Split Split

The instances of OutputFormat provided by file

Hadoop write to files on the local disk or in HDFS RR RR RR

Several OutputFormats are provided by Hadoop:

Map Map Map
OutputFormat Description
TextOutputFormat Default; writes lines in "key \t Partitioner
value" format
SequenceFileOutputFormat Writes binary files suitable for Sort
reading into subsequent
MapReduce jobs
Reduce
NullOutputFormat Generates no output files

OutputFormat
Questions?

40
Exercise

41
Exercise

Read the original Map Reduce paper

Answer some questions

Implement friends count

Fill word length (why fill, anyway?)
Understand and run inverted indexes

Code available as a Maven or

Eclipse project: Just run locally

42
MapReduce Use Case: Word Length

Big = Yellow = 10+ letters

Medium = Red = 5..9 letters
Small = Blue = 2..4 letters
Tiny = Pink = 1 letter

Big 37
Medium 148
Small 200
Tiny 9

43
MapReduce Use Case: Word Length

Split the document into

chunks and process
each chunk
on a different computer

44
MapReduce Use Case: Word Length
Big 1
Big 1
Big 1

Big 1,1,1,1,
Medium 1,1,1,.. Medium 1
Small 1,1,1,1,.. Medium 1
Tiny 1,1,1,1, Big 37
Medium 148
Small 1 Small 200
Small 1 Tiny 9
Big 1,1,1,1, Small 1
Medium 1,1,1,..
Small 1,1,1,1,..
Tiny 1,1,1,1,
Tiny 1
Tiny 1
Tiny 1

45
MapReduce Use Case: Inverted Indexing

Construction of inverted lists for document search

Input: documents: (docid, [term, term..]), (docid, [term, ..]), ..
Output: (term, [docid, docid, ])
E.g., (apple, [Foo.txt, Bar.txt, Boo.txt, ])

2010, Jamie Callan 46

Inverted Index: Data flow
Foo
Foo map output
contains: Foo
much: Foo Reduced output
This page contains page : Foo
so much text so : Foo
text: Foo contains: Foo, Bar
This : Foo much: Foo
My: Bar
page : Foo, Bar
so : Foo
Bar text: Foo, Bar
Bar map output
This : Foo
contains: Bar
too: Bar
My: Bar
My page contains page : Bar
text too text: Bar
too: Bar
MapReduce Use Case: Inverted Indexing

A simple approach to creating inverted lists

Each Map task is a document parser
Input: A stream of documents
Output: A stream of (term, docid) tuples
(long, Foo.txt) (ago, Foo.txt) (and, Foo.txt) (once, Bar.txt) (upon, Bar.txt)
We may create internal IDs for words.
Shuffle sorts tuples by key and routes tuples to Reducers
Reducers convert streams of keys into streams of inverted lists
Input: (long, Foo.txt) (long, Bar.txt) (long, Boo.txt) (long, )
The reducer sorts the values for a key and builds an inverted list
Output: (long, [Foo.txt, Bar.txt, ])

2010, Jamie Callan 48

Questions?

49
Sources & References

Excellent intro to MapReduce:

https://websci.informatik.uni-
freiburg.de/teaching/ws201213/infosys/slides/m3_l1_mapreduce.pdf
http://www.systems.ethz.ch/sites/default/files/file/BigData_Fall2012/BigData-
2012-M3.pdf

MapReduce & Functional Programming:

https://courses.cs.washington.edu/courses/cse490h/08au/lectures/mapred.ppt

For the introductory part:

http://www.cs.ucsb.edu/~tyang/class/140s14/slides/CS140TopicMapReduce.pdf
A lot of details about the Hadoop case:
www.qatar.cmu.edu/~msakr/15440-
f11/.../Lecture18_15440_MHH_9Nov_2011.ppt

Understanding Hadoop and MapReduce
No ratings yet
Understanding Hadoop and MapReduce
44 pages
MapReduce: Distributed Computing Explained
No ratings yet
MapReduce: Distributed Computing Explained
51 pages
MapReduce for Data Scientists
No ratings yet
MapReduce for Data Scientists
213 pages
MapReduce Overview and Implementation Guide
No ratings yet
MapReduce Overview and Implementation Guide
42 pages
Unit 5 Lecture 5
No ratings yet
Unit 5 Lecture 5
21 pages
Mapreduce Model Principles
No ratings yet
Mapreduce Model Principles
65 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
MapReduceIntro Updated
No ratings yet
MapReduceIntro Updated
31 pages
Ir MR 1
No ratings yet
Ir MR 1
34 pages
Chapter 4
No ratings yet
Chapter 4
53 pages
Parallel Programming and MapReduce Overview
No ratings yet
Parallel Programming and MapReduce Overview
47 pages
Understanding MapReduce Basics
No ratings yet
Understanding MapReduce Basics
13 pages
MapReduce Programming Model Overview
No ratings yet
MapReduce Programming Model Overview
26 pages
PDC Lecture 13
No ratings yet
PDC Lecture 13
32 pages
Lecture 3 - Big Data
No ratings yet
Lecture 3 - Big Data
43 pages
Dean 08 Map Reduce
No ratings yet
Dean 08 Map Reduce
7 pages
Understanding MapReduce Framework
No ratings yet
Understanding MapReduce Framework
44 pages
Big Data Computing: MapReduce & Clustering
No ratings yet
Big Data Computing: MapReduce & Clustering
36 pages
MapReduce and Hadoop Overview Guide
No ratings yet
MapReduce and Hadoop Overview Guide
69 pages
Lecture 10 Chapter 6 Part 1 Big Data Processing Concepts
No ratings yet
Lecture 10 Chapter 6 Part 1 Big Data Processing Concepts
26 pages
Map Reduce Notes and Learning
No ratings yet
Map Reduce Notes and Learning
48 pages
MapReduce Overview and Applications
No ratings yet
MapReduce Overview and Applications
42 pages
Distributed and Cloud Computing
No ratings yet
Distributed and Cloud Computing
58 pages
Lecture 3 - MapReduce
No ratings yet
Lecture 3 - MapReduce
9 pages
Introduction to MapReduce & Functional Programming
No ratings yet
Introduction to MapReduce & Functional Programming
37 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
9 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
17 pages
Data Mining with MapReduce Techniques
No ratings yet
Data Mining with MapReduce Techniques
28 pages
Unit 2 Topic 4 Map Reduce
No ratings yet
Unit 2 Topic 4 Map Reduce
27 pages
Introduction to MapReduce and Hadoop
No ratings yet
Introduction to MapReduce and Hadoop
45 pages
Big Data Management Continued
No ratings yet
Big Data Management Continued
48 pages
MapReduce: Efficient Data Processing
No ratings yet
MapReduce: Efficient Data Processing
29 pages
Lecture 05
No ratings yet
Lecture 05
23 pages
Lecture 1 - Map Reduce
No ratings yet
Lecture 1 - Map Reduce
31 pages
Understanding Hadoop MapReduce Framework
No ratings yet
Understanding Hadoop MapReduce Framework
15 pages
Module2 C MapReduceParadigm
No ratings yet
Module2 C MapReduceParadigm
74 pages
Module2 C MapReduceParadigm
No ratings yet
Module2 C MapReduceParadigm
74 pages
MapReduce and Hadoop Overview
No ratings yet
MapReduce and Hadoop Overview
24 pages
Introduction to MapReduce for Big Data
No ratings yet
Introduction to MapReduce for Big Data
35 pages
Week 02
No ratings yet
Week 02
115 pages
B. Hadoop Ecosystem - III (MapReduce)
No ratings yet
B. Hadoop Ecosystem - III (MapReduce)
55 pages
Paper Map Reduce
No ratings yet
Paper Map Reduce
16 pages
MapReduce Tutorial: Theory & Practice
100% (1)
MapReduce Tutorial: Theory & Practice
192 pages
Ch02a Mapreduce
No ratings yet
Ch02a Mapreduce
53 pages
Understanding MapReduce Framework
No ratings yet
Understanding MapReduce Framework
20 pages
MapReduce: Efficient Data Processing
No ratings yet
MapReduce: Efficient Data Processing
34 pages
MapReduce Programming Model Guide
No ratings yet
MapReduce Programming Model Guide
55 pages
MapReduce Tutorial: Data Processing Guide
No ratings yet
MapReduce Tutorial: Data Processing Guide
131 pages
Chapter 9 - Processing Big Data With Mapreduce
No ratings yet
Chapter 9 - Processing Big Data With Mapreduce
157 pages
MapReduce Paradigm in Big Data Analytics
No ratings yet
MapReduce Paradigm in Big Data Analytics
36 pages
MapReduce & Hadoop for CS Students
No ratings yet
MapReduce & Hadoop for CS Students
25 pages
MapReduce Overview and Word Count
No ratings yet
MapReduce Overview and Word Count
24 pages
Second Exam Summary
No ratings yet
Second Exam Summary
44 pages
MapReduce 2.0 Overview in Hadoop
No ratings yet
MapReduce 2.0 Overview in Hadoop
16 pages
Lecture 2 - Map Reduce
No ratings yet
Lecture 2 - Map Reduce
20 pages
DCC Chapter 4
No ratings yet
DCC Chapter 4
37 pages
Distributed Systems: 18. Mapreduce
No ratings yet
Distributed Systems: 18. Mapreduce
39 pages
Hadoop
No ratings yet
Hadoop
34 pages
Python
100% (4)
Python
390 pages
System Architecture: COMP9243 - Week 2 (16s1)
No ratings yet
System Architecture: COMP9243 - Week 2 (16s1)
15 pages
ATmega32 WAV Music Player Project
No ratings yet
ATmega32 WAV Music Player Project
3 pages
Pattern Matching Algorithms Overview
No ratings yet
Pattern Matching Algorithms Overview
16 pages
IQP
No ratings yet
IQP
84 pages
A Glimpse of The Hadoop Echosystem
No ratings yet
A Glimpse of The Hadoop Echosystem
16 pages
Anupam Vaishy Resume
No ratings yet
Anupam Vaishy Resume
1 page
Driver Drowsiness Detection System
No ratings yet
Driver Drowsiness Detection System
38 pages
Nuclear Hazards and Human Health (AEC)
No ratings yet
Nuclear Hazards and Human Health (AEC)
21 pages
Molecular Basis of Inheritance Short Notes
No ratings yet
Molecular Basis of Inheritance Short Notes
1 page
Modern Impact and Penetration Mechanics (James D. Walker) (Z-Library)
No ratings yet
Modern Impact and Penetration Mechanics (James D. Walker) (Z-Library)
689 pages
7QC Tools
No ratings yet
7QC Tools
89 pages
Interchange Best Practice Guidelines 2009 PDF
No ratings yet
Interchange Best Practice Guidelines 2009 PDF
42 pages
Centrifuges Fa Q
No ratings yet
Centrifuges Fa Q
4 pages
English Writing Course Guide
No ratings yet
English Writing Course Guide
69 pages
Hydraulic Governors
100% (3)
Hydraulic Governors
6 pages
Introduction to Yoga and Patanjali's Teachings
No ratings yet
Introduction to Yoga and Patanjali's Teachings
25 pages
Uniform Schedule For Third Quater 2025
No ratings yet
Uniform Schedule For Third Quater 2025
2 pages
Viscosity Effects on Pump Performance
No ratings yet
Viscosity Effects on Pump Performance
4 pages
San Diego Sewer Design Guide 2013
No ratings yet
San Diego Sewer Design Guide 2013
231 pages
Principios de Probabilidad Variables Aleatorias Peyton Peebles
No ratings yet
Principios de Probabilidad Variables Aleatorias Peyton Peebles
15 pages
Physics Measurement and Fluid Dynamics Quiz
No ratings yet
Physics Measurement and Fluid Dynamics Quiz
1 page
Process Costing Journal Entries Guide
100% (2)
Process Costing Journal Entries Guide
17 pages
Beneficiary Consent Form
100% (1)
Beneficiary Consent Form
1 page
Training Report at CESC LT Distribution Systems
100% (1)
Training Report at CESC LT Distribution Systems
20 pages
BE Computer Fourth Semester Course Structure & Syllabus
No ratings yet
BE Computer Fourth Semester Course Structure & Syllabus
79 pages
Kenyan Sign Language Book 1
100% (2)
Kenyan Sign Language Book 1
25 pages
Explanation of Low Voltage High Current and High Voltage Low Current Principle When Transmitting Power To The Load
No ratings yet
Explanation of Low Voltage High Current and High Voltage Low Current Principle When Transmitting Power To The Load
3 pages
12 Lecture
No ratings yet
12 Lecture
36 pages
CH 03 - Process Strategy
No ratings yet
CH 03 - Process Strategy
50 pages
Baker Manual EN403 603 Web
100% (1)
Baker Manual EN403 603 Web
37 pages
Purging Workflow Tables in R12
No ratings yet
Purging Workflow Tables in R12
10 pages
Gravitation Assignment 4
No ratings yet
Gravitation Assignment 4
3 pages
22428-2023-Summer-Question-Paper (Msbte Study Resources)
No ratings yet
22428-2023-Summer-Question-Paper (Msbte Study Resources)
4 pages
Datasheet Intellipoint Line Powered Level Switch RXL Series1
No ratings yet
Datasheet Intellipoint Line Powered Level Switch RXL Series1
4 pages
Chemical Bonding Seminar Notes
No ratings yet
Chemical Bonding Seminar Notes
2 pages

Introduction To Map Reduce

Uploaded by

Introduction To Map Reduce

Uploaded by

Introduction to Map Reduce

We realized that most of our computations involved applying a map

The issues of how to parallelize the computation, distribute the data,

Dean, Ghemawat. MapReduce: simplified data processing on large

Need to scale to 100s or 1000s of computers, each with several

How large is the amount of work?

Scalability to large data volumes:

Need to efficiently process large volumes of data by connecting many

A theoretical 1000-CPU machine would cost a very large amount of

Single-thread performance doesnt matter

MapReduce divides the workload into multiple independent tasks and

A work performed by each task is done in isolation from one another

The amount of communication which can be performed by tasks is

The communication overhead required to keep the data on the nodes

map in MapReduce map in FP

In a purely functional setting, an element of a list being computed by

Move over the list, apply f to each element and an accumulator. f

reduce in MapReduce fold in FP

Transform a set of input key-value pairs to a set of output values:

map(String key, String value): reduce(String key, Iterator values):

One master, many workers

In a MapReduce cluster, data is distributed to all the nodes of the

Node 1 Node 2 Node 3

In MapReduce, intermediate output values are not usually reduced together

Different colors represent

Partitions are the input to Reducers

M = number of map tasks, R = number of reduce tasks

Goal: To conserve network bandwidth.

Problem: Stragglers (i.e., slow workers) significantly lengthen the

User-specified combiner functions for partial combination within a

Tasks that can be performed independently on a data object, large

MapReduce Traditional RDBMS

Since its debut on the computing stage, MapReduce has

Hadoop is an open source implementation of MapReduce and is

Hadoop presents MapReduce as an analytics engine and under

HDFS mimics Google File System (GFS)

Input (K, V) pairs Input (K, V) pairs

Intermediate (K, V) pairs Intermediate (K, V) pairs

Sort Intermediate Sort

Final (K, V) pairs Final (K, V) pairs

Several InputFormats are provided with Hadoop:

InputFormat Description Key Value

By default, the InputFormat breaks a file up into 64MB splits

Files loaded from local HDFS store

Each map task corresponds to a single input split

The RecordReader is invoked repeatedly InputFormat

Each invocation of the RecordReader leads file

to another call of the map function defined RR RR RR

The Reducer performs the user-defined work RR RR RR

For each key in the partition assigned to a Reducer, the

The default partitioner computes a hash value for a Partitioner

node is automatically sorted by RR RR RR

The OutputFormat class defines the InputFormat

The instances of OutputFormat provided by file

Hadoop write to files on the local disk or in HDFS RR RR RR

Several OutputFormats are provided by Hadoop:

Read the original Map Reduce paper

Implement friends count

Code available as a Maven or

Big = Yellow = 10+ letters

Split the document into

Construction of inverted lists for document search

2010, Jamie Callan 46

A simple approach to creating inverted lists

2010, Jamie Callan 48

Excellent intro to MapReduce:

MapReduce & Functional Programming:

For the introductory part:

You might also like