HadoopMapreduce Summerization

Hadoop is an open source software framework for distributed storage and processing of large datasets across clusters of computers. It uses MapReduce, which breaks jobs into parallelized map and reduce tasks. The Hadoop Distributed File System stores data reliably across clusters and can handle petabytes of data and thousands of nodes. Hadoop provides scalable and fault-tolerant solutions for distributed computing problems on large datasets.

Uploaded by

Atharv Chaudhari

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

48 views24 pages

HadoopMapreduce Summerization

Uploaded by

Atharv Chaudhari

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 24

Shivajirao Kadam Institute of Technology

and Management, Indore (M.P.)

Department of Computer Science
and Engineering

Lecture
on

Hadoop/MapReduce
Conclusion
What is Apache
Hadoop?
• Large scale, open source software framework
▫ Yahoo! has been the largest contributor to date
• Dedicated to scalable, distributed, data-intensive
computing
• Handles thousands of nodes and petabytes of
data
• Supports applications under a free license
• 3 Hadoop subprojects:
▫ Hadoop Common: common utilities package
▫ HFDS: Hadoop Distributed File System with high
throughput access to application data
▫ MapReduce: A software framework for distributed
processing of large data sets on computer
clusters
Hadoop MapReduce
• MapReduce is a programming model and software
framework first developed by Google (Google’s
MapReduce paper submitted in 2004)
• Intended to facilitate and simplify the processing of
vast amounts of data in parallel on large clusters of
commodity hardware in a reliable, fault-tolerant
manner
▫ Petabytes of data
▫ Thousands of nodes
• Computational processing occurs on both:
▫ Unstructured data : filesystem
▫ Structured data : database
Hadoop Distributed File System (HFDS)
• Inspired by Google File System
• Scalable, distributed, portable filesystem written in Java for
Hadoop framework
▫ Primary distributed storage used by Hadoop applications
• HFDS can be part of a Hadoop cluster or can be a stand-alone
general purpose distributed file system
• An HFDS cluster primarily consists of
▫ NameNode that manages file system metadata
▫ DataNode that stores actual data
• Stores very large files in blocks across machines in a large
cluster
▫ Reliability and fault tolerance ensured by replicating data across
multiple hosts
• Has data awareness between nodes
• Designed to be deployed on low-cost hardware
More on Hadoop file systems

• Hadoop can work directly with any distributed

file system which can be mounted by the
underlying OS

• Hadoop-specific file systems like HFDS are

developed for speed, fault tolerance,
integration with Hadoop, and reliability
Typical Hadoop cluster integrates
MapReduce and HFDS
• Master/slave architecture
• Master node contains
▫ Job tracker node (MapReduce layer)
▫ Task tracker node (MapReduce layer)
▫ Name node (HFDS layer)
▫ Data node (HFDS layer)
• Multiple slave nodes contain
▫ Task tracker node (MapReduce layer)
▫ Data node (HFDS layer)
• MapReduce layer has job and task tracker nodes
• HFDS layer has name and data nodes
Hadoop simple cluster graphic
MapReduce layer HFDS layer

Master Node

JobTracker TaskTracker Name Data

Slave Node
1..*
TaskTracker Data
MapReduce framework
• Per cluster node:
▫ Single JobTracker per master
Responsible for scheduling the jobs’ component
tasks on the slaves
Monitors slave progress
Re-executing failed tasks
▫ Single TaskTracker per slave
Execute the tasks as directed by the master
MapReduce core functionality
• Code usually written in Java- though it can be written in
other languages with the Hadoop Streaming API
• Two fundamental pieces:
▫ Map step
 Master node takes large problem input and slices it into
smaller sub problems; distributes these to worker nodes.
 Worker node may do this again; leads to a multi-level tree
structure
Worker processes smaller problem and hands back to
master
▫ Reduce step
 Master node takes the answers to the sub problems and
combines them in a predefined way to get the output/answer
to original problem
MapReduce core functionality (II)
• Data flow beyond the two key pieces (map and reduce):
▫ Input reader – divides input into appropriate size splits
which get assigned to a Map function
▫ Map function – maps file data to smaller, intermediate
<key, value> pairs
▫ Partition function – finds the correct reducer: given the key
and number of reducers, returns the desired Reduce node
▫ Compare function – input for Reduce is pulled from the
Map intermediate output and sorted according to ths
compare function
▫ Reduce function – takes intermediate values and reduces to
a smaller solution handed back to the framework
▫ Output writer – writes file output
MapReduce core functionality (III)
• A MapReduce Job controls the execution
▫ Splits the input dataset into independent chunks
▫ Processed by the map tasks in parallel
• The framework sorts the outputs of the maps
• A MapReduce Task is sent the output of the
framework to reduce and combine
• Both the input and output of the job are stored
in a filesystem
• Framework handles scheduling
▫ Monitors and re-executes failed tasks
MapReduce input and output
• MapReduce operates exclusively on <key, value>
pairs
• Job Input: <key, value>
• pairs
Job Output: <key, value> pairs
▫ Conceivably of different types
• Key and value classes have to be serializable by the
framework.
▫ Default serialization requires keys and values to
implement Writable
▫ Key classes must facilitate sorting by the
framework
Input and Output (II)
Input Output
map combine* reduce
<k1, v1> <k2, v2> <k2, v2> <k3, v3>

From
http://code.google.com/edu/parallel/mapreduce-tutorial.html
How many maps?
• The number of maps is driven by the total size of
the inputs
• Hadoop has found the right level of parallelism
for maps is between 10-100 maps/node
• If you expect 10TB of input data and have a block
size of 128MB, you will have 82,000 maps
• Number of tasks controlled by number of splits
returned and can be user overridden
How many reduces?
• Increasing number of reduces increases
framework overhead; and increases load
balancing and lowers cost of failures
Task Execution and Environment
• TaskTracker executes Mapper/Reducer task as a
child process in a separate jvm
• Child task inherits the environment of the parent
TaskTracker
• User can specify environmental variables
controlling memory, parallel computation
settings, segment size, and more
Scheduling
• By default, Hadoop uses FIFO to schedule jobs.
Alternate scheduler options: capacity and fair
• Capacity scheduler
▫ Developed by Yahoo
▫ Jobs are submitted to queues
▫ Jobs can be prioritized
▫ Queues are allocated a fraction of the total
resource capacity
▫ Free resources are allocated to queues beyond
their total capacity
▫ No preemption once a job is running
• Fair scheduler
▫ Developed by Facebook
▫ Provides fast response times for small jobs
▫ Jobs are grouped into Pools
▫ Each pool assigned a guaranteed minimum share
▫ Excess capacity split between jobs
▫ By default, jobs that are uncategorized go into a
default pool. Pools have to specify the minimum
number of map slots, reduce slots, and a limit on
the number of running jobs
Requirements of applications using
MapReduce
• Specify the Job configuration
▫ Specify input/output locations
▫ Supply map and reduce functions via
implementations of appropriate interfaces and/or
abstract classes
• Job client then submits the job (jar/executables
etc) and the configuration to the JobTracker
What about bad input?
• Hadoop provides an option to skip bad records:
▫ SkipBadRecords class
• Used when map tasks crash deterministically on
certain input
▫ Usually a result of bugs in the map function
▫ May be in 3rd party libraries
▫ Tasks never complete successfully even after multiple
attempts
• Framework goes into ‘skipping mode’ after a certain
number of map failures
• Number of records skipped depends on how
frequently the processed record counter is
incremented by the application
What are Hadoop/MapReduce
limitations?
• Cannot control the order in which the maps or
reductions are run
• For maximum parallelism, you need Maps and
Reduces to not depend on data generated in the
same MapReduce job (i.e. stateless)
• A database with an index will always be faster than a
MapReduce job on unindexed data
• Reduce operations do not take place until all Maps
are complete (or have failed then been skipped)
• General assumption that the output of Reduce
is smaller than the input to Map; large
datasource used to generate smaller final values
Who’s using it?
• Lots of companies!
▫ Yahoo!, AOL, eBay, Facebook, IBM, Last.fm, LinkedIn,
The New York Times, Ning, Twitter, and more
• In 2007 IBM and Google announced an initiative
to use Hadoop to support university courses in
distributed computer programming
• In 2008 this collaboration and the Academic Cloud
Computing Initiative were funded by the NSF and
produced the Cluster Exploratory Program (CLuE)
Summary and Conclusion
• Hadoop MapReduce is a large scale, open source
software framework dedicated to scalable,
distributed, data-intensive computing

• The framework breaks up large data into

smaller parallelizable chunks and handles
scheduling
▫ Maps each piece to an intermediate value
▫ Reduces intermediate values to a solution
▫ User-specified partition and combiner
options
Summary and Conclusion
• Fault tolerant, reliable, and supports thousands of
nodes and petabytes of data
• If you can rewrite algorithms into Maps and
Reduces, and your problem can be broken up into
small pieces solvable in parallel, then Hadoop’s
MapReduce is the way to go for a distributed
problem solving approach to large datasets

Coles Retail Enterprise Agreement 2024
No ratings yet
Coles Retail Enterprise Agreement 2024
56 pages
Group7 Anadarko
100% (1)
Group7 Anadarko
16 pages
TWinSoft Training Uk 20220301 Ovarro
100% (1)
TWinSoft Training Uk 20220301 Ovarro
135 pages
2010 TRANSMISSION Automatic Transmission (RE0F08B) - Cube
100% (2)
2010 TRANSMISSION Automatic Transmission (RE0F08B) - Cube
233 pages
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
From Everand
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
alasdair gilchrist
5/5 (1)
AC Troubleshooting Guide Generic
100% (2)
AC Troubleshooting Guide Generic
2 pages
BDA unit-3
No ratings yet
BDA unit-3
63 pages
Notes Bug Data and of Apache
No ratings yet
Notes Bug Data and of Apache
4 pages
Unit 5
No ratings yet
Unit 5
35 pages
BDA UNIT-3 (1) - Merged
No ratings yet
BDA UNIT-3 (1) - Merged
98 pages
Unit-4-1
No ratings yet
Unit-4-1
12 pages
Cloud Computing Prof
No ratings yet
Cloud Computing Prof
11 pages
Bda Unit-3
No ratings yet
Bda Unit-3
20 pages
Unit 5
No ratings yet
Unit 5
7 pages
Unit IV Notes
No ratings yet
Unit IV Notes
25 pages
Hadoop
No ratings yet
Hadoop
34 pages
MapReduce_Unit3
No ratings yet
MapReduce_Unit3
27 pages
Unit 2 Topic 5 Developing A Map Reduce Application
No ratings yet
Unit 2 Topic 5 Developing A Map Reduce Application
52 pages
B. Hadoop Ecosystem_III (MapReduce)
No ratings yet
B. Hadoop Ecosystem_III (MapReduce)
55 pages
L02-Hadoop Framework
No ratings yet
L02-Hadoop Framework
40 pages
Unit - III Advanced Analytics Technology and Tools
No ratings yet
Unit - III Advanced Analytics Technology and Tools
44 pages
Shortnotes For Cloud
No ratings yet
Shortnotes For Cloud
22 pages
DSBDA Manual Assignment 11
No ratings yet
DSBDA Manual Assignment 11
6 pages
BDA Unit-2
No ratings yet
BDA Unit-2
11 pages
Chapter 4 MapReduce and New Software Stack
No ratings yet
Chapter 4 MapReduce and New Software Stack
48 pages
Hadoop (Mapreduce)
No ratings yet
Hadoop (Mapreduce)
43 pages
Unit Iv-1
No ratings yet
Unit Iv-1
84 pages
Parallel Project
No ratings yet
Parallel Project
32 pages
Map Reduce and Hadoop
No ratings yet
Map Reduce and Hadoop
39 pages
Chapter Five Hadoop Mapreduce & HDFS
No ratings yet
Chapter Five Hadoop Mapreduce & HDFS
44 pages
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
No ratings yet
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
54 pages
Hadoop - MapReduce
No ratings yet
Hadoop - MapReduce
5 pages
3.1.How Map Reduce Works & 3.2 Anatomy
No ratings yet
3.1.How Map Reduce Works & 3.2 Anatomy
11 pages
UNIT III Notes_18540760ab9652a7b4b8d9c1d0f56f3c
No ratings yet
UNIT III Notes_18540760ab9652a7b4b8d9c1d0f56f3c
24 pages
Big Data Analytics UNIT 3 Notets
No ratings yet
Big Data Analytics UNIT 3 Notets
12 pages
Hadoop: Er. Gursewak Singh Dsce
No ratings yet
Hadoop: Er. Gursewak Singh Dsce
15 pages
Big Data notes (1)
No ratings yet
Big Data notes (1)
13 pages
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
No ratings yet
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
40 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
Lecture 5 - Hadoop and Mapreduce
No ratings yet
Lecture 5 - Hadoop and Mapreduce
30 pages
Large-Scale Data Management: Cs525: Special Topics in Dbs
No ratings yet
Large-Scale Data Management: Cs525: Special Topics in Dbs
22 pages
Top Answers To Map Reduce Interview Questions
No ratings yet
Top Answers To Map Reduce Interview Questions
6 pages
Hadoop: A Seminar Report On
No ratings yet
Hadoop: A Seminar Report On
28 pages
Unit 3 & 4 big data
No ratings yet
Unit 3 & 4 big data
18 pages
BDA-UNIT-3
No ratings yet
BDA-UNIT-3
29 pages
Lecture4 IntroMapReduce PDF
No ratings yet
Lecture4 IntroMapReduce PDF
75 pages
P.Prabu (23x61c) CCS334-BDA - Unit-3
No ratings yet
P.Prabu (23x61c) CCS334-BDA - Unit-3
23 pages
Hadoop: A Report Writing On
No ratings yet
Hadoop: A Report Writing On
13 pages
MapReduce Arch
No ratings yet
MapReduce Arch
29 pages
CC_unit4_52e39303-d867-4b14-b5bf-38bc746359c6
No ratings yet
CC_unit4_52e39303-d867-4b14-b5bf-38bc746359c6
14 pages
U-3 Big Data
No ratings yet
U-3 Big Data
23 pages
CC unit5
No ratings yet
CC unit5
27 pages
Big Data Management Continued
No ratings yet
Big Data Management Continued
48 pages
BDM 2
No ratings yet
BDM 2
5 pages
BDA_UNIT_2
No ratings yet
BDA_UNIT_2
48 pages
BDA Module 3 - Part 1 (Mapreduce and HBase) 2023
No ratings yet
BDA Module 3 - Part 1 (Mapreduce and HBase) 2023
15 pages
1 MapReduce introduction with example
No ratings yet
1 MapReduce introduction with example
52 pages
Unit V Cloud Technologies and Advancements
No ratings yet
Unit V Cloud Technologies and Advancements
33 pages
Unit 2 - From Hadoop Streaming PDF
No ratings yet
Unit 2 - From Hadoop Streaming PDF
20 pages
Ditp - ch2 4
No ratings yet
Ditp - ch2 4
2 pages
Mastering DuckDB: High-Performance Analytics Made Easy
From Everand
Mastering DuckDB: High-Performance Analytics Made Easy
Robert Johnson
No ratings yet
Administering ArcGIS for Server
From Everand
Administering ArcGIS for Server
Hussein Nasser
No ratings yet
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
From Everand
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Phase Diagrams As Tools For Advanced Materials Design: Applications To Non-Ferrous Alloys
No ratings yet
Phase Diagrams As Tools For Advanced Materials Design: Applications To Non-Ferrous Alloys
18 pages
Neosporin H Ear Drops
No ratings yet
Neosporin H Ear Drops
7 pages
Petroleum Webquest
100% (1)
Petroleum Webquest
3 pages
Module 4 - Company Situation Analysis
No ratings yet
Module 4 - Company Situation Analysis
58 pages
Chapter 3 - The Role of The Project Manager - CLO1
No ratings yet
Chapter 3 - The Role of The Project Manager - CLO1
19 pages
TR Biomedical Equipment Services NCII
No ratings yet
TR Biomedical Equipment Services NCII
90 pages
Pta Book Inside Questions
No ratings yet
Pta Book Inside Questions
11 pages
Course 231: Equations of Mathematical Physics: Notes by Chris Blair
No ratings yet
Course 231: Equations of Mathematical Physics: Notes by Chris Blair
17 pages
ISO 9001 2015 Cluases PDF
No ratings yet
ISO 9001 2015 Cluases PDF
1 page
Team Charter Sample
No ratings yet
Team Charter Sample
2 pages
HBR - 5 Types of Stories Leaders Need to Tell
No ratings yet
HBR - 5 Types of Stories Leaders Need to Tell
6 pages
Exam-6-Set-A (Print)
No ratings yet
Exam-6-Set-A (Print)
10 pages
CHAPTER 1 TOPIC 3 A Brief History of Motor Control and Motor Learning
No ratings yet
CHAPTER 1 TOPIC 3 A Brief History of Motor Control and Motor Learning
3 pages
Sport Radio Show Questionnaire
No ratings yet
Sport Radio Show Questionnaire
5 pages
maths-class-x-chapter-02-polynomials-practice-paper-02-answers-1
No ratings yet
maths-class-x-chapter-02-polynomials-practice-paper-02-answers-1
5 pages
GME
100% (2)
GME
4 pages
Notice: Banks and Bank Holding Companies: Formations, Acquisitions, and Mergers
No ratings yet
Notice: Banks and Bank Holding Companies: Formations, Acquisitions, and Mergers
1 page
Academic Test 1 PDF
0% (1)
Academic Test 1 PDF
13 pages
Using ICT Applications in EFL Teaching: Challenges and Experiences of Novice vs. Experienced English Teachers
No ratings yet
Using ICT Applications in EFL Teaching: Challenges and Experiences of Novice vs. Experienced English Teachers
69 pages
Pathophysiology of Covid19
No ratings yet
Pathophysiology of Covid19
6 pages
Manual CableOS Support - TAC - PP - v14
No ratings yet
Manual CableOS Support - TAC - PP - v14
20 pages
Book Review On Communication Theories
No ratings yet
Book Review On Communication Theories
7 pages
Instructions: Recall What You Learned About Creative Writing in The Past. Read
No ratings yet
Instructions: Recall What You Learned About Creative Writing in The Past. Read
4 pages
Enc 13 Iso Audit Reports
No ratings yet
Enc 13 Iso Audit Reports
82 pages
PRESENTATION Chapter 13, 14, 15
No ratings yet
PRESENTATION Chapter 13, 14, 15
43 pages

HadoopMapreduce Summerization

Uploaded by

HadoopMapreduce Summerization

Uploaded by

Shivajirao Kadam Institute of Technology

and Management, Indore (M.P.)

• Hadoop can work directly with any distributed

• Hadoop-specific file systems like HFDS are

JobTracker TaskTracker Name Data

• The framework breaks up large data into

You might also like