0% found this document useful (0 votes)
48 views24 pages

HadoopMapreduce Summerization

Hadoop is an open source software framework for distributed storage and processing of large datasets across clusters of computers. It uses MapReduce, which breaks jobs into parallelized map and reduce tasks. The Hadoop Distributed File System stores data reliably across clusters and can handle petabytes of data and thousands of nodes. Hadoop provides scalable and fault-tolerant solutions for distributed computing problems on large datasets.

Uploaded by

Atharv Chaudhari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views24 pages

HadoopMapreduce Summerization

Hadoop is an open source software framework for distributed storage and processing of large datasets across clusters of computers. It uses MapReduce, which breaks jobs into parallelized map and reduce tasks. The Hadoop Distributed File System stores data reliably across clusters and can handle petabytes of data and thousands of nodes. Hadoop provides scalable and fault-tolerant solutions for distributed computing problems on large datasets.

Uploaded by

Atharv Chaudhari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 24

Shivajirao Kadam Institute of Technology

and Management, Indore (M.P.)


Department of Computer Science
and Engineering

Lecture
on

Hadoop/MapReduce
Conclusion
What is Apache
Hadoop?
• Large scale, open source software framework
▫ Yahoo! has been the largest contributor to date
• Dedicated to scalable, distributed, data-intensive
computing
• Handles thousands of nodes and petabytes of
data
• Supports applications under a free license
• 3 Hadoop subprojects:
▫ Hadoop Common: common utilities package
▫ HFDS: Hadoop Distributed File System with high
throughput access to application data
▫ MapReduce: A software framework for distributed
processing of large data sets on computer
clusters
Hadoop MapReduce
• MapReduce is a programming model and software
framework first developed by Google (Google’s
MapReduce paper submitted in 2004)
• Intended to facilitate and simplify the processing of
vast amounts of data in parallel on large clusters of
commodity hardware in a reliable, fault-tolerant
manner
▫ Petabytes of data
▫ Thousands of nodes
• Computational processing occurs on both:
▫ Unstructured data : filesystem
▫ Structured data : database
Hadoop Distributed File System (HFDS)
• Inspired by Google File System
• Scalable, distributed, portable filesystem written in Java for
Hadoop framework
▫ Primary distributed storage used by Hadoop applications
• HFDS can be part of a Hadoop cluster or can be a stand-alone
general purpose distributed file system
• An HFDS cluster primarily consists of
▫ NameNode that manages file system metadata
▫ DataNode that stores actual data
• Stores very large files in blocks across machines in a large
cluster
▫ Reliability and fault tolerance ensured by replicating data across
multiple hosts
• Has data awareness between nodes
• Designed to be deployed on low-cost hardware
More on Hadoop file systems

• Hadoop can work directly with any distributed


file system which can be mounted by the
underlying OS

• Hadoop-specific file systems like HFDS are


developed for speed, fault tolerance,
integration with Hadoop, and reliability
Typical Hadoop cluster integrates
MapReduce and HFDS
• Master/slave architecture
• Master node contains
▫ Job tracker node (MapReduce layer)
▫ Task tracker node (MapReduce layer)
▫ Name node (HFDS layer)
▫ Data node (HFDS layer)
• Multiple slave nodes contain
▫ Task tracker node (MapReduce layer)
▫ Data node (HFDS layer)
• MapReduce layer has job and task tracker nodes
• HFDS layer has name and data nodes
Hadoop simple cluster graphic
MapReduce layer HFDS layer

Master Node

JobTracker TaskTracker Name Data

Slave Node
1..*
TaskTracker Data
MapReduce framework
• Per cluster node:
▫ Single JobTracker per master
Responsible for scheduling the jobs’ component
tasks on the slaves
Monitors slave progress
Re-executing failed tasks
▫ Single TaskTracker per slave
Execute the tasks as directed by the master
MapReduce core functionality
• Code usually written in Java- though it can be written in
other languages with the Hadoop Streaming API
• Two fundamental pieces:
▫ Map step
 Master node takes large problem input and slices it into
smaller sub problems; distributes these to worker nodes.
 Worker node may do this again; leads to a multi-level tree
structure
Worker processes smaller problem and hands back to
master
▫ Reduce step
 Master node takes the answers to the sub problems and
combines them in a predefined way to get the output/answer
to original problem
MapReduce core functionality (II)
• Data flow beyond the two key pieces (map and reduce):
▫ Input reader – divides input into appropriate size splits
which get assigned to a Map function
▫ Map function – maps file data to smaller, intermediate
<key, value> pairs
▫ Partition function – finds the correct reducer: given the key
and number of reducers, returns the desired Reduce node
▫ Compare function – input for Reduce is pulled from the
Map intermediate output and sorted according to ths
compare function
▫ Reduce function – takes intermediate values and reduces to
a smaller solution handed back to the framework
▫ Output writer – writes file output
MapReduce core functionality (III)
• A MapReduce Job controls the execution
▫ Splits the input dataset into independent chunks
▫ Processed by the map tasks in parallel
• The framework sorts the outputs of the maps
• A MapReduce Task is sent the output of the
framework to reduce and combine
• Both the input and output of the job are stored
in a filesystem
• Framework handles scheduling
▫ Monitors and re-executes failed tasks
MapReduce input and output
• MapReduce operates exclusively on <key, value>
pairs
• Job Input: <key, value>
• pairs
Job Output: <key, value> pairs
▫ Conceivably of different types
• Key and value classes have to be serializable by the
framework.
▫ Default serialization requires keys and values to
implement Writable
▫ Key classes must facilitate sorting by the
framework
Input and Output (II)
Input Output
map combine* reduce
<k1, v1> <k2, v2> <k2, v2> <k3, v3>

From
http://code.google.com/edu/parallel/mapreduce-tutorial.html
How many maps?
• The number of maps is driven by the total size of
the inputs
• Hadoop has found the right level of parallelism
for maps is between 10-100 maps/node
• If you expect 10TB of input data and have a block
size of 128MB, you will have 82,000 maps
• Number of tasks controlled by number of splits
returned and can be user overridden
How many reduces?
• Increasing number of reduces increases
framework overhead; and increases load
balancing and lowers cost of failures
Task Execution and Environment
• TaskTracker executes Mapper/Reducer task as a
child process in a separate jvm
• Child task inherits the environment of the parent
TaskTracker
• User can specify environmental variables
controlling memory, parallel computation
settings, segment size, and more
Scheduling
• By default, Hadoop uses FIFO to schedule jobs.
Alternate scheduler options: capacity and fair
• Capacity scheduler
▫ Developed by Yahoo
▫ Jobs are submitted to queues
▫ Jobs can be prioritized
▫ Queues are allocated a fraction of the total
resource capacity
▫ Free resources are allocated to queues beyond
their total capacity
▫ No preemption once a job is running
• Fair scheduler
▫ Developed by Facebook
▫ Provides fast response times for small jobs
▫ Jobs are grouped into Pools
▫ Each pool assigned a guaranteed minimum share
▫ Excess capacity split between jobs
▫ By default, jobs that are uncategorized go into a
default pool. Pools have to specify the minimum
number of map slots, reduce slots, and a limit on
the number of running jobs
Requirements of applications using
MapReduce
• Specify the Job configuration
▫ Specify input/output locations
▫ Supply map and reduce functions via
implementations of appropriate interfaces and/or
abstract classes
• Job client then submits the job (jar/executables
etc) and the configuration to the JobTracker
What about bad input?
• Hadoop provides an option to skip bad records:
▫ SkipBadRecords class
• Used when map tasks crash deterministically on
certain input
▫ Usually a result of bugs in the map function
▫ May be in 3rd party libraries
▫ Tasks never complete successfully even after multiple
attempts
• Framework goes into ‘skipping mode’ after a certain
number of map failures
• Number of records skipped depends on how
frequently the processed record counter is
incremented by the application
What are Hadoop/MapReduce
limitations?
• Cannot control the order in which the maps or
reductions are run
• For maximum parallelism, you need Maps and
Reduces to not depend on data generated in the
same MapReduce job (i.e. stateless)
• A database with an index will always be faster than a
MapReduce job on unindexed data
• Reduce operations do not take place until all Maps
are complete (or have failed then been skipped)
• General assumption that the output of Reduce
is smaller than the input to Map; large
datasource used to generate smaller final values
Who’s using it?
• Lots of companies!
▫ Yahoo!, AOL, eBay, Facebook, IBM, Last.fm, LinkedIn,
The New York Times, Ning, Twitter, and more
• In 2007 IBM and Google announced an initiative
to use Hadoop to support university courses in
distributed computer programming
• In 2008 this collaboration and the Academic Cloud
Computing Initiative were funded by the NSF and
produced the Cluster Exploratory Program (CLuE)
Summary and Conclusion
• Hadoop MapReduce is a large scale, open source
software framework dedicated to scalable,
distributed, data-intensive computing

• The framework breaks up large data into


smaller parallelizable chunks and handles
scheduling
▫ Maps each piece to an intermediate value
▫ Reduces intermediate values to a solution
▫ User-specified partition and combiner
options
Summary and Conclusion
• Fault tolerant, reliable, and supports thousands of
nodes and petabytes of data
• If you can rewrite algorithms into Maps and
Reduces, and your problem can be broken up into
small pieces solvable in parallel, then Hadoop’s
MapReduce is the way to go for a distributed
problem solving approach to large datasets

You might also like