HadoopMapreduce Summerization
HadoopMapreduce Summerization
Lecture
on
Hadoop/MapReduce
Conclusion
What is Apache
Hadoop?
• Large scale, open source software framework
▫ Yahoo! has been the largest contributor to date
• Dedicated to scalable, distributed, data-intensive
computing
• Handles thousands of nodes and petabytes of
data
• Supports applications under a free license
• 3 Hadoop subprojects:
▫ Hadoop Common: common utilities package
▫ HFDS: Hadoop Distributed File System with high
throughput access to application data
▫ MapReduce: A software framework for distributed
processing of large data sets on computer
clusters
Hadoop MapReduce
• MapReduce is a programming model and software
framework first developed by Google (Google’s
MapReduce paper submitted in 2004)
• Intended to facilitate and simplify the processing of
vast amounts of data in parallel on large clusters of
commodity hardware in a reliable, fault-tolerant
manner
▫ Petabytes of data
▫ Thousands of nodes
• Computational processing occurs on both:
▫ Unstructured data : filesystem
▫ Structured data : database
Hadoop Distributed File System (HFDS)
• Inspired by Google File System
• Scalable, distributed, portable filesystem written in Java for
Hadoop framework
▫ Primary distributed storage used by Hadoop applications
• HFDS can be part of a Hadoop cluster or can be a stand-alone
general purpose distributed file system
• An HFDS cluster primarily consists of
▫ NameNode that manages file system metadata
▫ DataNode that stores actual data
• Stores very large files in blocks across machines in a large
cluster
▫ Reliability and fault tolerance ensured by replicating data across
multiple hosts
• Has data awareness between nodes
• Designed to be deployed on low-cost hardware
More on Hadoop file systems
Master Node
Slave Node
1..*
TaskTracker Data
MapReduce framework
• Per cluster node:
▫ Single JobTracker per master
Responsible for scheduling the jobs’ component
tasks on the slaves
Monitors slave progress
Re-executing failed tasks
▫ Single TaskTracker per slave
Execute the tasks as directed by the master
MapReduce core functionality
• Code usually written in Java- though it can be written in
other languages with the Hadoop Streaming API
• Two fundamental pieces:
▫ Map step
Master node takes large problem input and slices it into
smaller sub problems; distributes these to worker nodes.
Worker node may do this again; leads to a multi-level tree
structure
Worker processes smaller problem and hands back to
master
▫ Reduce step
Master node takes the answers to the sub problems and
combines them in a predefined way to get the output/answer
to original problem
MapReduce core functionality (II)
• Data flow beyond the two key pieces (map and reduce):
▫ Input reader – divides input into appropriate size splits
which get assigned to a Map function
▫ Map function – maps file data to smaller, intermediate
<key, value> pairs
▫ Partition function – finds the correct reducer: given the key
and number of reducers, returns the desired Reduce node
▫ Compare function – input for Reduce is pulled from the
Map intermediate output and sorted according to ths
compare function
▫ Reduce function – takes intermediate values and reduces to
a smaller solution handed back to the framework
▫ Output writer – writes file output
MapReduce core functionality (III)
• A MapReduce Job controls the execution
▫ Splits the input dataset into independent chunks
▫ Processed by the map tasks in parallel
• The framework sorts the outputs of the maps
• A MapReduce Task is sent the output of the
framework to reduce and combine
• Both the input and output of the job are stored
in a filesystem
• Framework handles scheduling
▫ Monitors and re-executes failed tasks
MapReduce input and output
• MapReduce operates exclusively on <key, value>
pairs
• Job Input: <key, value>
• pairs
Job Output: <key, value> pairs
▫ Conceivably of different types
• Key and value classes have to be serializable by the
framework.
▫ Default serialization requires keys and values to
implement Writable
▫ Key classes must facilitate sorting by the
framework
Input and Output (II)
Input Output
map combine* reduce
<k1, v1> <k2, v2> <k2, v2> <k3, v3>
From
http://code.google.com/edu/parallel/mapreduce-tutorial.html
How many maps?
• The number of maps is driven by the total size of
the inputs
• Hadoop has found the right level of parallelism
for maps is between 10-100 maps/node
• If you expect 10TB of input data and have a block
size of 128MB, you will have 82,000 maps
• Number of tasks controlled by number of splits
returned and can be user overridden
How many reduces?
• Increasing number of reduces increases
framework overhead; and increases load
balancing and lowers cost of failures
Task Execution and Environment
• TaskTracker executes Mapper/Reducer task as a
child process in a separate jvm
• Child task inherits the environment of the parent
TaskTracker
• User can specify environmental variables
controlling memory, parallel computation
settings, segment size, and more
Scheduling
• By default, Hadoop uses FIFO to schedule jobs.
Alternate scheduler options: capacity and fair
• Capacity scheduler
▫ Developed by Yahoo
▫ Jobs are submitted to queues
▫ Jobs can be prioritized
▫ Queues are allocated a fraction of the total
resource capacity
▫ Free resources are allocated to queues beyond
their total capacity
▫ No preemption once a job is running
• Fair scheduler
▫ Developed by Facebook
▫ Provides fast response times for small jobs
▫ Jobs are grouped into Pools
▫ Each pool assigned a guaranteed minimum share
▫ Excess capacity split between jobs
▫ By default, jobs that are uncategorized go into a
default pool. Pools have to specify the minimum
number of map slots, reduce slots, and a limit on
the number of running jobs
Requirements of applications using
MapReduce
• Specify the Job configuration
▫ Specify input/output locations
▫ Supply map and reduce functions via
implementations of appropriate interfaces and/or
abstract classes
• Job client then submits the job (jar/executables
etc) and the configuration to the JobTracker
What about bad input?
• Hadoop provides an option to skip bad records:
▫ SkipBadRecords class
• Used when map tasks crash deterministically on
certain input
▫ Usually a result of bugs in the map function
▫ May be in 3rd party libraries
▫ Tasks never complete successfully even after multiple
attempts
• Framework goes into ‘skipping mode’ after a certain
number of map failures
• Number of records skipped depends on how
frequently the processed record counter is
incremented by the application
What are Hadoop/MapReduce
limitations?
• Cannot control the order in which the maps or
reductions are run
• For maximum parallelism, you need Maps and
Reduces to not depend on data generated in the
same MapReduce job (i.e. stateless)
• A database with an index will always be faster than a
MapReduce job on unindexed data
• Reduce operations do not take place until all Maps
are complete (or have failed then been skipped)
• General assumption that the output of Reduce
is smaller than the input to Map; large
datasource used to generate smaller final values
Who’s using it?
• Lots of companies!
▫ Yahoo!, AOL, eBay, Facebook, IBM, Last.fm, LinkedIn,
The New York Times, Ning, Twitter, and more
• In 2007 IBM and Google announced an initiative
to use Hadoop to support university courses in
distributed computer programming
• In 2008 this collaboration and the Academic Cloud
Computing Initiative were funded by the NSF and
produced the Cluster Exploratory Program (CLuE)
Summary and Conclusion
• Hadoop MapReduce is a large scale, open source
software framework dedicated to scalable,
distributed, data-intensive computing