try

Sample Data Set
•  10 billion web pages
•  Average size - 20 KB
•  Total Size = 10 bil + 20 KB = 200 TB
•  Disk read bandwidth = 50 MB/sec
•  Time to read = 4 mil sec = 46+days
•  Takes more time to do with data

Huge Computations and Large Data
Sets
•  Hundreds of special-purpose computations
•  The computations processes large amounts of
raw data
–  Crawled documents
–  Web request logs

Huge Computations and Large Data Sets
•  Compute various of types of derived data
–  Inverted indices
–  Various representation of graph structure of web
documents
–  Summaries of no of pages crawled per host
–  Set of most frequent queries in a given day

Huge Computations and Large Data Sets
¨ Computations are straight forward but
¤ Input data is very huge
¤ Computations have to be distributed across
hundreds of thousands of machines in order to ﬁnish
in reasonable amount of time

Node failures
•  A single server can stay up for 3 years
•  1000 servers in a cluster => 1 failure per day
•  100K servers => 100 failures per day
•  How to store data persistently and keep it
available if nodes fail
•  How to deal with node failures during long
running computation

Network Bottleneck
•  Network Bandwidth = 1 GBps
•  Moving 10TB takes approximately - 1 day
•  Distributed Programming is hard

Motivation: Large scale data processing
¨ Framework should be able to
¤ Runs on large cluster of machines
¤ High scalable
¤ Takes Tera bytes of data on thousands of
machines
¤ Easy for programmers
¤ Hundreds of map reduce programs have been
implemented
¨ Many real world tasks are expressible in
this model

Motivation: Large scale data processing
¨ Run-time system takes care of
¤ Handling large amount of input data,
¤ scheduling the program execution across set of
machines
¤ Handling machine failures
¤ Manage the required inter-machine
communication
¨ Big beneﬁt
¤ Allows the programmers without any experience
with parallel and distributed systems to easily
utilize the resources of large distributed system

Motivation - Distributed Task Execution
•  Problem Statement:There is a large
computational problem that can be divided into
multiple parts and results from all parts can be
combined together to obtain a ﬁnal result.
•  Applications:
– Physical and Engineering Simulations, Numerical
Analysis, Performance Testing
– Large scale indexing

Summary and Aggregation
•  There is a number of documents where each
document is a set of terms. It is required to
calculate a total number of occurrences of each
term in all documents.Alternatively, it can be an
arbitrary function of the terms. For instance,
there is a log ﬁle where each record contains a
response time and it is required to calculate an
average response time.
– Log Analysis, Data Querying, ETL, DataValidation

Filtering, Parsing andValidation
•  There is a set of records and it is required to
collect all records that meet some condition or
transform each record (independently from
other records) into another representation.The
later case includes such tasks as text parsing and
value extraction, conversion from one format to
another.
•  Applications
– Log Analysis, Data Querying, ETL, DataValidation

Iterative Message Passing (Graph
Processing)
•  There is a network of entities and relationships
between them. It is required to calculate a state
of each entity on the basis of properties of the
other entities in its neighborhood.This state can
represent a distance to other nodes, indication
that there is a neighbor with the certain
properties, characteristic of neighborhood
density and so on.
•  Applications
– Social Network Analysis
– Supply Chain

Cross-Correlation
•  There is a set of tuples of items. For each
possible pair of items calculate a number of
tuples where these items co-occur. If the total
number of items is N then N*N values should
be reported.
– Text Analysis, Market Analysis

Relational MapReduce Patterns
•  Selection
•  Projection
•  Union
•  Intersection
•  Difference
•  GroupBy and Aggregation
•  Joining

Solution - MapReduce
•  Addresses the challenges of cluster computing
•  Store data redundantly on multiple nodes for
persistence and availability
•  Move computation close to the data – to
minimize data movement
•  Simple Programming Model – to hide the
complexity of all this magic

Solution - MapReduce
¨ MapReduce abstraction allows to express
simple computations, but hides
¤ Messy details of parallelization
¤ Fault tolerance
¤ Data distribution and
¤ Load balancing in a library

History of MapReduce
•  The speed at which MapReduce has been adopted is remarkable.
•  It went from an interesting paper from Google in 2004 to a widely adopted industry standard in distributed
data processing in 2012.
•  The actual origins of MapReduce are arguable, but the paper that most cite as the one that started us down
this journey is MapReduce: Simpliﬁed Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat
in 2004.
•  This paper described how Google split, processed, and aggregated their data set of mind-boggling size.
•  Shortly after the release of the paper, a free and open source software pioneer by the name of Doug Cutting
started working on a MapReduce implementation to solve scalability in another project he was working on
called Nutch, an effort to build an open source search engine.
•  Over time and with some investment byYahoo!, Hadoop split out as its own project and eventually became a
top-level Apache Foundation project.
•  Today, numerous independent people and organizations contribute to Hadoop. Every new release adds
functionality and boosts performance.
•  Several other open source projects have been built with Hadoop at their core, and this list is continually
growing.
•  Some of the more popular ones include Pig, Hive, HBase, Mahout, and ZooKeeper.
•  Doug Cutting and other Hadoop experts have mentioned several times that Hadoop is becoming the kernel of
a distributed operating system in which distributed applications can be built.

Programming Model
•  The programming model addresses the
following
– The user should not worry about the
framework
– User should be able to write his business
logic
– The user should have good productivity
– The model should be simple

Distribute the work
•  Spread the work on more than one machine
•  Challenges
– Communication and co-ordination
– Recovering from machine failure
– Status and progress reporting
– Debugging
– Optimization
– Locality
•  The problems are mostly same for every
problem

Compute Machines
•  CPU’s - typically 2 or 4
–  Typically hyper threaded or dual core
•  Multiple local attached hard disks
•  4Gigs to 16gigs RAM
•  Challenges
–  Single thread performance doesn’t matter
–  Unreliable machines
•  One server – for 1000 days means
•  10000 servers, roughly you loose 10 in one day
–  Ultra-reliable hardware does not help
•  May fail less often
•  Software needs to be less tolerant
•  Commodity machine gives more perf/per dollar

What is MapReduce
•  “A simple and powerful interface that enables
automatic parallelization and distribution of large-scale
computations, combined with an implementation of this
interface that achieves high performance on large
clusters of commodity PCs.”
–  Dean and Ghemawat,“MapReduce: Simpliﬁed Data Processing on
Large Clusters”, Google Inc.
•  In simple terms
–  A distributed/parallel programming model and associated
implementation

MapReduce – Programming model
•  Process Data using Special Map() and Reduce()
functions
– The map function is called on every item in the input
and emits a series of intermediate key/value pairs
– All values associated with a given key are grouped
together
– The Reduce function is called on every unique key,
and its value list, and emits a value that is added to
the output

MapReduce
•  The messy details are transparent
– Automatic parallelization
– Load balancing
– Network and disk transfer optimization
– Handling machine failures
– Robustness

MapReduce
•  Map(k1,v1) --> list(k2,v2)
•  Reduce(k2, list(v2)) --> list(v2)
•  Runtime System
– Partitions input data
– Schedules execution across a set of machines
– Handles machine failure
– Manages interprocess communication

Beneﬁts
•  Greatly Reduces parallel Programming
complexity
– Reduces synchronization complexity
– Automatically partitions data
– Provides failure transparency
– Handles load balancing

MapReduce - Execution Overview
Step1:
•  User program via the MapReduce Library, splits
the input data
User
ProgramInput
Data
Shard 0
Shard 1
Shard 2
Shard 3
Shard 4
Shard 5
Shard 6
* Shards/splits are typically 16-64mb in size

Execution overview – step 2
•  Creates process copies on a distributed
machine cluster.
•  One copy will be the ‘Master’ and other will be
worked threads
User
Program
Master
Workers
Workers
Workers
Workers
Workers

Execution overview – Step 3
•  Master distributes M map and R Reduce tasks to
idle workers
– M = no of input splits
– R = The intermediate key space is divided into R
parts
Master
Idle
Worker
Message(Do_map_task)

Execution Overview – Step 4
Each map-task worker reads assigned input shard
and outputs intermediate key/value pairs.
–  Output buffered in RAM
Map
workerInput Split 0 Key/value pairs

•  Each worker ﬂushes intermediate values,
partitioned into R regions, to disk and notiﬁes
the Master process.
Master
Map
worker
Disk locations
Local
Storage

•  Master process gives disk locations to an
available reduce-task worker who reads all
associated intermediate data.
Master
Reduce
worker
Disk locations
remote
Storage

•  Each reduce-task worker sorts its intermediate
data. Calls the reduce function, passing in unique
keys and associated key values.
•  Sorts and shufﬂes the data
•  Reduce function output appended to reduce-task’s
partition output ﬁle
Reduce
worker
Sorts data Partition
Output file

•  Master process wakes up user process when all
tasks have completed. Output contained in R
output ﬁles.
wakeup User
Program
Master
Output
files

•  0067011990999991950051507004+68750+023550FM-12+03
8299999V0203301N00671220001CN9999999N9+00001+99
999999999
•  0043011990999991950051512004+68750+023550FM-12+03
8299999V0203201N00671220001CN9999999N9+00221+99
999999999
•  0043011990999991950051518004+68750+023550FM-12+03
8299999V0203201N00261220001CN9999999N9-00111+999
99999999
•  0043012650999991949032412004+62300+010750FM-12+04
8599999V0202701N00461220001CN0500001N9+01111+99
999999999
•  0043012650999991949032418004+62300+010750FM-12+04
8599999V0202701N00461220001CN0500001N9+00781+99
999999999
Weather Data

Data Set
¡  0043
¡  012650 - Weather station identiﬁer
¡  99999 – other identiﬁer
¡  19490324 – observation date
¡  1800 – observation time
¡  4
¡  +62300 – latitude
¡  +010750 – longitude
¡  ..
¡  ..
¡  Quality code
¡  ..
¡  Air temperature
¡  ..
¡  ,..Atmospheric pressure
¡  etc
0043012650999991949032418004+62300+010750FM-‐12+048599999V0202701N00461220001C
N0500001N9+00781+99999999999

Data
Map
output

Reduce
output

Maximum
global
temperature
recorded
each
year

1949 111
1950 22

Map inputs and outputs
•  Map Input in HDFS
•  Map output in Localdisk
•  Reduce reads input from all disks via RPC
mechanism
•  Reduce output stores in HDFS

Combiner Function
•  Allows to combine the output of Map
•  It is almost similar to Reduce Task

Installing ssh
•  Sudo apt-get install ssh
•  Rpm –i ssh.rpm
•  Password less login
– Ssh-keygen –t rsa –P’ ‘ –f ~/.ssh/id_ssa
– Cat ~/.ssh/id_rsa_pub >> ~/.ssh/authorized_keys
•  Test with ssh localhost

Installing
•  Tar xzf hadoop-0.20.0.tar.gz
•  Export JAVA_HOME=/jdk1.6path
•  Export HADOOP_INSTALL=/home/reddyraja/
hadoop-0.20.0
•  Export PATH=$PATH:$HADOOP_INSTALL
•  Check hadoop
– hadoop version

Standalone
¡ Everything runs on a single JVM
¡ Suitable for development of MapReduce programs
¡ Easy to test and debug
¡ No daemons to run
¡ Commands
§  Compile
▪  Javac –classpath $HADOOP_HOME/hadoop-0.20.0.jar –d bin
§  Create the Jar
▪  Jar –cvf maxtemp.jar –C bin /
§  Run the example
▪  Hadoop jar maxtemp.jar NewMaxTemperature input output

Pseudo distributed
•  Hadoop daemons run on local machine
•  Simulates the cluster

Hadoop and others..To avoid confusion
•  MapReduce
•  GFS
•  BigTable
•  Chubby
¨  Hadoop

¨  HDFS

¨  HBASE

¨  ZooKeeper

MapReduce Terms
¨ PayLoad – Applications implement Map and Reduce
functions, forms core of the job
¨ Mapper – Maps maps input key/value pairs to a set
of intermediate key/value pairs
¨ NamedNode - Node that manages the HDFS ﬁle
systems
¨ DataNode - Node where data is present.
¨ MasterNode – Node where JobTracker runs
¨ Slave node – Node where Map and Reduce runs

MapReduce Terms continued…
¨ JobTracker –Tracks and assigns jobs toTask tracker.
Schedules jobs.
¨ TaskTracker –Tracks the task and reports status to
JobTracket
¨ Job - A “full program” - an execution of a Mapper and
Reducer across a data set
¨ Task - An execution of a Mapper or a Reducer on a
slice of data
¨ Task Attempt – A particular instance of an attempt to
execute a task on a machine

Terminology Example
•  Running “Word Count” across 20 ﬁles is one
job
•  20 ﬁles to be mapped imply 20 map tasks +
some number of reduce tasks
•  At least 20 map task attempts will be
performed… more if a machine crashes, etc.
•  Map and Reduce functions inside WordCount

Task Attempts
•  A particular task will be attempted at least once,
possibly more times if it crashes
–  If the same input causes crashes over and over, that input will
eventually be abandoned
•  Multiple attempts at one task may occur in parallel with
speculative execution turned on
–  Task ID from TaskInProgress is not a unique identiﬁer; don’t
use it that way

MapReduce: High Level
JobTracker
MapReduce job
submitted by
client computer
Master node
TaskTracker
Slave node
Task instance
TaskTracker
Slave node
Task instance
TaskTracker
Slave node
Task instance

HDFS Concepts
¡ Distributed FileSytem
¡ Blocks Store
¡ NameNodes and DataNodes
¡ Command Line interface
¡ Basic FileSystem Operations
¡ Java Interfaces

Node-to-Node Communication
•  Hadoop uses its own RPC protocol
•  All communication begins in slave nodes
– Prevents circular-wait deadlock
– Slaves periodically poll for “status” message
•  Classes must provide explicit serialization

Nodes,Trackers,Tasks
•  Master node runs JobTracker instance, which
accepts Job requests from clients
•  TaskTracker instances run on slave nodes
•  TaskTracker forks separate Java process for task
instances

Job Distribution
•  MapReduce programs are contained in a Java “jar” file +
an XML file containing serialized program configuration
options
•  Running a MapReduce job places these files into the
HDFS and notifies TaskTrackers where to retrieve the
relevant program code
•  … Where’s the data distribution?

Data Distribution
•  Implicit in design of MapReduce!
– All mappers are equivalent; so map whatever data is
local to a particular node in HDFS
•  If lots of data does happen to pile up on the
same node, nearby nodes will map instead
– Data transfer is handled implicitly by HDFS

Configuring With JobConf
•  MR Programs have many configurable options
•  JobConf objects hold (key, value) components mapping
String à ’a
–  e.g.,“mapred.map.tasks” à 20
–  JobConf is serialized and distributed before running the job
•  Objects implementing JobConfigurable can retrieve
elements from a JobConf

Job Launch Process: Client
•  Client program creates a JobConf
– Identify classes implementing Mapper and Reducer
interfaces
•  JobConf.setMapperClass(), setReducerClass()
– Specify inputs, outputs
•  JobConf.setInputPath(), setOutputPath()
– Optionally, other options too:
•  JobConf.setNumReduceTasks(),
JobConf.setOutputFormat()…

Job Launch Process: JobClient
•  Pass JobConf to JobClient.runJob() or
submitJob()
– runJob() blocks, submitJob() does not
•  JobClient:
– Determines proper division of input into InputSplits
– Sends job data to master JobTracker server

Job Launch Process: JobTracker
•  JobTracker:
– Inserts jar and JobConf (serialized to XML) in shared
location
– Posts a JobInProgress to its run queue

Job Launch Process: TaskTracker
•  TaskTrackers running on slave nodes periodically
query JobTracker for work
•  Retrieve job-speciﬁc jar and conﬁg
•  Launch task in separate instance of Java
– main() is provided by Hadoop

Job Launch Process:Task
•  TaskTracker.Child.main():
– Sets up the child TaskInProgress attempt
– Reads XML conﬁguration
– Connects back to necessary MapReduce
components via RPC
– Uses TaskRunner to launch user process

Job Launch Process: TaskRunner
•  TaskRunner, MapTaskRunner, MapRunner work in
a daisy-chain to launch your Mapper
– Task knows ahead of time which InputSplits it should
be mapping
– Calls Mapper once for each record retrieved from
the InputSplit
•  Running the Reducer is much the same

Creating the Mapper
•  You provide the instance of Mapper
– Should extend MapReduceBase
•  One instance of your Mapper is initialized by the
MapTaskRunner for a TaskInProgress
– Exists in separate process from all other instances of
Mapper – no data sharing!

What is Writable?
•  Hadoop deﬁnes its own “box” classes for strings
(Text), integers (IntWritable), etc.
•  All values are instances of Writable
•  All keys are instances of WritableComparable

Data to the mapper
Input file
InputSplit InputSplit InputSplit InputSplit
Input file
RecordReader RecordReader RecordReader RecordReader
Mapper
(intermediates)
Mapper
(intermediates)
Mapper
(intermediates)
Mapper
(intermediates)
InputFormat

Reading Data
•  Data sets are specified by InputFormats
– Defines input data (e.g., a directory)
– Identifies partitions of the data that form an
InputSplit
– Factory for RecordReader objects to extract (k, v)
records from the input source

InputFormat
¨ Describes the input-specification for a MapReduce
Job
¨ MapReduce Relies on InputFormat for
¤ Validate the input-specification of the Job
¤ Split up input files into logical InputSplits and then
assigned to Individual Mapper
¤ Provide Record Reader implementation to be used to
glean input records from the logical processing by the
Mapper
¨ FileInputFormat – splits files based on file size
¤ FileSystem Block size is the upper limit
¤ Lower limit can be set by mapred.min.split.size

FileInputFormat
•  TextInputFormat – Treats each ‘n’-terminated
line of a ﬁle as a value
•  KeyValueTextInputFormat – Maps ‘n’-
terminated text lines of “k SEP v”
•  SequenceFileInputFormat – Binary ﬁle of (k, v)
pairs with some add’l metadata
•  SequenceFileAsTextInputFormat – Same, but
maps (k.toString(), v.toString())

Filtering File Inputs
•  FileInputFormat will read all files out of a
specified directory and send them to the
mapper
•  Delegates filtering this file list to a method
subclasses may override
– e.g., Create your own “xyzFileInputFormat” to read
*.xyz from directory list

InputSplit
¨ One MapTask for each Split
¨ Framework calls for each record in InputSplit
¤ Setup
¤ Map
¤ Cleanup
¨ Intermediate values
¤ Grouped by framework
¤ Passed to Reducer
¤ Control Sorting using RawComparator class
¤ Use CombinerClass to perform local segregation
¨ Use Partitioner to control which outputs to go to what
Reduce
¨ Use CompressionCodecs to compress intermediate output

Input Split Size
•  FileInputFormat will divide large ﬁles into chunks
– Exact size controlled by mapred.min.split.size
•  RecordReaders receive ﬁle, offset, and length of
chunk
•  Custom InputFormat implementations may
override split size – e.g.,“NeverChunkFile”

Record Readers
•  Each InputFormat provides its own RecordReader
implementation
– Provides (unused?) capability multiplexing
•  LineRecordReader – Reads a line from a text ﬁle
•  KeyValueRecordReader – Used by
KeyValueTextInputFormat

Partitioner
•  Controls the keys of intermediate maps
•  Key is used to derive the partition, which is a
hash function
•  Total partitions is same as number of reduce
tasks

Reduce
¨ Reduces a set of intermediate values which share a
key to a smaller set of values
¨ Has 3 phases
¤ Shufﬂe
n Copies the sorted output from each mapper using HTTP
across the network
¤ Sort
n Sorts reduce inputs by keys
n Shufﬂe and sort phases occur simultaneously
n Secondary Sort using custom functions
¤ Reduce
n Framework class Reduce for each key and collection of values

Sending Data To Reducers
•  Map function receives OutputCollector object
– OutputCollector.collect() takes (k, v) elements
•  Any (WritableComparable,Writable) can be used

WritableComparator
•  Compares WritableComparable data
– Will call WritableComparable.compare()
– Can provide fast path for serialized data
•  JobConf.setOutputValueGroupingComparator()

Sending Data To The Client
•  Reporter object sent to Mapper allows simple
asynchronous feedback
– incrCounter(Enum key, long amount)
– setStatus(String msg)
•  Allows self-identiﬁcation of input
– InputSplit getInputSplit()

Partition and shufﬂe
Mapper
(intermediates)
Mapper
(intermediates)
Mapper
(intermediates)
Mapper
(intermediates)
Reducer Reducer Reducer
(intermediates) (intermediates) (intermediates)
Partitioner Partitioner Partitioner Partitioner
shuffling

Partitioner
•  int getPartition(key, val, numPartitions)
– Outputs the partition number for a given key
– One partition == values sent to one Reduce task
•  HashPartitioner used by default
– Uses key.hashCode() to return partition num
•  JobConf sets Partitioner implementation

Reduction
•  reduce( WritableComparable key,
Iterator values,
OutputCollector output,
Reporter reporter)
•  Keys & values sent to one partition all go to the
same reduce task
•  Calls are sorted by key – “earlier” keys are
reduced and output before “later” keys

Finally:Writing The Output
Reducer Reducer Reducer
RecordWriter RecordWriter RecordWriter
output file output file output file
OutputFormat

OutputFormat
•  Analogous to InputFormat
•  TextOutputFormat – Writes “key valn” strings to
output ﬁle
•  SequenceFileOutputFormat – Uses a binary format
to pack (k, v) pairs
•  NullOutputFormat – Discards output

Installing Hadoop
•  http://juliensimon.blogspot.in/2011/01/installing-
hadoop-on-windows-cygwin.html
•  http://blog.benhall.me.uk/2011/01/installing-
hadoop-0210-on-windows_18.html

try

More Related Content

What's hot

Viewers also liked

Similar to try

try