Introduction
MapReduce
LARGE DATA SETS
Sample Data Set
•  10 billion web pages
•  Average size - 20 KB
•  Total Size = 10 bil + 20 KB = 200 TB
•  Disk read bandwidth = 50 MB/sec
•  Time to read = 4 mil sec = 46+days
•  Takes more time to do with data
Huge Computations and Large Data
Sets
•  Hundreds of special-purpose computations
•  The computations processes large amounts of
raw data
–  Crawled documents
–  Web request logs
Huge Computations and Large Data Sets
•  Compute various of types of derived data
–  Inverted indices
–  Various representation of graph structure of web
documents
–  Summaries of no of pages crawled per host
–  Set of most frequent queries in a given day
Huge Computations and Large Data Sets
¨ Computations are straight forward but
¤ Input data is very huge
¤ Computations have to be distributed across
hundreds of thousands of machines in order to finish
in reasonable amount of time
CHALLENGES
Node failures
•  A single server can stay up for 3 years
•  1000 servers in a cluster => 1 failure per day
•  100K servers => 100 failures per day
•  How to store data persistently and keep it
available if nodes fail
•  How to deal with node failures during long
running computation
Network Bottleneck
•  Network Bandwidth = 1 GBps
•  Moving 10TB takes approximately - 1 day
•  Distributed Programming is hard
MOTIVATION
Motivation: Large scale data processing
¨ Framework should be able to
¤ Runs on large cluster of machines
¤ High scalable
¤ Takes Tera bytes of data on thousands of
machines
¤ Easy for programmers
¤ Hundreds of map reduce programs have been
implemented
¨ Many real world tasks are expressible in
this model
Motivation: Large scale data processing
¨ Run-time system takes care of
¤ Handling large amount of input data,
¤ scheduling the program execution across set of
machines
¤ Handling machine failures
¤ Manage the required inter-machine
communication
¨ Big benefit
¤ Allows the programmers without any experience
with parallel and distributed systems to easily
utilize the resources of large distributed system
APPLICATIONS
Motivation - Distributed Task Execution
•  Problem Statement:There is a large
computational problem that can be divided into
multiple parts and results from all parts can be
combined together to obtain a final result.
•  Applications:
– Physical and Engineering Simulations, Numerical
Analysis, Performance Testing
– Large scale indexing
Summary and Aggregation
•  There is a number of documents where each
document is a set of terms. It is required to
calculate a total number of occurrences of each
term in all documents.Alternatively, it can be an
arbitrary function of the terms. For instance,
there is a log file where each record contains a
response time and it is required to calculate an
average response time.
•  Applications:
– Log Analysis, Data Querying, ETL, DataValidation
Filtering, Parsing andValidation
•  There is a set of records and it is required to
collect all records that meet some condition or
transform each record (independently from
other records) into another representation.The
later case includes such tasks as text parsing and
value extraction, conversion from one format to
another.
•  Applications
– Log Analysis, Data Querying, ETL, DataValidation
Iterative Message Passing (Graph
Processing)
•  There is a network of entities and relationships
between them. It is required to calculate a state
of each entity on the basis of properties of the
other entities in its neighborhood.This state can
represent a distance to other nodes, indication
that there is a neighbor with the certain
properties, characteristic of neighborhood
density and so on.
•  Applications
– Social Network Analysis
– Supply Chain
Cross-Correlation
•  There is a set of tuples of items. For each
possible pair of items calculate a number of
tuples where these items co-occur. If the total
number of items is N then N*N values should
be reported.
•  Applications:
– Text Analysis, Market Analysis
Relational MapReduce Patterns
•  Selection
•  Projection
•  Union
•  Intersection
•  Difference
•  GroupBy and Aggregation
•  Joining
SOLUTION
Solution - MapReduce
•  Addresses the challenges of cluster computing
•  Store data redundantly on multiple nodes for
persistence and availability
•  Move computation close to the data – to
minimize data movement
•  Simple Programming Model – to hide the
complexity of all this magic
Solution - MapReduce
¨ MapReduce abstraction allows to express
simple computations, but hides
¤ Messy details of parallelization
¤ Fault tolerance
¤ Data distribution and
¤ Load balancing in a library
History of MapReduce
•  The speed at which MapReduce has been adopted is remarkable.
•  It went from an interesting paper from Google in 2004 to a widely adopted industry standard in distributed
data processing in 2012.
•  The actual origins of MapReduce are arguable, but the paper that most cite as the one that started us down
this journey is MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat
in 2004.
•  This paper described how Google split, processed, and aggregated their data set of mind-boggling size.
•  Shortly after the release of the paper, a free and open source software pioneer by the name of Doug Cutting
started working on a MapReduce implementation to solve scalability in another project he was working on
called Nutch, an effort to build an open source search engine.
•  Over time and with some investment byYahoo!, Hadoop split out as its own project and eventually became a
top-level Apache Foundation project.
•  Today, numerous independent people and organizations contribute to Hadoop. Every new release adds
functionality and boosts performance.
•  Several other open source projects have been built with Hadoop at their core, and this list is continually
growing.
•  Some of the more popular ones include Pig, Hive, HBase, Mahout, and ZooKeeper.
•  Doug Cutting and other Hadoop experts have mentioned several times that Hadoop is becoming the kernel of
a distributed operating system in which distributed applications can be built.
Programming Model
•  The programming model addresses the
following
– The user should not worry about the
framework
– User should be able to write his business
logic
– The user should have good productivity
– The model should be simple
Distribute the work
•  Spread the work on more than one machine
•  Challenges
– Communication and co-ordination
– Recovering from machine failure
– Status and progress reporting
– Debugging
– Optimization
– Locality
•  The problems are mostly same for every
problem
Compute Machines
•  CPU’s - typically 2 or 4
–  Typically hyper threaded or dual core
•  Multiple local attached hard disks
•  4Gigs to 16gigs RAM
•  Challenges
–  Single thread performance doesn’t matter
–  Unreliable machines
•  One server – for 1000 days means
•  10000 servers, roughly you loose 10 in one day
–  Ultra-reliable hardware does not help
•  May fail less often
•  Software needs to be less tolerant
•  Commodity machine gives more perf/per dollar
What is MapReduce
•  “A simple and powerful interface that enables
automatic parallelization and distribution of large-scale
computations, combined with an implementation of this
interface that achieves high performance on large
clusters of commodity PCs.”
–  Dean and Ghemawat,“MapReduce: Simplified Data Processing on
Large Clusters”, Google Inc.
•  In simple terms
–  A distributed/parallel programming model and associated
implementation
PROGRAMMING MODEL
MapReduce – Programming model
•  Process Data using Special Map() and Reduce()
functions
– The map function is called on every item in the input
and emits a series of intermediate key/value pairs
– All values associated with a given key are grouped
together
– The Reduce function is called on every unique key,
and its value list, and emits a value that is added to
the output
MapReduce
•  The messy details are transparent
– Automatic parallelization
– Load balancing
– Network and disk transfer optimization
– Handling machine failures
– Robustness
MapReduce
•  Map(k1,v1) --> list(k2,v2)
•  Reduce(k2, list(v2)) --> list(v2)
•  Runtime System
– Partitions input data
– Schedules execution across a set of machines
– Handles machine failure
– Manages interprocess communication
Benefits
•  Greatly Reduces parallel Programming
complexity
– Reduces synchronization complexity
– Automatically partitions data
– Provides failure transparency
– Handles load balancing
MapReduce - Execution Overview
Step1:
•  User program via the MapReduce Library, splits
the input data
User
ProgramInput
Data
Shard 0
Shard 1
Shard 2
Shard 3
Shard 4
Shard 5
Shard 6
* Shards/splits are typically 16-64mb in size
Execution overview – step 2
•  Creates process copies on a distributed
machine cluster.
•  One copy will be the ‘Master’ and other will be
worked threads
User
Program
Master
Workers
Workers
Workers
Workers
Workers
Execution overview – Step 3
•  Master distributes M map and R Reduce tasks to
idle workers
– M = no of input splits
– R = The intermediate key space is divided into R
parts
Master
Idle
Worker
Message(Do_map_task)
Execution Overview – Step 4
Each map-task worker reads assigned input shard
and outputs intermediate key/value pairs.
–  Output buffered in RAM
Map
workerInput Split 0 Key/value pairs
Execution overview – Step 5
•  Each worker flushes intermediate values,
partitioned into R regions, to disk and notifies
the Master process.
Master
Map
worker
Disk locations
Local
Storage
Execution overview – Step 6
•  Master process gives disk locations to an
available reduce-task worker who reads all
associated intermediate data.
Master
Reduce
worker
Disk locations
remote
Storage
Execution overview – Step 7
•  Each reduce-task worker sorts its intermediate
data. Calls the reduce function, passing in unique
keys and associated key values.
•  Sorts and shuffles the data
•  Reduce function output appended to reduce-task’s
partition output file
Reduce
worker
Sorts data Partition
Output file
Execution overview – Step 8
•  Master process wakes up user process when all
tasks have completed. Output contained in R
output files.
wakeup User
Program
Master
Output
files
SAMPLE USECASE
•  0067011990999991950051507004+68750+023550FM-12+03
8299999V0203301N00671220001CN9999999N9+00001+99
999999999
•  0043011990999991950051512004+68750+023550FM-12+03
8299999V0203201N00671220001CN9999999N9+00221+99
999999999
•  0043011990999991950051518004+68750+023550FM-12+03
8299999V0203201N00261220001CN9999999N9-00111+999
99999999
•  0043012650999991949032412004+62300+010750FM-12+04
8599999V0202701N00461220001CN0500001N9+01111+99
999999999
•  0043012650999991949032418004+62300+010750FM-12+04
8599999V0202701N00461220001CN0500001N9+00781+99
999999999
Weather Data
Data Set
¡  0043
¡  012650 - Weather station identifier
¡  99999 – other identifier
¡  19490324 – observation date
¡  1800 – observation time
¡  4
¡  +62300 – latitude
¡  +010750 – longitude
¡  ..
¡  ..
¡  Quality code
¡  ..
¡  Air temperature
¡  ..
¡  ,..Atmospheric pressure
¡  etc
0043012650999991949032418004+62300+010750FM-­‐12+048599999V0202701N00461220001C
N0500001N9+00781+99999999999	
  
Data
Map	
  output	
  
Reduce	
  output	
  
Maximum	
  global	
  temperature	
  recorded	
  each	
  year	
  
1949 111
1950 22
Map Reduce Logical Flow
Map function
Reduce function
Map inputs and outputs
•  Map Input in HDFS
•  Map output in Localdisk
•  Reduce reads input from all disks via RPC
mechanism
•  Reduce output stores in HDFS
Single Reduce task
Multiple Reduce Tasks
No Reduce Task
Combiner Function
•  Allows to combine the output of Map
•  It is almost similar to Reduce Task
Installing ssh
•  Sudo apt-get install ssh
•  Rpm –i ssh.rpm
•  Password less login
– Ssh-keygen –t rsa –P’ ‘ –f ~/.ssh/id_ssa
– Cat ~/.ssh/id_rsa_pub >> ~/.ssh/authorized_keys
•  Test with ssh localhost
Installing
•  Tar xzf hadoop-0.20.0.tar.gz
•  Export JAVA_HOME=/jdk1.6path
•  Export HADOOP_INSTALL=/home/reddyraja/
hadoop-0.20.0
•  Export PATH=$PATH:$HADOOP_INSTALL
•  Check hadoop
– hadoop version
Standalone
¡ Everything runs on a single JVM
¡ Suitable for development of MapReduce programs
¡ Easy to test and debug
¡ No daemons to run
¡ Commands
§  Compile
▪  Javac –classpath $HADOOP_HOME/hadoop-0.20.0.jar –d bin
§  Create the Jar
▪  Jar –cvf maxtemp.jar –C bin /
§  Run the example
▪  Hadoop jar maxtemp.jar NewMaxTemperature input output
Pseudo distributed
•  Hadoop daemons run on local machine
•  Simulates the cluster
Hadoop and others..To avoid confusion
•  MapReduce
•  GFS
•  BigTable
•  Chubby
¨  Hadoop	
  
¨  HDFS	
  
¨  HBASE	
  
¨  ZooKeeper	
  
MapReduce Terms
¨ PayLoad – Applications implement Map and Reduce
functions, forms core of the job
¨ Mapper – Maps maps input key/value pairs to a set
of intermediate key/value pairs
¨ NamedNode - Node that manages the HDFS file
systems
¨ DataNode - Node where data is present.
¨ MasterNode – Node where JobTracker runs
¨ Slave node – Node where Map and Reduce runs
MapReduce Terms continued…
¨ JobTracker –Tracks and assigns jobs toTask tracker.
Schedules jobs.
¨ TaskTracker –Tracks the task and reports status to
JobTracket
¨ Job - A “full program” - an execution of a Mapper and
Reducer across a data set
¨ Task - An execution of a Mapper or a Reducer on a
slice of data
¨ Task Attempt – A particular instance of an attempt to
execute a task on a machine
Terminology Example
•  Running “Word Count” across 20 files is one
job
•  20 files to be mapped imply 20 map tasks +
some number of reduce tasks
•  At least 20 map task attempts will be
performed… more if a machine crashes, etc.
•  Map and Reduce functions inside WordCount
Task Attempts
•  A particular task will be attempted at least once,
possibly more times if it crashes
–  If the same input causes crashes over and over, that input will
eventually be abandoned
•  Multiple attempts at one task may occur in parallel with
speculative execution turned on
–  Task ID from TaskInProgress is not a unique identifier; don’t
use it that way
MapReduce: High Level
JobTracker
MapReduce job
submitted by
client computer
Master node
TaskTracker
Slave node
Task instance
TaskTracker
Slave node
Task instance
TaskTracker
Slave node
Task instance
Hadoop Deployment
Hadoop Stack
HDFS Concepts
¡ Distributed FileSytem
¡ Blocks Store
¡ NameNodes and DataNodes
¡ Command Line interface
¡ Basic FileSystem Operations
¡ Java Interfaces
HDFC Architecture
Node-to-Node Communication
•  Hadoop uses its own RPC protocol
•  All communication begins in slave nodes
– Prevents circular-wait deadlock
– Slaves periodically poll for “status” message
•  Classes must provide explicit serialization
Nodes,Trackers,Tasks
•  Master node runs JobTracker instance, which
accepts Job requests from clients
•  TaskTracker instances run on slave nodes
•  TaskTracker forks separate Java process for task
instances
Job Distribution
•  MapReduce programs are contained in a Java “jar” file +
an XML file containing serialized program configuration
options
•  Running a MapReduce job places these files into the
HDFS and notifies TaskTrackers where to retrieve the
relevant program code
•  … Where’s the data distribution?
Data Distribution
•  Implicit in design of MapReduce!
– All mappers are equivalent; so map whatever data is
local to a particular node in HDFS
•  If lots of data does happen to pile up on the
same node, nearby nodes will map instead
– Data transfer is handled implicitly by HDFS
Configuring With JobConf
•  MR Programs have many configurable options
•  JobConf objects hold (key, value) components mapping
String à ’a
–  e.g.,“mapred.map.tasks” à 20
–  JobConf is serialized and distributed before running the job
•  Objects implementing JobConfigurable can retrieve
elements from a JobConf
Job Launch Process: Client
•  Client program creates a JobConf
– Identify classes implementing Mapper and Reducer
interfaces
•  JobConf.setMapperClass(), setReducerClass()
– Specify inputs, outputs
•  JobConf.setInputPath(), setOutputPath()
– Optionally, other options too:
•  JobConf.setNumReduceTasks(),
JobConf.setOutputFormat()…
Job Launch Process: JobClient
•  Pass JobConf to JobClient.runJob() or
submitJob()
– runJob() blocks, submitJob() does not
•  JobClient:
– Determines proper division of input into InputSplits
– Sends job data to master JobTracker server
Job Launch Process: JobTracker
•  JobTracker:
– Inserts jar and JobConf (serialized to XML) in shared
location
– Posts a JobInProgress to its run queue
Job Launch Process: TaskTracker
•  TaskTrackers running on slave nodes periodically
query JobTracker for work
•  Retrieve job-specific jar and config
•  Launch task in separate instance of Java
– main() is provided by Hadoop
Job Launch Process:Task
•  TaskTracker.Child.main():
– Sets up the child TaskInProgress attempt
– Reads XML configuration
– Connects back to necessary MapReduce
components via RPC
– Uses TaskRunner to launch user process
Job Launch Process: TaskRunner
•  TaskRunner, MapTaskRunner, MapRunner work in
a daisy-chain to launch your Mapper
– Task knows ahead of time which InputSplits it should
be mapping
– Calls Mapper once for each record retrieved from
the InputSplit
•  Running the Reducer is much the same
Creating the Mapper
•  You provide the instance of Mapper
– Should extend MapReduceBase
•  One instance of your Mapper is initialized by the
MapTaskRunner for a TaskInProgress
– Exists in separate process from all other instances of
Mapper – no data sharing!
What is Writable?
•  Hadoop defines its own “box” classes for strings
(Text), integers (IntWritable), etc.
•  All values are instances of Writable
•  All keys are instances of WritableComparable
Data to the mapper
Input file
InputSplit InputSplit InputSplit InputSplit
Input file
RecordReader RecordReader RecordReader RecordReader
Mapper
(intermediates)
Mapper
(intermediates)
Mapper
(intermediates)
Mapper
(intermediates)
InputFormat
Reading Data
•  Data sets are specified by InputFormats
– Defines input data (e.g., a directory)
– Identifies partitions of the data that form an
InputSplit
– Factory for RecordReader objects to extract (k, v)
records from the input source
InputFormat
¨ Describes the input-specification for a MapReduce
Job
¨ MapReduce Relies on InputFormat for
¤ Validate the input-specification of the Job
¤ Split up input files into logical InputSplits and then
assigned to Individual Mapper
¤ Provide Record Reader implementation to be used to
glean input records from the logical processing by the
Mapper
¨ FileInputFormat – splits files based on file size
¤ FileSystem Block size is the upper limit
¤ Lower limit can be set by mapred.min.split.size
FileInputFormat
•  TextInputFormat – Treats each ‘n’-terminated
line of a file as a value
•  KeyValueTextInputFormat – Maps ‘n’-
terminated text lines of “k SEP v”
•  SequenceFileInputFormat – Binary file of (k, v)
pairs with some add’l metadata
•  SequenceFileAsTextInputFormat – Same, but
maps (k.toString(), v.toString())
Filtering File Inputs
•  FileInputFormat will read all files out of a
specified directory and send them to the
mapper
•  Delegates filtering this file list to a method
subclasses may override
– e.g., Create your own “xyzFileInputFormat” to read
*.xyz from directory list
InputSplit
¨ One MapTask for each Split
¨ Framework calls for each record in InputSplit
¤ Setup
¤ Map
¤ Cleanup
¨ Intermediate values
¤ Grouped by framework
¤ Passed to Reducer
¤ Control Sorting using RawComparator class
¤ Use CombinerClass to perform local segregation
¨ Use Partitioner to control which outputs to go to what
Reduce
¨ Use CompressionCodecs to compress intermediate output
Input Split Size
•  FileInputFormat will divide large files into chunks
– Exact size controlled by mapred.min.split.size
•  RecordReaders receive file, offset, and length of
chunk
•  Custom InputFormat implementations may
override split size – e.g.,“NeverChunkFile”
Record Readers
•  Each InputFormat provides its own RecordReader
implementation
– Provides (unused?) capability multiplexing
•  LineRecordReader – Reads a line from a text file
•  KeyValueRecordReader – Used by
KeyValueTextInputFormat
Partitioner
•  Controls the keys of intermediate maps
•  Key is used to derive the partition, which is a
hash function
•  Total partitions is same as number of reduce
tasks
Reduce
¨ Reduces a set of intermediate values which share a
key to a smaller set of values
¨ Has 3 phases
¤ Shuffle
n Copies the sorted output from each mapper using HTTP
across the network
¤ Sort
n Sorts reduce inputs by keys
n Shuffle and sort phases occur simultaneously
n Secondary Sort using custom functions
¤ Reduce
n Framework class Reduce for each key and collection of values
Sending Data To Reducers
•  Map function receives OutputCollector object
– OutputCollector.collect() takes (k, v) elements
•  Any (WritableComparable,Writable) can be used
WritableComparator
•  Compares WritableComparable data
– Will call WritableComparable.compare()
– Can provide fast path for serialized data
•  JobConf.setOutputValueGroupingComparator()
Sending Data To The Client
•  Reporter object sent to Mapper allows simple
asynchronous feedback
– incrCounter(Enum key, long amount)
– setStatus(String msg)
•  Allows self-identification of input
– InputSplit getInputSplit()
Partition and shuffle
Mapper
(intermediates)
Mapper
(intermediates)
Mapper
(intermediates)
Mapper
(intermediates)
Reducer Reducer Reducer
(intermediates) (intermediates) (intermediates)
Partitioner Partitioner Partitioner Partitioner
shuffling
Partitioner
•  int getPartition(key, val, numPartitions)
– Outputs the partition number for a given key
– One partition == values sent to one Reduce task
•  HashPartitioner used by default
– Uses key.hashCode() to return partition num
•  JobConf sets Partitioner implementation
Reduction
•  reduce( WritableComparable key,
Iterator values,
OutputCollector output,
Reporter reporter)
•  Keys & values sent to one partition all go to the
same reduce task
•  Calls are sorted by key – “earlier” keys are
reduced and output before “later” keys
Finally:Writing The Output
Reducer Reducer Reducer
RecordWriter RecordWriter RecordWriter
output file output file output file
OutputFormat
OutputFormat
•  Analogous to InputFormat
•  TextOutputFormat – Writes “key valn” strings to
output file
•  SequenceFileOutputFormat – Uses a binary format
to pack (k, v) pairs
•  NullOutputFormat – Discards output
Installing Hadoop
•  http://juliensimon.blogspot.in/2011/01/installing-
hadoop-on-windows-cygwin.html
•  http://blog.benhall.me.uk/2011/01/installing-
hadoop-0210-on-windows_18.html

try

  • 1.
  • 2.
  • 3.
    Sample Data Set • 10 billion web pages •  Average size - 20 KB •  Total Size = 10 bil + 20 KB = 200 TB •  Disk read bandwidth = 50 MB/sec •  Time to read = 4 mil sec = 46+days •  Takes more time to do with data
  • 4.
    Huge Computations andLarge Data Sets •  Hundreds of special-purpose computations •  The computations processes large amounts of raw data –  Crawled documents –  Web request logs
  • 5.
    Huge Computations andLarge Data Sets •  Compute various of types of derived data –  Inverted indices –  Various representation of graph structure of web documents –  Summaries of no of pages crawled per host –  Set of most frequent queries in a given day
  • 6.
    Huge Computations andLarge Data Sets ¨ Computations are straight forward but ¤ Input data is very huge ¤ Computations have to be distributed across hundreds of thousands of machines in order to finish in reasonable amount of time
  • 7.
  • 8.
    Node failures •  Asingle server can stay up for 3 years •  1000 servers in a cluster => 1 failure per day •  100K servers => 100 failures per day •  How to store data persistently and keep it available if nodes fail •  How to deal with node failures during long running computation
  • 9.
    Network Bottleneck •  NetworkBandwidth = 1 GBps •  Moving 10TB takes approximately - 1 day •  Distributed Programming is hard
  • 10.
  • 11.
    Motivation: Large scaledata processing ¨ Framework should be able to ¤ Runs on large cluster of machines ¤ High scalable ¤ Takes Tera bytes of data on thousands of machines ¤ Easy for programmers ¤ Hundreds of map reduce programs have been implemented ¨ Many real world tasks are expressible in this model
  • 12.
    Motivation: Large scaledata processing ¨ Run-time system takes care of ¤ Handling large amount of input data, ¤ scheduling the program execution across set of machines ¤ Handling machine failures ¤ Manage the required inter-machine communication ¨ Big benefit ¤ Allows the programmers without any experience with parallel and distributed systems to easily utilize the resources of large distributed system
  • 13.
  • 14.
    Motivation - DistributedTask Execution •  Problem Statement:There is a large computational problem that can be divided into multiple parts and results from all parts can be combined together to obtain a final result. •  Applications: – Physical and Engineering Simulations, Numerical Analysis, Performance Testing – Large scale indexing
  • 15.
    Summary and Aggregation • There is a number of documents where each document is a set of terms. It is required to calculate a total number of occurrences of each term in all documents.Alternatively, it can be an arbitrary function of the terms. For instance, there is a log file where each record contains a response time and it is required to calculate an average response time. •  Applications: – Log Analysis, Data Querying, ETL, DataValidation
  • 16.
    Filtering, Parsing andValidation • There is a set of records and it is required to collect all records that meet some condition or transform each record (independently from other records) into another representation.The later case includes such tasks as text parsing and value extraction, conversion from one format to another. •  Applications – Log Analysis, Data Querying, ETL, DataValidation
  • 17.
    Iterative Message Passing(Graph Processing) •  There is a network of entities and relationships between them. It is required to calculate a state of each entity on the basis of properties of the other entities in its neighborhood.This state can represent a distance to other nodes, indication that there is a neighbor with the certain properties, characteristic of neighborhood density and so on. •  Applications – Social Network Analysis – Supply Chain
  • 18.
    Cross-Correlation •  There isa set of tuples of items. For each possible pair of items calculate a number of tuples where these items co-occur. If the total number of items is N then N*N values should be reported. •  Applications: – Text Analysis, Market Analysis
  • 19.
    Relational MapReduce Patterns • Selection •  Projection •  Union •  Intersection •  Difference •  GroupBy and Aggregation •  Joining
  • 20.
  • 21.
    Solution - MapReduce • Addresses the challenges of cluster computing •  Store data redundantly on multiple nodes for persistence and availability •  Move computation close to the data – to minimize data movement •  Simple Programming Model – to hide the complexity of all this magic
  • 22.
    Solution - MapReduce ¨ MapReduceabstraction allows to express simple computations, but hides ¤ Messy details of parallelization ¤ Fault tolerance ¤ Data distribution and ¤ Load balancing in a library
  • 23.
    History of MapReduce • The speed at which MapReduce has been adopted is remarkable. •  It went from an interesting paper from Google in 2004 to a widely adopted industry standard in distributed data processing in 2012. •  The actual origins of MapReduce are arguable, but the paper that most cite as the one that started us down this journey is MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat in 2004. •  This paper described how Google split, processed, and aggregated their data set of mind-boggling size. •  Shortly after the release of the paper, a free and open source software pioneer by the name of Doug Cutting started working on a MapReduce implementation to solve scalability in another project he was working on called Nutch, an effort to build an open source search engine. •  Over time and with some investment byYahoo!, Hadoop split out as its own project and eventually became a top-level Apache Foundation project. •  Today, numerous independent people and organizations contribute to Hadoop. Every new release adds functionality and boosts performance. •  Several other open source projects have been built with Hadoop at their core, and this list is continually growing. •  Some of the more popular ones include Pig, Hive, HBase, Mahout, and ZooKeeper. •  Doug Cutting and other Hadoop experts have mentioned several times that Hadoop is becoming the kernel of a distributed operating system in which distributed applications can be built.
  • 24.
    Programming Model •  Theprogramming model addresses the following – The user should not worry about the framework – User should be able to write his business logic – The user should have good productivity – The model should be simple
  • 25.
    Distribute the work • Spread the work on more than one machine •  Challenges – Communication and co-ordination – Recovering from machine failure – Status and progress reporting – Debugging – Optimization – Locality •  The problems are mostly same for every problem
  • 26.
    Compute Machines •  CPU’s- typically 2 or 4 –  Typically hyper threaded or dual core •  Multiple local attached hard disks •  4Gigs to 16gigs RAM •  Challenges –  Single thread performance doesn’t matter –  Unreliable machines •  One server – for 1000 days means •  10000 servers, roughly you loose 10 in one day –  Ultra-reliable hardware does not help •  May fail less often •  Software needs to be less tolerant •  Commodity machine gives more perf/per dollar
  • 27.
    What is MapReduce • “A simple and powerful interface that enables automatic parallelization and distribution of large-scale computations, combined with an implementation of this interface that achieves high performance on large clusters of commodity PCs.” –  Dean and Ghemawat,“MapReduce: Simplified Data Processing on Large Clusters”, Google Inc. •  In simple terms –  A distributed/parallel programming model and associated implementation
  • 35.
  • 36.
    MapReduce – Programmingmodel •  Process Data using Special Map() and Reduce() functions – The map function is called on every item in the input and emits a series of intermediate key/value pairs – All values associated with a given key are grouped together – The Reduce function is called on every unique key, and its value list, and emits a value that is added to the output
  • 37.
    MapReduce •  The messydetails are transparent – Automatic parallelization – Load balancing – Network and disk transfer optimization – Handling machine failures – Robustness
  • 38.
    MapReduce •  Map(k1,v1) -->list(k2,v2) •  Reduce(k2, list(v2)) --> list(v2) •  Runtime System – Partitions input data – Schedules execution across a set of machines – Handles machine failure – Manages interprocess communication
  • 39.
    Benefits •  Greatly Reducesparallel Programming complexity – Reduces synchronization complexity – Automatically partitions data – Provides failure transparency – Handles load balancing
  • 40.
    MapReduce - ExecutionOverview Step1: •  User program via the MapReduce Library, splits the input data User ProgramInput Data Shard 0 Shard 1 Shard 2 Shard 3 Shard 4 Shard 5 Shard 6 * Shards/splits are typically 16-64mb in size
  • 41.
    Execution overview –step 2 •  Creates process copies on a distributed machine cluster. •  One copy will be the ‘Master’ and other will be worked threads User Program Master Workers Workers Workers Workers Workers
  • 42.
    Execution overview –Step 3 •  Master distributes M map and R Reduce tasks to idle workers – M = no of input splits – R = The intermediate key space is divided into R parts Master Idle Worker Message(Do_map_task)
  • 43.
    Execution Overview –Step 4 Each map-task worker reads assigned input shard and outputs intermediate key/value pairs. –  Output buffered in RAM Map workerInput Split 0 Key/value pairs
  • 44.
    Execution overview –Step 5 •  Each worker flushes intermediate values, partitioned into R regions, to disk and notifies the Master process. Master Map worker Disk locations Local Storage
  • 45.
    Execution overview –Step 6 •  Master process gives disk locations to an available reduce-task worker who reads all associated intermediate data. Master Reduce worker Disk locations remote Storage
  • 46.
    Execution overview –Step 7 •  Each reduce-task worker sorts its intermediate data. Calls the reduce function, passing in unique keys and associated key values. •  Sorts and shuffles the data •  Reduce function output appended to reduce-task’s partition output file Reduce worker Sorts data Partition Output file
  • 47.
    Execution overview –Step 8 •  Master process wakes up user process when all tasks have completed. Output contained in R output files. wakeup User Program Master Output files
  • 48.
  • 49.
    •  0067011990999991950051507004+68750+023550FM-12+03 8299999V0203301N00671220001CN9999999N9+00001+99 999999999 •  0043011990999991950051512004+68750+023550FM-12+03 8299999V0203201N00671220001CN9999999N9+00221+99 999999999 • 0043011990999991950051518004+68750+023550FM-12+03 8299999V0203201N00261220001CN9999999N9-00111+999 99999999 •  0043012650999991949032412004+62300+010750FM-12+04 8599999V0202701N00461220001CN0500001N9+01111+99 999999999 •  0043012650999991949032418004+62300+010750FM-12+04 8599999V0202701N00461220001CN0500001N9+00781+99 999999999 Weather Data
  • 50.
    Data Set ¡  0043 ¡ 012650 - Weather station identifier ¡  99999 – other identifier ¡  19490324 – observation date ¡  1800 – observation time ¡  4 ¡  +62300 – latitude ¡  +010750 – longitude ¡  .. ¡  .. ¡  Quality code ¡  .. ¡  Air temperature ¡  .. ¡  ,..Atmospheric pressure ¡  etc 0043012650999991949032418004+62300+010750FM-­‐12+048599999V0202701N00461220001C N0500001N9+00781+99999999999  
  • 51.
    Data Map  output   Reduce  output   Maximum  global  temperature  recorded  each  year   1949 111 1950 22
  • 52.
  • 53.
  • 54.
  • 55.
    Map inputs andoutputs •  Map Input in HDFS •  Map output in Localdisk •  Reduce reads input from all disks via RPC mechanism •  Reduce output stores in HDFS
  • 56.
  • 57.
  • 58.
  • 59.
    Combiner Function •  Allowsto combine the output of Map •  It is almost similar to Reduce Task
  • 60.
    Installing ssh •  Sudoapt-get install ssh •  Rpm –i ssh.rpm •  Password less login – Ssh-keygen –t rsa –P’ ‘ –f ~/.ssh/id_ssa – Cat ~/.ssh/id_rsa_pub >> ~/.ssh/authorized_keys •  Test with ssh localhost
  • 61.
    Installing •  Tar xzfhadoop-0.20.0.tar.gz •  Export JAVA_HOME=/jdk1.6path •  Export HADOOP_INSTALL=/home/reddyraja/ hadoop-0.20.0 •  Export PATH=$PATH:$HADOOP_INSTALL •  Check hadoop – hadoop version
  • 62.
    Standalone ¡ Everything runs ona single JVM ¡ Suitable for development of MapReduce programs ¡ Easy to test and debug ¡ No daemons to run ¡ Commands §  Compile ▪  Javac –classpath $HADOOP_HOME/hadoop-0.20.0.jar –d bin §  Create the Jar ▪  Jar –cvf maxtemp.jar –C bin / §  Run the example ▪  Hadoop jar maxtemp.jar NewMaxTemperature input output
  • 63.
    Pseudo distributed •  Hadoopdaemons run on local machine •  Simulates the cluster
  • 64.
    Hadoop and others..Toavoid confusion •  MapReduce •  GFS •  BigTable •  Chubby ¨  Hadoop   ¨  HDFS   ¨  HBASE   ¨  ZooKeeper  
  • 65.
    MapReduce Terms ¨ PayLoad –Applications implement Map and Reduce functions, forms core of the job ¨ Mapper – Maps maps input key/value pairs to a set of intermediate key/value pairs ¨ NamedNode - Node that manages the HDFS file systems ¨ DataNode - Node where data is present. ¨ MasterNode – Node where JobTracker runs ¨ Slave node – Node where Map and Reduce runs
  • 66.
    MapReduce Terms continued… ¨ JobTracker–Tracks and assigns jobs toTask tracker. Schedules jobs. ¨ TaskTracker –Tracks the task and reports status to JobTracket ¨ Job - A “full program” - an execution of a Mapper and Reducer across a data set ¨ Task - An execution of a Mapper or a Reducer on a slice of data ¨ Task Attempt – A particular instance of an attempt to execute a task on a machine
  • 67.
    Terminology Example •  Running“Word Count” across 20 files is one job •  20 files to be mapped imply 20 map tasks + some number of reduce tasks •  At least 20 map task attempts will be performed… more if a machine crashes, etc. •  Map and Reduce functions inside WordCount
  • 68.
    Task Attempts •  Aparticular task will be attempted at least once, possibly more times if it crashes –  If the same input causes crashes over and over, that input will eventually be abandoned •  Multiple attempts at one task may occur in parallel with speculative execution turned on –  Task ID from TaskInProgress is not a unique identifier; don’t use it that way
  • 69.
    MapReduce: High Level JobTracker MapReducejob submitted by client computer Master node TaskTracker Slave node Task instance TaskTracker Slave node Task instance TaskTracker Slave node Task instance
  • 70.
  • 71.
  • 72.
    HDFS Concepts ¡ Distributed FileSytem ¡ BlocksStore ¡ NameNodes and DataNodes ¡ Command Line interface ¡ Basic FileSystem Operations ¡ Java Interfaces
  • 73.
  • 74.
    Node-to-Node Communication •  Hadoopuses its own RPC protocol •  All communication begins in slave nodes – Prevents circular-wait deadlock – Slaves periodically poll for “status” message •  Classes must provide explicit serialization
  • 75.
    Nodes,Trackers,Tasks •  Master noderuns JobTracker instance, which accepts Job requests from clients •  TaskTracker instances run on slave nodes •  TaskTracker forks separate Java process for task instances
  • 76.
    Job Distribution •  MapReduceprograms are contained in a Java “jar” file + an XML file containing serialized program configuration options •  Running a MapReduce job places these files into the HDFS and notifies TaskTrackers where to retrieve the relevant program code •  … Where’s the data distribution?
  • 77.
    Data Distribution •  Implicitin design of MapReduce! – All mappers are equivalent; so map whatever data is local to a particular node in HDFS •  If lots of data does happen to pile up on the same node, nearby nodes will map instead – Data transfer is handled implicitly by HDFS
  • 78.
    Configuring With JobConf • MR Programs have many configurable options •  JobConf objects hold (key, value) components mapping String à ’a –  e.g.,“mapred.map.tasks” à 20 –  JobConf is serialized and distributed before running the job •  Objects implementing JobConfigurable can retrieve elements from a JobConf
  • 79.
    Job Launch Process:Client •  Client program creates a JobConf – Identify classes implementing Mapper and Reducer interfaces •  JobConf.setMapperClass(), setReducerClass() – Specify inputs, outputs •  JobConf.setInputPath(), setOutputPath() – Optionally, other options too: •  JobConf.setNumReduceTasks(), JobConf.setOutputFormat()…
  • 80.
    Job Launch Process:JobClient •  Pass JobConf to JobClient.runJob() or submitJob() – runJob() blocks, submitJob() does not •  JobClient: – Determines proper division of input into InputSplits – Sends job data to master JobTracker server
  • 81.
    Job Launch Process:JobTracker •  JobTracker: – Inserts jar and JobConf (serialized to XML) in shared location – Posts a JobInProgress to its run queue
  • 82.
    Job Launch Process:TaskTracker •  TaskTrackers running on slave nodes periodically query JobTracker for work •  Retrieve job-specific jar and config •  Launch task in separate instance of Java – main() is provided by Hadoop
  • 83.
    Job Launch Process:Task • TaskTracker.Child.main(): – Sets up the child TaskInProgress attempt – Reads XML configuration – Connects back to necessary MapReduce components via RPC – Uses TaskRunner to launch user process
  • 84.
    Job Launch Process:TaskRunner •  TaskRunner, MapTaskRunner, MapRunner work in a daisy-chain to launch your Mapper – Task knows ahead of time which InputSplits it should be mapping – Calls Mapper once for each record retrieved from the InputSplit •  Running the Reducer is much the same
  • 85.
    Creating the Mapper • You provide the instance of Mapper – Should extend MapReduceBase •  One instance of your Mapper is initialized by the MapTaskRunner for a TaskInProgress – Exists in separate process from all other instances of Mapper – no data sharing!
  • 86.
    What is Writable? • Hadoop defines its own “box” classes for strings (Text), integers (IntWritable), etc. •  All values are instances of Writable •  All keys are instances of WritableComparable
  • 87.
    Data to themapper Input file InputSplit InputSplit InputSplit InputSplit Input file RecordReader RecordReader RecordReader RecordReader Mapper (intermediates) Mapper (intermediates) Mapper (intermediates) Mapper (intermediates) InputFormat
  • 88.
    Reading Data •  Datasets are specified by InputFormats – Defines input data (e.g., a directory) – Identifies partitions of the data that form an InputSplit – Factory for RecordReader objects to extract (k, v) records from the input source
  • 89.
    InputFormat ¨ Describes the input-specificationfor a MapReduce Job ¨ MapReduce Relies on InputFormat for ¤ Validate the input-specification of the Job ¤ Split up input files into logical InputSplits and then assigned to Individual Mapper ¤ Provide Record Reader implementation to be used to glean input records from the logical processing by the Mapper ¨ FileInputFormat – splits files based on file size ¤ FileSystem Block size is the upper limit ¤ Lower limit can be set by mapred.min.split.size
  • 90.
    FileInputFormat •  TextInputFormat –Treats each ‘n’-terminated line of a file as a value •  KeyValueTextInputFormat – Maps ‘n’- terminated text lines of “k SEP v” •  SequenceFileInputFormat – Binary file of (k, v) pairs with some add’l metadata •  SequenceFileAsTextInputFormat – Same, but maps (k.toString(), v.toString())
  • 91.
    Filtering File Inputs • FileInputFormat will read all files out of a specified directory and send them to the mapper •  Delegates filtering this file list to a method subclasses may override – e.g., Create your own “xyzFileInputFormat” to read *.xyz from directory list
  • 92.
    InputSplit ¨ One MapTask foreach Split ¨ Framework calls for each record in InputSplit ¤ Setup ¤ Map ¤ Cleanup ¨ Intermediate values ¤ Grouped by framework ¤ Passed to Reducer ¤ Control Sorting using RawComparator class ¤ Use CombinerClass to perform local segregation ¨ Use Partitioner to control which outputs to go to what Reduce ¨ Use CompressionCodecs to compress intermediate output
  • 93.
    Input Split Size • FileInputFormat will divide large files into chunks – Exact size controlled by mapred.min.split.size •  RecordReaders receive file, offset, and length of chunk •  Custom InputFormat implementations may override split size – e.g.,“NeverChunkFile”
  • 94.
    Record Readers •  EachInputFormat provides its own RecordReader implementation – Provides (unused?) capability multiplexing •  LineRecordReader – Reads a line from a text file •  KeyValueRecordReader – Used by KeyValueTextInputFormat
  • 95.
    Partitioner •  Controls thekeys of intermediate maps •  Key is used to derive the partition, which is a hash function •  Total partitions is same as number of reduce tasks
  • 96.
    Reduce ¨ Reduces a setof intermediate values which share a key to a smaller set of values ¨ Has 3 phases ¤ Shuffle n Copies the sorted output from each mapper using HTTP across the network ¤ Sort n Sorts reduce inputs by keys n Shuffle and sort phases occur simultaneously n Secondary Sort using custom functions ¤ Reduce n Framework class Reduce for each key and collection of values
  • 97.
    Sending Data ToReducers •  Map function receives OutputCollector object – OutputCollector.collect() takes (k, v) elements •  Any (WritableComparable,Writable) can be used
  • 98.
    WritableComparator •  Compares WritableComparabledata – Will call WritableComparable.compare() – Can provide fast path for serialized data •  JobConf.setOutputValueGroupingComparator()
  • 99.
    Sending Data ToThe Client •  Reporter object sent to Mapper allows simple asynchronous feedback – incrCounter(Enum key, long amount) – setStatus(String msg) •  Allows self-identification of input – InputSplit getInputSplit()
  • 100.
    Partition and shuffle Mapper (intermediates) Mapper (intermediates) Mapper (intermediates) Mapper (intermediates) ReducerReducer Reducer (intermediates) (intermediates) (intermediates) Partitioner Partitioner Partitioner Partitioner shuffling
  • 101.
    Partitioner •  int getPartition(key,val, numPartitions) – Outputs the partition number for a given key – One partition == values sent to one Reduce task •  HashPartitioner used by default – Uses key.hashCode() to return partition num •  JobConf sets Partitioner implementation
  • 102.
    Reduction •  reduce( WritableComparablekey, Iterator values, OutputCollector output, Reporter reporter) •  Keys & values sent to one partition all go to the same reduce task •  Calls are sorted by key – “earlier” keys are reduced and output before “later” keys
  • 103.
    Finally:Writing The Output ReducerReducer Reducer RecordWriter RecordWriter RecordWriter output file output file output file OutputFormat
  • 104.
    OutputFormat •  Analogous toInputFormat •  TextOutputFormat – Writes “key valn” strings to output file •  SequenceFileOutputFormat – Uses a binary format to pack (k, v) pairs •  NullOutputFormat – Discards output
  • 105.
    Installing Hadoop •  http://juliensimon.blogspot.in/2011/01/installing- hadoop-on-windows-cygwin.html • http://blog.benhall.me.uk/2011/01/installing- hadoop-0210-on-windows_18.html