Module-2
Module-2
Why Hadoop?
• The key consideration (the rationale behind its huge popularity) is:
Its capability to handle massive amounts of data, different categories of data -fairly
quickly.
• The other considerations are:
i. Low cost: Hadoop is an open-source framework and uses commodity hardware
(commodity hardware is relatively inexpensive and easy to obtain hardware) to
store enormous quantities of data.
ii. Computing power: Hadoop is based on distributed computing model which
processes very large volumes of data fairly quickly. The more the number of
computing nodes, the more the processing power at hand.
iii. Scalability: This boils down to simply adding nodes as the system grows and
requires much less administration.
iv. Storage flexibility: Unlike the traditional relational databases, in Hadoop data
need not be pre-processed before storing it. Hadoop provides the convenience of
storing as much data as one needs and also the added flexibility of deciding later
as to how to use the stored data. In Hadoop, one can store unstructured data like
images, videos, and free-form text.
v. Inherent data protection: Hadoop protects data and executing applications
against hardware failure. If a node fails, it automatically redirects the jobs that had
been assigned to this node to the other functional and available nodes and ensures
that distributed computing does not fail. It goes a step further to store multiple
copies (replicas) of the data on various nodes across the cluster.
• Hadoop makes use of commodity hardware, distributed file system, and distributed
computing as shown in Figure.
Hadoop framework (distributed file system, commodity hardware)
• In this new design, groups of machine are gathered together; it is known as a Cluster.
• With this new paradigm, the data can be managed with Hadoop as follows:
i. Distributes the data and duplicates chunks of each data file across several nodes,
for example, 25-30 is one chunk of data as shown in Figure.
ii. Locally available compute resource is used to process each chunk of data in
parallel.
iii. Hadoop Framework handles failover smartly and automatically.
In a distributed system, several servers are networked together. This implies that more often
than not, there may be a possibility of hardware failure. A regular hard disk may fail once in 3
years. And when you have 1000 such hard disks, there is a possibility of at least a few being
down every day.
Hadoop has an answer to this problem in Replication Factor (RF). Replication Factor
connotes the number of data copies of a given data item/data block stored across the network.
Replication factor
Hadoop was created by Doug Cutting, the creator of Apache Lucene (a commonly used text
search library). Hadoop is a part of the Apache Nutch (Yahoo) project (an open-source web
search engine) and also a part of the Lucene project.
Hadoop history
Subprojects and "contrib" modules in Hadoop also tend to have names that are unrelated to
their function, often with an elephant or other animal theme ("Pig", for example).
Hadoop Overview
Open-source software framework to store and process massive amounts of data in a
distributed fashion on large clusters of commodity hardware. Basically, Hadoop accomplishes
two tasks:
Hadoop components
i. HDFS:
(a) Storage component
(b) Distributes data across several nodes
(c) Natively redundant
ii. MapReduce:
(a) Computational framework
Hadoop Ecosystem: Hadoop Ecosystem are support projects to enhance the functionality of
Hadoop Com Components. The Eco Projects are as follows:
i. HIVE
ii. PIG
iii. SQOOР
iv. HBASE
v. FLUME
vi. OOZIE
vii. МАНOUT
It is conceptually divided into Data Storage Layer which stores huge volumes of data and
Data Processing Layer which processes data in parallel to extract richer and meaningful
insights from data.
Hadoop conceptual layer
ClickStream data (mouse clicks) helps you to understand the purchasing behavior of
customers. ClickStream analysis helps online marketers to optimize their product web pages,
promotional content, etc. to improve their business.
ClickStream data analysis
i. Hadoop helps to join ClickStream data with other data sources such as Customer
Relationship Management Data (Customer Demographics Data, Sales Data, and
Information on Advertising Campaigns). This additional data often provides the
much needed information to understand customer behavior.
ii. Hadoop's scalability property helps you to store years of data without ample
incremental cost. This helps you to perform temporal or year over year analysis on
ClickStream data which your competitors may miss.
iii. Business analysts can use Apache Pig or Apache Hive for website analysis. With
these tools, you can organize ClickStream data by user session, refine it, and feed
it to visualization or analytics tools.
Client Application interacts with NameNode for metadata related activities and
communicates with DataNodes to read and write files. DataNodes converse with each other
for pipeline reads and writes.
Let us assume that the file "Sample.txt" is of size 192 MB. As per the default data block size
(64 MB), it will split into three blocks and replicated across the nodes on the cluster based on
the default replication factor.
HDFS Daemons
NameNode
HDFS breaks a large file into smaller pieces called blocks. NameNode uses a rack ID to identify DataNodes in
the rack. A rack is a collection of DataNodes within the cluster. NameNode keeps tracks of blocks of a file as
placed on various DataNodes. NameNode manages file-related operations such as read, write, create, and delete.
Its main job is managing the File System Namespace. A file system namespace is collection of files in the
cluster. NameNode stores HDFS namespace. File system namespace includes mapping of blocks to file, file
properties and is stored in a file called FsImage. NameNode uses an EditLog (transaction log) to record every
transaction that happens to the file system metadata.
NameNode
DataNode
There are multiple DataNodes per cluster. During Pipeline read and write DataNodes communicate with each
other. A DataNode also continuously sends "heartbeat" message to NameNode to ensure the connectivity
between the NameNode and DataNode. In case there is no heartbeat from a DataNode, the NameNode replicates
that DataNode within the cluster and keeps on running as if nothing had happened.
The Secondary NameNode takes a snapshot of HDFS metadata at intervals specified in the Hadoop
configuration. Since the memory requirements of Secondary NameNode are the same as NameNode, it is better
to run NameNode and Secondary NameNode on different machines. In case of failure of the NameNode, the
Secondary NameNode can be configured manually to bring up the cluster. However, the Secondary NameNode
does not record any real-time changes that happen to the HDFS metadata.
File Read
The steps involved in the File Read are as follows:
i. The client opens the file that it wishes to read from by calling open() on the
DistributedFileSystem.
ii. DistributedFileSystem communicates with the NameNode to get the location of
data blocks. NameNode returns with the addresses of the DataNodes that the data
blocks are stored on. Subsequent to this, the DistributedFileSystem returns an
FSDatalnputStream to client to read from the file.
iii. Client then calls read() on the stream DFSInputStream, which has addresses of the
DataNodes for the first few blocks of the file, connects to the closest DataNode
for the first block in the file.
iv. Client calls read() repeatedly to stream the data from the DataNode.
v. When end of the block is reached, DFSInputStream closes the connection with the
DataNode. It repeats the steps to find the best DataNode for the next block and
subsequent blocks.
vi. When the client completes the reading of the file, it calls close() on the
FSDataInputStream to close the connection.
Anatomy of File Write
File Write
As per the Hadoop Replica Placement Strategy, first replica is placed on the same node as the client. Then it
places second replica on a node that is present on different rack. It places the third replica on the same rack on
second, but on a different node in the rack. Once replica locations have been set, a pipeline is built. This strategy
provides good reliability. Figure describes the typical replica pipeline.
Objective: To get the list of directories and files at the root of HDFS.
Objective: To copy a file from local file system to HDFS via copyFromLocal command.
Objective: To copy a file from Hadoop file system to local file system via copyToLocal
command.
i. Data Replication: There is absolutely no need for a client application to track all
blocks. It directs the client to the nearest replica to ensure high performance.
ii. Data Pipeline: A client application writes a block to the first DataNode in the
pipeline. Then this DataNode takes over and forwards the data to the next node in
the pipeline. This process continues for all the data blocks, and subsequently all
the replicas are written to the disk.
In MapReduce Programming, the input dataset is split into independent chunks. Map tasks
process these independent chunks completely in a parallel manner. The output produced by
the map tasks serves as intermediate data and is stored on the local disk of that server. The
output of the mappers are automatically shuffled and sorted by the framework. MapReduce
Framework sorts the output based on keys. This sorted output becomes the input to the reduce
tasks. Reduce task provides reduced output by combining the output of the various mappers.
Job inputs and outputs are stored in a file system. MapReduce framework also takes care of
the other tasks such as scheduling, monitoring, re-executing failed tasks, etc.
Hadoop Distributed File System and MapReduce Framework run on the same set of nodes.
This configuration allows effective scheduling of tasks on the nodes where data is present
(Data Locality). This in turn results in very high throughput.
There are two daemons associated with MapReduce Programming. A single master
JobTracker per cluster and one slave TaskTracker per cluster-node. The JobTracker is
responsible for scheduling tasks to the Task Trackers, monitoring the task, and re-executing
the task just in case the TaskTracker fails. The Task Tracker executes the task.
• MapReduce divides a data analysis task into two parts - map and reduce.
• Figure depicts how the MapReduce Programming works.
MapReduce Example
The famous example for MapReduce Programming is Word Count. For example, consider
you need to count the occurrences of similar words across 50 files. You can achieve this using
MapReduce Programming.
Wordcount example
package com.app;
import java.io.IOException;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.mapred.JobConf;
import org.apache. hadoop.mapred. Mapper;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
job.setJarByClass (WordCounter.class);
job.setMapperClass (WordCounterMap.class);
job.setReducerClass (WordCounterRed.class);
job.setOutputKeyClass (Text.class);
job.setOutputValueclass (IntWritable.class);
package com.app;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
protected void map (LongWritable key, Text value, Context context) throws IOException,
InterruptedException {
package com.infosys;
import java.io.IOException;
import org.apache. hadoop.io.IntWritable;
protected void reduce (Text word, Iterable<IntWritable> values, Context context) throws
IOException, InterruptedException {
Integer count = 0;
count += val.get();
}
}
SQL versus MapReduce
In Hadoop 1.0, HDFS and MapReduce are Core Components, while other components are
built around the core.
In this Architecture, map slots might be "full", while the reduce slots are empty and vice
versa. This causes resource utilization issues. This needs to be improved for proper resource
utilization.
HDFS Limitation
NameNode saves all its file metadata in main memory. Although the main memory today is
not as small and as expensive as it used to be two decades ago, still there is a limit on the
number of objects that one can have in the memory on a single NameNode. The NameNode
can quickly become overwhelmed with load on the system increasing.
HDFS 2 consists of two major components: (a) namespace, (b) blocks storage service.
Namespace service takes care of file-related operations, such as creating files, modifying
files, and directories. The block storage service handles data node cluster management,
replication.
HDFS 2 Features
i. Horizontal scalability.
ii. High availability.
High availability of NameNode is obtained with the help of Passive Standby NameNode. In
Hadoop 2.x, Active-Passive NameNode handles failover automatically. All namespace edits
are recorded to a shared NFS storage and there is a single writer at any point of time. Passive
NameNode reads edits from shared storage and keeps updated metadata information. In case
of Active NameNode failure, Passive NameNode becomes an Active NameNode
automatically. Then it starts writing to the shared storage.
YARN helps us to store all data in one place. We can interact in multiple ways to get
predictable performance and quality of services. This was originally architected by Yahoo.
Hadoop YARN
Fundamental Idea
The fundamental idea behind this architecture is splitting the JobTracker responsibility of resource management
and Job Scheduling/Monitoring into separate daemons. Daemons that are part of YARN Architecture are
described below.
Basic Concepts
Application:
YARN architecture
Map task takes care of loading, parsing, transforming, and filtering. The responsibility of
reduce task is grouping and aggregating data that is produced by map tasks to generate final
output. Each map task is broken into the following phases:
i. RecordReader
ii. Mapper
iii. Combiner
iv. Partitioner.
The output produced by map task is known as intermediate keys and values. These
intermediate keys and values are sent to reducer. The reduce tasks are broken into the
following phases:
i. Shuffe
ii. Sort
iii. Reducer
iv. Output Format.
Hadoop assigns map tasks to the DataNode where the actual data to be processed resides.
This way, Hadoop ensures data locality. Data locality means that data is not moved over
network; only computational code is moved to process data which saves network bandwidth.
Mapper
A mapper maps the input key-value pairs into a set of intermediate key-value pairs. Maps are
individual tasks that have the responsibility of transforming input records into intermediate
key-value pairs.
Reducer
The primary chore of the Reducer is to reduce a set of intermediate values (the ones that
share a common key) to a smaller set of values. The Reducer has three primary phases:
Shuffle and Sort, Reduce, and Output Format.
i. Shuffle and Sort: This phase takes the output of all the partitioners and
downloads them into the local machine where the reducer is running. Then these
individual data pipes are sorted by keys which produce larger data list. The main
purpose of this sort is grouping similar words so that their values can be easily
iterated over by the reduce task.
ii. Reduce: The reducer takes the grouped data produced by the shuffle and sort
phase, applies reduce function, and processes one group at a time. The reduce
function iterates all the values associated with that key. Reducer function provides
various operations such as aggregation, filtering, and combining data. Once it is
done, the output (zero or more key-value pairs) of reducer is sent to the output
format.
iii. Output Format: The output format separates key-value pair with tab (default)
and writes it out to a file using record writer.
Figure describes the chores of Mapper, Combiner, Partitioner, and Reducer for the word
count problem.
Input Data: What is the input that has been given to us to act upon?
Objective: Write a MapReduce program to count the occurrence of similar words in a file.
Use combiner for optimization.
Input Data:
Introduction to Hadoop
Introducing Hive
Hive Session
Pig Session
Act: In the driver program, set the combiner class as shown below.
job.setCombinerClass(WordCounterRed.class);
Here driver class name, input path, and output path are optional arguments.
Output:
The reducer output will be stored in part-r-00000 file by default.
Partitioner
The partitioning phase happens after map phase and before reduce phase. Usually the number
of partitions are equal to the number of reducers. The default partitioner is hash partitioner.
Objective: Write a MapReduce program to count the occurrence of similar words in a file.
Use partitioner to partition key based on alphabets.
Input Data:
Introduction to Hadoop
Introducing Hive
Hive Session
Pig Session
Act:
WordCountPartitioner.java
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Partitioner;
@Override
int partitionNumber = 0;
switch(alphabet) {
return partitionNumber;
Output:
Searching
Objective: To write a MapReduce program to search for a specific keyword in a file.
Input Data:
1001,John,45
1002,Jack,39
1003,Alex,44
1004,Smith,38
1005,Bob,33
Act:
WordSearcher.java
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setMapperClass(WordSearchMapper.class);
job.setReducerClass(WordSearch Reducer.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setNumReduceTasks(1);
job.getConfiguration().set("keyword", "Jack");
FileInputFormat.setInputPaths(job, new Path("/mapreduce/student.csv"));
WordSearchMapper.java
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
keyword = configuration.get("keyword");
protected void map(LongWritable key, Text value, Context context) throws IOException,
InterruptedException {
FileSplit f= (FileSplit) i;
Integer wordPos;
pos++;
if (value.toString().contains(keyword)) {
wordPos = value.find(keyword);
wordPos.toString()));
}
WordSearch Reducer.java
import java.io.IOException;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
protected void reduce(Text key, Text value, Context context) throws IOException,
InterruptedException {
context.write(key, value);
}
}
Output:
Sorting
Objective: To write a MapReduce program to sort data by student name (value).
Input Data:
1001,John,45
1002,Jack,39
1003,Alex,44
1004,Smith,38
1005,Bob,33
Act:
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class SortStudNames {
context.write(NullWritable.get(), details);
}
public static void main(String[ ] args) throws IOException, InterruptedException,
ClassNotFoundException {
Configuration conf = new Configuration();
job.setJarByClass (SortEmpNames.class);
job.setMapperClass(SortMapper.class);
job.setReducerClass(SortReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileInputFormat.setInputPaths(job, new Path("/mapreduce/student.csy"));
System.exit(job.waitForCompletion(true)? 0: 1);
Output:
File: /mapreduce/output/search/part-r-00000
Compression
In MapReduce programming, you can compress the MapReduce output file. Compression
provides two benefits as follows:
i. Reduces the space to store files.
ii. Speeds up data transfer across the network.
You can specify compression format in the Driver Program as shown below:
conf.setBoolean("mapred.output.compress", true);
conf.setClass("mapred.output.compression.codec",GzipCodec.class,CompressionCodec.class);