Unit 3 Part 1
Unit 3 Part 1
Hadoop comes with a distributed file system called HDFS. In HDFS data is distributed over
several machines and replicated to ensure their durability to failure and high availability to
parallel application.
It is cost effective as it uses commodity hardware. It involves the concept of blocks, data nodes
and node name.
o Low Latency data access: Applications that require very less time to access the first
data should not use HDFS as it is giving importance to whole data rather than time to
fetch the first record.
o Lots Of Small Files:The name node contains the metadata of files in memory and if
the files are small in size it takes a lot of memory for name node's memory which is
not feasible.
o Multiple Writes:It should not be used when we have to write multiple times.
HDFS Concepts
1. Blocks: A Block is the minimum amount of data that it can read or write.HDFS blocks
are 128 MB by default and this is configurable.Files n HDFS are broken into block-sized
chunks,which are stored as independent units.Unlike a file system, if the file is in HDFS
is smaller than block size, then it does not occupy full block?s size, i.e. 5 MB of file
stored in HDFS of block size 128 MB takes 5MB of space only.The HDFS block size is
large just to minimize the cost of seek.
2. Name Node: HDFS works in master-worker pattern where the name node acts as
master.Name Node is controller and manager of HDFS as it knows the status and the
metadata of all the files in HDFS; the metadata information being file permission,
names and location of each block.The metadata are small, so it is stored in the
memory of name node,allowing faster access to data. Moreover the HDFS cluster is
accessed by multiple clients concurrently,so all this information is handled bya single
machine. The file system operations like opening, closing, renaming etc. are executed
by it.
3. Data Node: They store and retrieve blocks when they are told to; by client or name
node. They report back to name node periodically, with list of blocks that they are
storing. The data node being a commodity hardware also does the work of block
creation, deletion and replication as stated by the name node.
4. Secondary Name Node: Since all the metadata is stored in name node, it is very
important. If it fails the file system can not be used as there would be no way of
knowing how to reconstruct the files from blocks present in data node. To overcome
this, the concept of secondary name node arises. It is a separate physical machine
which acts as a helper of name node. It performs periodic check points. It
communicates with the name node and take snapshot of meta data which helps
minimize downtime and loss of data.
HDFS Read Image: Data read request is served by HDFS, NameNode, and DataNode. Let’s call
the reader as a ‘client’. Below diagram depicts file read operation in Hadoop .
1. A client initiates read request by calling ‘open()’ method of FileSystem object; it is an
object of type DistributedFileSystem.
2. This object connects to namenode using RPC and gets metadata information such as
the locations of the blocks of the file. Please note that these addresses are of first few
blocks of a file. In response to this metadata request, addresses of the DataNodes
having a copy of that block is returned back.
3. Once addresses of DataNodes are received, an object of type FSDataInputStream is
returned to the client. FSDataInputStream contains DFSInputStream which takes
care of interactions with DataNode and NameNode. In step 4 shown in the above
diagram, a client invokes ‘read()’ method which causes DFSInputStream to establish
a connection with the first DataNode with the first block of a file.
4. Data is read in the form of streams wherein client invokes ‘read()’ method repeatedly.
This process of read() operation continues till it reaches the end of block.
5. Once the end of a block is reached, DFSInputStream closes the connection and moves
on to locate the next DataNode for the next block
6. Once a client has done with the reading, it calls a close() method.
We can run ‘$HADOOP_HOME/bin/hdfs dfs -help’ to get detailed help on every command.
Here, ‘dfs’ is a shell command of HDFS which supports multiple subcommands.
Some of the widely used commands are listed below along with some details of each one.
This command copies file temp.txt from the local filesystem to HDFS.
We can see a file ‘temp.txt’ (copied earlier) being listed under ‘ / ‘ directory.
Below are some of the Hadoop drawbacks with solutions, how can you overcome a problem.
Read all the Hadoop limitations and its solutions, it will definitely help you for working
smoothly in Hadoop.
• No Iterative Processing
• Ease of Use
• Security Problem
Let’s discuss these Hadoop Limitations in detail -
The idea behind Hadoop was to have a small number of large files. But if you have many small
files then Hadoop cannot manage it. Small files are those files whose size is quite less than the
block size in Hadoop. Each file, directory, and block occupies a memory element inside
NameNode’s memory. As a rule of thumb, this memory element is about 150 bytes. So if you
have 10 million files each using a block then it would occupy 1.39 GB of memory. Scaling
beyond this level is not possible with current hardware. Also Retrieving small files is very
inefficient in Hadoop. At the back-end, it causes many disks to seeks and hopping from one
Solution
Hadoop Archives or HAR files is one of the solutions to small files problem. Hadoop archives
act as another layer of the file system over Hadoop. With Hadoop archive command we can
build HAR files. This command runs a map-reduce job at the backend to pack the archived files
into a small number of HDFS files. But again reading through HAR files is not much efficient
than reading through HDFS. This is because it requires to access two index files and then finally
Sequence file is another solution to small file problem. In this, we write a program to merge a
number of small files into one sequence file. Then we process this sequence file in a streaming
fashion. Map-reduce can break this sequence files into chunks and process it in parallel as we
In Hadoop, the MapReduce reads and writes the data to and from the disk. For every stage in
processing the data gets read from the disk and written to the disk. This disk seeks takes time
thereby making the whole process very slow. If Hadoop processes data in small volume, it is
very slow comparatively. It is ideal for large data sets. As Hadoop has batch processing engine
at the core its speed for real-time processing is less. Hadoop is slow in comparison with newer
Solution
Spark is the solution for the slow processing speed of map-reduce. It does in-memory
calculations which makes it a hundred times faster than Hadoop. Spark while processing reads
the data from RAM and writes the data to RAM thereby making it a fast processing tool.
Flink is one more technology which is faster than Hadoop map-reduce as it does in-memory
calculations. Flink is even faster than Spark. This is due to the stream processing engine at the
Hadoop with its core Map-Reduce framework is unable to process real-time data. Hadoop
process data in batches. First, the user loads the file into HDFS. Then the user runs map-reduce
job with the file as input. It follows the ETL cycle of processing. The user extracts the data from
the source. Then the data gets transformed to meet the business requirements. And finally
loaded into the data warehouse. The users can generate insights from this data. The
Solution
Spark has come up as a solution to the above problem. Spark supports real-time processing. It
processes the incoming streams of data by forming micro-batches and then applying
Flink is also one more solution for slow processing speed. It is even much faster than Spark as
it has a stream processing engine at the core. Flink is a true streaming engine with adjustable
latency and throughput. It has a rich set of APIs exploiting streaming runtime.
4. No Iterative Processing
Core Hadoop does not support iterative processing. Iterative processing requires a cyclic data
flow. In this output of a previous stage serves as an input to the next stage. Hadoop map-
data gets written on the disk once. And then read multiple times to get insights. The Map-
reduce of Hadoop has a batch processing engine at its core. It is not able to iterate through
data.
Solution
Spark supports iterative processing. In Spark, each iteration needs to get scheduled and
executed separately. It accomplishes iterative processing through DAG i.e. Directed Acyclic
Graph. Spark has RDDs or Resilient Distributed Datasets. These are a collection of elements
partitioned across the cluster of nodes. Spark creates RDDs from HDFS files. We can also cache
them allowing reusability of RDDs. The iterative algorithms apply operations repeatedly over
Flink also supports iterative processing. Flink iterates data using streaming architecture. We
can instruct Flink to process only the data which gets changed thereby improving the
performance. Flink implements iterative algorithms by defining a step function. It embeds the
step functions into special iteration operator. The two variants of this operator are — iterate
and delta iterate. Both these operators apply the step function over and over again until they
5. Ease of Use
In Hadoop, we have to hand code each and every operation. This has two drawbacks first it is
difficult to use. And second, it increases the number of lines to code. There is no interactive
mode available with Hadoop Map-Reduce. This also makes it difficult to debug as it runs in the
batch mode. In this mode, we have to specify the jar file, the input as well as the location of
the output file. If the program fails in between, it is difficult to find the culprit code.
Solution
Spark is easy for the user as compared to Hadoop. This is because it has many APIs for Java,
Scala, Python, and Spark SQL. Spark performs batch processing, stream processing and
machine learning on the same cluster. This makes life easy for users. They can use the same
infrastructure for various workloads.
In Flink, the number of high-level operators is available. This reduces the number of lines of
6. Security Problem
Hadoop does not implement encryption-decryption at the storage as well as network levels.
Thus it is not much secure. For security, Hadoop adopts Kerberos authentication which is
difficult to maintain.
Solution
Spark encrypts temporary data written to local disk. It does not support encryption of output
implements AES-based encryption for RPC connections. We should enable RPC authentication
1. Cost effectiveness. The DataNodes that store the data rely on inexpensive off-the-
shelf hardware, which cuts storage costs. Also, because HDFS is open source,
there's no licensing fee.
2. Large data set storage. HDFS stores a variety of data of any size -- from megabytes
to petabytes -- and in any format, including structured and unstructured data.
3. Fast recovery from hardware failure. HDFS is designed to detect faults and
automatically recover on its own.
5. Streaming data access. HDFS is built for high data throughput, which is best for
access to streaming data.
that deal with large data sets. These applications write their data only once but they read it one or
more times and require these reads to be satisfied at streaming speeds. HDFS supports write-once-
A typical block size used by HDFS is 128 MB. Thus, an HDFS file is chopped up into 128 MB chunks,
HDFS has been designed to be easily portable from one platform to another. This facilitates
widespread adoption of HDFS as a platform of choice for a large set of applications.
HDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode, a master
server that manages the file system namespace and regulates access to files by clients. In addition,
there are a number of DataNodes, usually one per node in the cluster, which manage storage
attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data
to be stored in files. Internally, a file is split into one or more blocks and these blocks are stored in
a set of DataNodes. The NameNode executes file system namespace operations like opening, closing,
and renaming files and directories. It also determines the mapping of blocks to DataNodes. The
DataNodes are responsible for serving read and write requests from the file system’s clients. The
DataNodes also perform block creation, deletion, and replication upon instruction from the
NameNode.
The NameNode and DataNode are pieces of software designed to run on commodity machines. These
machines typically run a GNU/Linux operating system (OS). HDFS is built using the Java language;
any machine that supports Java can run the NameNode or the DataNode software. Usage of the
highly portable Java language means that HDFS can be deployed on a wide range of machines. A
typical deployment has a dedicated machine that runs only the NameNode software. Each of the
other machines in the cluster runs one instance of the DataNode software. The architecture does not
preclude running multiple DataNodes on the same machine but in a real deployment that is rarely
the case.
The existence of a single NameNode in a cluster greatly simplifies the architecture of the system.
The NameNode is the arbitrator and repository for all HDFS metadata. The system is designed in
such a way that user data never flows through the NameNode.
HDFS supports a traditional hierarchical file organization. A user or an application can create
directories and store files inside these directories. The file system namespace hierarchy is similar to
most other existing file systems; one can create and remove files, move a file from one directory to
another, or rename a file. HDFS supports user quotas and access permissions. HDFS does not
support hard links or soft links. However, the HDFS architecture does not preclude implementing
these features.
The NameNode maintains the file system namespace. Any change to the file system namespace or
its properties is recorded by the NameNode. An application can specify the number of replicas of a
file that should be maintained by HDFS. The number of copies of a file is called the replication factor
of that file. This information is stored by the NameNode.
Data Replication
HDFS is designed to reliably store very large files across machines in a large cluster. It stores each
file as a sequence of blocks. The blocks of a file are replicated for fault tolerance. The block size and
replication factor are configurable per file.
All blocks in a file except the last block are the same size, while users can start a new block without
filling out the last block to the configured block size after the support for variable length block was
added to append and hsync.
An application can specify the number of replicas of a file. The replication factor can be specified at
file creation time and can be changed later. Files in HDFS are write-once (except for appends and
truncates) and have strictly one writer at any time.
The NameNode makes all decisions regarding replication of blocks. It periodically receives a
Heartbeat and a Blockreport from each of the DataNodes in the cluster. Receipt of a Heartbeat
implies that the DataNode is functioning properly. A Blockreport contains a list of all blocks on a
DataNode.
The placement of replicas is critical to HDFS reliability and performance. Optimizing replica
placement distinguishes HDFS from most other distributed file systems. This is a feature that needs
lots of tuning and experience. The purpose of a rack-aware replica placement policy is to improve
data reliability, availability, and network bandwidth utilization. The current implementation for the
replica placement policy is a first effort in this direction. The short-term goals of implementing this
policy are to validate it on production systems, learn more about its behavior, and build a foundation
to test and research more sophisticated policies.
Large HDFS instances run on a cluster of computers that commonly spread across many racks.
Communication between two nodes in different racks has to go through switches. In most cases,
network bandwidth between machines in the same rack is greater than network bandwidth between
machines in different racks.
The NameNode determines the rack id each DataNode belongs to via the process outlined in Hadoop
Rack Awareness. A simple but non-optimal policy is to place replicas on unique racks. This prevents
losing data when an entire rack fails and allows use of bandwidth from multiple racks when reading
data. This policy evenly distributes replicas in the cluster which makes it easy to balance load on
component failure. However, this policy increases the cost of writes because a write needs to transfer
blocks to multiple racks.
For the common case, when the replication factor is three, HDFS’s placement policy is to put one
replica on the local machine if the writer is on a datanode, otherwise on a random datanode, another
replica on a node in a different (remote) rack, and the last on a different node in the same remote
rack. This policy cuts the inter-rack write traffic which generally improves write performance. The
chance of rack failure is far less than that of node failure; this policy does not impact data reliability
and availability guarantees. However, it does reduce the aggregate network bandwidth used when
reading data since a block is placed in only two unique racks rather than three. With this policy, the
replicas of a file do not evenly distribute across the racks. One third of replicas are on one node, two
thirds of replicas are on one rack, and the other third are evenly distributed across the remaining
racks. This policy improves write performance without compromising data reliability or read
performance.
If the replication factor is greater than 3, the placement of the 4th and following replicas are
determined randomly while keeping the number of replicas per rack below the upper limit (which is
basically (replicas - 1) / racks + 2).
Because the NameNode does not allow DataNodes to have multiple replicas of the same block,
maximum number of replicas created is the total number of DataNodes at that time.
After the support for Storage Types and Storage Policies was added to HDFS, the NameNode takes
the policy into account for replica placement in addition to the rack awareness described above. The
NameNode chooses nodes based on rack awareness at first, then checks that the candidate node
have storage required by the policy associated with the file. If the candidate node does not have the
storage type, the NameNode looks for another node. If enough nodes to place replicas can not be
found in the first path, the NameNode looks for nodes having fallback storage types in the second
path.
The current, default replica placement policy described here is a work in progress.
Replica Selection
To minimize global bandwidth consumption and read latency, HDFS tries to satisfy a read request
from a replica that is closest to the reader. If there exists a replica on the same rack as the reader
node, then that replica is preferred to satisfy the read request. If HDFS cluster spans multiple data
centers, then a replica that is resident in the local data center is preferred over any remote replica.
Safemode
On startup, the NameNode enters a special state called Safemode. Replication of data blocks does
not occur when the NameNode is in the Safemode state. The NameNode receives Heartbeat and
Blockreport messages from the DataNodes. A Blockreport contains the list of data blocks that a
DataNode is hosting. Each block has a specified minimum number of replicas. A block is considered
safely replicated when the minimum number of replicas of that data block has checked in with the
NameNode. After a configurable percentage of safely replicated data blocks checks in with the
NameNode (plus an additional 30 seconds), the NameNode exits the Safemode state. It then
determines the list of data blocks (if any) that still have fewer than the specified number of replicas.
The NameNode then replicates these blocks to other DataNodes.
The placement of replicas is critical to HDFS reliability and performance. Optimizing replica
placement distinguishes HDFS from most other distributed file systems. This is a feature that needs
lots of tuning and experience. The purpose of a rack-aware replica placement policy is to improve
data reliability, availability, and network bandwidth utilization. The current implementation for the
replica placement policy is a first effort in this direction. The short-term goals of implementing this
policy are to validate it on production systems, learn more about its behavior, and build a foundation
to test and research more sophisticated policies.
Large HDFS instances run on a cluster of computers that commonly spread across many racks.
Communication between two nodes in different racks has to go through switches. In most cases,
network bandwidth between machines in the same rack is greater than network bandwidth between
machines in different racks.
The NameNode determines the rack id each DataNode belongs to via the process outlined in Hadoop
Rack Awareness. A simple but non-optimal policy is to place replicas on unique racks. This prevents
losing data when an entire rack fails and allows use of bandwidth from multiple racks when reading
data. This policy evenly distributes replicas in the cluster which makes it easy to balance load on
component failure. However, this policy increases the cost of writes because a write needs to transfer
blocks to multiple racks.
For the common case, when the replication factor is three, HDFS’s placement policy is to put one
replica on the local machine if the writer is on a datanode, otherwise on a random datanode, another
replica on a node in a different (remote) rack, and the last on a different node in the same remote
rack. This policy cuts the inter-rack write traffic which generally improves write performance. The
chance of rack failure is far less than that of node failure; this policy does not impact data reliability
and availability guarantees. However, it does reduce the aggregate network bandwidth used when
reading data since a block is placed in only two unique racks rather than three. With this policy, the
replicas of a file do not evenly distribute across the racks. One third of replicas are on one node, two
thirds of replicas are on one rack, and the other third are evenly distributed across the remaining
racks. This policy improves write performance without compromising data reliability or read
performance.
If the replication factor is greater than 3, the placement of the 4th and following replicas are
determined randomly while keeping the number of replicas per rack below the upper limit (which is
basically (replicas - 1) / racks + 2).
Because the NameNode does not allow DataNodes to have multiple replicas of the same block,
maximum number of replicas created is the total number of DataNodes at that time.
After the support for Storage Types and Storage Policies was added to HDFS, the NameNode takes
the policy into account for replica placement in addition to the rack awareness described above. The
NameNode chooses nodes based on rack awareness at first, then checks that the candidate node
have storage required by the policy associated with the file. If the candidate node does not have the
storage type, the NameNode looks for another node. If enough nodes to place replicas can not be
found in the first path, the NameNode looks for nodes having fallback storage types in the second
path.
The current, default replica placement policy described here is a work in progress.
Replica Selection
To minimize global bandwidth consumption and read latency, HDFS tries to satisfy a read request
from a replica that is closest to the reader. If there exists a replica on the same rack as the reader
node, then that replica is preferred to satisfy the read request. If HDFS cluster spans multiple data
centers, then a replica that is resident in the local data center is preferred over any remote replica.
Safemode
On startup, the NameNode enters a special state called Safemode. Replication of data blocks does
not occur when the NameNode is in the Safemode state. The NameNode receives Heartbeat and
Blockreport messages from the DataNodes. A Blockreport contains the list of data blocks that a
DataNode is hosting. Each block has a specified minimum number of replicas. A block is considered
safely replicated when the minimum number of replicas of that data block has checked in with the
NameNode. After a configurable percentage of safely replicated data blocks checks in with the
NameNode (plus an additional 30 seconds), the NameNode exits the Safemode state. It then
determines the list of data blocks (if any) that still have fewer than the specified number of replicas.
The NameNode then replicates these blocks to other DataNodes.
The Persistence of File System Metadata
The HDFS namespace is stored by the NameNode. The NameNode uses a transaction log called the
EditLog to persistently record every change that occurs to file system metadata. For example,
creating a new file in HDFS causes the NameNode to insert a record into the EditLog indicating this.
Similarly, changing the replication factor of a file causes a new record to be inserted into the EditLog.
The NameNode uses a file in its local host OS file system to store the EditLog. The entire file system
namespace, including the mapping of blocks to files and file system properties, is stored in a file
called the FsImage. The FsImage is stored as a file in the NameNode’s local file system too.
The NameNode keeps an image of the entire file system namespace and file Blockmap in memory.
When the NameNode starts up, or a checkpoint is triggered by a configurable threshold, it reads the
FsImage and EditLog from disk, applies all the transactions from the EditLog to the in-memory
representation of the FsImage, and flushes out this new version into a new FsImage on disk. It can
then truncate the old EditLog because its transactions have been applied to the persistent FsImage.
This process is called a checkpoint. The purpose of a checkpoint is to make sure that HDFS has a
consistent view of the file system metadata by taking a snapshot of the file system metadata and
saving it to FsImage. Even though it is efficient to read a FsImage, it is not efficient to make
incremental edits directly to a FsImage. Instead of modifying FsImage for each edit, we persist the
edits in the Editlog. During the checkpoint the changes from Editlog are applied to the FsImage. A
checkpoint can be triggered at a given time interval (dfs.namenode.checkpoint.period)
expressed in seconds, or after a given number of filesystem transactions have accumulated
(dfs.namenode.checkpoint.txns). If both of these properties are set, the first threshold to be
reached triggers a checkpoint.
The DataNode stores HDFS data in files in its local file system. The DataNode has no knowledge
about HDFS files. It stores each block of HDFS data in a separate file in its local file system. The
DataNode does not create all files in the same directory. Instead, it uses a heuristic to determine
the optimal number of files per directory and creates subdirectories appropriately. It is not optimal
to create all local files in the same directory because the local file system might not be able to
efficiently support a huge number of files in a single directory. When a DataNode starts up, it scans
through its local file system, generates a list of all HDFS data blocks that correspond to each block.