Big Data Analytics (15CS82) : Venugopala Rao A S Dept. of CSE, SMVITM, Bantakal
Big Data Analytics (15CS82) : Venugopala Rao A S Dept. of CSE, SMVITM, Bantakal
(15CS82)
Venugopala Rao A S
Dept. of CSE, SMVITM, Bantakal
Module 1
• During any ongoing data transfer, the NameNode monitors the
DataNodes by listening for heartbeats sent from them.
• If NameNode could not sense heartbeat signal from a specific
DataNode, this indicates a potential node failure.
• In such a case, the NameNode will start re-replicating the now-
missing blocks.
• Because the file system is redundant, DataNodes can be taken
offline (decommissioned) for maintenance by informing the
NameNode of the DataNodes to exclude from the HDFS pool.
• The mappings between data blocks and the physical
DataNodes are not kept in persistent storage on the NameNode
BDA-15CS82
Module 1
• Once the NameNode starts up, each DataNode provides a
block report (which it keeps in persistent storage) to the
NameNode.
• The block reports are sent every 10 heartbeats. (This interval is
configurable)
• The reports enable the NameNode to keep an up-to-date
account of all data blocks in the cluster.
BDA-15CS82
Module 1
• SecondaryNameNode
• Available in almost all Hadoop deployments
• Not explicitly required by a NameNode, but recommended.
• The name “SecondaryNameNode” (CheckPointNode) is
somewhat misleading.
• It is not an active failover node and cannot replace the primary
NameNode in case of its failure.
• The purpose of this is to perform periodic checkpoints that
evaluate the status of the NameNode.
• Note that the NameNode keeps all system metadata memory
for fast access.
• It also has two disk files that track changes to the metadata:
BDA-15CS82
Module 1
• First is an image of the file system state when the NameNode
was started.
• This file begins with fsimage_* and is used only at startup by
the NameNode.
• A series of modifications done to the file system after starting
the NameNode.
• These files begin with edit_* and reflect the changes made
after the fsimage_* file was read.
• The SecondaryNameNode periodically downloads fsimage and
edits files, joins them into a new fsimage, and uploads the new
fsimage file to the NameNode.
• Thus, when the NameNode restarts, the fsimage file is
reasonably up-to-date and requires only the edit logs to be
applied since the last checkpoint
BDA-15CS82
Module 1
• Thus in the absence of SecondaryNameNode, a restart of the
NameNode could take a long time due to the number of
changes to the file system
• To summarize various roles of HDFS,
• HDFS uses a master/slave model designed for large file
reading/streaming.
• The NameNode is a metadata server or “data traffic cop.”
• HDFS provides a single namespace that is managed by the
NameNode.
• Data is redundantly stored on DataNodes; there is no data on
the NameNode.
• The SecondaryNameNode performs checkpoints of
NameNode file system’s state but is not a failover node.
BDA-15CS82
Module 1
• HDFS Block Replication
• We saw that, when HDFS writes a file, it is replicated across
the cluster.
• The amount of replication is based on the value of
dfs.replication in the hdfs-site.xml file
• This default value can be overruled with the hdfs dfs-setrep
command.
• For Hadoop cluster
• containing more than eight DataNodes, the replication value is usually
set to 3
• of eight or fewer DataNodes but more than one DataNode, a
replication factor may be set to 2.
• For a single machine, the replication factor is set to 1
BDA-15CS82
Module 1
• If several machines are to be involved in the serving of a file,
and if any one of these machines go down then a file could be
rendered unavailable.
• HDFS overcomes this problem by replicating each block
across a number of machines (three is the default).
• The HDFS default block size is often 64MB.
• Note that, the HDFS default block size is not the minimum
block size.
• If a 20KB file is written to HDFS, it will create a block that is
approximately 20KB in size.
• On the other hand, if a file of size 80MB is written to HDFS, a
64MB block and a 16MB block will be created
BDA-15CS82
Module 1
• HDFS blocks are not exactly the same as the data splits used
by the MapReduce process.
• The HDFS blocks are based on size, while the splits are based
on a logical partitioning of the data.
• i.e. if a file contains discrete records, the logical split ensures
that a record is not split physically across two separate servers
during processing.
• Each HDFS block may consist of one or more splits
• HDFS block replication example is shown in the figure
BDA-15CS82
Module 1
BDA-15CS82