0% found this document useful (0 votes)
8 views12 pages

Big Data Analytics (15CS82) : Venugopala Rao A S Dept. of CSE, SMVITM, Bantakal

The document discusses the architecture and functionality of the Hadoop Distributed File System (HDFS), highlighting the roles of the NameNode and SecondaryNameNode in managing data blocks and metadata. It explains the replication process for data blocks, the significance of safe mode during NameNode startup, and how HDFS ensures data availability through redundancy. Additionally, it covers the configuration of replication factors based on the number of DataNodes in a cluster.

Uploaded by

venurao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views12 pages

Big Data Analytics (15CS82) : Venugopala Rao A S Dept. of CSE, SMVITM, Bantakal

The document discusses the architecture and functionality of the Hadoop Distributed File System (HDFS), highlighting the roles of the NameNode and SecondaryNameNode in managing data blocks and metadata. It explains the replication process for data blocks, the significance of safe mode during NameNode startup, and how HDFS ensures data availability through redundancy. Additionally, it covers the configuration of replication factors based on the number of DataNodes in a cluster.

Uploaded by

venurao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 12

Big Data Analytics

(15CS82)
Venugopala Rao A S
Dept. of CSE, SMVITM, Bantakal
Module 1
• During any ongoing data transfer, the NameNode monitors the
DataNodes by listening for heartbeats sent from them.
• If NameNode could not sense heartbeat signal from a specific
DataNode, this indicates a potential node failure.
• In such a case, the NameNode will start re-replicating the now-
missing blocks.
• Because the file system is redundant, DataNodes can be taken
offline (decommissioned) for maintenance by informing the
NameNode of the DataNodes to exclude from the HDFS pool.
• The mappings between data blocks and the physical
DataNodes are not kept in persistent storage on the NameNode

BDA-15CS82
Module 1
• Once the NameNode starts up, each DataNode provides a
block report (which it keeps in persistent storage) to the
NameNode.
• The block reports are sent every 10 heartbeats. (This interval is
configurable)
• The reports enable the NameNode to keep an up-to-date
account of all data blocks in the cluster.

BDA-15CS82
Module 1
• SecondaryNameNode
• Available in almost all Hadoop deployments
• Not explicitly required by a NameNode, but recommended.
• The name “SecondaryNameNode” (CheckPointNode) is
somewhat misleading.
• It is not an active failover node and cannot replace the primary
NameNode in case of its failure.
• The purpose of this is to perform periodic checkpoints that
evaluate the status of the NameNode.
• Note that the NameNode keeps all system metadata memory
for fast access.
• It also has two disk files that track changes to the metadata:

BDA-15CS82
Module 1
• First is an image of the file system state when the NameNode
was started.
• This file begins with fsimage_* and is used only at startup by
the NameNode.
• A series of modifications done to the file system after starting
the NameNode.
• These files begin with edit_* and reflect the changes made
after the fsimage_* file was read.
• The SecondaryNameNode periodically downloads fsimage and
edits files, joins them into a new fsimage, and uploads the new
fsimage file to the NameNode.
• Thus, when the NameNode restarts, the fsimage file is
reasonably up-to-date and requires only the edit logs to be
applied since the last checkpoint
BDA-15CS82
Module 1
• Thus in the absence of SecondaryNameNode, a restart of the
NameNode could take a long time due to the number of
changes to the file system
• To summarize various roles of HDFS,
• HDFS uses a master/slave model designed for large file
reading/streaming.
• The NameNode is a metadata server or “data traffic cop.”
• HDFS provides a single namespace that is managed by the
NameNode.
• Data is redundantly stored on DataNodes; there is no data on
the NameNode.
• The SecondaryNameNode performs checkpoints of
NameNode file system’s state but is not a failover node.
BDA-15CS82
Module 1
• HDFS Block Replication
• We saw that, when HDFS writes a file, it is replicated across
the cluster.
• The amount of replication is based on the value of
dfs.replication in the hdfs-site.xml file
• This default value can be overruled with the hdfs dfs-setrep
command.
• For Hadoop cluster
• containing more than eight DataNodes, the replication value is usually
set to 3
• of eight or fewer DataNodes but more than one DataNode, a
replication factor may be set to 2.
• For a single machine, the replication factor is set to 1

BDA-15CS82
Module 1
• If several machines are to be involved in the serving of a file,
and if any one of these machines go down then a file could be
rendered unavailable.
• HDFS overcomes this problem by replicating each block
across a number of machines (three is the default).
• The HDFS default block size is often 64MB.
• Note that, the HDFS default block size is not the minimum
block size.
• If a 20KB file is written to HDFS, it will create a block that is
approximately 20KB in size.
• On the other hand, if a file of size 80MB is written to HDFS, a
64MB block and a 16MB block will be created

BDA-15CS82
Module 1
• HDFS blocks are not exactly the same as the data splits used
by the MapReduce process.
• The HDFS blocks are based on size, while the splits are based
on a logical partitioning of the data.
• i.e. if a file contains discrete records, the logical split ensures
that a record is not split physically across two separate servers
during processing.
• Each HDFS block may consist of one or more splits
• HDFS block replication example is shown in the figure

BDA-15CS82
Module 1

• a file is broken into blocks and replicated across the cluster.


• In this case, a replication factor of 3 ensures that any one
DataNode can fail and the replicated blocks will be available on
other nodes—and then subsequently re-replicated on other
DataNodes. BDA-15CS82
Module 1
• HDFS Safe Mode
• When the NameNode starts, it enters a read-only safe mode
where blocks cannot be replicated or deleted.
• Safe Mode enables the NameNode to perform two important
processes:
• 1. The previous file system state is reconstructed by loading
the fsimage file into memory and replaying the edit log.
• 2. The mapping between blocks and data nodes is created by
waiting for enough of the DataNodes to register so that at least
one copy of the data is available.
• Not all DataNodes are required to register before HDFS exits
from Safe Mode. The registration process may continue for
some time
BDA-15CS82
Module 1
• HDFS may also enter Safe Mode for maintenance using the
hdfs dfsadmin-safemode command or when there is a
file system issue that must be addressed by the administrator.

BDA-15CS82

You might also like