100% found this document useful (1 vote)
51 views

HDFS Material

HDFS is a distributed file system designed to run on commodity hardware. It follows a master-slave architecture with a single NameNode that manages file metadata and multiple DataNodes that store file data blocks. The NameNode monitors DataNodes and ensures data is replicated across multiple nodes for fault tolerance. Files are divided into blocks which are stored in DataNodes, with heartbeat and block report mechanisms used for communication and monitoring. HDFS provides high throughput access to application data and is suitable for applications handling large data sets.

Uploaded by

Nik Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
51 views

HDFS Material

HDFS is a distributed file system designed to run on commodity hardware. It follows a master-slave architecture with a single NameNode that manages file metadata and multiple DataNodes that store file data blocks. The NameNode monitors DataNodes and ensures data is replicated across multiple nodes for fault tolerance. Files are divided into blocks which are stored in DataNodes, with heartbeat and block report mechanisms used for communication and monitoring. HDFS provides high throughput access to application data and is suitable for applications handling large data sets.

Uploaded by

Nik Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Hadoop Distributed File System

1. The origin of HDFS


2. Using HDFS what we can do
3. HDFS - Follows - Master - Slave -Architecture
4. Hadoop Cluster
5. Hadoop Server Roles
6. Daemon
7. BLOCK
8. Name Node
9. Data Node
10. Secondary Name Node
11. Use case – 1
12. Write files into HDFS
13. Read from HDFS
14. Heart Beat Mechanism
15. HDFS Goals
1. The origin of HDFS

 The full form of HDFS is Hadoop Distributed File System.


 It was create based on Google File System (GFS).
 HDFS is written in Java programming language.
 Google provided only a white paper, without any implementation.
 So, GFS architecture has been applied in HDFS implementation.

2. Using HDFS what we can do?

 Let’s take a basic formula,

Hadoop = Store + Process

Hadoop 1.0 = HDFS + MapReduce

Hadoop 2.0 = HDFS + YARN

HDFS = Hadoop Distributed File System

YARN = Yet Another Resource Negotiator

Resource allocation + processing

Make a note:

 So,
o Hadoop 1.0 = HDFS + MapReduce
o Please don't apply the math’s formula like,
 HDFS = Hadoop - MapReduce

Purpose of HDFS

 HDFS is used only for storing the data.


 MapReduce is used to process the data.
 HDFS is specially designed file system to store large data sets.
 HDFS has been designed to run on commodity hardware(means less expensive).

Features

 Highly scalable,
 Distributed,
 Load-balanced,
 Portable
 Fault-tolerant storage system
3. HDFS Follows

 HDFS follows master-slave architecture.


 Master node gives instructions to slave nodes.
 Slave nodes receive work as per the master node instructions.
 An HDFS cluster consists of Name Node and Data Node.
 Name node is the Master Node and Data nodes are slave nodes.
4. Hadoop cluster

Node

 Any computer hardware which is functioning properly.

Rack

 A group of interconnected nodes.

Switch

 By using switch, communication is happening between the rack nodes.

Cluster

 A group of interconnected Racks.

Info

 Hadoop cluster contains the Group of Racks.


 Name node, Job Tracker, Secondary name node, Data nodes, task tracker & client all
these presents in Rack.
 Name node is one per cluster.
 Data node and task tracker can exists many in cluster to store actual data and
process the data.
5. Roles and Responsible to each node

There are mainly three machines in Hadoop deployment.

1. Master nodes.
2. Slave nodes.
3. Client nodes.
1. Master node:

 Master node will do mainly two things.


o How and where the data is storing.
o How to process the stored data in parallel way.

 Master node having master daemons


 Master daemons are Name node and Job Tracker
 Name node coordinates with HDFS to store the data
 Job Tracker coordinates with MapReduce to process the data in parallel way.

2. Slave nodes

 Slave nodes will store the actual data (raw data), and running the computations over
the data
 Slave nodes having slave daemons(background jobs).
 Slave daemons are Data node and Task tracker.
 Data node is slave to the Name node
o Data node will communicate with Name node to receive the instructions.
 Task tracker is slave to the Job tracker
o Task tracker will communicate with Job tracker to receive the instructions.
 So, Slave Daemons work as per the Master Daemons instructions.

3. Client nodes

 The main role of client node is to load the data into the cluster.
 Submit the MapReduce jobs.
 MapReduce job describes how that data should be processed.
 Client node will receive the final results from finished jobs.
6. Daemon

 The word daemon comes from the UNIX.


 Another name for Daemon: Process or Service.
 Daemon runs in the background.
 On a Windows platform Daemons is called as service.
 In Hadoop Name node and Data node are daemons

7. Block

 Internally, a file is split it into one or more blocks.


 In Hadoop 1.0 the default block size is 64MB.
 In Hadoop 2.0 the default block size is 128MB
 We can customize this size in configuration level.
 These spited blocks will be stored in data nodes.

 Each block will be stored in different Data nodes to get fault tolerant.
 Hadoop maintains replication factor, by default replication factor is 3.
 We can customize this value
o At the cluster level
o At file creation
o Later stage for stored file
8. Name node

Name node,

 What it stores?
 What it not stores?
 What is the responsible?
 How many name nodes?
 Production environment?
 If heart fails then what is the result?
 Why Name node is high-expensive ?

What Name node stores?

 Name node is a master daemon; it is the heart of HDFS file system.


 Name node stores the all files information’s which presents in HDFS.
 This file information we can say metadata as well, means Name node stores the
metadata about the files.
 Name node is the point of contact for any hadoop files

What Name node not stores

 Name node does not stores raw data or actual data.

Name node Responsible

 We know file blocks are stored in Data Nodes, these data nodes maintains and
manage by Name node.
 Client application communicates to name node to do file operations like add, copy,
move, delete file.
 Name node used to provide the required metadata to client.

How many

 Only one name node per cluster.

Production

 In production environment name node runs on separate machine.

If Name node fails

 As per the discussion Name node is the heart of the HDFS, so if heart fails then we
know well about the next condition.
 Name node is a single point of failure, means if name node fails then accessing file
system is not at all possible.

Why Name node is high expensive

 Name node should run in high end machine.


 Reason why Name node high expensive
 Name node is a Single Point of Failure. (SPOF)
 Name node holds metadata for quick response, so more memory is required.
 Name node organizes hundreds or thousands of data nodes and responds to client
requests.
 Name node have to maintain load balance, by considering all reasons, name node
required high expensive machine.
 Based on requirement we can scaled-up machine for a name node.
9. Data node

Data node,

 What it stores
 What is the responsible
 Heartbeats
 Block report
 How many
 Commission
 Decommission
 Communication
 If data node fails

Store

 A data node stores the actual data or raw data.

Responsible

 Data node manages the blocks and store data blocks


 Data node responds to Name node for any kind of file operations.
 Data nodes take responsible for serving read and write request for the clients.

Heartbeat

 By using heartbeat mechanism data node used to update current status to Name
node about,
o Stored blocks.
o Idle blocks.
o Working status

 Heartbeat interval is every 3 seconds


 If heartbeat is not receiving then name node recognize particular data node down.
 When data node down then immediately name node takes responsible to take the
replication of data.

Block report

 Every 10th heart beat is a block report.

How many

 It can be any number of data nodes per cluster, means if data set is growing then we
can add more data nodes.

Commission

 Adding Data Nodes to the cluster is called commissioning


Decommission

 Removing Data Nodes from the cluster is called Decommissioning

Communication

 One Data node can communicate with other Data node during replication

If Data node fails

 When data node down or fails then immediately name node takes responsible to take
the replication of data.

10. Secondary name node

 A secondary name node is another daemon.


 The secondary name node is not a standby name node, so it is not meant as a
backup in case of name node failure.
 The primary purpose of secondary name node is to periodically download the name
node fsimage and edit the log file from the name node, create a new fsimage by
merging the older fsimage and edit the log file, and upload the new fsimage back to
the name node.
 By periodically merging the namespace fsimage with the edit log, the secondary
name node prevents the edit log from becoming too large.
11. A general use case

Huge file how it stores in HDFS?

1. If the Data is in small size then it’s very easy to store and process the data.
2. But if the data is growing and growing and if it reaches to BIG DATA definition then
it’s a bit difficult to store and process the data.
3. So, to handle this situation a special mechanism or techniques required.
4. When we are speaking about BIG DATA problems then Hadoop is the Best solution.
5. Basically Hadoop will store and process large data and gives the results fast.
6. Hadoop follows Divide and Conquer rule.
 Hadoop cut the large data into pieces and spread it out over many machines.
 Hadoop process these pieces of data over machines in parallel way.
 So that Hadoop will give the results in extremely fast.
Example

 Assuming that we have a huge data file (100GB) containing emails sent to the
customer service department.
 So, requirement is trying to find out how many times “Refund” word typed by
customer.
 This exercise will help to improve the business and customer needs.
 It's a simple word count exercise.

Work flow

 Client will load the data file (File.txt) into the Cluster.
 Submit the job describing how to analyze that data (word count).
 Finally cluster will store the result in a new file (Results.txt).
 Client will read the results file.
12. Writing to HDFS

 When a client or application wants to write a file to HDFS, it reaches out to the name
node with details of the file.
 The name node responds with details based on the actual size of the file, block, and
replication configuration.
 These details from the name node contain the number of blocks of the file, the
replication factor, and data nodes where each block will be stored
 In above diagram Giant file is divided into blocks (A, B, C, D…)

Client splits the files into blocks

 Based on information received from the name node, the client or application splits
the files into multiple blocks and starts sending them to data nodes.

Name node never involve in actual data transfer

 The client or application directly transfers the data to data nodes based on the
replication factor.
 The name node is not involved in the actual data transfer (data blocks don’t pass
through the name node).

How a block stores in cluster?

 As per the diagram Block A is transferred to data node 1 along with details of the
two other data nodes where this block needs to be stored.
 When it receives Block A from the client (assuming a replication factor of 3), data
node 1 copies the same block to the data node 2 (in this case, data node 2 of the
same rack).
 This involves a block transfer via the rack switch because both of these data nodes
are in the same rack.
 When it receives Block A from data node 1, data node 2 copies the same block to the
data node 3 (in this case, data node 3 of another rack).
 This involves a block transfer via an out-of-rack switch along with a rack switch
because both of these data nodes are in separate racks.
Data Flow Pipeline

 In fact, the data transfer from the client to data node 1 for a given block (128 MB)
will be in smaller chunks of 4KB.
 For better performance, data nodes maintain a pipeline for data transfer.
 When data node 1 receives the first 4KB chunk from the client, it stores this chunk in
its local repository and immediately starts transferring it to data node 2 in the flow.
 Likewise, when data node 2 receives first 4KB chunk from data node 1, it stores this
chunk in its local repository and immediately starts transferring it to data node 3.

Make a note

 Kindly read twice above pipeline process for better understanding.

Data node confirms to the Name node

 Whenever all data nodes receive the blocks then it informs to name node.
Data node confirms to client as well

 So, data node 1 sends an acknowledgment back to the client.

Make a note

 For simplicity, we explained how one block from the client is written to different data
nodes.
 But the whole process is actually repeated for each block of the file, and data
transfer happens in parallel for faster write of blocks.
All data blocks in corresponding data nodes

 In above diagram we can see all blocks(A, B, C,…) in corresponding data nodes in
cluster
13. Reading from HDFS

 To read a file from the HDFS, the client or application reaches out to the name node
with the name of the file.
 The name node responds with the number of blocks of the file, data nodes where
each block has been stored.
Data blocks don’t pass through name node

 Now the client or application reaches out to the data nodes directly (data blocks
don’t pass through the name node) to read the blocks of the files in parallel, based
on information received from the name node.
 When the client or application receives all the blocks of the file, it combines these
blocks into the form of the original file
14. Heart beat mechanism

 On cluster startup, the name node enters into a special state called safe mode.
 During this time, the name node receives a heartbeat signal (what are all data nodes
active and functioning properly) and a block-report from each data node (containing
a list of all blocks on that specific data node) in the cluster.
15. Goals of HDFS

1. Horizontal scalability
2. Fault tolerance
3. Capability to run on commodity hardware
4. Write once, read many times
5. Capacity to handle large data sets
6. Data locality

1. Horizontal scalability

 HDFS is based on a scale-out model.


 We can scale up to thousands of nodes, to store terabytes or petabytes of data.
 As the data increases, we can increase the data nodes.
 Increasing the data nodes will give additional storage and more processing power.

2. Fault tolerance

 HDFS assumes that failures (Hardware and software) are very common.
 Even though failure occurs, HDFS by default it provides data replication.
 Rule
o By default, hadoop creating three copies of the data.
o Two copies on same rack and one copy on different rack.
o Even though if rack fails also, we will not lose the data
o If one copy of the data is not able to access or gets corrupted then no need to
worry.
o The framework itself takes care to get high availability of data.

Still not understand fault tolerance, then for you you short definition,

 Hardware failure are very common


 So, instead of believing hardware to deliver high availability of the data, it’s good to
believe a framework which is designed to handle failure and even delivered exact
services with default recovery.

3. Capability to run on commodity hardware

 HDFS runs on commodity hardware, means to store the large data we can use low
cost hardware.
 RDBMS is more expensive to store and process the data.
4. Write once, read many times

 HDFS is based on a concept of write once, read many times, means once data is
written then it will not be modified.
 HDFS focuses on retrieving the data in fastest possible way.
 HDFS was originally designed for batch processing.

5. Capable to handle large data set

 HDFS is the best to store large data sets in size of GB, TB and TB etc

6. Data locality

 Data node and Task tracker are presents in Slave nodes in Hadoop cluster.
 Data node is used to store the data and Task tracker is used to process the data.
 When you run a query or Map Reduce job, the Task Tracker processes data at the
node where the data exists.
o Because of this, minimizing the need for data transfer across nodes and
improving job performance this is called as Data locality.
 If the size of the data is HUGE, then
o It’s highly recommended to move computation logic near to data.
o It’s not recommended to move data near to computation logic.
o Advantage: Minimizes the risk of network traffic and improve job
performance.

Thanks 

You might also like