HDFS Material
HDFS Material
Make a note:
So,
o Hadoop 1.0 = HDFS + MapReduce
o Please don't apply the math’s formula like,
HDFS = Hadoop - MapReduce
Purpose of HDFS
Features
Highly scalable,
Distributed,
Load-balanced,
Portable
Fault-tolerant storage system
3. HDFS Follows
Node
Rack
Switch
Cluster
Info
1. Master nodes.
2. Slave nodes.
3. Client nodes.
1. Master node:
2. Slave nodes
Slave nodes will store the actual data (raw data), and running the computations over
the data
Slave nodes having slave daemons(background jobs).
Slave daemons are Data node and Task tracker.
Data node is slave to the Name node
o Data node will communicate with Name node to receive the instructions.
Task tracker is slave to the Job tracker
o Task tracker will communicate with Job tracker to receive the instructions.
So, Slave Daemons work as per the Master Daemons instructions.
3. Client nodes
The main role of client node is to load the data into the cluster.
Submit the MapReduce jobs.
MapReduce job describes how that data should be processed.
Client node will receive the final results from finished jobs.
6. Daemon
7. Block
Each block will be stored in different Data nodes to get fault tolerant.
Hadoop maintains replication factor, by default replication factor is 3.
We can customize this value
o At the cluster level
o At file creation
o Later stage for stored file
8. Name node
Name node,
What it stores?
What it not stores?
What is the responsible?
How many name nodes?
Production environment?
If heart fails then what is the result?
Why Name node is high-expensive ?
We know file blocks are stored in Data Nodes, these data nodes maintains and
manage by Name node.
Client application communicates to name node to do file operations like add, copy,
move, delete file.
Name node used to provide the required metadata to client.
How many
Production
As per the discussion Name node is the heart of the HDFS, so if heart fails then we
know well about the next condition.
Name node is a single point of failure, means if name node fails then accessing file
system is not at all possible.
Data node,
What it stores
What is the responsible
Heartbeats
Block report
How many
Commission
Decommission
Communication
If data node fails
Store
Responsible
Heartbeat
By using heartbeat mechanism data node used to update current status to Name
node about,
o Stored blocks.
o Idle blocks.
o Working status
Block report
How many
It can be any number of data nodes per cluster, means if data set is growing then we
can add more data nodes.
Commission
Communication
One Data node can communicate with other Data node during replication
When data node down or fails then immediately name node takes responsible to take
the replication of data.
1. If the Data is in small size then it’s very easy to store and process the data.
2. But if the data is growing and growing and if it reaches to BIG DATA definition then
it’s a bit difficult to store and process the data.
3. So, to handle this situation a special mechanism or techniques required.
4. When we are speaking about BIG DATA problems then Hadoop is the Best solution.
5. Basically Hadoop will store and process large data and gives the results fast.
6. Hadoop follows Divide and Conquer rule.
Hadoop cut the large data into pieces and spread it out over many machines.
Hadoop process these pieces of data over machines in parallel way.
So that Hadoop will give the results in extremely fast.
Example
Assuming that we have a huge data file (100GB) containing emails sent to the
customer service department.
So, requirement is trying to find out how many times “Refund” word typed by
customer.
This exercise will help to improve the business and customer needs.
It's a simple word count exercise.
Work flow
Client will load the data file (File.txt) into the Cluster.
Submit the job describing how to analyze that data (word count).
Finally cluster will store the result in a new file (Results.txt).
Client will read the results file.
12. Writing to HDFS
When a client or application wants to write a file to HDFS, it reaches out to the name
node with details of the file.
The name node responds with details based on the actual size of the file, block, and
replication configuration.
These details from the name node contain the number of blocks of the file, the
replication factor, and data nodes where each block will be stored
In above diagram Giant file is divided into blocks (A, B, C, D…)
Based on information received from the name node, the client or application splits
the files into multiple blocks and starts sending them to data nodes.
The client or application directly transfers the data to data nodes based on the
replication factor.
The name node is not involved in the actual data transfer (data blocks don’t pass
through the name node).
As per the diagram Block A is transferred to data node 1 along with details of the
two other data nodes where this block needs to be stored.
When it receives Block A from the client (assuming a replication factor of 3), data
node 1 copies the same block to the data node 2 (in this case, data node 2 of the
same rack).
This involves a block transfer via the rack switch because both of these data nodes
are in the same rack.
When it receives Block A from data node 1, data node 2 copies the same block to the
data node 3 (in this case, data node 3 of another rack).
This involves a block transfer via an out-of-rack switch along with a rack switch
because both of these data nodes are in separate racks.
Data Flow Pipeline
In fact, the data transfer from the client to data node 1 for a given block (128 MB)
will be in smaller chunks of 4KB.
For better performance, data nodes maintain a pipeline for data transfer.
When data node 1 receives the first 4KB chunk from the client, it stores this chunk in
its local repository and immediately starts transferring it to data node 2 in the flow.
Likewise, when data node 2 receives first 4KB chunk from data node 1, it stores this
chunk in its local repository and immediately starts transferring it to data node 3.
Make a note
Whenever all data nodes receive the blocks then it informs to name node.
Data node confirms to client as well
Make a note
For simplicity, we explained how one block from the client is written to different data
nodes.
But the whole process is actually repeated for each block of the file, and data
transfer happens in parallel for faster write of blocks.
All data blocks in corresponding data nodes
In above diagram we can see all blocks(A, B, C,…) in corresponding data nodes in
cluster
13. Reading from HDFS
To read a file from the HDFS, the client or application reaches out to the name node
with the name of the file.
The name node responds with the number of blocks of the file, data nodes where
each block has been stored.
Data blocks don’t pass through name node
Now the client or application reaches out to the data nodes directly (data blocks
don’t pass through the name node) to read the blocks of the files in parallel, based
on information received from the name node.
When the client or application receives all the blocks of the file, it combines these
blocks into the form of the original file
14. Heart beat mechanism
On cluster startup, the name node enters into a special state called safe mode.
During this time, the name node receives a heartbeat signal (what are all data nodes
active and functioning properly) and a block-report from each data node (containing
a list of all blocks on that specific data node) in the cluster.
15. Goals of HDFS
1. Horizontal scalability
2. Fault tolerance
3. Capability to run on commodity hardware
4. Write once, read many times
5. Capacity to handle large data sets
6. Data locality
1. Horizontal scalability
2. Fault tolerance
HDFS assumes that failures (Hardware and software) are very common.
Even though failure occurs, HDFS by default it provides data replication.
Rule
o By default, hadoop creating three copies of the data.
o Two copies on same rack and one copy on different rack.
o Even though if rack fails also, we will not lose the data
o If one copy of the data is not able to access or gets corrupted then no need to
worry.
o The framework itself takes care to get high availability of data.
Still not understand fault tolerance, then for you you short definition,
HDFS runs on commodity hardware, means to store the large data we can use low
cost hardware.
RDBMS is more expensive to store and process the data.
4. Write once, read many times
HDFS is based on a concept of write once, read many times, means once data is
written then it will not be modified.
HDFS focuses on retrieving the data in fastest possible way.
HDFS was originally designed for batch processing.
HDFS is the best to store large data sets in size of GB, TB and TB etc
6. Data locality
Data node and Task tracker are presents in Slave nodes in Hadoop cluster.
Data node is used to store the data and Task tracker is used to process the data.
When you run a query or Map Reduce job, the Task Tracker processes data at the
node where the data exists.
o Because of this, minimizing the need for data transfer across nodes and
improving job performance this is called as Data locality.
If the size of the data is HUGE, then
o It’s highly recommended to move computation logic near to data.
o It’s not recommended to move data near to computation logic.
o Advantage: Minimizes the risk of network traffic and improve job
performance.
Thanks