100% found this document useful (1 vote)

51 views

HDFS Material

HDFS is a distributed file system designed to run on commodity hardware. It follows a master-slave architecture with a single NameNode that manages file metadata and multiple DataNodes that store file data blocks. The NameNode monitors DataNodes and ensures data is replicated across multiple nodes for fault tolerance. Files are divided into blocks which are stored in DataNodes, with heartbeat and block report mechanisms used for communication and monitoring. HDFS provides high throughput access to application data and is suitable for applications handling large data sets.

Uploaded by

Nik Kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

51 views

HDFS Material

Uploaded by

Nik Kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

Hadoop Distributed File System

1. The origin of HDFS

2. Using HDFS what we can do
3. HDFS - Follows - Master - Slave -Architecture
4. Hadoop Cluster
5. Hadoop Server Roles
6. Daemon
7. BLOCK
8. Name Node
9. Data Node
10. Secondary Name Node
11. Use case – 1
12. Write files into HDFS
13. Read from HDFS
14. Heart Beat Mechanism
15. HDFS Goals
1. The origin of HDFS

 The full form of HDFS is Hadoop Distributed File System.

 It was create based on Google File System (GFS).
 HDFS is written in Java programming language.
 Google provided only a white paper, without any implementation.
 So, GFS architecture has been applied in HDFS implementation.

2. Using HDFS what we can do?

 Let’s take a basic formula,

Hadoop = Store + Process

Hadoop 1.0 = HDFS + MapReduce

Hadoop 2.0 = HDFS + YARN

HDFS = Hadoop Distributed File System

YARN = Yet Another Resource Negotiator

Resource allocation + processing

Make a note:

 So,
o Hadoop 1.0 = HDFS + MapReduce
o Please don't apply the math’s formula like,
 HDFS = Hadoop - MapReduce

Purpose of HDFS

 HDFS is used only for storing the data.

 MapReduce is used to process the data.
 HDFS is specially designed file system to store large data sets.
 HDFS has been designed to run on commodity hardware(means less expensive).

Features

 Highly scalable,
 Distributed,
 Load-balanced,
 Portable
 Fault-tolerant storage system
3. HDFS Follows

 HDFS follows master-slave architecture.

 Master node gives instructions to slave nodes.
 Slave nodes receive work as per the master node instructions.
 An HDFS cluster consists of Name Node and Data Node.
 Name node is the Master Node and Data nodes are slave nodes.
4. Hadoop cluster

Node

 Any computer hardware which is functioning properly.

Rack

 A group of interconnected nodes.

Switch

 By using switch, communication is happening between the rack nodes.

Cluster

 A group of interconnected Racks.

Info

 Hadoop cluster contains the Group of Racks.

 Name node, Job Tracker, Secondary name node, Data nodes, task tracker & client all
these presents in Rack.
 Name node is one per cluster.
 Data node and task tracker can exists many in cluster to store actual data and
process the data.
5. Roles and Responsible to each node

There are mainly three machines in Hadoop deployment.

1. Master nodes.
2. Slave nodes.
3. Client nodes.
1. Master node:

 Master node will do mainly two things.

o How and where the data is storing.
o How to process the stored data in parallel way.

 Master node having master daemons

 Master daemons are Name node and Job Tracker
 Name node coordinates with HDFS to store the data
 Job Tracker coordinates with MapReduce to process the data in parallel way.

2. Slave nodes

 Slave nodes will store the actual data (raw data), and running the computations over
the data
 Slave nodes having slave daemons(background jobs).
 Slave daemons are Data node and Task tracker.
 Data node is slave to the Name node
o Data node will communicate with Name node to receive the instructions.
 Task tracker is slave to the Job tracker
o Task tracker will communicate with Job tracker to receive the instructions.
 So, Slave Daemons work as per the Master Daemons instructions.

3. Client nodes

 The main role of client node is to load the data into the cluster.
 Submit the MapReduce jobs.
 MapReduce job describes how that data should be processed.
 Client node will receive the final results from finished jobs.
6. Daemon

 The word daemon comes from the UNIX.

 Another name for Daemon: Process or Service.
 Daemon runs in the background.
 On a Windows platform Daemons is called as service.
 In Hadoop Name node and Data node are daemons

7. Block

 Internally, a file is split it into one or more blocks.

 In Hadoop 1.0 the default block size is 64MB.
 In Hadoop 2.0 the default block size is 128MB
 We can customize this size in configuration level.
 These spited blocks will be stored in data nodes.

 Each block will be stored in different Data nodes to get fault tolerant.
 Hadoop maintains replication factor, by default replication factor is 3.
 We can customize this value
o At the cluster level
o At file creation
o Later stage for stored file
8. Name node

Name node,

 What it stores?
 What it not stores?
 What is the responsible?
 How many name nodes?
 Production environment?
 If heart fails then what is the result?
 Why Name node is high-expensive ?

What Name node stores?

 Name node is a master daemon; it is the heart of HDFS file system.

 Name node stores the all files information’s which presents in HDFS.
 This file information we can say metadata as well, means Name node stores the
metadata about the files.
 Name node is the point of contact for any hadoop files

What Name node not stores

 Name node does not stores raw data or actual data.

Name node Responsible

 We know file blocks are stored in Data Nodes, these data nodes maintains and
manage by Name node.
 Client application communicates to name node to do file operations like add, copy,
move, delete file.
 Name node used to provide the required metadata to client.

How many

 Only one name node per cluster.

Production

 In production environment name node runs on separate machine.

If Name node fails

 As per the discussion Name node is the heart of the HDFS, so if heart fails then we
know well about the next condition.
 Name node is a single point of failure, means if name node fails then accessing file
system is not at all possible.

Why Name node is high expensive

 Name node should run in high end machine.

 Reason why Name node high expensive
 Name node is a Single Point of Failure. (SPOF)
 Name node holds metadata for quick response, so more memory is required.
 Name node organizes hundreds or thousands of data nodes and responds to client
requests.
 Name node have to maintain load balance, by considering all reasons, name node
required high expensive machine.
 Based on requirement we can scaled-up machine for a name node.
9. Data node

Data node,

 What it stores
 What is the responsible
 Heartbeats
 Block report
 How many
 Commission
 Decommission
 Communication
 If data node fails

Store

 A data node stores the actual data or raw data.

Responsible

 Data node manages the blocks and store data blocks

 Data node responds to Name node for any kind of file operations.
 Data nodes take responsible for serving read and write request for the clients.

Heartbeat

 By using heartbeat mechanism data node used to update current status to Name
node about,
o Stored blocks.
o Idle blocks.
o Working status

 Heartbeat interval is every 3 seconds

 If heartbeat is not receiving then name node recognize particular data node down.
 When data node down then immediately name node takes responsible to take the
replication of data.

Block report

 Every 10th heart beat is a block report.

How many

 It can be any number of data nodes per cluster, means if data set is growing then we
can add more data nodes.

Commission

 Adding Data Nodes to the cluster is called commissioning

Decommission

 Removing Data Nodes from the cluster is called Decommissioning

Communication

 One Data node can communicate with other Data node during replication

If Data node fails

 When data node down or fails then immediately name node takes responsible to take
the replication of data.

10. Secondary name node

 A secondary name node is another daemon.

 The secondary name node is not a standby name node, so it is not meant as a
backup in case of name node failure.
 The primary purpose of secondary name node is to periodically download the name
node fsimage and edit the log file from the name node, create a new fsimage by
merging the older fsimage and edit the log file, and upload the new fsimage back to
the name node.
 By periodically merging the namespace fsimage with the edit log, the secondary
name node prevents the edit log from becoming too large.
11. A general use case

Huge file how it stores in HDFS?

1. If the Data is in small size then it’s very easy to store and process the data.
2. But if the data is growing and growing and if it reaches to BIG DATA definition then
it’s a bit difficult to store and process the data.
3. So, to handle this situation a special mechanism or techniques required.
4. When we are speaking about BIG DATA problems then Hadoop is the Best solution.
5. Basically Hadoop will store and process large data and gives the results fast.
6. Hadoop follows Divide and Conquer rule.
 Hadoop cut the large data into pieces and spread it out over many machines.
 Hadoop process these pieces of data over machines in parallel way.
 So that Hadoop will give the results in extremely fast.
Example

 Assuming that we have a huge data file (100GB) containing emails sent to the
customer service department.
 So, requirement is trying to find out how many times “Refund” word typed by
customer.
 This exercise will help to improve the business and customer needs.
 It's a simple word count exercise.

Work flow

 Client will load the data file (File.txt) into the Cluster.
 Submit the job describing how to analyze that data (word count).
 Finally cluster will store the result in a new file (Results.txt).
 Client will read the results file.
12. Writing to HDFS

 When a client or application wants to write a file to HDFS, it reaches out to the name
node with details of the file.
 The name node responds with details based on the actual size of the file, block, and
replication configuration.
 These details from the name node contain the number of blocks of the file, the
replication factor, and data nodes where each block will be stored
 In above diagram Giant file is divided into blocks (A, B, C, D…)

Client splits the files into blocks

 Based on information received from the name node, the client or application splits
the files into multiple blocks and starts sending them to data nodes.

Name node never involve in actual data transfer

 The client or application directly transfers the data to data nodes based on the
replication factor.
 The name node is not involved in the actual data transfer (data blocks don’t pass
through the name node).

How a block stores in cluster?

 As per the diagram Block A is transferred to data node 1 along with details of the
two other data nodes where this block needs to be stored.
 When it receives Block A from the client (assuming a replication factor of 3), data
node 1 copies the same block to the data node 2 (in this case, data node 2 of the
same rack).
 This involves a block transfer via the rack switch because both of these data nodes
are in the same rack.
 When it receives Block A from data node 1, data node 2 copies the same block to the
data node 3 (in this case, data node 3 of another rack).
 This involves a block transfer via an out-of-rack switch along with a rack switch
because both of these data nodes are in separate racks.
Data Flow Pipeline

 In fact, the data transfer from the client to data node 1 for a given block (128 MB)
will be in smaller chunks of 4KB.
 For better performance, data nodes maintain a pipeline for data transfer.
 When data node 1 receives the first 4KB chunk from the client, it stores this chunk in
its local repository and immediately starts transferring it to data node 2 in the flow.
 Likewise, when data node 2 receives first 4KB chunk from data node 1, it stores this
chunk in its local repository and immediately starts transferring it to data node 3.

Make a note

 Kindly read twice above pipeline process for better understanding.

Data node confirms to the Name node

 Whenever all data nodes receive the blocks then it informs to name node.
Data node confirms to client as well

 So, data node 1 sends an acknowledgment back to the client.

Make a note

 For simplicity, we explained how one block from the client is written to different data
nodes.
 But the whole process is actually repeated for each block of the file, and data
transfer happens in parallel for faster write of blocks.
All data blocks in corresponding data nodes

 In above diagram we can see all blocks(A, B, C,…) in corresponding data nodes in
cluster
13. Reading from HDFS

 To read a file from the HDFS, the client or application reaches out to the name node
with the name of the file.
 The name node responds with the number of blocks of the file, data nodes where
each block has been stored.
Data blocks don’t pass through name node

 Now the client or application reaches out to the data nodes directly (data blocks
don’t pass through the name node) to read the blocks of the files in parallel, based
on information received from the name node.
 When the client or application receives all the blocks of the file, it combines these
blocks into the form of the original file
14. Heart beat mechanism

 On cluster startup, the name node enters into a special state called safe mode.
 During this time, the name node receives a heartbeat signal (what are all data nodes
active and functioning properly) and a block-report from each data node (containing
a list of all blocks on that specific data node) in the cluster.
15. Goals of HDFS

1. Horizontal scalability
2. Fault tolerance
3. Capability to run on commodity hardware
4. Write once, read many times
5. Capacity to handle large data sets
6. Data locality

1. Horizontal scalability

 HDFS is based on a scale-out model.

 We can scale up to thousands of nodes, to store terabytes or petabytes of data.
 As the data increases, we can increase the data nodes.
 Increasing the data nodes will give additional storage and more processing power.

2. Fault tolerance

 HDFS assumes that failures (Hardware and software) are very common.
 Even though failure occurs, HDFS by default it provides data replication.
 Rule
o By default, hadoop creating three copies of the data.
o Two copies on same rack and one copy on different rack.
o Even though if rack fails also, we will not lose the data
o If one copy of the data is not able to access or gets corrupted then no need to
worry.
o The framework itself takes care to get high availability of data.

Still not understand fault tolerance, then for you you short definition,

 Hardware failure are very common

 So, instead of believing hardware to deliver high availability of the data, it’s good to
believe a framework which is designed to handle failure and even delivered exact
services with default recovery.

3. Capability to run on commodity hardware

 HDFS runs on commodity hardware, means to store the large data we can use low
cost hardware.
 RDBMS is more expensive to store and process the data.
4. Write once, read many times

 HDFS is based on a concept of write once, read many times, means once data is
written then it will not be modified.
 HDFS focuses on retrieving the data in fastest possible way.
 HDFS was originally designed for batch processing.

5. Capable to handle large data set

 HDFS is the best to store large data sets in size of GB, TB and TB etc

6. Data locality

 Data node and Task tracker are presents in Slave nodes in Hadoop cluster.
 Data node is used to store the data and Task tracker is used to process the data.
 When you run a query or Map Reduce job, the Task Tracker processes data at the
node where the data exists.
o Because of this, minimizing the need for data transfer across nodes and
improving job performance this is called as Data locality.
 If the size of the data is HUGE, then
o It’s highly recommended to move computation logic near to data.
o It’s not recommended to move data near to computation logic.
o Advantage: Minimizes the risk of network traffic and improve job
performance.

Thanks 

The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
From Everand
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
Mark Manson
4/5 (6418)
Principles: Life and Work
From Everand
Principles: Life and Work
Ray Dalio
4/5 (640)
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
From Everand
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
Brené Brown
4/5 (1173)
Never Split the Difference: Negotiating As If Your Life Depended On It
From Everand
Never Split the Difference: Negotiating As If Your Life Depended On It
Chris Voss
4.5/5 (992)
The Glass Castle: A Memoir
From Everand
The Glass Castle: A Memoir
Jeannette Walls
4.5/5 (1853)
Grit: The Power of Passion and Perseverance
From Everand
Grit: The Power of Passion and Perseverance
Angela Duckworth
4/5 (650)
Sing, Unburied, Sing: A Novel
From Everand
Sing, Unburied, Sing: A Novel
Jesmyn Ward
4/5 (1267)
The Perks of Being a Wallflower
From Everand
The Perks of Being a Wallflower
Stephen Chbosky
4.5/5 (4102)
Her Body and Other Parties: Stories
From Everand
Her Body and Other Parties: Stories
Carmen Maria Machado
4/5 (903)
Shoe Dog: A Memoir by the Creator of Nike
From Everand
Shoe Dog: A Memoir by the Creator of Nike
Phil Knight
4.5/5 (628)
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
From Everand
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
Margot Lee Shetterly
4/5 (1016)
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
From Everand
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
Ben Horowitz
4.5/5 (361)
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
From Everand
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
Ashlee Vance
4.5/5 (581)
The Emperor of All Maladies: A Biography of Cancer
From Everand
The Emperor of All Maladies: A Biography of Cancer
Siddhartha Mukherjee
4.5/5 (297)
Steve Jobs
From Everand
Steve Jobs
Walter Isaacson
4.5/5 (1138)
A Man Called Ove: A Novel
From Everand
A Man Called Ove: A Novel
Fredrik Backman
4.5/5 (5143)
Angela's Ashes: A Memoir
From Everand
Angela's Ashes: A Memoir
Frank McCourt
4.5/5 (943)
The Yellow House: A Memoir (2019 National Book Award Winner)
From Everand
The Yellow House: A Memoir (2019 National Book Award Winner)
Sarah M. Broom
4/5 (100)
The Little Book of Hygge: Danish Secrets to Happy Living
From Everand
The Little Book of Hygge: Danish Secrets to Happy Living
Meik Wiking
3.5/5 (460)
Brooklyn: A Novel
From Everand
Brooklyn: A Novel
Colm Tóibín
3.5/5 (2126)
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
From Everand
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
Gilbert King
4.5/5 (279)
The World Is Flat 3.0: A Brief History of the Twenty-first Century
From Everand
The World Is Flat 3.0: A Brief History of the Twenty-first Century
Thomas L. Friedman
3.5/5 (2289)
The Art of Racing in the Rain: A Novel
From Everand
The Art of Racing in the Rain: A Novel
Garth Stein
4/5 (4355)
Yes Please
From Everand
Yes Please
Amy Poehler
4/5 (2001)
Bad Feminist: Essays
From Everand
Bad Feminist: Essays
Roxane Gay
4/5 (1090)
The Woman in Cabin 10
From Everand
The Woman in Cabin 10
Ruth Ware
3.5/5 (2787)
A Tree Grows in Brooklyn
From Everand
A Tree Grows in Brooklyn
Betty Smith
4.5/5 (2033)
The Outsider: A Novel
From Everand
The Outsider: A Novel
Stephen King
4/5 (2876)
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
From Everand
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
Dave Eggers
3.5/5 (233)
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
From Everand
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
Viet Thanh Nguyen
4.5/5 (141)
Team of Rivals: The Political Genius of Abraham Lincoln
From Everand
Team of Rivals: The Political Genius of Abraham Lincoln
Doris Kearns Goodwin
4.5/5 (244)
Wolf Hall: A Novel
From Everand
Wolf Hall: A Novel
Hilary Mantel
4/5 (4087)
VSP E590-790 Expansion
No ratings yet
VSP E590-790 Expansion
25 pages
On Fire: The (Burning) Case for a Green New Deal
From Everand
On Fire: The (Burning) Case for a Green New Deal
Naomi Klein
4/5 (78)
Fear: Trump in the White House
From Everand
Fear: Trump in the White House
Bob Woodward
3.5/5 (835)
Manhattan Beach: A Novel
From Everand
Manhattan Beach: A Novel
Jennifer Egan
3.5/5 (918)
Rise of ISIS: A Threat We Can't Ignore
From Everand
Rise of ISIS: A Threat We Can't Ignore
Jay Sekulow
3.5/5 (144)
John Adams
From Everand
John Adams
David McCullough
4.5/5 (2546)
The Light Between Oceans: A Novel
From Everand
The Light Between Oceans: A Novel
M.L. Stedman
4.5/5 (815)
The Unwinding: An Inner History of the New America
From Everand
The Unwinding: An Inner History of the New America
George Packer
4/5 (45)
Little Women
From Everand
Little Women
Louisa May Alcott
4.5/5 (2369)
The Constant Gardener: A Novel
From Everand
The Constant Gardener: A Novel
John le Carré
4/5 (278)
RAID Configurations
No ratings yet
RAID Configurations
6 pages
HDD Vs SSD
No ratings yet
HDD Vs SSD
3 pages
CDROM, Floppy and Hard Disk Structure
No ratings yet
CDROM, Floppy and Hard Disk Structure
44 pages
Lesson 4
No ratings yet
Lesson 4
27 pages
Infinite Volumes
No ratings yet
Infinite Volumes
11 pages
UNIT 2 Managing Storage Devices
No ratings yet
UNIT 2 Managing Storage Devices
34 pages
Computer System Servicing Grade 11 Module Q2 1
No ratings yet
Computer System Servicing Grade 11 Module Q2 1
124 pages
Computer Servicing
No ratings yet
Computer Servicing
43 pages
Lab 3
No ratings yet
Lab 3
3 pages
IBM Tape Drives and Automation Level 2
No ratings yet
IBM Tape Drives and Automation Level 2
13 pages
02 Quiz - 1 ARG-JMBM
No ratings yet
02 Quiz - 1 ARG-JMBM
1 page
W25Q80, W25Q16, W25Q32
No ratings yet
W25Q80, W25Q16, W25Q32
61 pages
Module 6 Memory
No ratings yet
Module 6 Memory
7 pages
IBM Dana
No ratings yet
IBM Dana
13 pages
Pertemuan 16 - Komputasi Awan
No ratings yet
Pertemuan 16 - Komputasi Awan
14 pages
MD2800_M-Systems
No ratings yet
MD2800_M-Systems
25 pages
Coa Book Laes
No ratings yet
Coa Book Laes
205 pages
ICT grade 10 Filnal Exam
No ratings yet
ICT grade 10 Filnal Exam
3 pages
HDD Phycomp
No ratings yet
HDD Phycomp
40 pages
1.2 Number System Student Lecture Notes
No ratings yet
1.2 Number System Student Lecture Notes
28 pages
hpe_nimble_storage_starterkits
No ratings yet
hpe_nimble_storage_starterkits
5 pages
Disk management and disk structure
No ratings yet
Disk management and disk structure
9 pages
[Ebooks PDF] download A Beginner's Guide to SSD Firmware: Designing, Optimizing, and Maintaining SSD Firmware 1st Edition Gopi Kuppan Thirumalai full chapters
100% (2)
[Ebooks PDF] download A Beginner's Guide to SSD Firmware: Designing, Optimizing, and Maintaining SSD Firmware 1st Edition Gopi Kuppan Thirumalai full chapters
41 pages
SR-33 5.auflage Engl USB
No ratings yet
SR-33 5.auflage Engl USB
71 pages
Lecture 7 Main Memory
No ratings yet
Lecture 7 Main Memory
36 pages
TLE - ICT 3rd Q Test 2023
No ratings yet
TLE - ICT 3rd Q Test 2023
2 pages
Digital Forensics - Getting Started With File Systems
No ratings yet
Digital Forensics - Getting Started With File Systems
38 pages
Slide 04 - Computer Storage
No ratings yet
Slide 04 - Computer Storage
13 pages
Learnin G Activity Sheets For: Comp Uter 3
No ratings yet
Learnin G Activity Sheets For: Comp Uter 3
6 pages

HDFS Material

Uploaded by

HDFS Material

Uploaded by

Hadoop Distributed File System

1. The origin of HDFS

 The full form of HDFS is Hadoop Distributed File System.

2. Using HDFS what we can do?

 Let’s take a basic formula,

Hadoop = Store + Process

Hadoop 1.0 = HDFS + MapReduce

Hadoop 2.0 = HDFS + YARN

HDFS = Hadoop Distributed File System

YARN = Yet Another Resource Negotiator

Resource allocation + processing

 HDFS is used only for storing the data.

 HDFS follows master-slave architecture.

 Any computer hardware which is functioning properly.

 A group of interconnected nodes.

 By using switch, communication is happening between the rack nodes.

 A group of interconnected Racks.

 Hadoop cluster contains the Group of Racks.

There are mainly three machines in Hadoop deployment.

 Master node will do mainly two things.

 Master node having master daemons

 The word daemon comes from the UNIX.

 Internally, a file is split it into one or more blocks.

What Name node stores?

 Name node is a master daemon; it is the heart of HDFS file system.

What Name node not stores

 Name node does not stores raw data or actual data.

Name node Responsible

 Only one name node per cluster.

 In production environment name node runs on separate machine.

If Name node fails

Why Name node is high expensive

 Name node should run in high end machine.

 A data node stores the actual data or raw data.

 Data node manages the blocks and store data blocks

 Heartbeat interval is every 3 seconds

 Every 10th heart beat is a block report.

 Adding Data Nodes to the cluster is called commissioning

 Removing Data Nodes from the cluster is called Decommissioning

If Data node fails

10. Secondary name node

 A secondary name node is another daemon.

Huge file how it stores in HDFS?

Client splits the files into blocks

Name node never involve in actual data transfer

How a block stores in cluster?

 Kindly read twice above pipeline process for better understanding.

Data node confirms to the Name node

 So, data node 1 sends an acknowledgment back to the client.

 HDFS is based on a scale-out model.

 Hardware failure are very common

3. Capability to run on commodity hardware

5. Capable to handle large data set

You might also like