0% found this document useful (0 votes)

9 views17 pages

6 - BDP 2024 07

Uploaded by

khalidalam980

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views17 pages

6 - BDP 2024 07

Uploaded by

khalidalam980

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

Big Data Processing

Jiaul Paik
Lecture 7
Storing Big Data in Cluster

Hadoop Distributed Filesystem

HDFS (Hadoop) Architecture
namenode = master node

HDFS namenode
Application /foo/bar
(file name, block id)
File namespace block 3df2
HDFS Client
(block id, block location)

instructions to datanode

datanode state
(block id, byte range)
HDFS datanode HDFS datanode
block data
Linux file system Linux file system

… …

(Ghemawat et al., SOSP 2003)

HDFS

namenode job submission node

namenode daemon jobtracker

tasktracker tasktracker tasktracker

datanode daemon datanode daemon datanode daemon

Linux file system Linux file system Linux file system

… … …
slave node slave node slave node
HDFS
Reading and Writing
Dataflow: Reading data from HDFS
2:get block
HDFS Distributed locations
FileSystem NameNode
client

FSData namenode
InputStream

DataNode DataNode DataNode

datanode datanode datanode

Adapted from: Hadoop the definitive Guide, 4th ed, Tom white
Writing data to HDFS

1. Create 2. Create
Distributed
HDFS FileSystem Namenode
3. Write
Client
7. Complete namenode
FSData
6. Close OutputStream

4. Write Packet 5. ack Packet

4 4
Pipeline of Datanode Datanode Datanode
datanodes
datanode datanode datanode
5 5

Adapted from: Hadoop the definitive Guide, 4th ed, Tom white
Managing Hadoop: Other Key Issues
• Node failure

• HDFS federation (for memory issue)

• Cluster Balancing

• Data Caching
Node failures
• Namenode failures
• All the files in the filesystem are lost
• Since, reconstruction is not possible

• Datanode failure
• Won’t be a problem
• Data blocks are stored in many machines
• Can be recovered from another machine
Tackling Namenode failure
• If namenode fails, then all metadata are lost
• Won’t be able to reconstruct the file from the blocks

• How to handle?

• Maintain a replica of the metadata into another passive machine

• If the active namenode fails, start the passive namenode

• Needs to load the namepace into memory before it starts

HDFS Federation

• The namenode keeps a reference to every file and

block in the filesystem in memory

• For a very large cluster, namenode may run out of memory to hold
the metadata

• Solution: add more namenodes in the cluster

HDFS Cluster Balancing
• When copying data into HDFS, balancing of data
storage is important

• Why?
• HDFS works best when blocks are spread evenly

• Examples:
• In distcp, if m = 1, single task will do the copying
• It will be slow
• Bad utilization of resources

• Default value of m is 20 in Hadoop.

Block Caching
• Generally, datanodes read blocks from the disk

• Frequently accessed blocks can be stored in RAM

• A block is cached in only one datanode’s memory

• Job schedulers tries to run the code on the block that

is cached
Filesystem Operations
Filesystem Operations
• Major Filesystem operations:
• reading files, creating directories, moving files, deleting data, and
listing directories.

• One can run a Hadoop command from command line

• To know the details about every command

hadoop fs -help
Filesystem Operations
• Copying a file from the local filesystem to HDFS
hadoop fs -copyFromLocal file-1 file-2

• Copying a file to the local filesystem from HDFS

hadoop fs -copyToLocal source-file dest-file
Filesystem Operations
• Creating a directory
hadoop fs -mkdir mydir

• Listing the files

hadoop fs -ls

HDFS (27 Jan 2025 Hadoop Distributed File System)
No ratings yet
HDFS (27 Jan 2025 Hadoop Distributed File System)
73 pages
Bigdta Unit 3
No ratings yet
Bigdta Unit 3
65 pages
BDP 2023 03
No ratings yet
BDP 2023 03
59 pages
BDP 2024 07
No ratings yet
BDP 2024 07
17 pages
Lec 5 - Big Data Storage Technologies I - Hadoop
No ratings yet
Lec 5 - Big Data Storage Technologies I - Hadoop
44 pages
HDFS 3
No ratings yet
HDFS 3
51 pages
DATA228 Lecture Notes Week 4
No ratings yet
DATA228 Lecture Notes Week 4
21 pages
Unit 3.1
No ratings yet
Unit 3.1
88 pages
Big Data
No ratings yet
Big Data
51 pages
BDP 2024 06
No ratings yet
BDP 2024 06
14 pages
Cloud Computing - Unit 3
No ratings yet
Cloud Computing - Unit 3
38 pages
Hadoop Intro and Hdfs
No ratings yet
Hadoop Intro and Hdfs
37 pages
Bigdata Unit 3
No ratings yet
Bigdata Unit 3
96 pages
5 - BDP 2024 06
No ratings yet
5 - BDP 2024 06
14 pages
3 HDFS
No ratings yet
3 HDFS
16 pages
BD U-3 Notes
No ratings yet
BD U-3 Notes
27 pages
Unit 3 Big Data - 240516 - 090400
No ratings yet
Unit 3 Big Data - 240516 - 090400
20 pages
BD U-3 (Anupam Sir)
No ratings yet
BD U-3 (Anupam Sir)
23 pages
HDFS Concepts
No ratings yet
HDFS Concepts
10 pages
HDFS
No ratings yet
HDFS
20 pages
HDFS Presentation Kunal Yadav
No ratings yet
HDFS Presentation Kunal Yadav
11 pages
Chapter 4 - Hadoop Ecosystem
No ratings yet
Chapter 4 - Hadoop Ecosystem
24 pages
Unit - 3 (HDFS) - 1
No ratings yet
Unit - 3 (HDFS) - 1
24 pages
Hadoop Intro
No ratings yet
Hadoop Intro
40 pages
4
No ratings yet
4
53 pages
Unit - 3 (HDFS)
No ratings yet
Unit - 3 (HDFS)
23 pages
Unit-1 Introduction To Big Data
No ratings yet
Unit-1 Introduction To Big Data
38 pages
Hadoop Distributed File System (HDFS)
No ratings yet
Hadoop Distributed File System (HDFS)
22 pages
BCS061 Notes Unit3
No ratings yet
BCS061 Notes Unit3
23 pages
Big Data 11 TH Class
No ratings yet
Big Data 11 TH Class
15 pages
Unit-2 CH 1 Updated
No ratings yet
Unit-2 CH 1 Updated
22 pages
Unit II Big Data Analytics
No ratings yet
Unit II Big Data Analytics
11 pages
Unit-Iv CC&BD CS71
No ratings yet
Unit-Iv CC&BD CS71
148 pages
2-Hadoop History Terminologies DFS-03-01-2025
No ratings yet
2-Hadoop History Terminologies DFS-03-01-2025
52 pages
Chap4 BigDataStorageAndManagement
No ratings yet
Chap4 BigDataStorageAndManagement
46 pages
HDFS Datadotz
No ratings yet
HDFS Datadotz
22 pages
Unit-3 (HDFS)
No ratings yet
Unit-3 (HDFS)
59 pages
Unit2 HDFS
No ratings yet
Unit2 HDFS
17 pages
CS19741-Cloud Computing-Unit 3 Notes
No ratings yet
CS19741-Cloud Computing-Unit 3 Notes
37 pages
Unit 2
No ratings yet
Unit 2
14 pages
HDFS
No ratings yet
HDFS
16 pages
Unit 2
No ratings yet
Unit 2
56 pages
Big Data Unit 3 by Multi Atoms
No ratings yet
Big Data Unit 3 by Multi Atoms
6 pages
Lab2 BD
No ratings yet
Lab2 BD
20 pages
Big Data Lecture # 05
No ratings yet
Big Data Lecture # 05
22 pages
Complete Hadoop Notes Final
No ratings yet
Complete Hadoop Notes Final
4 pages
Hadoop Distributed File System: Bhavneet Kaur B.Tech Computer Science 2 Year
No ratings yet
Hadoop Distributed File System: Bhavneet Kaur B.Tech Computer Science 2 Year
34 pages
HDFS
No ratings yet
HDFS
8 pages
10 Dfs
No ratings yet
10 Dfs
5 pages
UNIT 3 HDFS, Hadoop Environment Part 1
No ratings yet
UNIT 3 HDFS, Hadoop Environment Part 1
9 pages
Hadoop: OREIN IT Technologies
No ratings yet
Hadoop: OREIN IT Technologies
65 pages
Big Data Importance of Hadoop Distributed Filesystem
No ratings yet
Big Data Importance of Hadoop Distributed Filesystem
4 pages
HDFS
No ratings yet
HDFS
13 pages
Module 1 PDF
No ratings yet
Module 1 PDF
49 pages
Hadoop Session
No ratings yet
Hadoop Session
65 pages
Apex Institute of Technology: Big Data Security
No ratings yet
Apex Institute of Technology: Big Data Security
30 pages
Hadoop File System: B. Ramamurthy
No ratings yet
Hadoop File System: B. Ramamurthy
36 pages
Hadoop File System
No ratings yet
Hadoop File System
36 pages
Migrating From SAP WM To Embedded EWM in SAP S - 4HANA
No ratings yet
Migrating From SAP WM To Embedded EWM in SAP S - 4HANA
15 pages
5G Core 21.1 PCC Principles For SA Networking ISSUE 1.0
No ratings yet
5G Core 21.1 PCC Principles For SA Networking ISSUE 1.0
73 pages
Document 4 HDFS
No ratings yet
Document 4 HDFS
8 pages
Iso27002 2022
No ratings yet
Iso27002 2022
1 page
ABInitio FAQ
No ratings yet
ABInitio FAQ
21 pages
Cell Controller Rel. 1.X E Rel. 2.X: Operations and Maintenance Manual
100% (1)
Cell Controller Rel. 1.X E Rel. 2.X: Operations and Maintenance Manual
266 pages
Mplab Xc16 Assembler, Linker and Utilities User's Guide: 2013-2016 Microchip Technology Inc. DS50002106C
No ratings yet
Mplab Xc16 Assembler, Linker and Utilities User's Guide: 2013-2016 Microchip Technology Inc. DS50002106C
272 pages
Caldera Readthedocs Io en 4.1.0
No ratings yet
Caldera Readthedocs Io en 4.1.0
249 pages
User Manual Easergy P5 Protection Relay - P5 - EN - M - 44B
No ratings yet
User Manual Easergy P5 Protection Relay - P5 - EN - M - 44B
528 pages
Patient Health Monitoring System
No ratings yet
Patient Health Monitoring System
12 pages
Introduction To Internet of Things
No ratings yet
Introduction To Internet of Things
54 pages
DAG2000-24S 32S VoIP Gateway User Manual
No ratings yet
DAG2000-24S 32S VoIP Gateway User Manual
82 pages
MIS Database7 9
No ratings yet
MIS Database7 9
88 pages
DX Diag
No ratings yet
DX Diag
56 pages
Final Thesis of Reaz
No ratings yet
Final Thesis of Reaz
66 pages
ZXONE 8X00 SOSCB Board Common Command
No ratings yet
ZXONE 8X00 SOSCB Board Common Command
47 pages
MAX22530 Isolated ADC
No ratings yet
MAX22530 Isolated ADC
37 pages
Instrumented Protective Systems
No ratings yet
Instrumented Protective Systems
24 pages
Exploring A Self-Replication Algorithm To Flexibly Match Patterns
No ratings yet
Exploring A Self-Replication Algorithm To Flexibly Match Patterns
18 pages
AWS
No ratings yet
AWS
14 pages
Application Note EltekSP2 SNMP IO Gateway
No ratings yet
Application Note EltekSP2 SNMP IO Gateway
13 pages
Roaming Quality Testing Whitepaper
No ratings yet
Roaming Quality Testing Whitepaper
12 pages
AA Syllabus
No ratings yet
AA Syllabus
10 pages
Helm Charts Interview Questions
No ratings yet
Helm Charts Interview Questions
13 pages
Crash
No ratings yet
Crash
1 page
The Programming The 8086 Microprocessor
No ratings yet
The Programming The 8086 Microprocessor
8 pages
1 s2.0 S0167404817301578 Main
No ratings yet
1 s2.0 S0167404817301578 Main
14 pages
Application of Residue Number System To Bioinformatics: Kwara State University, Malete
No ratings yet
Application of Residue Number System To Bioinformatics: Kwara State University, Malete
14 pages
Roasty Genitalia v3.2 For G8F & GF8.1: B - Manual Procedure
No ratings yet
Roasty Genitalia v3.2 For G8F & GF8.1: B - Manual Procedure
6 pages
6GK56362GS002AC2 Datasheet en
No ratings yet
6GK56362GS002AC2 Datasheet en
4 pages
Chapter Assignment
No ratings yet
Chapter Assignment
2 pages
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet

6 - BDP 2024 07

Uploaded by

6 - BDP 2024 07

Uploaded by

Big Data Processing

Hadoop Distributed Filesystem

(Ghemawat et al., SOSP 2003)

namenode job submission node

namenode daemon jobtracker

tasktracker tasktracker tasktracker

datanode daemon datanode daemon datanode daemon

Linux file system Linux file system Linux file system

DataNode DataNode DataNode

datanode datanode datanode

4. Write Packet 5. ack Packet

• HDFS federation (for memory issue)

• Maintain a replica of the metadata into another passive machine

• If the active namenode fails, start the passive namenode

• Needs to load the namepace into memory before it starts

• The namenode keeps a reference to every file and

• Solution: add more namenodes in the cluster

• Default value of m is 20 in Hadoop.

• Frequently accessed blocks can be stored in RAM

• A block is cached in only one datanode’s memory

• Job schedulers tries to run the code on the block that

• One can run a Hadoop command from command line

• To know the details about every command

• Copying a file to the local filesystem from HDFS

• Listing the files

You might also like