0% found this document useful (0 votes)
9 views17 pages

6 - BDP 2024 07

Uploaded by

khalidalam980
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views17 pages

6 - BDP 2024 07

Uploaded by

khalidalam980
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Big Data Processing

Jiaul Paik
Lecture 7
Storing Big Data in Cluster

Hadoop Distributed Filesystem


HDFS (Hadoop) Architecture
namenode = master node

HDFS namenode
Application /foo/bar
(file name, block id)
File namespace block 3df2
HDFS Client
(block id, block location)

instructions to datanode

datanode state
(block id, byte range)
HDFS datanode HDFS datanode
block data
Linux file system Linux file system

… …

(Ghemawat et al., SOSP 2003)


HDFS

namenode job submission node

namenode daemon jobtracker

tasktracker tasktracker tasktracker

datanode daemon datanode daemon datanode daemon

Linux file system Linux file system Linux file system

… … …
slave node slave node slave node
HDFS
Reading and Writing
Dataflow: Reading data from HDFS
2:get block
HDFS Distributed locations
FileSystem NameNode
client

FSData namenode
InputStream

DataNode DataNode DataNode

datanode datanode datanode

Adapted from: Hadoop the definitive Guide, 4th ed, Tom white
Writing data to HDFS

1. Create 2. Create
Distributed
HDFS FileSystem Namenode
3. Write
Client
7. Complete namenode
FSData
6. Close OutputStream

4. Write Packet 5. ack Packet

4 4
Pipeline of Datanode Datanode Datanode
datanodes
datanode datanode datanode
5 5

Adapted from: Hadoop the definitive Guide, 4th ed, Tom white
Managing Hadoop: Other Key Issues
• Node failure

• HDFS federation (for memory issue)

• Cluster Balancing

• Data Caching
Node failures
• Namenode failures
• All the files in the filesystem are lost
• Since, reconstruction is not possible

• Datanode failure
• Won’t be a problem
• Data blocks are stored in many machines
• Can be recovered from another machine
Tackling Namenode failure
• If namenode fails, then all metadata are lost
• Won’t be able to reconstruct the file from the blocks

• How to handle?

• Maintain a replica of the metadata into another passive machine

• If the active namenode fails, start the passive namenode

• Needs to load the namepace into memory before it starts


HDFS Federation

• The namenode keeps a reference to every file and


block in the filesystem in memory

• For a very large cluster, namenode may run out of memory to hold
the metadata

• Solution: add more namenodes in the cluster


HDFS Cluster Balancing
• When copying data into HDFS, balancing of data
storage is important

• Why?
• HDFS works best when blocks are spread evenly

• Examples:
• In distcp, if m = 1, single task will do the copying
• It will be slow
• Bad utilization of resources

• Default value of m is 20 in Hadoop.


Block Caching
• Generally, datanodes read blocks from the disk

• Frequently accessed blocks can be stored in RAM

• A block is cached in only one datanode’s memory

• Job schedulers tries to run the code on the block that


is cached
Filesystem Operations
Filesystem Operations
• Major Filesystem operations:
• reading files, creating directories, moving files, deleting data, and
listing directories.

• One can run a Hadoop command from command line

• To know the details about every command

hadoop fs -help
Filesystem Operations
• Copying a file from the local filesystem to HDFS
hadoop fs -copyFromLocal file-1 file-2

• Copying a file to the local filesystem from HDFS


hadoop fs -copyToLocal source-file dest-file
Filesystem Operations
• Creating a directory
hadoop fs -mkdir mydir

• Listing the files


hadoop fs -ls

You might also like