Big Data Processing
Jiaul Paik
Lecture 7
Storing Big Data in Cluster
Hadoop Distributed Filesystem
HDFS (Hadoop) Architecture
namenode = master node
HDFS namenode
Application /foo/bar
(file name, block id)
File namespace block 3df2
HDFS Client
(block id, block location)
instructions to datanode
datanode state
(block id, byte range)
HDFS datanode HDFS datanode
block data
Linux file system Linux file system
… …
(Ghemawat et al., SOSP 2003)
HDFS
namenode job submission node
namenode daemon jobtracker
tasktracker tasktracker tasktracker
datanode daemon datanode daemon datanode daemon
Linux file system Linux file system Linux file system
… … …
slave node slave node slave node
HDFS
Reading and Writing
Dataflow: Reading data from HDFS
2:get block
HDFS Distributed locations
FileSystem NameNode
client
FSData namenode
InputStream
DataNode DataNode DataNode
datanode datanode datanode
Adapted from: Hadoop the definitive Guide, 4th ed, Tom white
Writing data to HDFS
1. Create 2. Create
Distributed
HDFS FileSystem Namenode
3. Write
Client
7. Complete namenode
FSData
6. Close OutputStream
4. Write Packet 5. ack Packet
4 4
Pipeline of Datanode Datanode Datanode
datanodes
datanode datanode datanode
5 5
Adapted from: Hadoop the definitive Guide, 4th ed, Tom white
Managing Hadoop: Other Key Issues
• Node failure
• HDFS federation (for memory issue)
• Cluster Balancing
• Data Caching
Node failures
• Namenode failures
• All the files in the filesystem are lost
• Since, reconstruction is not possible
• Datanode failure
• Won’t be a problem
• Data blocks are stored in many machines
• Can be recovered from another machine
Tackling Namenode failure
• If namenode fails, then all metadata are lost
• Won’t be able to reconstruct the file from the blocks
• How to handle?
• Maintain a replica of the metadata into another passive machine
• If the active namenode fails, start the passive namenode
• Needs to load the namepace into memory before it starts
HDFS Federation
• The namenode keeps a reference to every file and
block in the filesystem in memory
• For a very large cluster, namenode may run out of memory to hold
the metadata
• Solution: add more namenodes in the cluster
HDFS Cluster Balancing
• When copying data into HDFS, balancing of data
storage is important
• Why?
• HDFS works best when blocks are spread evenly
• Examples:
• In distcp, if m = 1, single task will do the copying
• It will be slow
• Bad utilization of resources
• Default value of m is 20 in Hadoop.
Block Caching
• Generally, datanodes read blocks from the disk
• Frequently accessed blocks can be stored in RAM
• A block is cached in only one datanode’s memory
• Job schedulers tries to run the code on the block that
is cached
Filesystem Operations
Filesystem Operations
• Major Filesystem operations:
• reading files, creating directories, moving files, deleting data, and
listing directories.
• One can run a Hadoop command from command line
• To know the details about every command
hadoop fs -help
Filesystem Operations
• Copying a file from the local filesystem to HDFS
hadoop fs -copyFromLocal file-1 file-2
• Copying a file to the local filesystem from HDFS
hadoop fs -copyToLocal source-file dest-file
Filesystem Operations
• Creating a directory
hadoop fs -mkdir mydir
• Listing the files
hadoop fs -ls