0% found this document useful (0 votes)

5 views

Unit 3 Part 1

HDFS (Hadoop Distributed File System) is a distributed file system designed to store large files across multiple machines while ensuring data durability and high availability. It is suitable for handling very large files and streaming data access but is not ideal for low-latency access or numerous small files. Key components of HDFS include blocks, name nodes, data nodes, and secondary name nodes, which work together to manage data storage and retrieval efficiently.

Uploaded by

nikhil.2226cse1108

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views

Unit 3 Part 1

Uploaded by

nikhil.2226cse1108

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

What is HDFS

Hadoop comes with a distributed file system called HDFS. In HDFS data is distributed over
several machines and replicated to ensure their durability to failure and high availability to
parallel application.

It is cost effective as it uses commodity hardware. It involves the concept of blocks, data nodes
and node name.

Where to use HDFS

o Very Large Files: Files should be of hundreds of megabytes, gigabytes or more.

o Streaming Data Access: The time to read whole data set is more important than
latency in reading the first. HDFS is built on write-once and read-many-times pattern.
o Commodity Hardware:It works on low cost hardware.

Where not to use HDFS

o Low Latency data access: Applications that require very less time to access the first
data should not use HDFS as it is giving importance to whole data rather than time to
fetch the first record.
o Lots Of Small Files:The name node contains the metadata of files in memory and if
the files are small in size it takes a lot of memory for name node's memory which is
not feasible.
o Multiple Writes:It should not be used when we have to write multiple times.

HDFS Concepts

1. Blocks: A Block is the minimum amount of data that it can read or write.HDFS blocks
are 128 MB by default and this is configurable.Files n HDFS are broken into block-sized
chunks,which are stored as independent units.Unlike a file system, if the file is in HDFS
is smaller than block size, then it does not occupy full block?s size, i.e. 5 MB of file
stored in HDFS of block size 128 MB takes 5MB of space only.The HDFS block size is
large just to minimize the cost of seek.
2. Name Node: HDFS works in master-worker pattern where the name node acts as
master.Name Node is controller and manager of HDFS as it knows the status and the
metadata of all the files in HDFS; the metadata information being file permission,
names and location of each block.The metadata are small, so it is stored in the
memory of name node,allowing faster access to data. Moreover the HDFS cluster is
accessed by multiple clients concurrently,so all this information is handled bya single
machine. The file system operations like opening, closing, renaming etc. are executed
by it.
3. Data Node: They store and retrieve blocks when they are told to; by client or name
node. They report back to name node periodically, with list of blocks that they are
storing. The data node being a commodity hardware also does the work of block
creation, deletion and replication as stated by the name node.
4. Secondary Name Node: Since all the metadata is stored in name node, it is very
important. If it fails the file system can not be used as there would be no way of
knowing how to reconstruct the files from blocks present in data node. To overcome
this, the concept of secondary name node arises. It is a separate physical machine
which acts as a helper of name node. It performs periodic check points. It
communicates with the name node and take snapshot of meta data which helps
minimize downtime and loss of data.

HDFS DataNode and NameNode Image:

HDFS Read Image: Data read request is served by HDFS, NameNode, and DataNode. Let’s call
the reader as a ‘client’. Below diagram depicts file read operation in Hadoop .
1. A client initiates read request by calling ‘open()’ method of FileSystem object; it is an
object of type DistributedFileSystem.
2. This object connects to namenode using RPC and gets metadata information such as
the locations of the blocks of the file. Please note that these addresses are of first few
blocks of a file. In response to this metadata request, addresses of the DataNodes
having a copy of that block is returned back.
3. Once addresses of DataNodes are received, an object of type FSDataInputStream is
returned to the client. FSDataInputStream contains DFSInputStream which takes
care of interactions with DataNode and NameNode. In step 4 shown in the above
diagram, a client invokes ‘read()’ method which causes DFSInputStream to establish
a connection with the first DataNode with the first block of a file.
4. Data is read in the form of streams wherein client invokes ‘read()’ method repeatedly.
This process of read() operation continues till it reaches the end of block.
5. Once the end of a block is reached, DFSInputStream closes the connection and moves
on to locate the next DataNode for the next block
6. Once a client has done with the reading, it calls a close() method.

HDFS Write Image:

1. A client initiates write operation by calling ‘create()’ method of DistributedFileSystem

object which creates a new file – Step no. 1 in the above diagram.
2. DistributedFileSystem object connects to the NameNode using Remote Procedure call
(RPC) call and initiates new file creation. However, this file creates operation does not
associate any blocks with the file. It is the responsibility of NameNode to verify that
the file (which is being created) does not exist already and a client has correct
permissions to create a new file. If a file already exists or client does not have sufficient
permission to create a new file, then IOException is thrown to the client. Otherwise,
the operation succeeds and a new record for the file is created by the NameNode.
3. Once a new record in NameNode is created, an object of type FSDataOutputStream is
returned to the client. A client uses it to write data into the HDFS. Data write method
is invoked (step 3 in the diagram).
4. FSDataOutputStream contains DFSOutputStream object which looks after
communication with DataNodes and NameNode. While the client continues writing
data, DFSOutputStream continues creating packets with this data. These packets are
enqueued into a queue which is called as DataQueue.
5. There is one more component called DataStreamer which consumes this DataQueue.
DataStreamer also asks NameNode for allocation of new blocks thereby picking
desirable DataNodes to be used for replication.
6. Now, the process of replication starts by creating a pipeline using DataNodes. In our
case, we have chosen a replication level of 3 and hence there are 3 DataNodes in the
pipeline.
7. The DataStreamer pours packets into the first DataNode in the pipeline.
8. Every DataNode in a pipeline stores packet received by it and forwards the same to
the second DataNode in a pipeline.
9. Another queue, ‘Ack Queue’ is maintained by DFSOutputStream to store packets
which are waiting for acknowledgment from DataNodes.
10. Once acknowledgment for a packet in the queue is received from all DataNodes in the
pipeline, it is removed from the ‘Ack Queue’. In the event of any DataNode failure,
packets from this queue are used to reinitiate the operation.
11. After a client is done with the writing data, it calls a close() method (Step 9 in the
diagram) Call to close(), results into flushing remaining data packets to the pipeline
followed by waiting for acknowledgment.
12. Once a final acknowledgment is received, NameNode is contacted to tell it that the
file write operation is complete.

Access HDFS using JAVA API (Java Interfaces to HDFS)

In order to interact with Hadoop’s filesystem programmatically, Hadoop provides multiple
JAVA classes. Package named org.apache.hadoop.fs contains classes useful in manipulation
of a file in Hadoop’s filesystem. These operations include, open, read, write, and close.
Actually, file API for Hadoop is generic and can be extended to interact with other filesystems
other than HDFS. Reading a file from HDFS, programmatically, Object java.net.URL is
used for reading contents of a file. To begin with, we need to make Java recognize Hadoop’s
hdfs URL scheme. This is done by calling setURLStreamHandlerFactory method on URL
object and an instance of FsUrlStreamHandlerFactory is passed to it. This method needs to be
executed only once per JVM, hence it is enclosed in a static block.

An example code is-

public class URLCat {
static {
URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory());
}
public static void main(String[] args) throws Exception {
InputStream in = null;
try {
in = new URL(args[0]).openStream();
IOUtils.copyBytes(in, System.out, 4096, false);
} finally {
IOUtils.closeStream(in);
}
}
}
This code opens and reads contents of a file. Path of this file on HDFS is passed to the program
as a command line argument.

Access HDFS Using COMMAND-LINE INTERFACE

This is one of the simplest ways to interact with HDFS. Command-line interface has support
for filesystem operations like read the file, create directories, moving files, deleting data, and
listing directories.

We can run ‘$HADOOP_HOME/bin/hdfs dfs -help’ to get detailed help on every command.
Here, ‘dfs’ is a shell command of HDFS which supports multiple subcommands.

Some of the widely used commands are listed below along with some details of each one.

1. Copy a file from the local filesystem to HDFS

$HADOOP_HOME/bin/hdfs dfs -copyFromLocal temp.txt /

This command copies file temp.txt from the local filesystem to HDFS.

2. We can list files present in a directory using -ls

$HADOOP_HOME/bin/hdfs dfs -ls /

We can see a file ‘temp.txt’ (copied earlier) being listed under ‘ / ‘ directory.

3. Command to copy a file to the local filesystem from HDFS

$HADOOP_HOME/bin/hdfs dfs -copyToLocal /temp.txt

We can see temp.txt copied to a local filesystem.

4. Command to create a new directory

$HADOOP_HOME/bin/hdfs dfs -mkdir /mydirectory

Check whether a directory is created or not. Now, you should know how to do it

Major Hadoop Challenges with Their Solutions

Major Hadoop Limitations

Below are some of the Hadoop drawbacks with solutions, how can you overcome a problem.

Read all the Hadoop limitations and its solutions, it will definitely help you for working

smoothly in Hadoop.

• Small File Concerns

• Slow Processing Speed

• No Real Time Processing

• No Iterative Processing

• Ease of Use

• Security Problem
Let’s discuss these Hadoop Limitations in detail -

1. Small File Concerns

The idea behind Hadoop was to have a small number of large files. But if you have many small

files then Hadoop cannot manage it. Small files are those files whose size is quite less than the

block size in Hadoop. Each file, directory, and block occupies a memory element inside

NameNode’s memory. As a rule of thumb, this memory element is about 150 bytes. So if you

have 10 million files each using a block then it would occupy 1.39 GB of memory. Scaling

beyond this level is not possible with current hardware. Also Retrieving small files is very

inefficient in Hadoop. At the back-end, it causes many disks to seeks and hopping from one

datanode to another. It incurs a lot of time.

Solution

Hadoop Archives or HAR files is one of the solutions to small files problem. Hadoop archives

act as another layer of the file system over Hadoop. With Hadoop archive command we can

build HAR files. This command runs a map-reduce job at the backend to pack the archived files

into a small number of HDFS files. But again reading through HAR files is not much efficient

than reading through HDFS. This is because it requires to access two index files and then finally

the data file.

Sequence file is another solution to small file problem. In this, we write a program to merge a

number of small files into one sequence file. Then we process this sequence file in a streaming

fashion. Map-reduce can break this sequence files into chunks and process it in parallel as we

can split the sequence file.

2. Slow Processing Speed

In Hadoop, the MapReduce reads and writes the data to and from the disk. For every stage in

processing the data gets read from the disk and written to the disk. This disk seeks takes time

thereby making the whole process very slow. If Hadoop processes data in small volume, it is

very slow comparatively. It is ideal for large data sets. As Hadoop has batch processing engine
at the core its speed for real-time processing is less. Hadoop is slow in comparison with newer

technologies like Spark and Flink.

Solution

Spark is the solution for the slow processing speed of map-reduce. It does in-memory

calculations which makes it a hundred times faster than Hadoop. Spark while processing reads

the data from RAM and writes the data to RAM thereby making it a fast processing tool.

Flink is one more technology which is faster than Hadoop map-reduce as it does in-memory

calculations. Flink is even faster than Spark. This is due to the stream processing engine at the

core as opposed to Spark which has batch processing engine.

3. No Real Time Processing

Hadoop with its core Map-Reduce framework is unable to process real-time data. Hadoop

process data in batches. First, the user loads the file into HDFS. Then the user runs map-reduce

job with the file as input. It follows the ETL cycle of processing. The user extracts the data from

the source. Then the data gets transformed to meet the business requirements. And finally

loaded into the data warehouse. The users can generate insights from this data. The

companies use these insights for the betterment of their business.

Solution

Spark has come up as a solution to the above problem. Spark supports real-time processing. It

processes the incoming streams of data by forming micro-batches and then applying

computations on these micro-batches.

Flink is also one more solution for slow processing speed. It is even much faster than Spark as

it has a stream processing engine at the core. Flink is a true streaming engine with adjustable

latency and throughput. It has a rich set of APIs exploiting streaming runtime.
4. No Iterative Processing

Core Hadoop does not support iterative processing. Iterative processing requires a cyclic data

flow. In this output of a previous stage serves as an input to the next stage. Hadoop map-

reduce is capable of batch processing. It works on the principle of write-once-read-many. The

data gets written on the disk once. And then read multiple times to get insights. The Map-

reduce of Hadoop has a batch processing engine at its core. It is not able to iterate through

data.

Solution

Spark supports iterative processing. In Spark, each iteration needs to get scheduled and

executed separately. It accomplishes iterative processing through DAG i.e. Directed Acyclic
Graph. Spark has RDDs or Resilient Distributed Datasets. These are a collection of elements

partitioned across the cluster of nodes. Spark creates RDDs from HDFS files. We can also cache

them allowing reusability of RDDs. The iterative algorithms apply operations repeatedly over

data. Thus they benefit from RDDs caching across iterations.

Flink also supports iterative processing. Flink iterates data using streaming architecture. We

can instruct Flink to process only the data which gets changed thereby improving the

performance. Flink implements iterative algorithms by defining a step function. It embeds the

step functions into special iteration operator. The two variants of this operator are — iterate

and delta iterate. Both these operators apply the step function over and over again until they

meet a terminating condition.

5. Ease of Use

In Hadoop, we have to hand code each and every operation. This has two drawbacks first it is

difficult to use. And second, it increases the number of lines to code. There is no interactive

mode available with Hadoop Map-Reduce. This also makes it difficult to debug as it runs in the

batch mode. In this mode, we have to specify the jar file, the input as well as the location of

the output file. If the program fails in between, it is difficult to find the culprit code.
Solution

Spark is easy for the user as compared to Hadoop. This is because it has many APIs for Java,

Scala, Python, and Spark SQL. Spark performs batch processing, stream processing and

machine learning on the same cluster. This makes life easy for users. They can use the same
infrastructure for various workloads.

In Flink, the number of high-level operators is available. This reduces the number of lines of

code to achieve the same result.

6. Security Problem

Hadoop does not implement encryption-decryption at the storage as well as network levels.

Thus it is not much secure. For security, Hadoop adopts Kerberos authentication which is

difficult to maintain.

Solution

Spark encrypts temporary data written to local disk. It does not support encryption of output

data generated by applications having APIs such as saveAsHadoopFile or saveASTable. Spark

implements AES-based encryption for RPC connections. We should enable RPC authentication

to enable encryption. It should be properly configured.

Benefits of using HDFS

There are five main advantages to using HDFS, including:

1. Cost effectiveness. The DataNodes that store the data rely on inexpensive off-the-
shelf hardware, which cuts storage costs. Also, because HDFS is open source,
there's no licensing fee.

2. Large data set storage. HDFS stores a variety of data of any size -- from megabytes
to petabytes -- and in any format, including structured and unstructured data.
3. Fast recovery from hardware failure. HDFS is designed to detect faults and
automatically recover on its own.

4. Portability. HDFS is portable across all hardware platforms, and it is compatible

with several operating systems, including Windows, Linux and Mac OS/X.

5. Streaming data access. HDFS is built for high data throughput, which is best for
access to streaming data.

file sizes and block sizes of HDFS

HDFS is designed to support very large files. Applications that are compatible with HDFS are those

that deal with large data sets. These applications write their data only once but they read it one or

more times and require these reads to be satisfied at streaming speeds. HDFS supports write-once-

read-many semantics on files.

A typical block size used by HDFS is 128 MB. Thus, an HDFS file is chopped up into 128 MB chunks,

and if possible, each chunk will reside on a different DataNode.

HDFS has been designed to be easily portable from one platform to another. This facilitates
widespread adoption of HDFS as a platform of choice for a large set of applications.

NameNode and DataNodes

HDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode, a master
server that manages the file system namespace and regulates access to files by clients. In addition,
there are a number of DataNodes, usually one per node in the cluster, which manage storage
attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data
to be stored in files. Internally, a file is split into one or more blocks and these blocks are stored in
a set of DataNodes. The NameNode executes file system namespace operations like opening, closing,
and renaming files and directories. It also determines the mapping of blocks to DataNodes. The
DataNodes are responsible for serving read and write requests from the file system’s clients. The
DataNodes also perform block creation, deletion, and replication upon instruction from the
NameNode.
The NameNode and DataNode are pieces of software designed to run on commodity machines. These
machines typically run a GNU/Linux operating system (OS). HDFS is built using the Java language;
any machine that supports Java can run the NameNode or the DataNode software. Usage of the
highly portable Java language means that HDFS can be deployed on a wide range of machines. A
typical deployment has a dedicated machine that runs only the NameNode software. Each of the
other machines in the cluster runs one instance of the DataNode software. The architecture does not
preclude running multiple DataNodes on the same machine but in a real deployment that is rarely
the case.

The existence of a single NameNode in a cluster greatly simplifies the architecture of the system.
The NameNode is the arbitrator and repository for all HDFS metadata. The system is designed in
such a way that user data never flows through the NameNode.

The File System Namespace

HDFS supports a traditional hierarchical file organization. A user or an application can create
directories and store files inside these directories. The file system namespace hierarchy is similar to
most other existing file systems; one can create and remove files, move a file from one directory to
another, or rename a file. HDFS supports user quotas and access permissions. HDFS does not
support hard links or soft links. However, the HDFS architecture does not preclude implementing
these features.

The NameNode maintains the file system namespace. Any change to the file system namespace or
its properties is recorded by the NameNode. An application can specify the number of replicas of a
file that should be maintained by HDFS. The number of copies of a file is called the replication factor
of that file. This information is stored by the NameNode.

Data Replication
HDFS is designed to reliably store very large files across machines in a large cluster. It stores each
file as a sequence of blocks. The blocks of a file are replicated for fault tolerance. The block size and
replication factor are configurable per file.

All blocks in a file except the last block are the same size, while users can start a new block without
filling out the last block to the configured block size after the support for variable length block was
added to append and hsync.

An application can specify the number of replicas of a file. The replication factor can be specified at
file creation time and can be changed later. Files in HDFS are write-once (except for appends and
truncates) and have strictly one writer at any time.

The NameNode makes all decisions regarding replication of blocks. It periodically receives a
Heartbeat and a Blockreport from each of the DataNodes in the cluster. Receipt of a Heartbeat
implies that the DataNode is functioning properly. A Blockreport contains a list of all blocks on a
DataNode.

Replica Placement: The First Baby Steps

The placement of replicas is critical to HDFS reliability and performance. Optimizing replica
placement distinguishes HDFS from most other distributed file systems. This is a feature that needs
lots of tuning and experience. The purpose of a rack-aware replica placement policy is to improve
data reliability, availability, and network bandwidth utilization. The current implementation for the
replica placement policy is a first effort in this direction. The short-term goals of implementing this
policy are to validate it on production systems, learn more about its behavior, and build a foundation
to test and research more sophisticated policies.

Large HDFS instances run on a cluster of computers that commonly spread across many racks.
Communication between two nodes in different racks has to go through switches. In most cases,
network bandwidth between machines in the same rack is greater than network bandwidth between
machines in different racks.

The NameNode determines the rack id each DataNode belongs to via the process outlined in Hadoop
Rack Awareness. A simple but non-optimal policy is to place replicas on unique racks. This prevents
losing data when an entire rack fails and allows use of bandwidth from multiple racks when reading
data. This policy evenly distributes replicas in the cluster which makes it easy to balance load on
component failure. However, this policy increases the cost of writes because a write needs to transfer
blocks to multiple racks.

For the common case, when the replication factor is three, HDFS’s placement policy is to put one
replica on the local machine if the writer is on a datanode, otherwise on a random datanode, another
replica on a node in a different (remote) rack, and the last on a different node in the same remote
rack. This policy cuts the inter-rack write traffic which generally improves write performance. The
chance of rack failure is far less than that of node failure; this policy does not impact data reliability
and availability guarantees. However, it does reduce the aggregate network bandwidth used when
reading data since a block is placed in only two unique racks rather than three. With this policy, the
replicas of a file do not evenly distribute across the racks. One third of replicas are on one node, two
thirds of replicas are on one rack, and the other third are evenly distributed across the remaining
racks. This policy improves write performance without compromising data reliability or read
performance.

If the replication factor is greater than 3, the placement of the 4th and following replicas are
determined randomly while keeping the number of replicas per rack below the upper limit (which is
basically (replicas - 1) / racks + 2).

Because the NameNode does not allow DataNodes to have multiple replicas of the same block,
maximum number of replicas created is the total number of DataNodes at that time.

After the support for Storage Types and Storage Policies was added to HDFS, the NameNode takes
the policy into account for replica placement in addition to the rack awareness described above. The
NameNode chooses nodes based on rack awareness at first, then checks that the candidate node
have storage required by the policy associated with the file. If the candidate node does not have the
storage type, the NameNode looks for another node. If enough nodes to place replicas can not be
found in the first path, the NameNode looks for nodes having fallback storage types in the second
path.

The current, default replica placement policy described here is a work in progress.

Replica Selection

To minimize global bandwidth consumption and read latency, HDFS tries to satisfy a read request
from a replica that is closest to the reader. If there exists a replica on the same rack as the reader
node, then that replica is preferred to satisfy the read request. If HDFS cluster spans multiple data
centers, then a replica that is resident in the local data center is preferred over any remote replica.

Safemode

On startup, the NameNode enters a special state called Safemode. Replication of data blocks does
not occur when the NameNode is in the Safemode state. The NameNode receives Heartbeat and
Blockreport messages from the DataNodes. A Blockreport contains the list of data blocks that a
DataNode is hosting. Each block has a specified minimum number of replicas. A block is considered
safely replicated when the minimum number of replicas of that data block has checked in with the
NameNode. After a configurable percentage of safely replicated data blocks checks in with the
NameNode (plus an additional 30 seconds), the NameNode exits the Safemode state. It then
determines the list of data blocks (if any) that still have fewer than the specified number of replicas.
The NameNode then replicates these blocks to other DataNodes.

Replica Placement: The First Baby Steps

Because the NameNode does not allow DataNodes to have multiple replicas of the same block,
maximum number of replicas created is the total number of DataNodes at that time.

The current, default replica placement policy described here is a work in progress.

Replica Selection

Safemode

The HDFS namespace is stored by the NameNode. The NameNode uses a transaction log called the
EditLog to persistently record every change that occurs to file system metadata. For example,
creating a new file in HDFS causes the NameNode to insert a record into the EditLog indicating this.
Similarly, changing the replication factor of a file causes a new record to be inserted into the EditLog.
The NameNode uses a file in its local host OS file system to store the EditLog. The entire file system
namespace, including the mapping of blocks to files and file system properties, is stored in a file
called the FsImage. The FsImage is stored as a file in the NameNode’s local file system too.

The NameNode keeps an image of the entire file system namespace and file Blockmap in memory.
When the NameNode starts up, or a checkpoint is triggered by a configurable threshold, it reads the
FsImage and EditLog from disk, applies all the transactions from the EditLog to the in-memory
representation of the FsImage, and flushes out this new version into a new FsImage on disk. It can
then truncate the old EditLog because its transactions have been applied to the persistent FsImage.
This process is called a checkpoint. The purpose of a checkpoint is to make sure that HDFS has a
consistent view of the file system metadata by taking a snapshot of the file system metadata and
saving it to FsImage. Even though it is efficient to read a FsImage, it is not efficient to make
incremental edits directly to a FsImage. Instead of modifying FsImage for each edit, we persist the
edits in the Editlog. During the checkpoint the changes from Editlog are applied to the FsImage. A
checkpoint can be triggered at a given time interval (dfs.namenode.checkpoint.period)
expressed in seconds, or after a given number of filesystem transactions have accumulated
(dfs.namenode.checkpoint.txns). If both of these properties are set, the first threshold to be
reached triggers a checkpoint.

The DataNode stores HDFS data in files in its local file system. The DataNode has no knowledge
about HDFS files. It stores each block of HDFS data in a separate file in its local file system. The
DataNode does not create all files in the same directory. Instead, it uses a heuristic to determine
the optimal number of files per directory and creates subdirectories appropriately. It is not optimal
to create all local files in the same directory because the local file system might not be able to
efficiently support a huge number of files in a single directory. When a DataNode starts up, it scans
through its local file system, generates a list of all HDFS data blocks that correspond to each block.

Mori Server (English)
No ratings yet
Mori Server (English)
201 pages
3051 - DFSMS - MVS Basics - The ABCs of ACS
No ratings yet
3051 - DFSMS - MVS Basics - The ABCs of ACS
39 pages
Hadoop Working
No ratings yet
Hadoop Working
33 pages
1.HDFS Architecture and Its Operations
No ratings yet
1.HDFS Architecture and Its Operations
6 pages
HDFS Tutorial - Architecture, Read & Write Operation Using Java API
No ratings yet
HDFS Tutorial - Architecture, Read & Write Operation Using Java API
3 pages
1) Discuss The Design of Hadoop Distributed File System (HDFS) and Concept in Detail
No ratings yet
1) Discuss The Design of Hadoop Distributed File System (HDFS) and Concept in Detail
11 pages
HDFS Unit 4
No ratings yet
HDFS Unit 4
8 pages
UNIT-5-HDFS (Hadoop Distributed File System)
No ratings yet
UNIT-5-HDFS (Hadoop Distributed File System)
18 pages
Unit_3_Big Data
No ratings yet
Unit_3_Big Data
66 pages
Unit-2 Introduction To Hadoop
No ratings yet
Unit-2 Introduction To Hadoop
19 pages
UNIT 3 HDFS, Hadoop Environment Part 1
No ratings yet
UNIT 3 HDFS, Hadoop Environment Part 1
9 pages
Bigdata Unit 3
No ratings yet
Bigdata Unit 3
96 pages
UNIT-3-1 (1)
No ratings yet
UNIT-3-1 (1)
20 pages
HDFS
No ratings yet
HDFS
14 pages
Unit - 3 HDFS MAPREDUCE HBASE
No ratings yet
Unit - 3 HDFS MAPREDUCE HBASE
34 pages
Read Write in HDFS
No ratings yet
Read Write in HDFS
6 pages
Bda Unit 5
No ratings yet
Bda Unit 5
17 pages
HDFS
No ratings yet
HDFS
3 pages
Unit 3 Bda
No ratings yet
Unit 3 Bda
9 pages
Unit 3.1
No ratings yet
Unit 3.1
88 pages
Experiment No. 2 Training Session On Hadoop: Hadoop Distributed File System
No ratings yet
Experiment No. 2 Training Session On Hadoop: Hadoop Distributed File System
9 pages
Chapter 4 - Hadoop Ecosystem
No ratings yet
Chapter 4 - Hadoop Ecosystem
24 pages
Bda - Unit 2
No ratings yet
Bda - Unit 2
56 pages
Hadoop: OREIN IT Technologies
No ratings yet
Hadoop: OREIN IT Technologies
65 pages
Complete Hadoop Notes Final
No ratings yet
Complete Hadoop Notes Final
4 pages
IMTC634_Data Science_Chapter 14
No ratings yet
IMTC634_Data Science_Chapter 14
22 pages
3_HDFS-Hive-HBase-Pig
No ratings yet
3_HDFS-Hive-HBase-Pig
8 pages
Hadoop Distributed File System: Presented by Mohammad Sufiyan Nagaraju Kola Prudhvi Krishna Kamireddy
No ratings yet
Hadoop Distributed File System: Presented by Mohammad Sufiyan Nagaraju Kola Prudhvi Krishna Kamireddy
17 pages
BD Unit-IIINotes
No ratings yet
BD Unit-IIINotes
17 pages
Unit 4
No ratings yet
Unit 4
104 pages
Read and Write Operation
No ratings yet
Read and Write Operation
10 pages
BDA Module-1 Notes
No ratings yet
BDA Module-1 Notes
14 pages
Big data Unit 4 own
No ratings yet
Big data Unit 4 own
18 pages
HDFS
No ratings yet
HDFS
16 pages
Unit-3 (HDFS)
No ratings yet
Unit-3 (HDFS)
59 pages
BigdataUnit III-Part2
No ratings yet
BigdataUnit III-Part2
9 pages
Hadoop Distributed File System HDFS 1688981751
No ratings yet
Hadoop Distributed File System HDFS 1688981751
49 pages
CC Unit 5 Notes
No ratings yet
CC Unit 5 Notes
30 pages
Chapter N2 HDFS The Hadoop Distributed File System - Matrix
No ratings yet
Chapter N2 HDFS The Hadoop Distributed File System - Matrix
37 pages
Bigdata 15cs82 Vtu Module 1 2 Notes
57% (14)
Bigdata 15cs82 Vtu Module 1 2 Notes
49 pages
Bigdata 15cs82 Vtu Module 1 2 Notes PDF
No ratings yet
Bigdata 15cs82 Vtu Module 1 2 Notes PDF
49 pages
BDA Mod 3 QB Solns
No ratings yet
BDA Mod 3 QB Solns
19 pages
Unit 3 Big Data_240516_090400
No ratings yet
Unit 3 Big Data_240516_090400
20 pages
Unit III
No ratings yet
Unit III
86 pages
BDA UNIT-2dhhhhbv
No ratings yet
BDA UNIT-2dhhhhbv
23 pages
Module 1 PDF
No ratings yet
Module 1 PDF
49 pages
HDFS Internals
No ratings yet
HDFS Internals
30 pages
Unit - II
No ratings yet
Unit - II
64 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
48 pages
Lecture 4 - Hadoop HDFS
No ratings yet
Lecture 4 - Hadoop HDFS
48 pages
Unit 2 Da Material
No ratings yet
Unit 2 Da Material
71 pages
Big Data Unit-3 PPT
No ratings yet
Big Data Unit-3 PPT
46 pages
UNIT-2
No ratings yet
UNIT-2
14 pages
Apache Hadoop 3.4.1 – HDFS Architecture
No ratings yet
Apache Hadoop 3.4.1 – HDFS Architecture
7 pages
Unit-2_ch_1_updated
No ratings yet
Unit-2_ch_1_updated
22 pages
10 Dfs
No ratings yet
10 Dfs
5 pages
HDFS
No ratings yet
HDFS
11 pages
Unit2 HDFS
No ratings yet
Unit2 HDFS
17 pages
UNIT V-Cloud Computing
No ratings yet
UNIT V-Cloud Computing
33 pages
Hadoop
No ratings yet
Hadoop
23 pages
BD U-3 (Anupam Sir)
No ratings yet
BD U-3 (Anupam Sir)
23 pages
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Unit 1 Notes Final Part C
No ratings yet
Unit 1 Notes Final Part C
38 pages
14_Error 1
No ratings yet
14_Error 1
31 pages
NoSQL
No ratings yet
NoSQL
13 pages
Unit 3 Part 2 Scoopflume
No ratings yet
Unit 3 Part 2 Scoopflume
10 pages
16_HDLC
No ratings yet
16_HDLC
17 pages
Unit 4 Complete Notes
No ratings yet
Unit 4 Complete Notes
1 page
Toshiba B-SV4 User Manual
No ratings yet
Toshiba B-SV4 User Manual
101 pages
Networking Essentials 2.0 Capstone
No ratings yet
Networking Essentials 2.0 Capstone
10 pages
Linux Fun
No ratings yet
Linux Fun
251 pages
Deadlocks Dimitoglou
No ratings yet
Deadlocks Dimitoglou
4 pages
How Can I Change The Location of Docker Images When Using Docker Desktop On WSL2 With Windows 10 Home - Stack Overflow
No ratings yet
How Can I Change The Location of Docker Images When Using Docker Desktop On WSL2 With Windows 10 Home - Stack Overflow
7 pages
RGHE52 Ini
No ratings yet
RGHE52 Ini
4 pages
Department: IT-Remote Service Program Area: From Author: Marjan Mirtaheri 1.0 Date: 10-Jun-18
No ratings yet
Department: IT-Remote Service Program Area: From Author: Marjan Mirtaheri 1.0 Date: 10-Jun-18
2 pages
Automation Setup
No ratings yet
Automation Setup
30 pages
Gnat RM
No ratings yet
Gnat RM
345 pages
6t Sram
No ratings yet
6t Sram
13 pages
MSMC104+: Technical User'S Manual For
No ratings yet
MSMC104+: Technical User'S Manual For
28 pages
★5G 제안요청서 II.기술사항.ko.en PDF
No ratings yet
★5G 제안요청서 II.기술사항.ko.en PDF
173 pages
Lite Indicator Admin Manual
No ratings yet
Lite Indicator Admin Manual
16 pages
Release Notes 54111297
No ratings yet
Release Notes 54111297
7 pages
PREVIEW - VS Code SuperHero (Light)
No ratings yet
PREVIEW - VS Code SuperHero (Light)
5 pages
KLEVV - Product Sheet - MEMORY - CRAS V RGB - v6 - EN
No ratings yet
KLEVV - Product Sheet - MEMORY - CRAS V RGB - v6 - EN
1 page
Unit 6. Compilers Versus Interpreters: Task 1. Learn The Following Words and Word-Combinations
No ratings yet
Unit 6. Compilers Versus Interpreters: Task 1. Learn The Following Words and Word-Combinations
3 pages
FR 8.5.1 Framework Deployment Guide
No ratings yet
FR 8.5.1 Framework Deployment Guide
245 pages
SAP setp configuration
No ratings yet
SAP setp configuration
4 pages
The Savvy SchoolCity Campus Shahkot
No ratings yet
The Savvy SchoolCity Campus Shahkot
3 pages
mod1_arm_embedded_system for unit1 and 2
No ratings yet
mod1_arm_embedded_system for unit1 and 2
58 pages
HLR New in This Release
No ratings yet
HLR New in This Release
11 pages
ECB 8610 S Datasheet - 06102008 PDF
No ratings yet
ECB 8610 S Datasheet - 06102008 PDF
2 pages
Togainu No Chi Guide PDF
No ratings yet
Togainu No Chi Guide PDF
8 pages
70-741.examsforall - Premium.exam.257q: Number: 70-741 Passing Score: 800 Time Limit: 120 Min File Version: 19.0
No ratings yet
70-741.examsforall - Premium.exam.257q: Number: 70-741 Passing Score: 800 Time Limit: 120 Min File Version: 19.0
327 pages
User
No ratings yet
User
397 pages
CCNP Interview Questions
No ratings yet
CCNP Interview Questions
4 pages
Selap - SW-2730M Im 20210630
No ratings yet
Selap - SW-2730M Im 20210630
156 pages

Unit 3 Part 1

Uploaded by

Unit 3 Part 1

Uploaded by

What is HDFS

Where to use HDFS

o Very Large Files: Files should be of hundreds of megabytes, gigabytes or more.

Where not to use HDFS

HDFS DataNode and NameNode Image:

HDFS Write Image:

1. A client initiates write operation by calling ‘create()’ method of DistributedFileSystem

Access HDFS using JAVA API (Java Interfaces to HDFS)

An example code is-

Access HDFS Using COMMAND-LINE INTERFACE

1. Copy a file from the local filesystem to HDFS

$HADOOP_HOME/bin/hdfs dfs -copyFromLocal temp.txt /

2. We can list files present in a directory using -ls

$HADOOP_HOME/bin/hdfs dfs -ls /

3. Command to copy a file to the local filesystem from HDFS

$HADOOP_HOME/bin/hdfs dfs -copyToLocal /temp.txt

We can see temp.txt copied to a local filesystem.

4. Command to create a new directory

$HADOOP_HOME/bin/hdfs dfs -mkdir /mydirectory

Major Hadoop Challenges with Their Solutions

Major Hadoop Limitations

• Small File Concerns

• Slow Processing Speed

• No Real Time Processing

1. Small File Concerns

datanode to another. It incurs a lot of time.

the data file.

can split the sequence file.

2. Slow Processing Speed

technologies like Spark and Flink.

core as opposed to Spark which has batch processing engine.

3. No Real Time Processing

companies use these insights for the betterment of their business.

computations on these micro-batches.

reduce is capable of batch processing. It works on the principle of write-once-read-many. The

data. Thus they benefit from RDDs caching across iterations.

meet a terminating condition.

code to achieve the same result.

data generated by applications having APIs such as saveAsHadoopFile or saveASTable. Spark

to enable encryption. It should be properly configured.

Benefits of using HDFS

4. Portability. HDFS is portable across all hardware platforms, and it is compatible

file sizes and block sizes of HDFS

read-many semantics on files.

and if possible, each chunk will reside on a different DataNode.

NameNode and DataNodes

The File System Namespace

Replica Placement: The First Baby Steps

Replica Placement: The First Baby Steps

You might also like