0% found this document useful (0 votes)
5 views

Unit_3_Big Data

The document explains the concept of racks in Hadoop, detailing how data nodes are organized and the importance of rack awareness for efficient data communication and fault tolerance. It covers HDFS data read/write operations, the process of data replication, and the MapReduce framework, including job scheduling and the anatomy of a MapReduce job. Additionally, it discusses the advantages and challenges of Hadoop, including its limitations in real-time processing and security issues.

Uploaded by

nickyjaiswal85
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Unit_3_Big Data

The document explains the concept of racks in Hadoop, detailing how data nodes are organized and the importance of rack awareness for efficient data communication and fault tolerance. It covers HDFS data read/write operations, the process of data replication, and the MapReduce framework, including job scheduling and the anatomy of a MapReduce job. Additionally, it discusses the advantages and challenges of Hadoop, including its limitations in real-time processing and security issues.

Uploaded by

nickyjaiswal85
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 66

Unit -

3
What is a Rack?
 The Rack is the collection of around
40-50 Data Nodes connected using
the same network switch.
 If the network goes down, the
whole rack will be unavailable.
 A large Hadoop cluster is deployed
in multiple racks.
Rack Awareness in Hadoop
HDFS
 In a large Hadoop cluster, there are multiple racks.

 Each rack consists of DataNodes.


 Communication between the DataNodes on the same rack is more efficient as
compared to the communication between DataNodes residing on different
racks.
 To reduce the network traffic during file read/write, NameNode chooses the
closest DataNode for serving the client read/write request.
 NameNode maintains rack id’s of each DataNode to achieve this rack
information.
 This concept of choosing the closest DataNode based on the rack information
is known as Rack Awareness.
Rack awareness policies
 It says:
 Not more than one replica be placed on one node.
 Not more than two replicas are placed on the same rack.
 Also, the number of racks used for block replication should always be smaller
than the number of replicas.
Why Rack Awareness?
 The reasons for the Rack Awareness in Hadoop are:
 To reduce the network traffic while file read/write, which improves the
cluster performance.
 To achieve fault tolerance, even when the rack goes down.
 Achieve high availability of data so that data is available even in
unfavorable conditions.
 To reduce the latency, that is, to make the file read/write operations done
with lower delay.
HDFS Data Read
and Write
Operations
HDFS Data Read and Write
Operations
 HDFS follow Write once Read many (WORM) models.
 We cannot edit files already stored in HDFS, but we can append data by
reopening the file.
 In Read-Write operation client first, interact with the NameNode.
 NameNode provides privileges so, the client can easily read and write data
blocks into/from the respective datanodes.
Write Operation in HDFS
Write Operation in HDFS
 A client initiates write operation by calling 'create()' method of DistributedFileSystem object
which creates a new file - Step no. 1 in the previous diagram.
 DistributedFileSystem object connects to the NameNode using RPC and initiates new file
creation.
 It is the responsibility of NameNode to verify that the file (which is being created) does not
exist already and a client has correct permissions to create a new file.
 If a file already exists or client does not have sufficient permission to create a new file,
then IOException is thrown to the client. Otherwise, the operation succeeds and a new record
for the file is created by the NameNode.
 Once a new record in NameNode is created, an object of type FSDataOutputStream is
returned to the client. A client uses it to write data into the HDFS. Data write method is
invoked (step 3 in the diagram).
Write Operation in HDFS
 FSDataOutputStream contains DFSOutputStream object which looks after communication
with DataNodes and NameNode.
 While the client continues writing data, DFSOutputStream continues creating packets with
this data. These packets are enqueued into a queue which is called as DataQueue.
 There is one more component called DataStreamer which consumes this DataQueue.
 DataStreamer also asks NameNode for allocation of new blocks thereby picking desirable
DataNodes to be used for replication.
 Now, the process of replication starts by creating a pipeline using DataNodes. In our case, we
have chosen a replication level of 3 and hence there are 3 DataNodes in the pipeline.
 The DataStreamer pours packets into the first DataNode in the pipeline.
 Every DataNode in a pipeline stores packet received by it and forwards the same to the second
DataNode in a pipeline.
Write Operation in HDFS
 Every DataNode in a pipeline stores packet received by it and forwards the same to the second
DataNode in a pipeline.
 Another queue, 'Ack Queue' is maintained by DFSOutputStream to store packets which are
waiting for acknowledgment from DataNodes.
 Once acknowledgment for a packet in the queue is received from all DataNodes in the
pipeline, it is removed from the 'Ack Queue'. In the event of any DataNode failure, packets
from this queue are used to reinitiate the operation.
 After a client is done with the writing data, it calls a close() method (Step 9 in the diagram)
Call to close(), results into flushing remaining data packets to the pipeline followed by waiting
for acknowledgment.
 Once a final acknowledgment is received, NameNode is contacted to tell it that the file write
operation is complete.
Read Operation in HDFS
Java Interfaces to HDFS
 Java code FileSystem fileSystem = FileSystem.get(conf);
// Check if the file already exists
for writing file Path path = new Path("/path/to/file.ext");
if (fileSystem.exists(path)) {
in HDFS System.out.println("File " + dest + " already exists");
return;
}
// Create a new file and write data to it.
FSDataOutputStream out = fileSystem.create(path);
InputStream in = new BufferedInputStream(new FileInputStream(new File(source)));
byte[] b = new byte[1024];
int numBytes = 0;
while ((numBytes = in.read(b)) > 0) {
out.write(b, 0, numBytes);
}
// Close all the file descripters
in.close();
out.close();
fileSystem.close();
Java Interfaces to HDFS
 Java code
FileSystem fileSystem = FileSystem.get(conf);
for reading file Path path = new Path("/path/to/file.ext");
in HDFS if (!fileSystem.exists(path)) {
System.out.println("File does not exists");
return;
}
FSDataInputStream in = fileSystem.open(path);
int numBytes = 0;
while ((numBytes = in.read(b))> 0) {
System.out.prinln((char)numBytes));
// code to manipulate the data which is read
}
in.close();
out.close();
fileSystem.close();
Read Operation in HDFS
 A client initiates read request by calling 'open()' method of FileSystem
object; it is an object of type DistributedFileSystem.
 This object connects to Namenode using RPC and gets metadata
information such as the locations of the blocks of the file.
 In response to this metadata request, addresses of the DataNodes having a
copy of that block is returned back.
 Once addresses of DataNodes are received, an object of
type FSDataInputStream is returned to the client.
 FSDataInputStream contains DFSInputStream which takes care of
interactions with DataNode and NameNode.
Read Operation in HDFS
 In step 4 shown in the previous diagram, a client invokes 'read()' method
which causes DFSInputStream to establish a connection with the first
DataNode with the first block of a file.
 Data is read in the form of streams wherein client invokes 'read()' method
repeatedly. This process of read() operation continues till it reaches the
end of block.
 Once the end of a block is reached, DFSInputStream closes the connection
and moves on to locate the next DataNode for the next block
 Once a client has done with the reading, it calls a close() method.
30

Benefits and Challenges


Challenges for HDFS 31

• NO REAL-TIME PROCESSING : HADOOP WITH ITS CORE MAP-REDUCE


FRAMEWORK IS UNABLE TO PROCESS REAL-TIME DATA. HADOOP PROCESS DATA IN
BATCHES. FIRST, THE USER LOADS THE FILE INTO HDFS. THEN THE USER RUNS MAP-
REDUCE JOB WITH THE FILE AS INPUT.
• SECURITY ISSUE : HADOOP DOES NOT IMPLEMENT ENCRYPTION-DECRYPTION AT
THE STORAGE AS WELL AS NETWORK LEVELS.
• NO CACHING : MAPREDUCE CANNOT CACHE THE INTERMEDIATE DATA IN MEMORY
FOR THE FURTHER REQUIREMENT
• NO EASY TO USE, AND LENGTHY CODES
Data Replication
 Replication ensures the availability of the data.
 Replication is - making a copy of something and the number of
times we make a copy of that particular thing can be expressed
as its Replication Factor.
Data Replication
 As HDFS stores the data in the form of various blocks at the same time
Hadoop is also configured to make a copy of those file blocks.
 By default, the Replication Factor for Hadoop is set to 3 which can be
configured.
 We need this replication for our file blocks because for running Hadoop
we are using commodity hardware (inexpensive system hardware) which
can be crashed at any time.
 For the big brand organization, the data is very much important than the
storage, so nobody cares about this extra storage.
 You can configure the Replication factor in our hdfs-site.xml file
Data Replication
 We are not using a supercomputer for our Hadoop setup.
 That is why we need such a feature in HDFS that can make copies of that file
blocks for backup purposes, this is known as fault tolerance.
How MapReduce Works?
Data Flow
 MapReduce is used to compute a huge amount of data.
 To handle the upcoming data in a parallel and distributed form, the data
has to flow from various phases :
Data Flow
Input Reader :
 The input reader reads the upcoming data and splits it into the data blocks of the appropriate size (64 MB to 128
MB).
 Once input reads the data, it generates the corresponding key-value pairs.
 The input files reside in HDFS.
Map Function :
 The map function process the upcoming key-value pairs and generated the corresponding output key-value pairs.
 The mapped input and output types may be different from each other.
Partition Function :
 The partition function assigns the output of each Map function to the appropriate reducer.
 The available key and value provide this function.
 It returns the index of reducers.
Data Flow
Shuffling and Sorting :
 The data are shuffled between nodes so that it moves out from the map and get ready to process for
reduce function.
 The sorting operation is performed on input data for Reduce function.
Reduce Function :
 The Reduce function is assigned to each unique key.
 These keys are already arranged in sorted order.
 The values associated with the keys can iterate the Reduce and generates the corresponding output.
Output Writer :
 Once the data flow from all the above phases, the Output writer executes.
 The role of the Output writer is to write the Reduce output to the stable storage.
Job Scheduling
Job Scheduling

 The scheduler performs scheduling based on the resource requirements of


the applications.
 It has some pluggable policies that are responsible for partitioning the
cluster resources among the various queues, applications, etc.
 There are mainly 3 types of Schedulers in Hadoop:
1. FIFO (First In First Out) Scheduler.
2. Capacity Scheduler.
3. Fair Scheduler.
Schedulers 42
FIFO Scheduler

 First In First Out is the default scheduling policy used in Hadoop.


 FIFO Scheduler gives more preferences to the application coming first
than those coming later.
 It places the applications in a queue and executes them in the order of their
submission (first in, first out).
FIFO Scheduler

Advantage:
 It is simple to understand and doesn’t need any configuration.
 Jobs are executed in the order of their submission.

Disadvantage:
 It is not suitable for shared clusters.
 It does not take into account the balance of resource allocation between
the long applications and short applications.
 This leads to starvation.
Capacity Scheduler
 The CapacityScheduler allows multiple-tenants to securely share a large
Hadoop cluster.
 It is designed to run Hadoop applications in a shared, multi-tenant cluster
while maximizing the throughput and the utilization of the cluster.
 The Capacity Scheduler allows the sharing of the large cluster while
giving capacity guarantees to each organization by allocating a fraction of
cluster resources to each queue.
Capacity Scheduler
Advantages:
 It maximizes the utilization of resources and throughput in the Hadoop
cluster.
 Provides elasticity for groups or organizations in a cost-effective manner.
 It also gives capacity guarantees and safeguards to the organization
utilizing cluster.

Disadvantage:
 It is complex amongst the other scheduler.
Fair Scheduler
 FairScheduler allows YARN applications to fairly share resources in large
Hadoop clusters.
 With FairScheduler, there is no need for reserving a set amount of capacity
because it will dynamically balance resources between all running
applications.
 It assigns resources to applications in such a way that all applications get,
on average, an equal amount of resources over time.
 FairScheduler enables short apps to finish in a reasonable time without
starving.
Fair Scheduler
Advantages:
 It provides a reasonable way to share the Hadoop Cluster between the
number of users.
 Also, the Fair Scheduler can work with app priorities where the priorities
are used as weights in determining the fraction of the total resources that
each application should get.

Disadvantage:
 It requires configuration.
Anatomy
of
MapReduce
Job
MapReduce Job
 You can run a MapReduce job with a single method call: submit() on a Job
object (note that you can also call waitForCompletion(), which will submit
the job if it hasn’t been submitted already, then wait for it to finish).
Entities in MapReduce Job
 The client, which submits the MapReduce job.
 The jobtracker, which coordinates the job run. The jobtracker is a Java
application whose main class is JobTracker.
 The tasktrackers, which run the tasks that the job has been split into.
Tasktrackers are Java applications whose main class is TaskTracker.
 The distributed filesystem , which is used for sharing job files between the
other entities.
Anatomy of MapReduce Job
Task Assignment 55
Job Submission
 The job submission process implemented by the submit () method does the
following:
 Asks the jobtracker for a new job ID (by calling getNewJobId() on
JobTracker) (step 2).
 Checks the output specification of the job. For example, if the output
directory has not been specified or it already exists, the job is not
submitted and an error is thrown to the MapReduce program.
 Computes the input splits for the job. If the splits cannot be computed,
because the input paths don’t exist, for example, then the job is not
submitted and an error is thrown to the MapReduce program
Job Submission
 Copies the resources needed to run the job, including the job JAR file, the
configuration file, and the computed input splits, to the jobtracker’s
filesystem in a directory named after the job ID. The job JAR is copied with
a high replication factor (controlled by the mapred.submit.replication
property) so that there are lots of copies across the cluster for the
tasktrackers to access when they run tasks for the job(step3)
 Tells the jobtracker that the job is ready for execution (by calling
submitJob() on JobTracker) (step 4).
Job Initialization
 When the JobTracker receives a call to its submitJob() method, it puts it into
an internal queue from where the job scheduler will pick it up and initialize
it. Initialization involves creating an object to represent the job being run,
which encapsulates its tasks, and bookkeeping information to keep track of
the tasks’ status and progress (step 5).
 To create the list of tasks to run, the job scheduler first retrieves the input
splits computed by the JobClient from the shared filesystem (step 6). It
then creates one map task for each split. The number of reduce tasks to
create is determined by the mapred.reduce.tasks property in the JobConf,
which is set by the setNumReduce Tasks() method, and the scheduler
simply creates this number of reduce tasks to be run. Tasks are given IDs
at this point.
Task Assignment
 Tasktrackers run a simple loop that periodically sends heartbeat method
calls to the jobtracker. Heartbeats tell the jobtracker that a tasktracker is
alive, but they also double as a channel for messages. As a part of the
heartbeat, a tasktracker will indicate whether it is ready to run a new task,
and if it is, the jobtracker will allocate it a task, which it communicates to
the tasktracker using the heartbeat return value (step 7).
 Before it can choose a task for the tasktracker, the jobtracker must choose
a job to select the task from. There are various scheduling algorithms as
explained later in this chapter (see “Job Scheduling”), but the default one
simply maintains a priority list of jobs. Having chosen a job, the jobtracker
now chooses a task for the job.
Task Assignment
 Tasktrackers have a fixed number of slots for map tasks and for reduce tasks:
for example, a tasktracker may be able to run two map tasks and two reduce
tasks simultaneously. (The precise number depends on the number of cores and
the amount of memory on the tasktracker; see “Memory” ) The default
scheduler fills empty map task slots before reduce task slots, so if the
tasktracker has at least one empty map task slot, the jobtracker will select a
map task; otherwise, it will select a reduce task.
 To choose a reduce task, the jobtracker simply takes the next in its list of yet-to-
be-run reduce tasks, since there are no data locality considerations. For a map
task, however, it takes account of the tasktracker’s network location and picks
a task whose input split is as close as possible to the tasktracker. In the optimal
case, the task is data-local, that is, running on the same node that the split
resides on. Alternatively, the task may be rack-local: on the same rack, but not
the same node, as the split. Some tasks are neither data-local nor rack-local
and retrieve their data from a different rack from the one theyare running on.
You can tell the proportion of each type of task by looking at a job’s counters .
Data Ingestion
 Hadoop Data ingestion is the beginning of your data pipeline in a data lake.
 It means taking data from various databases and files and putting it into Hadoop.
 For many companies, it does turn out to be an intricate task.
 That is why they take more than a year to ingest all their data into the Hadoop data lake.
 The reason is, as Hadoop is open-source; there are a variety of ways you can ingest data into Hadoop.
 It gives every developer the choice of using her/his favorite tool or language to ingest data into
Hadoop.
 Developers while choosing a tool/technology stress on performance, but this makes governance very
complicated.
Data Ingestion
Sqoop :
 Tool used to transfer bulk data between HDFS & Relational Database
Servers
Data Ingestion
Sqoop :
 Apache Sqoop (SQL-to-Hadoop) is a lifesaver for anyone who is experiencing
difficulties in moving data from the data warehouse into the Hadoop environment.
 Apache Sqoop is an effective Hadoop tool used for importing data from
RDBMS’s like MySQL, Oracle, etc. into HBase, Hive or HDFS.
 Sqoop Hadoop can also be used for exporting data from HDFS into RDBMS.
 Apache Sqoop is a command-line interpreter i.e. the Sqoop commands are
executed one at a time by the interpreter.
Data Ingestion

Flume :
 Flume is an open-source distributed data collection service used for transferring the data from
source to destination
 It is a reliable, and highly available service for collecting, aggregating, and transferring huge
amounts of logs into HDFS
 Apache Flume is a service designed for streaming logs into the Hadoop environment.
 Flume is a distributed and reliable service for collecting and aggregating huge amounts of log
data.
 With a simple and easy to use architecture based on streaming data flows, it also has tunable
reliability mechanisms and several recoveries and failover mechanisms.
Data Ingestion

Flume Architecture :
Hadoop Archives

 Hadoop is created to deal with large files data, so small files are
problematic and to be handled efficiently.
 As a large input file is split into a number of small input files and stored
across all the data nodes, all these huge numbers of records are to be
stored in the name node which makes the name node inefficient.
 To handle this problem, Hadoop Archive has been created which packs
the HDFS files into archives and we can directly use these files as input
to the MR jobs.
 It always comes with *.har extension.
Data Ingestion

 HAR Syntax :

 Example :
 hadoop archive -archiveName foo.har -p /user/hadoop dir1 dir2
/user/zoo
Hadoop Archives

 Hadoop Archive is a facility that packs up small files into one


compact HDFS block to avoid memory wastage of name nodes.
 Name node stores the metadata information of the HDFS data.
 If 1GB file is broken into 1000 pieces then namenode will have to store
metadata about all those 1000 small files.
 In that manner, namenode memory will be wasted in storing and
managing a lot of data.
 HAR is created from a collection of files and the archiving tool will run
a MapReduce job.
 These Maps reduces jobs to process the input files in parallel to create
an archive file.
I/O Compression
 In the Hadoop framework, where large data sets are stored and processed, you will need storage for large files.
 These files are divided into blocks and those blocks are stored in different nodes across the cluster so lots of I/O
and network data transfer is also involved.
 In order to reduce the storage requirements and to reduce the time spent in-network transfer, you can have a
look at data compression in the Hadoop framework.
 Using data compression in Hadoop you can compress files at various steps, at all of these steps it will help to
reduce storage and quantity of data transferred.
 You can compress the input file itself.
 That will help you reduce storage space in HDFS.
 You can also configure that the output of a MapReduce job is compressed in Hadoop.
 That helps is reducing storage space if you are archiving output or sending it to some other application for
further processing.
I/O Serialization
 Serialization refers to the conversion of structured objects into byte streams for transmission over the
network or permanent storage on a disk.
 Deserialization refers to the conversion of byte streams back to structured objects.
 Serialization is mainly used in two areas of distributed data processing :
 Interprocess communication
 Permanent storage
 We require I/O Serialization because :
 To process records faster (Time-bound).
 When proper data formats need to maintain and transmit over data without schema support on another end.
 When in the future, data without structure or format needs to process, complex Errors may occur.
 Serialization offers data validation over transmission.
I/O Serialization
 To maintain the proper format of data serialization, the system must have the following four
properties -
1. Compact - helps in the best use of network bandwidth
2. Fast - reduces the performance overhead
3. Extensible - can match new requirements
4. Inter-operable - not language-specific
Avro
 Avro is an open source project that provides data serialization and data exchange services for Apache
Hadoop.
 Avro facilitates the exchange of big data between programs written in any language..
 Since Hadoop writable classes lack language portability, Avro becomes quite helpful, as it deals with data
formats that can be processed by multiple languages.
 Avro is a preferred tool to serialize data in Hadoop.
 Avro has a schema-based system.
 A language-independent schema is associated with its read and write operations.
 Avro serializes the data which has a built-in schema.
 Avro serializes the data into a compact binary format, which can be deserialized by any application.
 Avro uses JSON format to declare the data structures.
 Presently, it supports languages such as Java, C, C++, C#, Python, and Ruby.

You might also like