Unit_3_Big Data
Unit_3_Big Data
3
What is a Rack?
The Rack is the collection of around
40-50 Data Nodes connected using
the same network switch.
If the network goes down, the
whole rack will be unavailable.
A large Hadoop cluster is deployed
in multiple racks.
Rack Awareness in Hadoop
HDFS
In a large Hadoop cluster, there are multiple racks.
Advantage:
It is simple to understand and doesn’t need any configuration.
Jobs are executed in the order of their submission.
Disadvantage:
It is not suitable for shared clusters.
It does not take into account the balance of resource allocation between
the long applications and short applications.
This leads to starvation.
Capacity Scheduler
The CapacityScheduler allows multiple-tenants to securely share a large
Hadoop cluster.
It is designed to run Hadoop applications in a shared, multi-tenant cluster
while maximizing the throughput and the utilization of the cluster.
The Capacity Scheduler allows the sharing of the large cluster while
giving capacity guarantees to each organization by allocating a fraction of
cluster resources to each queue.
Capacity Scheduler
Advantages:
It maximizes the utilization of resources and throughput in the Hadoop
cluster.
Provides elasticity for groups or organizations in a cost-effective manner.
It also gives capacity guarantees and safeguards to the organization
utilizing cluster.
Disadvantage:
It is complex amongst the other scheduler.
Fair Scheduler
FairScheduler allows YARN applications to fairly share resources in large
Hadoop clusters.
With FairScheduler, there is no need for reserving a set amount of capacity
because it will dynamically balance resources between all running
applications.
It assigns resources to applications in such a way that all applications get,
on average, an equal amount of resources over time.
FairScheduler enables short apps to finish in a reasonable time without
starving.
Fair Scheduler
Advantages:
It provides a reasonable way to share the Hadoop Cluster between the
number of users.
Also, the Fair Scheduler can work with app priorities where the priorities
are used as weights in determining the fraction of the total resources that
each application should get.
Disadvantage:
It requires configuration.
Anatomy
of
MapReduce
Job
MapReduce Job
You can run a MapReduce job with a single method call: submit() on a Job
object (note that you can also call waitForCompletion(), which will submit
the job if it hasn’t been submitted already, then wait for it to finish).
Entities in MapReduce Job
The client, which submits the MapReduce job.
The jobtracker, which coordinates the job run. The jobtracker is a Java
application whose main class is JobTracker.
The tasktrackers, which run the tasks that the job has been split into.
Tasktrackers are Java applications whose main class is TaskTracker.
The distributed filesystem , which is used for sharing job files between the
other entities.
Anatomy of MapReduce Job
Task Assignment 55
Job Submission
The job submission process implemented by the submit () method does the
following:
Asks the jobtracker for a new job ID (by calling getNewJobId() on
JobTracker) (step 2).
Checks the output specification of the job. For example, if the output
directory has not been specified or it already exists, the job is not
submitted and an error is thrown to the MapReduce program.
Computes the input splits for the job. If the splits cannot be computed,
because the input paths don’t exist, for example, then the job is not
submitted and an error is thrown to the MapReduce program
Job Submission
Copies the resources needed to run the job, including the job JAR file, the
configuration file, and the computed input splits, to the jobtracker’s
filesystem in a directory named after the job ID. The job JAR is copied with
a high replication factor (controlled by the mapred.submit.replication
property) so that there are lots of copies across the cluster for the
tasktrackers to access when they run tasks for the job(step3)
Tells the jobtracker that the job is ready for execution (by calling
submitJob() on JobTracker) (step 4).
Job Initialization
When the JobTracker receives a call to its submitJob() method, it puts it into
an internal queue from where the job scheduler will pick it up and initialize
it. Initialization involves creating an object to represent the job being run,
which encapsulates its tasks, and bookkeeping information to keep track of
the tasks’ status and progress (step 5).
To create the list of tasks to run, the job scheduler first retrieves the input
splits computed by the JobClient from the shared filesystem (step 6). It
then creates one map task for each split. The number of reduce tasks to
create is determined by the mapred.reduce.tasks property in the JobConf,
which is set by the setNumReduce Tasks() method, and the scheduler
simply creates this number of reduce tasks to be run. Tasks are given IDs
at this point.
Task Assignment
Tasktrackers run a simple loop that periodically sends heartbeat method
calls to the jobtracker. Heartbeats tell the jobtracker that a tasktracker is
alive, but they also double as a channel for messages. As a part of the
heartbeat, a tasktracker will indicate whether it is ready to run a new task,
and if it is, the jobtracker will allocate it a task, which it communicates to
the tasktracker using the heartbeat return value (step 7).
Before it can choose a task for the tasktracker, the jobtracker must choose
a job to select the task from. There are various scheduling algorithms as
explained later in this chapter (see “Job Scheduling”), but the default one
simply maintains a priority list of jobs. Having chosen a job, the jobtracker
now chooses a task for the job.
Task Assignment
Tasktrackers have a fixed number of slots for map tasks and for reduce tasks:
for example, a tasktracker may be able to run two map tasks and two reduce
tasks simultaneously. (The precise number depends on the number of cores and
the amount of memory on the tasktracker; see “Memory” ) The default
scheduler fills empty map task slots before reduce task slots, so if the
tasktracker has at least one empty map task slot, the jobtracker will select a
map task; otherwise, it will select a reduce task.
To choose a reduce task, the jobtracker simply takes the next in its list of yet-to-
be-run reduce tasks, since there are no data locality considerations. For a map
task, however, it takes account of the tasktracker’s network location and picks
a task whose input split is as close as possible to the tasktracker. In the optimal
case, the task is data-local, that is, running on the same node that the split
resides on. Alternatively, the task may be rack-local: on the same rack, but not
the same node, as the split. Some tasks are neither data-local nor rack-local
and retrieve their data from a different rack from the one theyare running on.
You can tell the proportion of each type of task by looking at a job’s counters .
Data Ingestion
Hadoop Data ingestion is the beginning of your data pipeline in a data lake.
It means taking data from various databases and files and putting it into Hadoop.
For many companies, it does turn out to be an intricate task.
That is why they take more than a year to ingest all their data into the Hadoop data lake.
The reason is, as Hadoop is open-source; there are a variety of ways you can ingest data into Hadoop.
It gives every developer the choice of using her/his favorite tool or language to ingest data into
Hadoop.
Developers while choosing a tool/technology stress on performance, but this makes governance very
complicated.
Data Ingestion
Sqoop :
Tool used to transfer bulk data between HDFS & Relational Database
Servers
Data Ingestion
Sqoop :
Apache Sqoop (SQL-to-Hadoop) is a lifesaver for anyone who is experiencing
difficulties in moving data from the data warehouse into the Hadoop environment.
Apache Sqoop is an effective Hadoop tool used for importing data from
RDBMS’s like MySQL, Oracle, etc. into HBase, Hive or HDFS.
Sqoop Hadoop can also be used for exporting data from HDFS into RDBMS.
Apache Sqoop is a command-line interpreter i.e. the Sqoop commands are
executed one at a time by the interpreter.
Data Ingestion
Flume :
Flume is an open-source distributed data collection service used for transferring the data from
source to destination
It is a reliable, and highly available service for collecting, aggregating, and transferring huge
amounts of logs into HDFS
Apache Flume is a service designed for streaming logs into the Hadoop environment.
Flume is a distributed and reliable service for collecting and aggregating huge amounts of log
data.
With a simple and easy to use architecture based on streaming data flows, it also has tunable
reliability mechanisms and several recoveries and failover mechanisms.
Data Ingestion
Flume Architecture :
Hadoop Archives
Hadoop is created to deal with large files data, so small files are
problematic and to be handled efficiently.
As a large input file is split into a number of small input files and stored
across all the data nodes, all these huge numbers of records are to be
stored in the name node which makes the name node inefficient.
To handle this problem, Hadoop Archive has been created which packs
the HDFS files into archives and we can directly use these files as input
to the MR jobs.
It always comes with *.har extension.
Data Ingestion
HAR Syntax :
Example :
hadoop archive -archiveName foo.har -p /user/hadoop dir1 dir2
/user/zoo
Hadoop Archives