0% found this document useful (0 votes)

419 views14 pages

Big Data Unit 4

This document discusses the anatomy of MapReduce jobs. It describes how MapReduce jobs work in classic (MapReduce 1) and YARN (MapReduce 2) frameworks. In classic MapReduce, the jobtracker coordinates the job while tasktrackers run the tasks. The document outlines the job submission process, initialization, task assignment, execution, and use of streaming and pipes. It provides detailed explanations of each step in the MapReduce workflow with diagrams.

Uploaded by

Chitra Madhuri Yashoda

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

419 views14 pages

Big Data Unit 4

Uploaded by

Chitra Madhuri Yashoda

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

INTRODUCTION TO BIG-DATA: Map-Reduce Jobs Unit-4

Unit-4: Anatomy of Map-Reduce Jobs:

1. Understanding how Map- Reduce program works

2. Tuning Map-Reduce jobs
3. Understanding different logs produced by Map-Reduce jobs and
4. Debugging the Map- Reduce jobs.

1. Understanding how Map-Reduce program works

 In releases of Hadoop up to and including the 0.20 release series, mapred.job.tracker determines the
means of execution.
 In Hadoop 0.23.0 a new MapReduce implementation was introduced. The new implementation (called
MapReduce 2) is built on a system called YARN, described in “YARN (MapReduce 2)”.
 For now, the framework that is used for execution is set by the mapreduce.framework.name property,
which takes the values local (for the local job runner), classic (for the “classic” MapReduce framework,
also called MapReduce 1, which uses a jobtracker and tasktrackers), and yarn (for the new framework).

1. Classic MapReduce (MapReduce 1)

A job run in classic MapReduce is illustrated in Figure 5-1. At the highest level, there are four independent
entities:
 The client, which submits the MapReduce job.
 The jobtracker, which coordinates the job run. The jobtracker is a Java application whose main class is
JobTracker.
 The tasktrackers, which run the tasks that the job has been split into. Tasktrackers are Java applications
whose main class is TaskTracker.
 The distributed filesystem (normally HDFS), which is used for sharing job files between the other
entities.

Figure 5-1. How Hadoop runs a MapReduce job using the classic framework

G B Gangadhar 1
INTRODUCTION TO BIG-DATA: Map-Reduce Jobs Unit-4

Job Submission

The submit() method on Job creates an internal JobSummitter instance and calls submitJobInternal() on it
(step1 in Figure 5-1). Having submitted the job, waitForCompletion() polls the job’s progress once a
second and reports the progress to the console if it has changed since the last report. When the job is
complete, if it was successful, the job counters are displayed. Otherwise, the error that caused the job to
fail is logged to the console.

The job submission process implemented by JobSummitter does the following:

 Asks the jobtracker for a new job ID (by calling getNewJobId() on JobTracker) (step 2).
 Checks the output specification of the job. For example, if the output directory has not been specified
or it already exists, the job is not submitted and an error is thrown to the MapReduce program.
 Computes the input splits for the job. If the splits cannot be computed, because the input paths don’t
exist, for example, then the job is not submitted and an error is thrown to the MapReduce program.
 Copies the resources needed to run the job, including the job JAR file, the configuration file, and the
computed input splits, to the jobtracker’s filesystem in a directory named after the job ID. The job JAR
is copied with a high replication factor (controlled by the mapred.submit.replication property, which
defaults to 10) so that there are lots of copies across the cluster for the tasktrackers to access when they
run tasks for the job (step 3).
 Tells the jobtracker that the job is ready for execution (by calling submitJob() on JobTracker)
(step4).

Job Initialization

- When the JobTracker receives a call to its submitJob() method, it puts it into an internal queue from
where the job scheduler will pick it up and initialize it. Initialization involves creating an object to
represent the job being run, which encapsulates its tasks, and bookkeeping information to keep track of
the tasks’ status and progress (step 5).
- To create the list of tasks to run, the job scheduler first retrieves the input splits computed by the client
from the shared filesystem (step 6).
- It then creates one map task for each split. The number of reduce tasks to create is determined by the
mapred.reduce.tasks property in the Job, which is set by the setNumReduceTasks() method, and the
scheduler simply creates this number of reduce tasks to be run. Tasks are given IDs at this point.
- In addition to the map and reduce tasks, two further tasks are created: a job setup task and a job
cleanup task. These are run by tasktrackers and are used to run code to setup the job before any map
tasks run, and to cleanup after all the reduce tasks are complete.
- The OutputCommitter that is configured for the job determines the code to be run, and by default this is
a FileOutputCommitter. For the job setup task it will create the final output directory for the job and the
temporary working space for the task output, and for the job cleanup task it will delete the temporary
working space for the task output.

Task Assignment

- Tasktrackers run a simple loop that periodically sends heartbeat method calls to the jobtracker.
Heartbeats tell the jobtracker that a tasktracker is alive, but they also double as a channel for messages.
- As a part of the heartbeat, a tasktracker will indicate whether it is ready to run a new task, and if it is,
the jobtracker will allocate it a task, which it communicates to the tasktracker using the heartbeat return
value (step 7).
- Before it can choose a task for the tasktracker, the jobtracker must choose a job to select the task from.
Job Scheduler simply maintains a priority list of jobs. Having chosen a job, the jobtracker now chooses
a task for the job.
- Tasktrackers have a fixed number of slots for map tasks and for reduce tasks: for example, a
tasktracker may be able to run two map tasks and two reduce tasks simultaneously.
- (The precise number depends on the number of cores and the amount of memory on the tasktracker;
The default scheduler fills empty map task slots before reduce task slots, so if the tasktracker has at
least one empty map task slot, the jobtracker will select a map task; otherwise, it will select a reduce
task.
2
INTRODUCTION TO BIG-DATA: Map-Reduce Jobs Unit-4

- To choose a reduce task, the jobtracker simply takes the next in its list of yet-to-be-run reduce tasks,
since there are no data locality considerations.
- For a map task, however, it takes account of the tasktracker’s network location and picks a task whose
input split is as close as possible to the tasktracker.
- In the optimal case, the task is data-local, that is, running on the same node that the split resides on.
Alternatively, the task may be rack-local: on the same rack, but not the same node, as the split. Some
tasks are neither data-local nor rack-local and retrieve their data from a different rack from the one they
are running on. You can tell the proportion of each type of task by looking at a job’s counters

Task Execution

- Now that the tasktracker has been assigned a task, the next step is for it to run the task.
- First, it localizes the job JAR by copying it from the shared filesystem to the tasktracker’s filesystem. It
also copies any files needed from the distributed cache by the application to the local disk; (step 8).
- Second, it creates a local working directory for the task, and un-jars the contents of the JAR into this
directory.
- Third, it creates an instance of TaskRunner to run the task.
- TaskRunner launches a new Java Virtual Machine (step 9) to run each task in (step 10), so that any
bugs in the user-defined map and reduce functions don’t affect the tasktracker (by causing it to crash or
hang, for example).
- The child process communicates with its parent through the umbilical interface. This way it informs the
parent of the task’s progress every few seconds until the task is complete.
- Each task can perform setup and cleanup actions, which are run in the same JVM as the task itself, and
are determined by the OutputCommitter for the job.
- The cleanup action is used to commit the task, which in the case of file-based jobs means that its output
is written to the final location for that task.
- The commit protocol ensures that when speculative execution is enabled, only one of the duplicate
tasks is committed and the other is aborted.

Streaming and Pipes

Figure 5-2. The relationship of the Streaming and Pipes executable to the tasktracker and its child

3
INTRODUCTION TO BIG-DATA: Map-Reduce Jobs Unit-4

- Both Streaming and Pipes run special map and reduce tasks for the purpose of launching the user-
supplied executable and communicating with it (Figure 5-2).
- In the case of Streaming, the Streaming task communicates with the process (which may be written in
any language) using standard input and output streams.
- The Pipes task, on the other hand, listens on a socket and passes the C++ process a port number in its
environment, so that on startup, the C++ process can establish a persistent socket connection back to
the parent Java Pipes task.

Progress and Status Updates

- MapReduce jobs are long-running batch jobs, taking anything from minutes to hours to run.
- A job and each of its tasks have a status, which includes such things as the state of the job or task (e.g.,
running, successfully completed, failed), the progress of maps and reduces, the values of the job’s
counters, and a status message or description (which may be set by user code).
- These statuses change over the course of the job, so how do they get communicated back to the client?
- When a task is running, it keeps track of its progress, that is, the proportion of the task completed.
- If a task reports progress, it sets a flag to indicate that the status change should be sent to the
tasktracker.
- The flag is checked in a separate thread every three seconds, and if set it notifies the tasktracker of the
current task status. Meanwhile, the tasktracker is sending heartbeats to the jobtracker longer), and the
status of all the tasks being run by the tasktracker is sent in the call.
- The jobtracker combines these updates to produce a global view of the status of all the jobs being run
and their constituent tasks.
- Finally, as mentioned earlier, the Job receives the latest status by polling the jobtracker every second.
Clients can also use Job’s getStatus() method to obtain a JobStatus instance, which contains all of the
status information for the job.
- The method calls are illustrated in Figure 5-3

Figure 5-3. How status updates are propagated through the MapReduce 1 system

Job Completion

- When the jobtracker receives a notification that the last task for a job is complete, it changes the status
for the job to “successful.”

4
INTRODUCTION TO BIG-DATA: Map-Reduce Jobs Unit-4

- Then, when the Job polls for status, it learns that the job has completed successfully, so it prints a
message to tell the user and then returns from the waitForCompletion() method.
- The jobtracker also sends an HTTP job notification if it is configured to do so. This can be configured
by clients wishing to receive callbacks, via the job.end.notification.url property.
- Last, the jobtracker cleans up its working state for the job and instructs tasktrackers to do the same

2. YARN (MapReduce 2)

For very large clusters in the region of 4000 nodes and higher, the MapReduce system described in the
previous section begins to hit scalability bottlenecks, so in 2010 a group at Yahoo! began to design the next
generation of MapReduce. The result was YARN, short for Yet Another Resource Negotiator (or if you prefer
recursive ancronyms, YARN Application Resource Negotiator).

Figure 5-4. How Hadoop runs a MapReduce job using YARN

You can run a MapReduce job with a single method call: submit() on a Job object (you can also call
waitForCompletion(), which submits the job if it hasn’t been submitted already, then waits for it to finish).
This method call conceals a great deal of processing behind the scenes.
The whole process is illustrated in Figure 5-4.
At the highest level, there are five independent entities
1. The client, which submits the MapReduce job.
2. The YARN resource manager, which coordinates the allocation of compute resources on the cluster.
3. The YARN node managers, which launch and monitor the compute containers on machines in the
cluster.
4. The MapReduce application master, which coordinates the tasks running the MapReduce job. The
application master and the MapReduce tasks run in containers that are scheduled by the resource
manager and managed by the node managers.
5. The distributed filesystem (normally HDFS), which is used for sharing job files between the other
entities.

5
INTRODUCTION TO BIG-DATA: Map-Reduce Jobs Unit-4

Job Submission
The submit() method on Job creates an internal JobSubmitter instance and calls submitJobInternal() on
it (step 1 in Figure 5-4). Having submitted the job, waitForCompletion() polls the job’s progress once
per second and reports the progress to the console if it has changed since the last report.
When the job completes successfully, the job counters are displayed. Otherwise, the error that caused
the job to fail is logged to the console.
The job submission process implemented by JobSubmitter does the following:
- Asks the resource manager for a new application ID, used for the MapReduce job ID (step 2).
- Checks the output specification of the job. For example, if the output directory has not been
specified or it already exists, the job is not submitted and an error is thrown to the MapReduce
program.
- Computes the input splits for the job. If the splits cannot be computed (because the input paths
don’t exist, for example), the job is not submitted and an error is thrown to the MapReduce
program.
- Copies the resources needed to run the job, including the job JAR file, the configuration file,
and the computed input splits, to the shared filesystem in a directory named after the job ID
(step 3).
- The job JAR is copied with a high replication factor (controlled by the
mapreduce.client.submit.file.replication property, which defaults to 10) so that there are lots of
copies across the cluster for the node managers to access when they run tasks for the job.
- Submits the job by calling submitApplication() on the resource manager (step 4).

Job Initialization

- When the resource manager receives a call to its submitApplication() method, it hands off the request
to the YARN scheduler. The scheduler allocates a container, and the resource manager then launches
the application master’s process there, under the node manager’s management (steps 5a and 5b).
- The application master for MapReduce jobs is a Java application whose main class is MRAppMaster. It
initializes the job by creating a number of bookkeeping objects to keep track of the job’s progress, as it
will receive progress and completion reports from the tasks (step 6).
- Next, it retrieves the input splits computed in the client from the shared filesystem (step 7). It then
creates a map task object for each split, as well as a number of reduce task objects determined by the
mapreduce.job.reduces property (set by the setNumReduceTasks() method on Job). Tasks are given
IDs at this point.
- The application master decides if the job is small, the application master may choose to run the tasks in
the same JVM as itself. This happens when it judges that the overhead of allocating and running tasks
in new containers outweighs the gain to be had in running them in parallel, compared to running them
sequentially on one node. Such a job is said to be uberized, or run as an uber task.
- What qualifies as a small job? By default, a small job is one that has less than 10 mappers, only one
reducer, and an input size that is less than the size of one HDFS block.

- Finally, before any tasks can be run, the application master calls the setupJob() method on the
OutputCommitter. For FileOutputCommitter, which is the default, it will create the final output
directory for the job and the temporary working space for the task output.

Task Assignment

- If the job does not qualify for running as an uber task, then the application master requests containers
for all the map and reduce tasks in the job from the resource manager (step 8).

- Requests for map tasks are made first and with a higher priority than those for reduce tasks, since all
the map tasks must complete before the sort phase of the reduce can start.

- Requests for reduce tasks are not made until 5% of map tasks have completed.

- Reduce tasks can run anywhere in the cluster, but requests for map tasks have data locality constraints
that the scheduler tries to honor. In the optimal case, the task is data local—that is, running on the
same node that the split resides on. Alternatively, the task may be rack local: on the same rack, but not
6
INTRODUCTION TO BIG-DATA: Map-Reduce Jobs Unit-4

the same node, as the split. Some tasks are neither data local nor rack local and retrieve their data from
a different rack than the one they are running on. For a particular job run, you can determine the
number of tasks that ran at each locality level by looking at the job’s counters.
- Requests also specify memory requirements and CPUs for tasks. By default, each map and reduce task
is allocated 1,024 MB of memory and one virtual core. The values are configurable on a per-job basis
via the following properties: mapreduce.map.memory.mb, mapreduce.reduce.memory.mb,
mapreduce.map.cpu.vcores and mapreduce.reduce.cpu.vcores.

Task Execution

- Once a task has been assigned resources for a container on a particular node by the resource manager’s
scheduler, the application master starts the container by contacting the node manager (steps 9a and 9b).

- The task is executed by a Java application whose main class is YarnChild. Before it can run the task, it
localizes the resources that the task needs, including the job configuration and JAR file, and any files
from the distributed cache (step 10).

- Finally, it runs the map or reduce task (step 11).

- The YarnChild runs in a dedicated JVM, so that any bugs in the user-defined map and reduce functions
(or even in YarnChild) don’t affect the node manager—by causing it to crash or hang, for example.

- Each task can perform setup and commit actions, which are run in the same JVM as the task itself and
are determined by the OutputCommitter for the job.

- For file-based jobs, the commit action moves the task output from a temporary location to its final
location. The commit protocol ensures that when speculative execution is enabled, only one of the
duplicate tasks is committed and the other is aborted.

Streaming

Streaming runs special map and reduce

tasks for the purpose of launching the
usersupplied executable and
communicating with it (Figure 5-5).

The Streaming task communicates with

the process (which may be written in any
language) using standard input and output
streams. During execution of the task, the
Java process passes input key-value pairs
to the external process, which runs it
through the user-defined map or reduce
function and passes the output key-value
pairs back to the Java process. From the
node manager’s point of view, it is as if
the child process ran the map or reduce
code itself.

Figure 5-5. The relationship of the Streaming

executable to the node manager and the task
container

7
INTRODUCTION TO BIG-DATA: Map-Reduce Jobs Unit-4

Progress and Status Updates

A job and each of its tasks have a status, which includes such things as the state of the job or
task (e.g., running, successfully completed, failed), the progress of maps and reduces, the values of the
job’s counters, and a status message or description (which may be set by user code).

These statuses change over the course of the job, so how do they get communicated back to
the client? When a task is running, it keeps track of its progress (i.e., the proportion of the task
completed). For map tasks, this is the proportion of the input that has been processed. For reduce tasks
the proportion of the reduce input processed.

Figure 5-6. How status updates are propagated through the MapReduce system

Tasks also have a set of counters that count various events as the task runs, which are either
built into the framework, such as the number of map output records written, or defined by users. As the
map or reduce task runs, the child process communicates with its parent application master through the
umbilical interface. The task reports its progress and status (including counters) back to its application
master, which has an aggregate view of the job, every three seconds over the umbilical interface.

The resource manager web UI displays all the running applications with links to the web UIs
of their respective application masters, each of which displays further details on the MapReduce job,
including its progress.

During the course of the job, the client receives the latest status by polling the application
master every second (the interval is set via mapreduce.client.progressmonitor.pol linterval). Clients can
also use Job’s getStatus() method to obtain a JobStatus instance, which contains all of the status
information for the job.

The process is illustrated in Figure 5-6.

Job Completion

When the application master receives a notification that the last task for a job is complete, it
changes the status for the job to “successful.” Then, when the Job polls for status, it learns that the job
has completed successfully, so it prints a message to tell the user and then returns from the
waitForCompletion() method. Job statistics and counters are printed to the console at this point.
8
INTRODUCTION TO BIG-DATA: Map-Reduce Jobs Unit-4

The application master also sends an HTTP job notification if it is configured to do so. This
can be configured by clients wishing to receive callbacks, via the mapreduce.job.end-notification.url
property.

Finally, on job completion, the application master and the task containers clean up their
working state (so intermediate output is deleted), and the OutputCommitter’s commitJob() method is
called. Job information is archived by the job history server to enable later interrogation by users if
desired.

2. Tuning Map-Reduce jobs

After a job is working, the question many developers ask is, “Can I make it run faster?” There are a few
Hadoop-specific “usual suspects” that are worth checking to see whether they are responsible for a performance
problem. You should run through the checklist in Table-1 before you start trying to profile or optimize at the
task level.

Table-1 Tuning checklist

Profiling Tasks

Hadoop allows you to profile a fraction of the tasks in a job and, as each task completes, pulls
down the profile information to your machine for later analysis with standard profiling tools.

Of course, it’s possible, and somewhat easier, to profile a job running in the local job runner.
And provided you can run with enough input data to exercise the map and reduce tasks, this can be a
valuable way of improving the performance of your mappers and reducers. There are a couple of
caveats, however. The local job runner is a very different environment from a cluster, and the data flow
patterns are very different. Optimizing the CPU performance of your code may be pointless if your
MapReduce job is I/O-bound (as many jobs are). To be sure that any tuning is effective, you should
compare the new execution time with the old one running on a real cluster. Even this is easier said than
done, since job execution times can vary due to resource contention with other jobs and the decisions
the scheduler makes regarding task placement. To get a good idea of job execution time under these
circumstances, perform a series of runs (with and without the change) and check whether any
improvement is statistically significant.

9
INTRODUCTION TO BIG-DATA: Map-Reduce Jobs Unit-4

The HPROF profiler

There are a number of configuration properties to control profiling, which are also exposed via convenience
methods on JobConf. Enabling profiling is as simple as setting the property mapreduce.task.profile to true:

>% hadoop jar hadoop-examples.jar v4.MaxTemperatureDriver \ -conf conf/hadoop-cluster.xml \ -D

mapreduce.task.profile=true \ input/ncdc/all max-temp

This runs the job as normal, but adds an -agentlib parameter to the Java command used to launch the
task containers on the node managers. You can control the precise parameter that is added by setting the
mapreduce.task.profile.params property. The default uses HPROF, a profiling tool that comes with the JDK that,

The profile output for each task is saved with the task logs in the userlogs subdirectory of the node
manager’s local log directory (alongside the syslog, stdout, and stderr files), and can be retrieved in the way
described in “Hadoop Logs”, according to whether log aggregation is enabled or not.

3. Understanding different logs produced by Map-Reduce

Hadoop produces logs in various places, and for various audiences. These are summarized in Table-2.

Table-2. Types of Hadoop logs

MapReduce log levels

MapReduce logs support various levels. You can configure the log levels for the MapReduce service
and tasks.
You can set log levels to any of the following values:
Level Description
DEBUG Logs all debug-level and informational messages.
INFO Logs all informational messages and more serious messages. This is the default log level.
Logs only those messages that are warnings or more serious messages. This is the default level
WARN
of debug information.
ERROR Logs only those messages that indicate error conditions or more serious messages.
FATAL Logs only those messages in which the system is unusable.

To modify the level of the log printed to the console, change the value of the log4j.rootLogger
property in the log configuration file

10
INTRODUCTION TO BIG-DATA: Map-Reduce Jobs Unit-4

System logfiles

System logfiles produced by Hadoop are stored in $HADOOP_INSTALL/logs by default. This can be
changed in hadoop-env.sh.

Each Hadoop daemon running on a machine produces two logfiles.

The first is the log output written via log4j. This file, which ends in .log. Old logfiles are never deleted,
so you should arrange for them to be periodically deleted or archived, so as to not run out of disk space
on the local node.

The second logfile is the combined standard output and standard error log. This logfile, which ends in
.out, usually contains little or no output, since Hadoop uses log4j for logging. It is only rotated when
the daemon is restarted, and only the last five logs are retained. Old logfiles are suffixed with a number
between 1 and 5, with 5 being the oldest file.

Audit Logging

HDFS has the ability to log all filesystem access requests, a feature that some organizations require for
auditing purposes. Audit logging is implemented using log4j logging at the INFO level, and in the
default configuration it is disabled.You can enable audit logging by replacing WARN with INFO, and
the result will be a log line written to the namenode’s log for every HDFS event.

It is a good idea to configure log4j so that the audit log is written to a separate file and isn’t mixed up
with the namenode’s other log entries.

Job History Logging

Job history refers to the events and configuration for a completed job. It is retained whether the job was
successful or not, in an attempt to provide interesting information for the user running a job.

Job history files are stored on the local filesystem of the jobtracker in a history subdirectory of the logs
directory.

The jobtracker’s history files are kept for 30 days before being deleted by the system.

The history log includes job, task, and attempt events, all of which are stored in a plaintext file. The
history for a particular job may be viewed through the web UI, or via the command line, using hadoop
job -history (which you point at the job’s output directory).

MapReduce task logs

These are accessible throughthe web UI, which is the most convenient way to view them. You
can also find the logfiles on the local filesystem of the tasktracker that ran the task attempt, in a
directory named by the task attempt. If task JVM reuse is enabled, then each task attempts will be
found in each logfile. It is straightforward to write to these logfiles. Anything written to standard
output, or standard error, is directed to the relevant logfile.

The default log level is INFO, so DEBUG level messages do not appear in the syslog task log
file. However, sometimes you want to see these messages—to do this set mapred.map.child.log.level or
mapred.reduce.child.log.level, as appropriate (from 0.22). For example, in this case we could set it for
the mapper to see the map values in the log as follows:

>% hadoop jar hadoop-examples.jar LoggingDriver -conf conf/hadoop-cluster.xml \-D

mapred.map.child.log.level=DEBUG input/ncdc/sample.txt logging-out

There are some controls for managing retention and size of task logs. By default, logs are deleted after
a minimum of 24 hours (set using the mapred.userlog.retain.hoursproperty). You can also set a cap on
the maximum size of each logfile using the mapred.userlog.limit.kb property, which is 0 by default,
meaning there is no cap.

11
INTRODUCTION TO BIG-DATA: Map-Reduce Jobs Unit-4

4. Debugging the Map- Reduce jobs.

The time-honored way of debugging programs is via print statements, and this is certainly possible in
Hadoop. However, there are complications to consider: with programs running on tens, hundreds, or thousands
of nodes, how do we find and examine the output of the debug statements, which may be scattered across these
nodes?

For this particular case, where we are looking for (what we think is) an unusual case, we can use a
debug statement to log to standard error, in conjunction with a message to update the task’s status message to
prompt us to look in the error log. The web UI makes this easy, as we will see.

We also create a custom counter to count the total number of records with implausible temperatures in
the whole dataset. This gives us valuable information about how to deal with the condition—if it turns out to be
a common occurrence, then we might need to learn more about the condition and how to extract the temperature
in these cases, rather than simply dropping the record. In fact, when trying to debug a job, you should always
ask yourself if you can use a counter to get the information you need to find out what’s happening. Even if you
need to use logging or a status message, it may be useful to use a counter to gauge the extent of the problem.

If the amount of log data you produce in the course of debugging is large, then you’ve got a couple of
options. The first is to write the information to the map’s output, rather than to standard error, for analysis and
aggregation by the reduce. This approach usually necessitates structural changes to your program, so start with
the other techniques

You can write a program (in MapReduce of course) to analyze the logs produced by your job. We add
our debugging to the mapper, as opposed to the reducer, as we want to find out what the source data causing the
anomalous output looks like:

If the temperature is over 100°C (represented by 1000, since temperatures are in tenths of a degree),

- we print a line to standard error with the suspect line, as well as updating the map’s status message
using the setStatus() method on Context directing us to look in the log. We also increment a counter,
which in Java is represented by a field of an enum type. In this program, we have defined a single field
OVER_100 as a way to count the number of records with a temperature of over 100°C.

With this modification, we recompile the code, re-create the JAR file, then rerun the job, and while it’s running
go to the tasks page. The tasks page The job page has a number of links for look at the tasks in a job in more
detail. For example, by clicking on the “map” link, you are brought to a page that lists information for all of the

12
INTRODUCTION TO BIG-DATA: Map-Reduce Jobs Unit-4

map tasks on one page. You can also see just the completed tasks. The screenshot in Figure 5-7 shows a portion
of this page for the job run with our debugging statements.

Each row in the table is a task, and it provides such information as the start and end times for each task, any
errors reported back from the tasktracker, and a link to view the counters for an individual task.

The “Status” column can be helpful for debugging, since it shows a task’s latest status message. Before a task
starts, it shows its status as “initializing,” then once it starts reading records it shows the split information for the
split it is reading as a filename with a byte offset and length. You can see the status we set for debugging for
task task_200904110811_0003_m_000044, so let’s click through to the logs page to find the associated debug
message. (Notice, too, that there is an extra counter for this task, since our user counter has a nonzero count for
this task.)

The task details page From the tasks page, you can click on any task to get more information about it. The task
details page, shown in Figure 5-8, shows each task attempt. In this case, there was one task attempt, which
completed successfully. The table provides further useful data, such as the node the task attempt ran on, and
links to task logfiles and counters.

The “Actions” column contains links for killing a task attempt. By default, this is disabled, making the web UI a
read-only interface. Set webinterface.private.actions to true to enable the actions links.

Figure 5-7. Screenshot of the tasks page

Figure 5-8. Screenshot of the task details page

For map tasks, there is also a section showing which nodes the input split was located on. By following
one of the links to the logfiles for the successful task attempt (you can see the last 4 KB or 8 KB of each logfile,
13
INTRODUCTION TO BIG-DATA: Map-Reduce Jobs Unit-4

or the entire file), we can find the suspect input record that we logged (the line is wrapped and truncated to fit on
the page):

Temperature over 100 degrees for input:

0335999999433181957042302005+37950+139117SAO+0004RJSNV020113590031500703569999994332019
57010100005+35317+139650SAO +000899999V02002359002650076249N004000599+0067...

This record seems to be in a different format to the others. For one thing, there are spaces in the line, which are
not described in the specification.

When the job has finished, we can look at the value of the counter we defined to see how many records over
100°C there are in the whole dataset. Counters are accessible via the web UI or the command line:

>% hadoop job -counter job_200904110811_0003 'v4.MaxTemperatureMapper$Temperature'

\OVER_100

The -counter option takes the job ID, counter group name (which is the fully qualified classname here), and the
counter name (the enum name). There are only three malformed records in the entire dataset of over a billion
records.

Throwing out bad records is standard for many big data problems, although we need to be careful in this case,
since we are looking for an extreme value—the maximum temperature rather than an aggregate measure. Still,
throwing away three records is probably not going to change the result.

Mapreduce
No ratings yet
Mapreduce
5 pages
Anatomy of Map-Reduce Jobs PDF
100% (1)
Anatomy of Map-Reduce Jobs PDF
30 pages
MapReduce Programming Model Guide
No ratings yet
MapReduce Programming Model Guide
27 pages
Anatomy of Mapreduce Job Run
No ratings yet
Anatomy of Mapreduce Job Run
15 pages
BDA UNIT - 4 Notes
No ratings yet
BDA UNIT - 4 Notes
28 pages
MapReduce Job Management and Workflows
No ratings yet
MapReduce Job Management and Workflows
37 pages
Unit - 4
No ratings yet
Unit - 4
50 pages
Unit IV-Big Data
No ratings yet
Unit IV-Big Data
70 pages
Module 4 BDA Solutions
No ratings yet
Module 4 BDA Solutions
22 pages
BDA Unit II
No ratings yet
BDA Unit II
12 pages
1 Unit-1
No ratings yet
1 Unit-1
59 pages
Hadoop and Big Data Unit 31
No ratings yet
Hadoop and Big Data Unit 31
9 pages
Unit Iv Mapreduce Applications
No ratings yet
Unit Iv Mapreduce Applications
70 pages
BigData Unit4
No ratings yet
BigData Unit4
70 pages
How Map Reduce Work
No ratings yet
How Map Reduce Work
99 pages
A Weather Dataset. Understanding Hadoop API For MapReduce Framework
No ratings yet
A Weather Dataset. Understanding Hadoop API For MapReduce Framework
9 pages
BDA - Mapreduce 31 01 2025
No ratings yet
BDA - Mapreduce 31 01 2025
48 pages
2 BDA MapReduce
No ratings yet
2 BDA MapReduce
30 pages
Hadoop Learning MapReduce
No ratings yet
Hadoop Learning MapReduce
3 pages
Unit 4
No ratings yet
Unit 4
19 pages
BDA Unit 4
No ratings yet
BDA Unit 4
22 pages
Unit 3-1
No ratings yet
Unit 3-1
65 pages
Anatomyofclassicmapreduceinhadoop 140102205919 Phpapp01
No ratings yet
Anatomyofclassicmapreduceinhadoop 140102205919 Phpapp01
15 pages
Unit 3 Handouts
No ratings yet
Unit 3 Handouts
11 pages
Mapreduce Lifecycle
No ratings yet
Mapreduce Lifecycle
8 pages
Anatomy of a MapReduce Job Run
No ratings yet
Anatomy of a MapReduce Job Run
20 pages
Hadoop MapReduce Overview
No ratings yet
Hadoop MapReduce Overview
16 pages
Unit III
No ratings yet
Unit III
90 pages
Unit III
No ratings yet
Unit III
161 pages
UNIT 4 Notes by ARUN JHAPATE
No ratings yet
UNIT 4 Notes by ARUN JHAPATE
20 pages
Map Reduce Workflows
No ratings yet
Map Reduce Workflows
32 pages
BDA Unit 3 Notes
No ratings yet
BDA Unit 3 Notes
11 pages
Overview of Classic MapReduce Process
No ratings yet
Overview of Classic MapReduce Process
26 pages
Big Data BCA Unit4
No ratings yet
Big Data BCA Unit4
9 pages
Understanding MapReduce Job Execution
No ratings yet
Understanding MapReduce Job Execution
24 pages
Unit 3
No ratings yet
Unit 3
13 pages
Bda U4
No ratings yet
Bda U4
25 pages
Lecture 5 MapReduce Working
No ratings yet
Lecture 5 MapReduce Working
15 pages
Hadoop
No ratings yet
Hadoop
38 pages
BDA-Unit-4 Notes
No ratings yet
BDA-Unit-4 Notes
55 pages
Unit 5 - Mapreduce
No ratings yet
Unit 5 - Mapreduce
8 pages
Bda Unit 3
No ratings yet
Bda Unit 3
28 pages
2025 CSC14118 Lecture02c HadoopMapReduce
No ratings yet
2025 CSC14118 Lecture02c HadoopMapReduce
89 pages
Unit3 MapReduce
No ratings yet
Unit3 MapReduce
7 pages
Analyzing The Data With Hadoop
No ratings yet
Analyzing The Data With Hadoop
13 pages
Map Reduce
No ratings yet
Map Reduce
40 pages
HadoopMapreduce Summerization
No ratings yet
HadoopMapreduce Summerization
24 pages
Big Data Analytics-4
No ratings yet
Big Data Analytics-4
26 pages
3-MapReduce Different Phases-13-01-2025
No ratings yet
3-MapReduce Different Phases-13-01-2025
23 pages
BDA Unit 2 Notes
No ratings yet
BDA Unit 2 Notes
32 pages
Bda Unit 4
No ratings yet
Bda Unit 4
22 pages
Map Reduce
No ratings yet
Map Reduce
23 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
Essential MapReduce Interview Questions
No ratings yet
Essential MapReduce Interview Questions
6 pages
P.Prabu (23x61c) CCS334-BDA - Unit-3
No ratings yet
P.Prabu (23x61c) CCS334-BDA - Unit-3
23 pages
Unit 3
100% (1)
Unit 3
46 pages
BDA Unit 3 Notes
No ratings yet
BDA Unit 3 Notes
23 pages
Big Data Unit 3 Own
No ratings yet
Big Data Unit 3 Own
20 pages
15A05602 Data Warehousing & Mining
No ratings yet
15A05602 Data Warehousing & Mining
1 page
15A05602 Data Warehousing & Mining
No ratings yet
15A05602 Data Warehousing & Mining
2 pages
15A05602 Data Warehousing & Mining
No ratings yet
15A05602 Data Warehousing & Mining
2 pages
15A05602 Data Warehousing & Mining
No ratings yet
15A05602 Data Warehousing & Mining
2 pages
15A05602 Data Warehousing & Mining
No ratings yet
15A05602 Data Warehousing & Mining
2 pages
Cyber Security Exam Questions 2019
No ratings yet
Cyber Security Exam Questions 2019
1 page
Big Data Unit 3
0% (1)
Big Data Unit 3
23 pages
Overview of Hadoop Distributed File System
No ratings yet
Overview of Hadoop Distributed File System
12 pages
Unit-5 Unit-5: Case Studies of Big Data Analytics Using Map-Reduce Programming
No ratings yet
Unit-5 Unit-5: Case Studies of Big Data Analytics Using Map-Reduce Programming
11 pages
B.Tech IV Year II Semester (R15) Regular Examinations April 2019
No ratings yet
B.Tech IV Year II Semester (R15) Regular Examinations April 2019
1 page
Pre QP
No ratings yet
Pre QP
4 pages
WWW - Manaresults.Co - In: Code: 13A12804
No ratings yet
WWW - Manaresults.Co - In: Code: 13A12804
1 page
Cyber Notes
No ratings yet
Cyber Notes
73 pages
Cyber Notes
No ratings yet
Cyber Notes
73 pages
IV Year - II Sem CS 2
No ratings yet
IV Year - II Sem CS 2
3 pages
Understanding Hacking and Cybersecurity
No ratings yet
Understanding Hacking and Cybersecurity
37 pages
Biblical Leadership Models Explored
No ratings yet
Biblical Leadership Models Explored
33 pages
January Activities for Church Members
No ratings yet
January Activities for Church Members
3 pages
Myth, Symbol and Reality in The Apocalypse
No ratings yet
Myth, Symbol and Reality in The Apocalypse
16 pages
Kotze Et Al. 2021
No ratings yet
Kotze Et Al. 2021
15 pages
BÀI TẬP ÔN TẬP
No ratings yet
BÀI TẬP ÔN TẬP
6 pages
Top-Down Network Design: Chapter One
No ratings yet
Top-Down Network Design: Chapter One
28 pages
The New Institutional Economics
No ratings yet
The New Institutional Economics
10 pages
Achievement Test
95% (21)
Achievement Test
16 pages
Clri Finishing Technics
No ratings yet
Clri Finishing Technics
9 pages
19 Versoza vs. CA
No ratings yet
19 Versoza vs. CA
11 pages
VprZFGt8hjM0sO0d56QuzbpoJ
No ratings yet
VprZFGt8hjM0sO0d56QuzbpoJ
8 pages
FreeHebrewVowelFlashCards 1
No ratings yet
FreeHebrewVowelFlashCards 1
9 pages
Beginning of The Year Letter 2019
No ratings yet
Beginning of The Year Letter 2019
5 pages
Perseus and the Gorgon's Head Adventure
No ratings yet
Perseus and the Gorgon's Head Adventure
4 pages
Oboè 2
No ratings yet
Oboè 2
9 pages
Organic Nutrient Solution For Hydroponic
No ratings yet
Organic Nutrient Solution For Hydroponic
10 pages
Wing Heater DH
No ratings yet
Wing Heater DH
12 pages
Action Plan Yes-O
100% (1)
Action Plan Yes-O
5 pages
Chennai's Economy and Digital Lending Insights
No ratings yet
Chennai's Economy and Digital Lending Insights
6 pages
Pieces of Us Ashley Warren Download
No ratings yet
Pieces of Us Ashley Warren Download
34 pages
Motion For Reconsideration in Criminal Case
100% (1)
Motion For Reconsideration in Criminal Case
3 pages
Introduction to Agriculture Systems
100% (1)
Introduction to Agriculture Systems
15 pages
Upper and Lower Bounds For Stochastic Processes: Michel Talagrand
No ratings yet
Upper and Lower Bounds For Stochastic Processes: Michel Talagrand
727 pages
Web Development Agreement: - (" Client - Developer
No ratings yet
Web Development Agreement: - (" Client - Developer
5 pages
Business Plan
No ratings yet
Business Plan
5 pages
Tribals and Dalits in Orissa Towards A Social History of Exclusion C 18001950 Biswamoy Pati Digital Access
No ratings yet
Tribals and Dalits in Orissa Towards A Social History of Exclusion C 18001950 Biswamoy Pati Digital Access
411 pages
Grade 1 Zainab Mathematics Final Term Question Paper
No ratings yet
Grade 1 Zainab Mathematics Final Term Question Paper
6 pages
RISC-V Processor Design Guide
No ratings yet
RISC-V Processor Design Guide
35 pages
Energy Banking Nepal
No ratings yet
Energy Banking Nepal
4 pages
Miss Thathu Ji
No ratings yet
Miss Thathu Ji
3 pages

Big Data Unit 4

Uploaded by

Big Data Unit 4

Uploaded by

INTRODUCTION TO BIG-DATA: Map-Reduce Jobs Unit-4

Unit-4: Anatomy of Map-Reduce Jobs:

1. Understanding how Map- Reduce program works

1. Understanding how Map-Reduce program works

1. Classic MapReduce (MapReduce 1)

The job submission process implemented by JobSummitter does the following:

Streaming and Pipes

Progress and Status Updates

Figure 5-4. How Hadoop runs a MapReduce job using YARN

- Finally, it runs the map or reduce task (step 11).

Streaming runs special map and reduce

The Streaming task communicates with

Figure 5-5. The relationship of the Streaming

Progress and Status Updates

The process is illustrated in Figure 5-6.

2. Tuning Map-Reduce jobs

Table-1 Tuning checklist

The HPROF profiler

>% hadoop jar hadoop-examples.jar v4.MaxTemperatureDriver \ -conf conf/hadoop-cluster.xml \ -D

3. Understanding different logs produced by Map-Reduce

Table-2. Types of Hadoop logs

MapReduce log levels

Each Hadoop daemon running on a machine produces two logfiles.

Job History Logging

MapReduce task logs

>% hadoop jar hadoop-examples.jar LoggingDriver -conf conf/hadoop-cluster.xml \-D

4. Debugging the Map- Reduce jobs.

Figure 5-7. Screenshot of the tasks page

Figure 5-8. Screenshot of the task details page

Temperature over 100 degrees for input:

>% hadoop job -counter job_200904110811_0003 'v4.MaxTemperatureMapper$Temperature'

You might also like