Big Data Unit 4
Big Data Unit 4
In releases of Hadoop up to and including the 0.20 release series, mapred.job.tracker determines the
means of execution.
In Hadoop 0.23.0 a new MapReduce implementation was introduced. The new implementation (called
MapReduce 2) is built on a system called YARN, described in “YARN (MapReduce 2)”.
For now, the framework that is used for execution is set by the mapreduce.framework.name property,
which takes the values local (for the local job runner), classic (for the “classic” MapReduce framework,
also called MapReduce 1, which uses a jobtracker and tasktrackers), and yarn (for the new framework).
A job run in classic MapReduce is illustrated in Figure 5-1. At the highest level, there are four independent
entities:
The client, which submits the MapReduce job.
The jobtracker, which coordinates the job run. The jobtracker is a Java application whose main class is
JobTracker.
The tasktrackers, which run the tasks that the job has been split into. Tasktrackers are Java applications
whose main class is TaskTracker.
The distributed filesystem (normally HDFS), which is used for sharing job files between the other
entities.
Figure 5-1. How Hadoop runs a MapReduce job using the classic framework
G B Gangadhar 1
INTRODUCTION TO BIG-DATA: Map-Reduce Jobs Unit-4
Job Submission
The submit() method on Job creates an internal JobSummitter instance and calls submitJobInternal() on it
(step1 in Figure 5-1). Having submitted the job, waitForCompletion() polls the job’s progress once a
second and reports the progress to the console if it has changed since the last report. When the job is
complete, if it was successful, the job counters are displayed. Otherwise, the error that caused the job to
fail is logged to the console.
Asks the jobtracker for a new job ID (by calling getNewJobId() on JobTracker) (step 2).
Checks the output specification of the job. For example, if the output directory has not been specified
or it already exists, the job is not submitted and an error is thrown to the MapReduce program.
Computes the input splits for the job. If the splits cannot be computed, because the input paths don’t
exist, for example, then the job is not submitted and an error is thrown to the MapReduce program.
Copies the resources needed to run the job, including the job JAR file, the configuration file, and the
computed input splits, to the jobtracker’s filesystem in a directory named after the job ID. The job JAR
is copied with a high replication factor (controlled by the mapred.submit.replication property, which
defaults to 10) so that there are lots of copies across the cluster for the tasktrackers to access when they
run tasks for the job (step 3).
Tells the jobtracker that the job is ready for execution (by calling submitJob() on JobTracker)
(step4).
Job Initialization
- When the JobTracker receives a call to its submitJob() method, it puts it into an internal queue from
where the job scheduler will pick it up and initialize it. Initialization involves creating an object to
represent the job being run, which encapsulates its tasks, and bookkeeping information to keep track of
the tasks’ status and progress (step 5).
- To create the list of tasks to run, the job scheduler first retrieves the input splits computed by the client
from the shared filesystem (step 6).
- It then creates one map task for each split. The number of reduce tasks to create is determined by the
mapred.reduce.tasks property in the Job, which is set by the setNumReduceTasks() method, and the
scheduler simply creates this number of reduce tasks to be run. Tasks are given IDs at this point.
- In addition to the map and reduce tasks, two further tasks are created: a job setup task and a job
cleanup task. These are run by tasktrackers and are used to run code to setup the job before any map
tasks run, and to cleanup after all the reduce tasks are complete.
- The OutputCommitter that is configured for the job determines the code to be run, and by default this is
a FileOutputCommitter. For the job setup task it will create the final output directory for the job and the
temporary working space for the task output, and for the job cleanup task it will delete the temporary
working space for the task output.
Task Assignment
- Tasktrackers run a simple loop that periodically sends heartbeat method calls to the jobtracker.
Heartbeats tell the jobtracker that a tasktracker is alive, but they also double as a channel for messages.
- As a part of the heartbeat, a tasktracker will indicate whether it is ready to run a new task, and if it is,
the jobtracker will allocate it a task, which it communicates to the tasktracker using the heartbeat return
value (step 7).
- Before it can choose a task for the tasktracker, the jobtracker must choose a job to select the task from.
Job Scheduler simply maintains a priority list of jobs. Having chosen a job, the jobtracker now chooses
a task for the job.
- Tasktrackers have a fixed number of slots for map tasks and for reduce tasks: for example, a
tasktracker may be able to run two map tasks and two reduce tasks simultaneously.
- (The precise number depends on the number of cores and the amount of memory on the tasktracker;
The default scheduler fills empty map task slots before reduce task slots, so if the tasktracker has at
least one empty map task slot, the jobtracker will select a map task; otherwise, it will select a reduce
task.
2
INTRODUCTION TO BIG-DATA: Map-Reduce Jobs Unit-4
- To choose a reduce task, the jobtracker simply takes the next in its list of yet-to-be-run reduce tasks,
since there are no data locality considerations.
- For a map task, however, it takes account of the tasktracker’s network location and picks a task whose
input split is as close as possible to the tasktracker.
- In the optimal case, the task is data-local, that is, running on the same node that the split resides on.
Alternatively, the task may be rack-local: on the same rack, but not the same node, as the split. Some
tasks are neither data-local nor rack-local and retrieve their data from a different rack from the one they
are running on. You can tell the proportion of each type of task by looking at a job’s counters
Task Execution
- Now that the tasktracker has been assigned a task, the next step is for it to run the task.
- First, it localizes the job JAR by copying it from the shared filesystem to the tasktracker’s filesystem. It
also copies any files needed from the distributed cache by the application to the local disk; (step 8).
- Second, it creates a local working directory for the task, and un-jars the contents of the JAR into this
directory.
- Third, it creates an instance of TaskRunner to run the task.
- TaskRunner launches a new Java Virtual Machine (step 9) to run each task in (step 10), so that any
bugs in the user-defined map and reduce functions don’t affect the tasktracker (by causing it to crash or
hang, for example).
- The child process communicates with its parent through the umbilical interface. This way it informs the
parent of the task’s progress every few seconds until the task is complete.
- Each task can perform setup and cleanup actions, which are run in the same JVM as the task itself, and
are determined by the OutputCommitter for the job.
- The cleanup action is used to commit the task, which in the case of file-based jobs means that its output
is written to the final location for that task.
- The commit protocol ensures that when speculative execution is enabled, only one of the duplicate
tasks is committed and the other is aborted.
Figure 5-2. The relationship of the Streaming and Pipes executable to the tasktracker and its child
3
INTRODUCTION TO BIG-DATA: Map-Reduce Jobs Unit-4
- Both Streaming and Pipes run special map and reduce tasks for the purpose of launching the user-
supplied executable and communicating with it (Figure 5-2).
- In the case of Streaming, the Streaming task communicates with the process (which may be written in
any language) using standard input and output streams.
- The Pipes task, on the other hand, listens on a socket and passes the C++ process a port number in its
environment, so that on startup, the C++ process can establish a persistent socket connection back to
the parent Java Pipes task.
- MapReduce jobs are long-running batch jobs, taking anything from minutes to hours to run.
- A job and each of its tasks have a status, which includes such things as the state of the job or task (e.g.,
running, successfully completed, failed), the progress of maps and reduces, the values of the job’s
counters, and a status message or description (which may be set by user code).
- These statuses change over the course of the job, so how do they get communicated back to the client?
- When a task is running, it keeps track of its progress, that is, the proportion of the task completed.
- If a task reports progress, it sets a flag to indicate that the status change should be sent to the
tasktracker.
- The flag is checked in a separate thread every three seconds, and if set it notifies the tasktracker of the
current task status. Meanwhile, the tasktracker is sending heartbeats to the jobtracker longer), and the
status of all the tasks being run by the tasktracker is sent in the call.
- The jobtracker combines these updates to produce a global view of the status of all the jobs being run
and their constituent tasks.
- Finally, as mentioned earlier, the Job receives the latest status by polling the jobtracker every second.
Clients can also use Job’s getStatus() method to obtain a JobStatus instance, which contains all of the
status information for the job.
- The method calls are illustrated in Figure 5-3
Figure 5-3. How status updates are propagated through the MapReduce 1 system
Job Completion
- When the jobtracker receives a notification that the last task for a job is complete, it changes the status
for the job to “successful.”
4
INTRODUCTION TO BIG-DATA: Map-Reduce Jobs Unit-4
- Then, when the Job polls for status, it learns that the job has completed successfully, so it prints a
message to tell the user and then returns from the waitForCompletion() method.
- The jobtracker also sends an HTTP job notification if it is configured to do so. This can be configured
by clients wishing to receive callbacks, via the job.end.notification.url property.
- Last, the jobtracker cleans up its working state for the job and instructs tasktrackers to do the same
2. YARN (MapReduce 2)
For very large clusters in the region of 4000 nodes and higher, the MapReduce system described in the
previous section begins to hit scalability bottlenecks, so in 2010 a group at Yahoo! began to design the next
generation of MapReduce. The result was YARN, short for Yet Another Resource Negotiator (or if you prefer
recursive ancronyms, YARN Application Resource Negotiator).
You can run a MapReduce job with a single method call: submit() on a Job object (you can also call
waitForCompletion(), which submits the job if it hasn’t been submitted already, then waits for it to finish).
This method call conceals a great deal of processing behind the scenes.
The whole process is illustrated in Figure 5-4.
At the highest level, there are five independent entities
1. The client, which submits the MapReduce job.
2. The YARN resource manager, which coordinates the allocation of compute resources on the cluster.
3. The YARN node managers, which launch and monitor the compute containers on machines in the
cluster.
4. The MapReduce application master, which coordinates the tasks running the MapReduce job. The
application master and the MapReduce tasks run in containers that are scheduled by the resource
manager and managed by the node managers.
5. The distributed filesystem (normally HDFS), which is used for sharing job files between the other
entities.
5
INTRODUCTION TO BIG-DATA: Map-Reduce Jobs Unit-4
Job Submission
The submit() method on Job creates an internal JobSubmitter instance and calls submitJobInternal() on
it (step 1 in Figure 5-4). Having submitted the job, waitForCompletion() polls the job’s progress once
per second and reports the progress to the console if it has changed since the last report.
When the job completes successfully, the job counters are displayed. Otherwise, the error that caused
the job to fail is logged to the console.
The job submission process implemented by JobSubmitter does the following:
- Asks the resource manager for a new application ID, used for the MapReduce job ID (step 2).
- Checks the output specification of the job. For example, if the output directory has not been
specified or it already exists, the job is not submitted and an error is thrown to the MapReduce
program.
- Computes the input splits for the job. If the splits cannot be computed (because the input paths
don’t exist, for example), the job is not submitted and an error is thrown to the MapReduce
program.
- Copies the resources needed to run the job, including the job JAR file, the configuration file,
and the computed input splits, to the shared filesystem in a directory named after the job ID
(step 3).
- The job JAR is copied with a high replication factor (controlled by the
mapreduce.client.submit.file.replication property, which defaults to 10) so that there are lots of
copies across the cluster for the node managers to access when they run tasks for the job.
- Submits the job by calling submitApplication() on the resource manager (step 4).
Job Initialization
- When the resource manager receives a call to its submitApplication() method, it hands off the request
to the YARN scheduler. The scheduler allocates a container, and the resource manager then launches
the application master’s process there, under the node manager’s management (steps 5a and 5b).
- The application master for MapReduce jobs is a Java application whose main class is MRAppMaster. It
initializes the job by creating a number of bookkeeping objects to keep track of the job’s progress, as it
will receive progress and completion reports from the tasks (step 6).
- Next, it retrieves the input splits computed in the client from the shared filesystem (step 7). It then
creates a map task object for each split, as well as a number of reduce task objects determined by the
mapreduce.job.reduces property (set by the setNumReduceTasks() method on Job). Tasks are given
IDs at this point.
- The application master decides if the job is small, the application master may choose to run the tasks in
the same JVM as itself. This happens when it judges that the overhead of allocating and running tasks
in new containers outweighs the gain to be had in running them in parallel, compared to running them
sequentially on one node. Such a job is said to be uberized, or run as an uber task.
- What qualifies as a small job? By default, a small job is one that has less than 10 mappers, only one
reducer, and an input size that is less than the size of one HDFS block.
- Finally, before any tasks can be run, the application master calls the setupJob() method on the
OutputCommitter. For FileOutputCommitter, which is the default, it will create the final output
directory for the job and the temporary working space for the task output.
Task Assignment
- If the job does not qualify for running as an uber task, then the application master requests containers
for all the map and reduce tasks in the job from the resource manager (step 8).
- Requests for map tasks are made first and with a higher priority than those for reduce tasks, since all
the map tasks must complete before the sort phase of the reduce can start.
- Requests for reduce tasks are not made until 5% of map tasks have completed.
- Reduce tasks can run anywhere in the cluster, but requests for map tasks have data locality constraints
that the scheduler tries to honor. In the optimal case, the task is data local—that is, running on the
same node that the split resides on. Alternatively, the task may be rack local: on the same rack, but not
6
INTRODUCTION TO BIG-DATA: Map-Reduce Jobs Unit-4
the same node, as the split. Some tasks are neither data local nor rack local and retrieve their data from
a different rack than the one they are running on. For a particular job run, you can determine the
number of tasks that ran at each locality level by looking at the job’s counters.
- Requests also specify memory requirements and CPUs for tasks. By default, each map and reduce task
is allocated 1,024 MB of memory and one virtual core. The values are configurable on a per-job basis
via the following properties: mapreduce.map.memory.mb, mapreduce.reduce.memory.mb,
mapreduce.map.cpu.vcores and mapreduce.reduce.cpu.vcores.
Task Execution
- Once a task has been assigned resources for a container on a particular node by the resource manager’s
scheduler, the application master starts the container by contacting the node manager (steps 9a and 9b).
- The task is executed by a Java application whose main class is YarnChild. Before it can run the task, it
localizes the resources that the task needs, including the job configuration and JAR file, and any files
from the distributed cache (step 10).
- The YarnChild runs in a dedicated JVM, so that any bugs in the user-defined map and reduce functions
(or even in YarnChild) don’t affect the node manager—by causing it to crash or hang, for example.
- Each task can perform setup and commit actions, which are run in the same JVM as the task itself and
are determined by the OutputCommitter for the job.
- For file-based jobs, the commit action moves the task output from a temporary location to its final
location. The commit protocol ensures that when speculative execution is enabled, only one of the
duplicate tasks is committed and the other is aborted.
Streaming
7
INTRODUCTION TO BIG-DATA: Map-Reduce Jobs Unit-4
A job and each of its tasks have a status, which includes such things as the state of the job or
task (e.g., running, successfully completed, failed), the progress of maps and reduces, the values of the
job’s counters, and a status message or description (which may be set by user code).
These statuses change over the course of the job, so how do they get communicated back to
the client? When a task is running, it keeps track of its progress (i.e., the proportion of the task
completed). For map tasks, this is the proportion of the input that has been processed. For reduce tasks
the proportion of the reduce input processed.
Figure 5-6. How status updates are propagated through the MapReduce system
Tasks also have a set of counters that count various events as the task runs, which are either
built into the framework, such as the number of map output records written, or defined by users. As the
map or reduce task runs, the child process communicates with its parent application master through the
umbilical interface. The task reports its progress and status (including counters) back to its application
master, which has an aggregate view of the job, every three seconds over the umbilical interface.
The resource manager web UI displays all the running applications with links to the web UIs
of their respective application masters, each of which displays further details on the MapReduce job,
including its progress.
During the course of the job, the client receives the latest status by polling the application
master every second (the interval is set via mapreduce.client.progressmonitor.pol linterval). Clients can
also use Job’s getStatus() method to obtain a JobStatus instance, which contains all of the status
information for the job.
Job Completion
When the application master receives a notification that the last task for a job is complete, it
changes the status for the job to “successful.” Then, when the Job polls for status, it learns that the job
has completed successfully, so it prints a message to tell the user and then returns from the
waitForCompletion() method. Job statistics and counters are printed to the console at this point.
8
INTRODUCTION TO BIG-DATA: Map-Reduce Jobs Unit-4
The application master also sends an HTTP job notification if it is configured to do so. This
can be configured by clients wishing to receive callbacks, via the mapreduce.job.end-notification.url
property.
Finally, on job completion, the application master and the task containers clean up their
working state (so intermediate output is deleted), and the OutputCommitter’s commitJob() method is
called. Job information is archived by the job history server to enable later interrogation by users if
desired.
After a job is working, the question many developers ask is, “Can I make it run faster?” There are a few
Hadoop-specific “usual suspects” that are worth checking to see whether they are responsible for a performance
problem. You should run through the checklist in Table-1 before you start trying to profile or optimize at the
task level.
Profiling Tasks
Hadoop allows you to profile a fraction of the tasks in a job and, as each task completes, pulls
down the profile information to your machine for later analysis with standard profiling tools.
Of course, it’s possible, and somewhat easier, to profile a job running in the local job runner.
And provided you can run with enough input data to exercise the map and reduce tasks, this can be a
valuable way of improving the performance of your mappers and reducers. There are a couple of
caveats, however. The local job runner is a very different environment from a cluster, and the data flow
patterns are very different. Optimizing the CPU performance of your code may be pointless if your
MapReduce job is I/O-bound (as many jobs are). To be sure that any tuning is effective, you should
compare the new execution time with the old one running on a real cluster. Even this is easier said than
done, since job execution times can vary due to resource contention with other jobs and the decisions
the scheduler makes regarding task placement. To get a good idea of job execution time under these
circumstances, perform a series of runs (with and without the change) and check whether any
improvement is statistically significant.
9
INTRODUCTION TO BIG-DATA: Map-Reduce Jobs Unit-4
There are a number of configuration properties to control profiling, which are also exposed via convenience
methods on JobConf. Enabling profiling is as simple as setting the property mapreduce.task.profile to true:
This runs the job as normal, but adds an -agentlib parameter to the Java command used to launch the
task containers on the node managers. You can control the precise parameter that is added by setting the
mapreduce.task.profile.params property. The default uses HPROF, a profiling tool that comes with the JDK that,
The profile output for each task is saved with the task logs in the userlogs subdirectory of the node
manager’s local log directory (alongside the syslog, stdout, and stderr files), and can be retrieved in the way
described in “Hadoop Logs”, according to whether log aggregation is enabled or not.
Hadoop produces logs in various places, and for various audiences. These are summarized in Table-2.
MapReduce logs support various levels. You can configure the log levels for the MapReduce service
and tasks.
You can set log levels to any of the following values:
Level Description
DEBUG Logs all debug-level and informational messages.
INFO Logs all informational messages and more serious messages. This is the default log level.
Logs only those messages that are warnings or more serious messages. This is the default level
WARN
of debug information.
ERROR Logs only those messages that indicate error conditions or more serious messages.
FATAL Logs only those messages in which the system is unusable.
To modify the level of the log printed to the console, change the value of the log4j.rootLogger
property in the log configuration file
10
INTRODUCTION TO BIG-DATA: Map-Reduce Jobs Unit-4
System logfiles
System logfiles produced by Hadoop are stored in $HADOOP_INSTALL/logs by default. This can be
changed in hadoop-env.sh.
The first is the log output written via log4j. This file, which ends in .log. Old logfiles are never deleted,
so you should arrange for them to be periodically deleted or archived, so as to not run out of disk space
on the local node.
The second logfile is the combined standard output and standard error log. This logfile, which ends in
.out, usually contains little or no output, since Hadoop uses log4j for logging. It is only rotated when
the daemon is restarted, and only the last five logs are retained. Old logfiles are suffixed with a number
between 1 and 5, with 5 being the oldest file.
Audit Logging
HDFS has the ability to log all filesystem access requests, a feature that some organizations require for
auditing purposes. Audit logging is implemented using log4j logging at the INFO level, and in the
default configuration it is disabled.You can enable audit logging by replacing WARN with INFO, and
the result will be a log line written to the namenode’s log for every HDFS event.
It is a good idea to configure log4j so that the audit log is written to a separate file and isn’t mixed up
with the namenode’s other log entries.
Job history refers to the events and configuration for a completed job. It is retained whether the job was
successful or not, in an attempt to provide interesting information for the user running a job.
Job history files are stored on the local filesystem of the jobtracker in a history subdirectory of the logs
directory.
The jobtracker’s history files are kept for 30 days before being deleted by the system.
The history log includes job, task, and attempt events, all of which are stored in a plaintext file. The
history for a particular job may be viewed through the web UI, or via the command line, using hadoop
job -history (which you point at the job’s output directory).
These are accessible throughthe web UI, which is the most convenient way to view them. You
can also find the logfiles on the local filesystem of the tasktracker that ran the task attempt, in a
directory named by the task attempt. If task JVM reuse is enabled, then each task attempts will be
found in each logfile. It is straightforward to write to these logfiles. Anything written to standard
output, or standard error, is directed to the relevant logfile.
The default log level is INFO, so DEBUG level messages do not appear in the syslog task log
file. However, sometimes you want to see these messages—to do this set mapred.map.child.log.level or
mapred.reduce.child.log.level, as appropriate (from 0.22). For example, in this case we could set it for
the mapper to see the map values in the log as follows:
There are some controls for managing retention and size of task logs. By default, logs are deleted after
a minimum of 24 hours (set using the mapred.userlog.retain.hoursproperty). You can also set a cap on
the maximum size of each logfile using the mapred.userlog.limit.kb property, which is 0 by default,
meaning there is no cap.
11
INTRODUCTION TO BIG-DATA: Map-Reduce Jobs Unit-4
For this particular case, where we are looking for (what we think is) an unusual case, we can use a
debug statement to log to standard error, in conjunction with a message to update the task’s status message to
prompt us to look in the error log. The web UI makes this easy, as we will see.
We also create a custom counter to count the total number of records with implausible temperatures in
the whole dataset. This gives us valuable information about how to deal with the condition—if it turns out to be
a common occurrence, then we might need to learn more about the condition and how to extract the temperature
in these cases, rather than simply dropping the record. In fact, when trying to debug a job, you should always
ask yourself if you can use a counter to get the information you need to find out what’s happening. Even if you
need to use logging or a status message, it may be useful to use a counter to gauge the extent of the problem.
If the amount of log data you produce in the course of debugging is large, then you’ve got a couple of
options. The first is to write the information to the map’s output, rather than to standard error, for analysis and
aggregation by the reduce. This approach usually necessitates structural changes to your program, so start with
the other techniques
You can write a program (in MapReduce of course) to analyze the logs produced by your job. We add
our debugging to the mapper, as opposed to the reducer, as we want to find out what the source data causing the
anomalous output looks like:
If the temperature is over 100°C (represented by 1000, since temperatures are in tenths of a degree),
- we print a line to standard error with the suspect line, as well as updating the map’s status message
using the setStatus() method on Context directing us to look in the log. We also increment a counter,
which in Java is represented by a field of an enum type. In this program, we have defined a single field
OVER_100 as a way to count the number of records with a temperature of over 100°C.
With this modification, we recompile the code, re-create the JAR file, then rerun the job, and while it’s running
go to the tasks page. The tasks page The job page has a number of links for look at the tasks in a job in more
detail. For example, by clicking on the “map” link, you are brought to a page that lists information for all of the
12
INTRODUCTION TO BIG-DATA: Map-Reduce Jobs Unit-4
map tasks on one page. You can also see just the completed tasks. The screenshot in Figure 5-7 shows a portion
of this page for the job run with our debugging statements.
Each row in the table is a task, and it provides such information as the start and end times for each task, any
errors reported back from the tasktracker, and a link to view the counters for an individual task.
The “Status” column can be helpful for debugging, since it shows a task’s latest status message. Before a task
starts, it shows its status as “initializing,” then once it starts reading records it shows the split information for the
split it is reading as a filename with a byte offset and length. You can see the status we set for debugging for
task task_200904110811_0003_m_000044, so let’s click through to the logs page to find the associated debug
message. (Notice, too, that there is an extra counter for this task, since our user counter has a nonzero count for
this task.)
The task details page From the tasks page, you can click on any task to get more information about it. The task
details page, shown in Figure 5-8, shows each task attempt. In this case, there was one task attempt, which
completed successfully. The table provides further useful data, such as the node the task attempt ran on, and
links to task logfiles and counters.
The “Actions” column contains links for killing a task attempt. By default, this is disabled, making the web UI a
read-only interface. Set webinterface.private.actions to true to enable the actions links.
For map tasks, there is also a section showing which nodes the input split was located on. By following
one of the links to the logfiles for the successful task attempt (you can see the last 4 KB or 8 KB of each logfile,
13
INTRODUCTION TO BIG-DATA: Map-Reduce Jobs Unit-4
or the entire file), we can find the suspect input record that we logged (the line is wrapped and truncated to fit on
the page):
0335999999433181957042302005+37950+139117SAO+0004RJSNV020113590031500703569999994332019
57010100005+35317+139650SAO +000899999V02002359002650076249N004000599+0067...
This record seems to be in a different format to the others. For one thing, there are spaces in the line, which are
not described in the specification.
When the job has finished, we can look at the value of the counter we defined to see how many records over
100°C there are in the whole dataset. Counters are accessible via the web UI or the command line:
The -counter option takes the job ID, counter group name (which is the fully qualified classname here), and the
counter name (the enum name). There are only three malformed records in the entire dataset of over a billion
records.
Throwing out bad records is standard for many big data problems, although we need to be careful in this case,
since we are looking for an extreme value—the maximum temperature rather than an aggregate measure. Still,
throwing away three records is probably not going to change the result.
14