Job_Scheduling_in_MapReduce
Job_Scheduling_in_MapReduce
Definition
Job Scheduling in MapReduce is the mechanism by which Hadoop determines the order and allocation of
cluster resources (like CPU, memory) to submitted jobs. It plays a vital role in ensuring that resources are
fairly and efficiently distributed among multiple users and their applications. The goal is to maximize
throughput, minimize response time, and provide fairness and resource guarantees when necessary.
MapReduce Algorithm
MapReduce is a data processing paradigm that allows for distributed computation on large datasets across a
- Shuffle and Sort Phase: Intermediate data is sorted and grouped by key.
- Reduce Phase: Aggregates values associated with a specific key to produce the final output.
Job scheduling in this context ensures that tasks in each phase are executed efficiently on available nodes.
Hadoop Schedulers
Schedulers in Hadoop manage how jobs are assigned to resources. They aim to enforce policies such as
fairness, prioritization, and guaranteed capacities. Hadoop supports different types of schedulers to match
1. FIFO Scheduler
- Jobs are placed in a single queue and executed in the order of submission.
- Lacks fairness and may delay short jobs if long jobs are submitted earlier.
2. Capacity Scheduler
3. Fair Scheduler
- Developed by Facebook to provide fair sharing of resources among all running jobs.
- Ensures all users/jobs get approximately equal resource share over time.
- Supports job pools, each with guaranteed minimum and fair shares.
- Allows preemption: if a job exceeds its share, running tasks may be paused or killed.
- Best suited for environments with mixed workloads and multiple users.
Advantages
- Elastic resource sharing (e.g., Capacity Scheduler allows borrowing unused capacity).
Disadvantages
- Monitoring and managing multiple queues and pools can add overhead.
- Improper setup may lead to inefficient cluster usage or unfair resource distribution.