Schwarzkopf - Omega Algorithm
Schwarzkopf - Omega Algorithm
Malte Schwarzkopf
Andy Konwinski
Michael Abd-El-Malek
John Wilkes
[email protected]
[email protected]
{mabdelmalek,johnwilkes}@google.com
Abstract
Increasing scale and the need for rapid response to changing
requirements are hard to meet with current monolithic clus-
ter scheduler architectures. This restricts the rate at which
new features can be deployed, decreases efciency and uti-
lization, and will eventually limit cluster growth. We present
a novel approach to address these needs using parallelism,
shared state, and lock-free optimistic concurrency control.
We compare this approach to existing cluster scheduler
designs, evaluate how much interference between schedulers
occurs and how much it matters in practice, present some
techniques to alleviate it, and nally discuss a use case
highlighting the advantages of our approach all driven by
real-life Google production workloads.
Categories and Subject Descriptors D.4.7 [Operating
Systems]: Organization and DesignDistributed systems;
K.6.4 [Management of computing and information systems]:
System ManagementCentralization/decentralization
Keywords Cluster scheduling, optimistic concurrency con-
trol
1. Introduction
Large-scale compute clusters are expensive, so it is impor-
tant to use them well. Utilization and efciency can be in-
creased by running a mix of workloads on the same ma-
chines: CPU- and memory-intensive jobs, small and large
ones, and a mix of batch and low-latency jobs ones that
serve end user requests or provide infrastructure services
such as storage, naming or locking. This consolidation re-
duces the amount of hardware required for a workload, but
it makes the scheduling problem (assigning jobs to ma-
chines) more complicated: a wider range of requirements
jobs
sampled actual data
Task duration sampled actual data
Sched. constraints ignored obeyed
Sched. algorithm randomized rst t Google algorithm
Runtime fast (24h 5 min.) slow (24h 2h)
Table 2: Comparison of the two simulators; actual data
refers to use of information found in a detailed workload-
execution trace taken from a production cluster.
permitted (e.g., a common notion of whether a machine is
full), and a common scale for expressing the relative impor-
tance of jobs, called precedence. These rules are deliberately
kept to a minimum. The two-level schemes centralized re-
source allocator component is thus simplied to a persistent
data store with validation code that enforces these common
rules. Since there is no central policy-enforcement engine
for high-level cluster-wide goals, we rely on these showing
up as emergent behaviors that result from the decisions of
individual schedulers. In this, it helps that fairness is not a
primary concern in our environment: we are driven more by
the need to meet business requirements. In support of these,
individual schedulers have conguration settings to limit the
total amount of resources they may claim, and to limit the
number of jobs they admit. Finally, we also rely on post-
facto enforcement, since we are monitoring the systems
behavior anyway.
The performance viability of the shared-state approach
is ultimately determined by the frequency at which transac-
tions fail and the costs of such failures. The rest of this paper
explores these issues for typical cluster workloads at Google.
4. Design comparisons
To understand the tradeoffs between the different approaches
described before (monolithic, two-level and shared-state
schedulers), we built two simulators:
1. A lightweight simulator driven by synthetic work-
loads using parameters drawn from empirical workload dis-
tributions. We use this to compare the behaviour of all three
architectures under the same conditions and with identical
workloads. By making some simplications, this lightweight
simulator allows us to sweep across a broad range of operat-
ing points within a reasonable runtime. The lightweight sim-
ulator also does not contain any proprietary Google code and
is available as open source software.
5
2. A high-delity simulator that replays historic work-
load traces from Google production clusters, and reuses
much of the Google production schedulers code. This gives
5
https://code.google.com/p/cluster-scheduler-simulator/.
355
10ms 0.1s 1s 10s 100s
t
job
[sec; log
10
]
0.1s
1s
1m
1h
1d
M
e
a
n
j
o
b
w
a
i
t
t
i
m
e
[
l
o
g
1
0
]
A
B
C
(a) Single-path.
10ms 0.1s 1s 10s 100s
t
job
(service) [sec; log
10
]
0.1s
1s
1m
1h
1d
M
e
a
n
j
o
b
w
a
i
t
t
i
m
e
[
l
o
g
1
0
]
A
B
C
(b) Multi-path.
10ms 0.1s 1s 10s 100s
t
job
(service) [sec; log
10
]
0.1s
1s
1m
1h
1d
M
e
a
n
j
o
b
w
a
i
t
t
i
m
e
[
l
o
g
1
0
]
A
B
C
(c) Shared state.
Figure 5: Schedulers job wait time, as a function of t
job
in the monolithic single-path case, t
job
(service) in the monolithic
multi-path and shared-state cases. The SLO (horizontal bar) is 30s.
10ms 0.1s 1s 10s 100s
t
job
[sec; log
10
]
0.0
0.2
0.4
0.6
0.8
1.0
S
c
h
e
d
u
l
e
r
b
u
s
y
n
e
s
s
A
B
C
(a) Single-path.
10ms 0.1s 1s 10s 100s
t
job
(service) [sec; log
10
]
0.0
0.2
0.4
0.6
0.8
1.0
S
c
h
e
d
u
l
e
r
b
u
s
y
n
e
s
s
A
B
C
(b) Multi-path.
10ms 0.1s 1s 10s 100s
t
job
(service) [sec; log
10
]
0.0
0.2
0.4
0.6
0.8
1.0
S
c
h
e
d
u
l
e
r
b
u
s
y
n
e
s
s
A
B
C
(c) Shared state.
Figure 6: Schedulers busyness, as a function of t
job
in the monolithic single-path case, t
job
(service) in the monolithic multi-
path and shared-state cases. The value is the median daily busyness over the 7-day experiment, and error bars are one median
absolute deviation (MAD), i.e. the median deviation from the median value, a robust estimator of typical value dispersion.
us behavior closer to the real system, at the price of only
supporting the Omega architecture and running a lot more
slowly than the lightweight simulator: a single run can take
days.
The rest of this section describes the simulators and our
experimental setup.
Simplications in the lightweight simulator. In the
lightweight simulator, we trade speed and exibility for ac-
curacy by making some simplifying assumptions, summa-
rized in Table 2.
The simulator is driven by a workload derived from from
real workloads that ran on the same clusters and time periods
discussed in 2.1. While the high-delity simulator is driven
by the actual workload traces, for the lightweight simulator
we analyze the workloads to obtain distributions of parame-
ter values such as the number of tasks per job, the task du-
ration, the per-task resources and job inter-arrival times, and
then synthesize jobs and tasks that conform to these distri-
butions.
At the start of a simulation, the lightweight simulator ini-
tializes cluster state using task-size data extracted from the
relevant trace, but only instantiates sufciently many tasks
to utilize about 60% of cluster resources, which is compara-
ble to the utilization level described in [24]. In production,
Google speculatively over-commits resources, but the mech-
anisms and policies for this are too complicated to be repli-
cated in the lightweight simulator.
The simulator can support multiple scheduler types, but
initially we consider just two: batch and service. The two
types of job have different parameter distributions, summa-
rized in 2.1.
To improve simulation runtime in pathological situations,
we limit any single job to 1,000 scheduling attempts, and the
simulator abandons the job at this point if some tasks are still
unscheduled. In practice, this only matters for the two-level
scheduler (see 4.2), and is rarely triggered by the others.
Parameters. We model the scheduler decision time
as a linear function of the form t
decision
= t
job
+ t
task
i
c
t
f
r
a
c
t
i
o
n
1
2
4
8
16
32
(a) Mean conict fraction.
1x 2x 4x 6x 8x 10x
Relative
jobs
(batch)
0.0
0.2
0.4
0.6
0.8
1.0
M
e
a
n
s
c
h
e
d
u
l
e
r
b
u
s
y
n
e
s
s
1
2
4
8
16
32
(b) Mean sched. busyness.
Figure 9: Shared-state scheduling (Omega): varying the ar-
rival rate for the batch workload (
jobs
(batch)) for cluster B;
1.0 is the default rate. Each line represents a different num-
ber of batch schedulers.
real-world workloads. We use it to answer the following
questions:
1. How much scheduling interference is present in real-
world workloads and what scheduler decision times can
we afford in production (5.1)?
2. What are the effects of different conict detection and
resolution techniques on real workloads (5.2)?
3. Can we take advantage of having access to the entire state
of the cell in a scheduler? (6)
Large-scale production systems are enormously compli-
cated, and thus even the high-delity simulator employs a
few simplications. It does not model machine failures (as
these only generate a small load on the scheduler); it does not
model the disparity between resource requests and the actual
usage of those resources in the traces (further discussed else-
where [24]); it xes the allocations at the initially-requested
sizes (a consequence of limitations in the trace data); and it
359
0.4
1.0
0.8
0.6
0.2
0.0
1.0
0.1
0.01
0.001
100
10
1.0
0.1
(a) Monolithic scheduler,
single-path.
0.4
1.0
0.8
0.6
0.2
0.0
1.0
0.1
0.01
0 001
100
10
1.0
0.1
(b) Monolithic scheduler,
multi-path.
0.4
1.0
0.8
0.6
0.2
0.0
1.0
0.1
0.01
0.001
100
10
1.0
0.1
(c) Two-level scheduling
(Mesos).
0.4
1.0
0.8
0.6
0.2
0.0
1.0
0.1
0.01
0 001
100
10
1.0
0.1
(d) Shared-state (Omega).
0.4
1.0
0.8
0.6
0.2
0.0
1.0
0.1
0.01
0 001
100
10
1.0
0.1
(e) Shared-state, coarse,
gang scheduling.
Figure 10: Lightweight simulator: impact of varying t
job
(service) (right axis) and t
task
(service) (left axis) on scheduler busyness
(z-axis) in different scheduling schemes, on cluster B. Red shading of a 3D graph means that part of the workload remained
unscheduled.
0.4
1.0
0.8
0.6
0.2
0.0
1.0
0.1
0.01
0.001
100
10
1.0
0.1
t (se
rv
ice
)
t (service)
s
e
r
v
i
c
e
s
c
h
e
d
u
l
e
r
b
u
s
y
n
e
s
s
task
job
Figure 11: Shared-state scheduling (Omega): effect on
service scheduler busyness of varying t
job
(service) and
t
task
(service), using the high-delity simulator and a 29-day
trace from cluster C.
disables preemptions, because we found that they make lit-
tle difference to the results, but signicantly slow down the
simulations.
As expected, the outputs of the two simulators generally
agree. The main difference is that the lightweight simula-
tor runs experience less interference, which is likely a result
of the lightweight simulators lack of support for placement
constraints (which makes picky jobs seem easier to sched-
ule than they are), and its simpler notion of when a machine
is considered full (which means it sees fewer conicts with
ne-grained conict detection, cf. 5.2).
We can nonetheless conrm all the trends the lightweight
simulator demonstrates for the Omega shared-state model
using the high-delity simulator. We believe this conrms
that the lightweight simulator experiments provide plausible
comparisons between different scheduling architectures un-
der a common set of assumptions.
5.1 Scheduling performance
Figure 11 shows how service scheduler busyness varies as a
function of both t
job
(service) and t
task
(service) for a month-
long trace of cluster C (covering the same workload as the
public trace). Encouragingly, the scheduler busyness re-
mains low across almost the entire range for both, which
means that the Omega architecture scales well to long deci-
sion times for service jobs.
Scaling the workload. We also investigate the perfor-
mance of the shared-state architecture using a 7-day trace
from cluster B, which is one of the largest and busiest
Google clusters. Again, we vary t
job
(service). In Figure 12b,
once t
job
(service) reaches about 10s, the conict fraction in-
creases beyond 1.0, so that scheduling a service job requires
at least one retry, on average.
At around the same point, we fail to meet the 30s job
wait time SLO for the service scheduler (Figure 12a), even
though the scheduler itself is not yet saturated: the additional
wait time is purely due to the impact of conicts. To con-
rm this, we approximate the time that the scheduler would
have taken if it had experienced no conicts or retries (the
no conict case in Figure 12c), and nd that the service
scheduler busyness with conicts is about 40% higher than
in the no-conict case. This is a higher level of interference
compared to cluster C, most likely because of a much higher
batch load in cluster B.
Despite these relatively high conict rates, our experi-
ments show that the shared-state Omega architecture can
support service schedulers that take several seconds to make
a decision. We also investigate scaling the per-task decision
time, and found that we can support t
task
(service) of 1 second
(at a t
job
(service) of 0.1s), resulting in a conict fraction
0.2. This means that we can support schedulers with a high
one-off per-job decision time, and ones with a large per-task
decision time.
Load-balancing the batch scheduler. With the mono-
lithic single-path scheduler (4.1), the high batch job ar-
rival rate requires the use of basic, simple scheduling algo-
rithms: it simply is not possible to use smarter, more time-
consuming scheduling algorithms for these jobs as we al-
ready miss the SLO on cluster B due to the high load. Batch
jobs want to survive failures, too, and the placement quality
would doubtless improve if a scheduler could be given a little
more time to make a decision. Fortunately, the Omega archi-
360
10ms 0.1s 1s 10s 100s
10ms
1s
1m
1h
1d
Batch, avg.
Batch, 90%ile
Service, avg.
Service, 90%ile
t (service): 5ms
task
J
o
b
w
a
i
t
t
i
m
e
[
l
o
g
]
1
0
job t (service) [sec; log ]
10
(a) Job wait time.
10ms 0.1s 1s 10s 100s
0.0
1.1
2.2
3.3
4.4
5.5
Batch Service
job
t (service) [sec; log ]
10
M
e
a
n
c
o
n
f
l
i
c
t
f
r
a
c
t
i
o
n
t (service) : 5ms
task
(b) Conict fraction.
10ms 0.1s 1s 10s 100s
0.0
0.2
0.4
0.6
0.8
1.0
Batch Service no conflicts
job t (service) [sec; log ]
10
S
c
h
e
d
u
l
e
r
b
u
s
y
n
e
s
s
t (service): 5ms
task
(c) Scheduler busyness.
Figure 12: Shared-state scheduling (Omega): performance effects of varying t
job
(service) on a 7-day trace from cluster B.
10ms 0.1s 1s 10s 100s
0.0
0.2
0.4
0.6
0.8
1.0
Batch 0
single
Batch sched.
(approx.)
Batch 1
Batch 2
Service
S
c
h
e
d
u
l
e
r
b
u
s
y
n
e
s
s
job t (batch) [sec; log ]
10
t (batch): 5ms
task
(a) Scheduler busyness.
10ms 0.1s 1s 10s 100s
10ms
1s
1m
1h
1d
mean
Batch 0
90%ile
mean
Batch 1
90%ile
mean
Batch 2
90%ile
mean
Service
90%ile
job t (batch) [sec; log ]
10
J
o
b
w
a
i
t
t
i
m
e
[
l
o
g
]
1
0
t (batch): 5ms
task
(b) Job wait time.
Figure 13: Shared-state scheduling (Omega): performance
effects of splitting the batch workload across 3 batch sched-
ulers, varying t
job
(batch) in a 24h trace from cluster C.
tecture can easily achieve this by load-balancing the schedul-
ing of batch jobs across multiple batch schedulers.
To test this, we run an experiment with three parallel
batch schedulers, partitioning the workload across them by
hashing the job identiers, akin to the earlier experiment
with the simple simulator. We achieve an increase in scal-
ability of 3, moving the saturation point fromt
job
(batch)
of about 4s to 15s (Figure 13a). At the same time, the con-
ict rate remains low (around 0.1), and all schedulers meet
the 30s job wait time SLO until the saturation point (Figure
13b).
In short, load-balancing across multiple schedulers can
increase scalability to increasing job arrival rates. Of course,
the scale-up must be sub-linear due to of the overhead of
maintaining and updating the local copies of cell state, and
this approach will not easily handle hundreds of sched-
ulers. Our comparison point, however, is a single monolithic
scheduler, so even a single-digit speedup is helpful.
In summary, the Omega architecture scales well, and tol-
erates large decision times on real cluster workloads.
5.2 Dealing with conicts
We also use the high-delity simulator to explore two imple-
mentation choices we were considering for Omega.
In the rst, coarse-grained conict detection, a sched-
ulers placement choice would be rejected if any changes had
been made to the target machine since the local copy of cell
state was synchronized at the beginning of the transaction.
This can be implemented with a simple sequence number in
the machines state object.
In the second, all-or-nothing scheduling, an entire cell
state transaction would be rejected if it would cause any
machine to be over-committed. The goal here was to support
jobs that require gang scheduling, or that cannot perform any
useful work until all their tasks are running.
7
Not surprisingly, both alternatives lead to additional con-
icts and higher scheduler busyness (Figure 14). While turn-
ing on all-or-nothing scheduling for all jobs only leads to
a minor increase in scheduler busyness when using ne-
grained conict detection (Figure 14a), it does increase con-
ict fraction by about 2 as retries now must re-place all
tasks, increasing their chance of failing again (Figure 14a).
Thus, this option should only be used on a job-level gran-
ularity. Relying on coarse-grained conict detection makes
things even worse: spurious conicts lead to increases in
conict rate, and consequently scheduler busyness, by 23.
Clearly, incremental transactions should be the default.
6. Flexibility: a MapReduce scheduler
Finally, we explore how well we can meet two additional de-
sign goals of the Omega shared-state model: supporting spe-
7
This is supported by Googles current scheduler, but it is only rarely used
due to the expectation of machine failures, which disrupt jobs anyway.
361
1s 10s 100s
t
job
(service)
0.0
1.0
2.0
3.0
4.0
5.0
C
o
n
i
c
t
f
r
a
c
t
i
o
n
Coarse/Gang
Coarse/Incr.
Fine/Gang
Fine/Incr.
(a) Conict fraction.
1s 10s 100s
t
job
(service)
0.0
0.2
0.4
0.6
0.8
1.0
S
c
h
e
d
u
l
e
r
b
u
s
y
n
e
s
s
Coarse/Gang
Coarse/Incr.
Fine/Gang
Fine/Incr.
(b) Scheduler busyness.
Figure 14: Shared-state scheduling (Omega): effect of gang
scheduling and coarse-grained conict detection as a func-
tion of t
job
(service) (cluster C, 29 days); mean daily values.
cialized schedulers, and broadening the kinds of decisions
that schedulers can perform compared to the two-level ap-
proach. This is somewhat challenging to evaluate quantita-
tively, so we proceed by way of a case study that adds a
specialized scheduler for MapReduce jobs.
Cluster users at Google currently specify the number of
workers for a MapReduce job and their resource require-
ments at job submission time, and the MapReduce frame-
work schedules map and reduce activities
8
onto these work-
ers. Because the available resources vary over time and be-
tween clusters, most users pick the number of workers based
on a combination of intuition, trial-and-error and experience:
data from a months worth of MapReduce jobs run at Google
showed that frequently observed values were 5, 11, 200 and
1,000 workers.
What if the number of workers could be chosen auto-
matically if additional resources were available, so that jobs
could complete sooner? Our specialized MapReduce sched-
uler does just this by opportunistically using idle cluster re-
sources to speed up MapReduce jobs. It observes the over-
all resource utilization in the cluster, predicts the benets of
scaling up current and pending MapReduce jobs, and appor-
tions some fraction of the unused resources across those jobs
according to some policy.
MapReduce jobs are particularly well-suited to this ap-
proach because it is possible to build reasonably accurate
models of how a jobs resource allocation affects its running
time [12, 26]. About 20% of jobs in Google are MapRe-
duce ones, and many of them are run repeatedly, so historical
data is available to build models. Many of the jobs are low-
priority, best effort computations that have to make way
for higher-priority service jobs, and so may benet from ex-
ploiting spare resources in the meantime [3].
8
These are typically called tasks in literature, but we have renamed them
to avoid confusion with the cluster-scheduler level tasks that substantiate
MapReduce workers.
6.1 Implementation
Since our goal is to investigate scheduler exibility rather
than demonstrate accurate MapReduce modelling, we de-
liberately use a simple performance model that only relies
on historical data about the jobs average map and reduce
activity duration. It assumes that adding more workers re-
sults in an idealized linear speedup (modulo dependencies
between mappers and reducers), up to the point where map
activities and all reduce activities respectively run in paral-
lel. Since large MapReduce jobs typically have many more
of these activities than congured workers, we usually run
out of available resources before this point.
We consider three different policies for adding resources:
max-parallelism, which keeps on adding workers as long as
benet is obtained, global cap, which stops the MapReduce
scheduler using idle resources if the total cluster utilization
is above a target value, and relative job size, which limits the
maximum number of workers to four times as many as it ini-
tially requested. In each case, a set of resource allocations to
be investigated is run through the predictive model, and the
allocation leading to the earliest possible nish time is used.
More elaborate approaches and objective functions, such as
used in deadline-based schedulering [10], are certainly pos-
sible, but not the focus of this case study.
6.2 Evaluation
We evaluate the three different resource-allocation policies
using traces from clusters A and C, plus cluster D, which is
a small, lightly-loaded cluster that is about a quarter of the
size of cluster C. Our results suggest that 5070%of MapRe-
duce jobs can benet from acceleration using opportunistic
resources (Figure 15). The huge speedups seen in the tail
should be taken with a pinch of salt due to our simple linear
speedup model, but we have more condence in the values
for the 80
th
percentile, and here, our simulations predict a
speedup of 34 using the eager max-parallelism policy.
Although the max-parallelism policy produces the largest
improvements, the relative job size policy also does quite
well, and its speedups probably have a higher likelihood
of being achieved because it requires fewer new MapRe-
duce workers to be constructed: the time to set up a worker
on a new machine is not fully accounted for in the simple
model. The global cap policy performs almost as well as
max-parallelism in the small, under-utilized cluster D, but
achieves little or no benet elsewhere, since the cluster uti-
lization is usually above the threshold, which was set at 60%.
Adding resources to a MapReduce job will cause the
clusters resource utilization to increase, and should result
in the job completing sooner, at which point all of the jobs
resources will free up. An effect of this is an increase in the
variability of the clusters resource utilization (Figure 16).
To do its work, the MapReduce scheduler relies on being
able to see the entire clusters state, which is straightforward
in the Omega architecture. A similar argument can be made
362
0.0
0.2
0.4
0.6
0.8
1.0
max-parallel
rel. job size
global-cap
.
Job completion speedup
1x 10x 100x 0.1x
A
0.0
0.2
0.4
0.6
0.8
1.0
max-parallel
rel. job size
global-cap
Job completion speedup
1x 10x 100x 0.1x
C
00
0.2
0.4
0.6
0.8
1.0
max-parallel
rel. job size
global-cap
.
D
Job completion speedup
1x 10x 100x 0.1x
Figure 15: CDF of potential per-job speedups using different policies on clusters A, C and D (a small, lightly-utilized cluster).
CPU
RAM
0 2 4 6 8 10 12 14 16 18 20 22 24
Experiment time [hours]
U
t
i
l
i
z
a
t
i
o
n
normal
max-parallel
Figure 16: Time series of normalized cluster utilization on
cluster C without the specialized Omega MapReduce sched-
uler (top), and in max-parallelism mode (bottom).
for a specialized service scheduler for highly-constrained,
high-priority jobs. Scheduling them requires determining
which machines are applicable, and deciding how best to
place the new job while minimizing the number of preemp-
tions caused to lower-priority jobs. The shared-state model
is ideally suited to this. Our prototype MapReduce sched-
uler demonstrates that adding a specialized functionality to
the Omega system is straightforward (unlike with our cur-
rent production scheduler).
7. Additional related work
Large-scale cluster resource scheduling is not a novel chal-
lenge. Many researchers have considered this problem be-
fore, and different solutions have been proposed in the HPC,
middleware and cloud communities. We discussed several
examples in 3, and further discussed the relative merits of
these approaches in 4.
The Omega approach builds on many prior ideas. Schedul-
ing using shared state is an example of optimistic concur-
rency control, which has been explored by the database
community for a long time [18], and, more recently, consid-
ered for general memory access in the transactional memory
community [2].
Exposing the entire cluster state to each scheduler is
not unlike the Exokernel approach of removing abstractions
and exposing maximal information to applications [9]. The
programming language and OS communities have recently
revisited application level scheduling as an alternative to
general-purpose thread and process schedulers, arguing that
a single, global OS scheduler is neither scalable, nor exible
enough for modern multi-core applications demands [22].
Amoeba [3] implements opportunistic allocation of spare
resources to jobs, with motivation similar to our MapReduce
scheduler use-case. However, it achieves this by complex
communication between resource and application managers,
whereas Omega naturally lends itself to such designs as it
exposes the entire cluster state to all schedulers.
8. Conclusions and future work
This investigation is part of a wider effort to build Omega,
Googles next-generation cluster management platform.
Here, we specically focused on a cluster scheduling ar-
chitecture that uses parallelism, shared state, and optimistic
concurrency control. Our performance evaluation of the
Omega model using both lightweight simulations with syn-
thetic workloads, and high-delity, trace-based simulations
of production workloads at Google, shows that optimistic
concurrency over shared state is a viable, attractive approach
to cluster scheduling.
Although this approach will do strictly more work than a
pessimistic locking scheme as work may need to be re-done,
we found the overhead to be acceptable at reasonable operat-
ing points, and the resulting benets in eliminating head-of-
line blocking and better scalability to often outweigh it. We
also found that Omegas approach offers an attractive plat-
form for development of specialized schedulers, and illus-
trated its exibility by adding a MapReduce scheduler with
opportunistic resource adjustment.
Future work could usefully focus on ways to provide
global guarantees (fairness, starvation avoidance, etc.) in
the Omega model: this is an area where centralized control
makes life easier. Furthermore, we believe there are some
techniques from the database community that could be ap-
plied to reduce the likelihood and effects of interference for
schedulers with long decision times. We hope to explore
some of these in the future.
363
Acknowledgements
Many people contributed to the work described in this pa-
per. Members of the Omega team at Google who contributed
to this design include Brian Grant, David Oppenheimer, Ja-
son Hickey, Jutta Degener, Rune Dahl, Todd Wang and Wal-
fredo Cirne. We would like to thank the Mesos team at UC
Berkeley for many fruitful and interesting discussions about
Mesos, and Joseph Hellerstein for his early work on model-
ing scheduler interference in Omega. Derek Murray, Steven
Hand and Alexey Tumanov provided valuable feedback on
draft versions of this paper. The nal version was much im-
proved by comments from the anonymous reviewers.
References
[1] ADAPTIVE COMPUTING ENTERPRISES INC. Maui Sched-
uler Administrators Guide, 3.2 ed. Provo, UT, 2011.
[2] ADL-TABATABAI, A.-R., LEWIS, B. T., MENON, V., MUR-
PHY, B. R., SAHA, B., AND SHPEISMAN, T. Compiler and
runtime support for efcient software transactional memory.
In Proceedings of PLDI (2006), pp. 2637.
[3] ANANTHANARAYANAN, G., DOUGLAS, C., RAMAKRISH-
NAN, R., RAO, S., AND STOICA, I. True elasticity in multi-
tenant data-intensive compute clusters. In Proceedings of
SoCC (2012), p. 24.
[4] APACHE. Hadoop On Demand. http://goo.gl/px8Yd,
2007. Accessed 20/06/2012.
[5] CHANG, F., DEAN, J., GHEMAWAT, S., HSIEH, W. C.,
WALLACH, D. A., BURROWS, M., CHANDRA, T., FIKES,
A., AND GRUBER, R. E. Bigtable: A Distributed Storage
System for Structured Data. ACM Transactions on Computer
Systems 26, 2 (June 2008), 4:14:26.
[6] CHEN, Y., ALSPAUGH, S., BORTHAKUR, D., AND KATZ, R.
Energy efciency for large-scale MapReduce workloads with
signicant interactive analysis. In Proceedings of EuroSys
(2012).
[7] CHEN, Y., GANAPATHI, A. S., GRIFFITH, R., AND KATZ,
R. H. Design insights for MapReduce from diverse produc-
tion workloads. Tech. Rep. UCB/EECS201217, UC Berke-
ley, Jan. 2012.
[8] DEAN, J., AND GHEMAWAT, S. MapReduce: Simplied data
processing on large clusters. CACM 51, 1 (2008), 107113.
[9] ENGLER, D. R., KAASHOEK, M. F., AND OTOOLE, JR., J.
Exokernel: an operating system architecture for application-
level resource management. In Proceedings of SOSP (1995),
pp. 251266.
[10] FERGUSON, A. D., BODIK, P., KANDULA, S., BOUTIN, E.,
AND FONSECA, R. Jockey: guaranteed job latency in data
parallel clusters. In Proceedings of EuroSys (2012), pp. 99
112.
[11] GHODSI, A., ZAHARIA, M., HINDMAN, B., KONWINSKI,
A., SHENKER, S., AND STOICA, I. Dominant resource fair-
ness: fair allocation of multiple resource types. In Proceedings
of NSDI (2011), pp. 323336.
[12] HERODOTOU, H., DONG, F., AND BABU, S. No one (cluster)
size ts all: automatic cluster sizing for data-intensive analyt-
ics. In Proceedings of SoCC (2011).
[13] HINDMAN, B., KONWINSKI, A., ZAHARIA, M., GHODSI,
A., JOSEPH, A., KATZ, R., SHENKER, S., AND STOICA, I.
Mesos: a platform for ne-grained resource sharing in the data
center. In Proceedings of NSDI (2011).
[14] IQBAL, S., GUPTA, R., AND FANG, Y.-C. Planning con-
siderations for job scheduling in HPC clusters. Dell Power
Solutions (Feb. 2005).
[15] ISARD, M., PRABHAKARAN, V., CURREY, J., WIEDER, U.,
TALWAR, K., AND GOLDBERG, A. Quincy: fair scheduling
for distributed computing clusters. In Proceedings of SOSP
(2009).
[16] JACKSON, D. AND SNELL, Q. AND CLEMENT, M. Core al-
gorithms of the Maui scheduler. In Job Scheduling Strategies
for Parallel Processing. 2001, pp. 87102.
[17] KAVULYA, S., TAN, J., GANDHI, R., AND NARASIMHAN, P.
An analysis of traces from a production MapReduce cluster.
In Proceedings of CCGrid (2010), pp. 94103.
[18] KUNG, H. T., AND ROBINSON, J. T. On optimistic methods
for concurrency control. ACM Transactions on Database
Systems 6, 2 (June 1981), 213226.
[19] MALEWICZ, G., AUSTERN, M., BIK, A., DEHNERT, J.,
HORN, I., LEISER, N., AND CZAJKOWSKI, G. Pregel: a
system for large-scale graph processing. In Proceedings of
SIGMOD (2010), pp. 135146.
[20] MISHRA, A. K., HELLERSTEIN, J. L., CIRNE, W., AND
DAS, C. R. Towards characterizing cloud backend workloads:
insights from Google compute clusters. SIGMETRICS Perfor-
mance Evaluation Review 37 (Mar. 2010), 3441.
[21] MURTHY, A. C., DOUGLAS, C., KONAR, M., OMALLEY,
O., RADIA, S., AGARWAL, S., AND K V, V. Architecture
of next generation Apache Hadoop MapReduce framework.
Tech. rep., Apache Hadoop, 2011.
[22] PAN, H., HINDMAN, B., AND ASANOVI C, K. Lithe: en-
abling efcient composition of parallel libraries. In Proceed-
ings of HotPar (2009).
[23] PENG, D., AND DABEK, F. Large-scale incremental process-
ing using distributed transactions and notications. In Pro-
ceedings of OSDI (2010).
[24] REISS, C., TUMANOV, A., GANGER, G. R., KATZ, R. H.,
AND KOZUCH, M. A. Heterogeneity and dynamicity of
clouds at scale: Google trace analysis. In Proceedings of
SoCC (2012).
[25] SHARMA, B., CHUDNOVSKY, V., HELLERSTEIN, J., RI-
FAAT, R., AND DAS, C. Modeling and synthesizing task
placement constraints in Google compute clusters. In Pro-
ceedings of SoCC (2011).
[26] VERMA, A., CHERKASOVA, L., AND CAMPBELL, R. SLO-
driven right-sizing and resource provisioning of MapReduce
jobs. In Proceedings of LADIS (2011).
[27] WILKES, J. More Google cluster data. Google research blog,
Nov. 2011. Posted at http://goo.gl/9B7PA.
[28] ZAHARIA, M., BORTHAKUR, D., SEN SARMA, J.,
ELMELEEGY, K., SHENKER, S., AND STOICA, I. Delay
scheduling: Asimple technique for achieving locality and fair-
ness in cluster scheduling. In Proceedings of EuroSys (2010),
pp. 265278.
[29] ZHANG, Q., HELLERSTEIN, J., AND BOUTABA, R. Charac-
terizing task usage shapes in Googles compute clusters. In
Proceedings of LADIS (2011).
364