
EuroSys ’18, April 23–26, 2018, Porto, Portugal Yanghua Peng, Yixin Bao, Yangrui Chen, Chuan Wu, and Chuanxiong Guo
(e.g., the parameter server architecture) and characteristics of deep
learning jobs (e.g., iterativeness, convergence properties) for maxi-
mal training eciency.
This paper proposes Optimus, a customized cluster scheduler
for deep learning jobs in production clusters, which minimizes
job training time and improves resource eciency as a result. We
focus on data-parallel DL training jobs using the parameter server
framework (§2). Optimus builds resource-performance models for
each job on the go, and dynamically schedules resources to jobs
based on job progress and the cluster load to minimize average job
completion time and makespan. Specically, we make the following
contributions in developing Optimus.
▷
We build accurate performance models for deep learning jobs
(§3). Through execution of a training job, we track the training
progress on the go and use online tting to predict the number
of steps/epochs required to achieve model convergence (§3.1). We
further build a resource-performance model by exploiting commu-
nication patterns in the parameter server architecture and iterative-
ness of the training process (§3.2). Dierent from existing detailed
modeling of a distributed deep learning job (such as in [
69
]), our
resource-performance model requires no knowledge about inter-
nals of the ML model and hardware conguration of the cluster.
The basis is an online learning idea: we run a job for a few steps
with dierent resource congurations, learn the training speed as a
function of resource congurations using data collected from these
steps, and then keep tuning our model on the go.
▷
Based on the performance models, we design a simple yet ef-
fective method for dynamically allocating resources to minimize
average job completion time (§4.1). We also propose a task place-
ment scheme for deploying parallel tasks in a job onto the servers,
given the job’s resource allocation (§4.2). The scheme further opti-
mizes training speed by mitigating communication overhead during
training.
▷
We discover a load imbalance issue on parameter servers with
the existing parameter server framework (as in MXNet [
59
]), which
signicantly lowers the training eciency. We resolve the issue
by reducing communication cost and assigning model slices to
parameter servers evenly (§5.3). We integrate our scheduler Op-
timus with Kubernetes [
14
], an open-source cluster manager for
production-grade container orchestration. We build a deep learn-
ing cluster consisting of 7 CPU servers and 6 GPU servers, and
run 9 representative DL jobs from dierent application domains
(see Table 1). Evaluation results show that Optimus achieves high
job performance and resource eciency, and outperforms widely
adopted cluster schedulers by 139% and 63% in job completion time
and makespan, respectively (§6).
2 BACKGROUND AND MOTIVATION
2.1 DL Model Training
A deep learning job trains a DL model, such as a deep neural net-
work (DNN), using a large number of training examples, to mini-
mize a loss function (typically) [48].
Iterativeness
. The model training is usually carried out in an itera-
tive fashion, due to the complexity of DNNs (i.e., no closed-form so-
lution) and the large size of training dataset (e.g., 14 million images
in the full Imagenet dataset [
12
]). The dataset is commonly divided
0 25 50 75 100
Epoch
0.6
1.2
1.8
Loss
Train-loss
Val-loss
Train-acc
Val-acc
25
50
75
Accuracy (%)
Figure 1: Training curves
of ResNext-110 on
the CIFAR10 dataset
ResNext
ResNet
Inception
KAGGLE
CNN-Rand
DSSM
RNN-LSTM
Seq2Seq
DS2
10
0
10
2
Completion time (h)
Figure 2: Training time
of deep learning models
in Table 1
into equal-sized data chunks, and each data chunk is further divided
into equal-sized mini-batches. In each training step, we process one
mini-batch by computing what changes to be made to the param-
eters in the DL model to approach their optimal values (typically
expressed as gradients, i.e., directions of changes), using examples
in the mini-batch, and then update parameters using a formula like
new_parameter = old_parameter − learninд_rate × дradient
. A
training performance metric is also computed for each mini-batch,
e.g., training loss (the sum of the errors made for each example in
the mini-batch) or accuracy (the percentage of correct predictions
compared to the labels), validation loss or accuracy (computed on
validation dataset for model evaluation). After all mini-batches in
training dataset have been processed once, one training epoch is
done.
Convergence
. The dataset is usually trained for multiple epochs
(tens to hundreds) until the model converges, i.e., the decrease or
increase in the performance metric’s value between consecutive
epochs becomes very small. An illustration of the training curves,
the variation of training/validation loss and accuracy vs. the number
of training epochs, is given in Fig. 1, with the example of training
ResNext-110 [
66
] on the CIFAR10 dataset [
2
]. DNN models are
usually non-convex and we can not always expect convergence [
29
].
However, dierent from experimental models, production models
are mature and can typically converge to the global/local optimum
very well since all hyper-parameters (e.g., learning rate – how
quickly a DNN adjusts itself, mini-batch size) have been well-tuned
during the experimental phase. In this work, we focus on such
production models, and leverage their convergence property to
estimate a training job’s progress towards convergence.
Especially, we use the convergence of training loss to decide the
completion of a DL job. The DL model converges if the decrease
of training loss between two consecutive epochs has consistently
fallen below a threshold that the job owner specied, for several
epochs. Training loss based training convergence is common in
practice
[48, 71]
and the convergence of training loss often implies
the convergence of other metrics (
e.g.
, accuracy) for production
models (
i.e.
, no overtting)
[5]
. Training/validation accuracy is dif-
cult to be dened in some scenarios where there is no “right
answer”,
e.g.
, language modeling
[6]
. Validation loss is usually used
to prevent model overtting, and evaluation on validation dataset
is performed only when necessary (
e.g.
, at the end of each epoch),
while we can obtain training loss after each step for more accurate
curve tting (§3.1).