Optimus基于机器学习的集群管理资源-CSDN下载

需积分: 9 170 浏览量 2018-07-02 19:42:03 上传评论收藏 11.48MB PDF 举报

Optimus是一个基于机器学习的集群管理系统，专注于深度学习集群的资源调度优化。在今天的生产集群中，由于深度学习驱动的人工智能服务（例如，语音识别、机器翻译）的普及，深度学习工作负载变得非常普遍。深度学习训练任务通常是计算密集型的并且耗时很长，因此高效的资源调度对于深度学习集群的最大性能至关重要。现有的集群调度器大多不适合深度学习任务，并且通常为每个作业指定固定数量的资源，这限制了资源效率和作业性能。本文提出了Optimus，这是一个为深度学习集群定制的作业调度器，它基于在线资源-性能模型来最小化作业训练时间。Optimus使用在线拟合来预测训练期间模型收敛，并设置性能模型来准确估计每个作业分配资源后的训练速度。基于这些模型，设计了一个简单但有效的方法用于动态分配资源并放置深度学习任务，以最小化作业完成时间。我们在Kubernetes上实现了Optimus，Kubernetes是一个用于容器编排的集群管理器，并在一个拥有7个CPU服务器和6个GPU服务器的深度学习集群上进行了实验，使用MXNet框架运行了9个训练作业。结果显示，与代表性集群调度器相比，Optimus在作业完成时间和跨度上分别提升了约139%和63%。为了更深入地理解Optimus背后的关键技术，我们需要关注以下几点： Optimus采用了一种新颖的在线拟合方法来预测模型在训练过程中的收敛情况。这项技术允许系统动态调整资源分配策略，以响应训练过程中的实时性能反馈。这种方法的重要性在于，它突破了传统调度器中预先分配资源的局限性，允许集群资源根据训练任务的实际需求进行更加灵活和高效的分配。 Optimus通过建立性能模型，能够准确估计训练速度与分配资源量之间的关系。这一点对于优化集群中多个作业的调度是至关重要的。通过准确的估计，调度器能够做出更合理的资源分配决策，从而减少作业完成的总时间，提高整体集群的利用率和吞吐量。再者，Optimus的实现基于Kubernetes，这表明了Optimus的通用性和与现代云基础设施的兼容性。Kubernetes作为一个开源的容器编排平台，可以自动化容器化应用程序的部署、扩展和管理。将Optimus部署在Kubernetes之上，意味着它可以受益于Kubernetes提供的强大功能和灵活性，同时也为Kubernetes环境中的深度学习工作负载提供更加精细的资源调度。 Optimus的实验验证了其在实际环境中的性能提升。在包含7个CPU服务器和6个GPU服务器的深度学习集群上，使用MXNet框架，通过优化资源分配和任务放置，Optimus显著降低了作业完成时间和跨度。这不仅证明了Optimus在提升集群性能方面的有效性，而且也展示了它在优化深度学习训练任务方面的潜力。从以上内容可以看出，Optimus代表了一种在深度学习集群管理领域的创新努力，旨在解决现有集群调度器资源利用率低和作业性能不高的问题。通过采用在线资源-性能模型预测和动态资源分配策略，Optimus不仅提升了集群的工作效率，还为深度学习研究和应用提供了更强大的支持。随着机器学习和云计算技术的持续发展，Optimus这样的系统将继续在集群管理和资源优化领域发挥重要作用。

资源推荐

资源详情

资源评论

Optimus: An Eicient Dynamic Resource Scheduler for Deep

Learning Clusters

Yanghua Peng

The University of Hong Kong

[email protected]

Yixin Bao

The University of Hong Kong

[email protected]

Yangrui Chen

The University of Hong Kong

[email protected]

Chuan Wu

The University of Hong Kong

[email protected]

Chuanxiong Guo

Bytedance Inc.

[email protected]

ABSTRACT

Deep learning workloads are common in today’s production clus-

ters due to the proliferation of deep learning driven AI services

(e.g., speech recognition, machine translation). A deep learning

training job is resource-intensive and time-consuming. Ecient

resource scheduling is the key to the maximal performance of a

deep learning cluster. Existing cluster schedulers are largely not tai-

lored to deep learning jobs, and typically specifying a xed amount

of resources for each job, prohibiting high resource eciency and

job performance. This paper proposes Optimus, a customized job

scheduler for deep learning clusters, which minimizes job training

time based on online resource-performance models. Optimus uses

online tting to predict model convergence during training, and

sets up performance models to accurately estimate training speed

as a function of allocated resources in each job. Based on the models,

a simple yet eective method is designed and used for dynamically

allocating resources and placing deep learning tasks to minimize

job completion time. We implement Optimus on top of Kubernetes,

a cluster manager for container orchestration, and experiment on a

deep learning cluster with 7 CPU servers and 6 GPU servers, run-

ning 9 training jobs using the MXNet framework. Results show that

Optimus outperforms representative cluster schedulers by about

139% and 63% in terms of job completion time and makespan, re-

spectively.

CCS CONCEPTS

• Computing methodologies → Machine learning

;

• Com-

puter systems organization → Cloud computing;

KEYWORDS

Resource management; deep learning

ACM Reference Format:

Yanghua Peng, Yixin Bao, Yangrui Chen, Chuan Wu, and Chuanxiong Guo.

2018. Optimus: An Ecient Dynamic Resource Scheduler for Deep Learning

Permission to make digital or hard copies of all or part of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for prot or commercial advantage and that copies bear this notice and the full citation

on the rst page. Copyrights for components of this work owned by others than ACM

must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,

to post on servers or to redistribute to lists, requires prior specic permission and/or a

fee. Request permissions from [email protected].

EuroSys ’18, April 23–26, 2018, Porto, Portugal

ACM ISBN 978-1-4503-5584-1/18/04... $15.00

https://doi.org/10.1145/3190508.3190517

Clusters. In EuroSys ’18: Thirteenth EuroSys Conference 2018, April 23–26,

2018, Porto, Portugal. ACM, New York, NY, USA, 14 pages. https://doi.org/

10.1145/3190508.3190517

1 INTRODUCTION

The recent ve years have witnessed substantial progress and suc-

cessful applications of deep learning in various domains of AI,

such as computer vision [

], natural language processing [

] and

speech recognition [

]. The rising amount of data and increasing

scale of training models (e.g., deep neural networks) signicantly

improve the learning accuracy, as well as remarkably extend the

training time. Distributed machine (deep) learning frameworks have

been designed and deployed to expedite model convergence using

parallel training with multiple machines, e.g., TensorFlow [

MXNet [

]. Most leading IT companies have been operating dis-

tributed machine learning (ML)/deep learning (DL) clusters, with

hundreds or thousands of (GPU) servers, to train various ML models

over large datasets for their AI-driven services.

Even with parallel training, a deep learning job is resource inten-

sive and time consuming. For example, to train the DeepSpeech2

model [

] on the LibriSpeech dataset (1000 hours of speech) [

], it

takes 3–5 days to achieve the state-of-the-art accuracy when train-

ing on 16 GPUs [

]. In a shared deep learning cluster with various

training jobs submitted over time, ecient resource scheduling is

the key to maximize utilization of expensive resources (e.g., GPUs

and RDMA networks) for expedited training completion. However,

achieving high training performance and resource eciency in deep

learning clusters is challenging with existing cluster schedulers.

First, schedulers used in existing ML/DL clusters (e.g., Google

uses Borg [

], Microsoft, Tencent and Baidu use YARN-like sched-

ulers [

]) allocate a xed amount of resources to each job upon its

submission, according to resource requirements specied by the job

owner. Jobs already running in the cluster cannot benet from extra

resources when they become available (e.g., during night time when

there are lower workloads), unless the cluster operator manually re-

congures their resource composition or a job owner resubmits the

job as new. This may well lead to low resource utilization eciency.

Second, existing schedulers are designed for dierent workloads

but deep learning. For example, Mesos, Yarn and Borg are for

general-purpose cluster resource management, Corral [

] is de-

signed for periodic data-parallel jobs, and TetriSched [

] handles

reservation-based workloads. There is room for improving resource

utilization in deep learning clusters with a tailor-made resource

scheduler that leverages structures of deep learning frameworks

EuroSys ’18, April 23–26, 2018, Porto, Portugal Yanghua Peng, Yixin Bao, Yangrui Chen, Chuan Wu, and Chuanxiong Guo

(e.g., the parameter server architecture) and characteristics of deep

learning jobs (e.g., iterativeness, convergence properties) for maxi-

mal training eciency.

This paper proposes Optimus, a customized cluster scheduler

for deep learning jobs in production clusters, which minimizes

job training time and improves resource eciency as a result. We

focus on data-parallel DL training jobs using the parameter server

framework (§2). Optimus builds resource-performance models for

each job on the go, and dynamically schedules resources to jobs

based on job progress and the cluster load to minimize average job

completion time and makespan. Specically, we make the following

contributions in developing Optimus.

▷

We build accurate performance models for deep learning jobs

(§3). Through execution of a training job, we track the training

progress on the go and use online tting to predict the number

of steps/epochs required to achieve model convergence (§3.1). We

further build a resource-performance model by exploiting commu-

nication patterns in the parameter server architecture and iterative-

ness of the training process (§3.2). Dierent from existing detailed

modeling of a distributed deep learning job (such as in [

]), our

resource-performance model requires no knowledge about inter-

nals of the ML model and hardware conguration of the cluster.

The basis is an online learning idea: we run a job for a few steps

with dierent resource congurations, learn the training speed as a

function of resource congurations using data collected from these

steps, and then keep tuning our model on the go.

▷

Based on the performance models, we design a simple yet ef-

fective method for dynamically allocating resources to minimize

average job completion time (§4.1). We also propose a task place-

ment scheme for deploying parallel tasks in a job onto the servers,

given the job’s resource allocation (§4.2). The scheme further opti-

mizes training speed by mitigating communication overhead during

training.

▷

We discover a load imbalance issue on parameter servers with

the existing parameter server framework (as in MXNet [

]), which

signicantly lowers the training eciency. We resolve the issue

by reducing communication cost and assigning model slices to

parameter servers evenly (§5.3). We integrate our scheduler Op-

timus with Kubernetes [

], an open-source cluster manager for

production-grade container orchestration. We build a deep learn-

ing cluster consisting of 7 CPU servers and 6 GPU servers, and

run 9 representative DL jobs from dierent application domains

(see Table 1). Evaluation results show that Optimus achieves high

job performance and resource eciency, and outperforms widely

adopted cluster schedulers by 139% and 63% in job completion time

and makespan, respectively (§6).

2 BACKGROUND AND MOTIVATION

2.1 DL Model Training

A deep learning job trains a DL model, such as a deep neural net-

work (DNN), using a large number of training examples, to mini-

mize a loss function (typically) [48].

Iterativeness

. The model training is usually carried out in an itera-

tive fashion, due to the complexity of DNNs (i.e., no closed-form so-

lution) and the large size of training dataset (e.g., 14 million images

in the full Imagenet dataset [

]). The dataset is commonly divided

0 25 50 75 100

Epoch

0.6

1.2

1.8

Loss

Train-loss

Val-loss

Train-acc

Val-acc

Accuracy (%)

Figure 1: Training curves

of ResNext-110 on

the CIFAR10 dataset

ResNext

ResNet

Inception

KAGGLE

CNN-Rand

DSSM

RNN-LSTM

Seq2Seq

DS2

Completion time (h)

Figure 2: Training time

of deep learning models

in Table 1

into equal-sized data chunks, and each data chunk is further divided

into equal-sized mini-batches. In each training step, we process one

mini-batch by computing what changes to be made to the param-

eters in the DL model to approach their optimal values (typically

expressed as gradients, i.e., directions of changes), using examples

in the mini-batch, and then update parameters using a formula like

new_parameter = old_parameter − learninд_rate × дradient

. A

training performance metric is also computed for each mini-batch,

e.g., training loss (the sum of the errors made for each example in

the mini-batch) or accuracy (the percentage of correct predictions

compared to the labels), validation loss or accuracy (computed on

validation dataset for model evaluation). After all mini-batches in

training dataset have been processed once, one training epoch is

done.

Convergence

. The dataset is usually trained for multiple epochs

(tens to hundreds) until the model converges, i.e., the decrease or

increase in the performance metric’s value between consecutive

epochs becomes very small. An illustration of the training curves,

the variation of training/validation loss and accuracy vs. the number

of training epochs, is given in Fig. 1, with the example of training

ResNext-110 [

] on the CIFAR10 dataset [

]. DNN models are

usually non-convex and we can not always expect convergence [

However, dierent from experimental models, production models

are mature and can typically converge to the global/local optimum

very well since all hyper-parameters (e.g., learning rate – how

quickly a DNN adjusts itself, mini-batch size) have been well-tuned

during the experimental phase. In this work, we focus on such

production models, and leverage their convergence property to

estimate a training job’s progress towards convergence.

Especially, we use the convergence of training loss to decide the

completion of a DL job. The DL model converges if the decrease

of training loss between two consecutive epochs has consistently

fallen below a threshold that the job owner specied, for several

epochs. Training loss based training convergence is common in

practice

[48, 71]

and the convergence of training loss often implies

the convergence of other metrics (

e.g.

, accuracy) for production

models (

i.e.

, no overtting)

[5]

. Training/validation accuracy is dif-

cult to be dened in some scenarios where there is no “right

answer”,

e.g.

, language modeling

[6]

. Validation loss is usually used

to prevent model overtting, and evaluation on validation dataset

is performed only when necessary (

e.g.

, at the end of each epoch),

while we can obtain training loss after each step for more accurate

curve tting (§3.1).

Optimus: An Eicient Dynamic Resource Scheduler for Deep Learning Clusters EuroSys ’18, April 23–26, 2018, Porto, Portugal

2.2 The Parameter Ser ver Architecture

Most distributed ML/DL frameworks (e.g., MXNet [

], Tensor-

Flow [

], PaddlePaddle [

], Angel [

], Petuum [

]) employ the

parameter server (PS) architecture [

] (Fig. 3). In this architecture,

the model (i.e., a DNN) is partitioned among multiple parameter

servers and the training data are split among workers. Each worker

computes parameter updates (i.e., gradients) locally using its data

partition and pushes them to parameter servers maintaining the

respective model parameters. After receiving gradients, parame-

ter servers update the model parameters using some optimization

algorithm, e.g., Stochastic Gradient Descent (SGD) [

]. Updated

parameters are sent back to the workers, which then start the next

training step, using the updated parameters.

There are two training modes: asynchronous training, where the

training progress (

i.e.

, number of steps) at dierent workers in a job

is not synchronized and a parameter server updates its model parti-

tion each time upon receiving gradients from a worker; synchronous

training, where training progress at all workers is synchronized in

each step and a parameter server updates parameters after it has

collected gradients from all workers.

2.3 Existing Cluster Schedulers

Static resource allocation

. Parameter servers and workers typ-

ically run in containers or virtual machines in a DL cluster, and

a cluster scheduler manages the resource allocation to training

jobs, e.g., Mesos [

] in a TensorFlow cluster [

], Yarn [

] for

clusters running MXNet [

] or Angel [

]. With these schedulers,

the owner of a training job species resource requirements, e.g.,

the numbers of parameter servers and workers, which remain un-

changed throughout the training process.

The numbers of workers and parameter servers used to run a

training job inuence the training speed (i.e., the average number

of training steps completed per second), and hence the training

completion time signicantly. Fig. 4 shows the training speed varies

with dierent numbers of workers and parameter servers deployed,

when we synchronously train a ResNet-50 model [

], one of the

state-of-the-art DNNs for image classication (details in Table 1),

on the ImageNet dataset [12]. Each container is congured with 5

CPU cores and 10GB memory, and can run 1 worker or 1 parameter

server. In Fig.

(a), we x the total number of containers to be 20,

i.e., if the number of workers is

, then the number of parameter

servers used is 20

− x

. We can see the maximal training speed is

achieved when there are 8 workers and 12 parameter servers. In

Figure 3: Parameter server architecture

0 5 10 15 20

# of workers

0.2

0.4

0.6

0.8

Speed (steps/s)

1e−1

speed

# of ps

(a) 20 ps and workers

0 5 10 15 20

# of workers

0.4

0.6

0.8

1.0

Speed (steps/s)

1e−1

speed

# of ps

(b) ps:workers = 1:1

Figure 4: Varying training speeds with dierent resource

congurations

Fig.

(b), we x the ratio of the number of parameter servers to the

number of workers to be 1:1. We see that increasing resources do

not lead to linear training speed improvement, and can even slow

down model training.

In a production cluster, job training speed is further inuenced

by many runtime factors, such as available bandwidth at the time.

Conguring a xed number of workers/parameter servers upon

job submission is hence unfavorable. In Optimus, we maximally

exploit varying runtime resource availability by adjusting numbers

and placement of workers and parameter servers, aiming to pursue

the best resource eciency and training speed at each time. Note

that the resource composition of each worker or parameter server

is still specied by the job owner.

Job size unawareness

. Existing schedulers largely adopt FIFO

(as in Spark [

]), Dominant Resource Fairness (DRF) [

] (as in

Mesos [

] and Yarn [

]) or their variants as default scheduling

strategies, which are ignorant of job sizes (represented by input data

size, model complexity, or time taken to complete the job). It has

been shown that job performance can be improved by considering

job sizes when making scheduling decisions [

]. For example, a

long job may block a series of short jobs with an FIFO scheduler that

is oblivious to the job sizes, causing starvation or long completion

time for short jobs.

Training completion time varies signicantly among DL jobs.

Fig. 2 shows the training time of several representative DL models

on respective datasets, as given in Table 1, on a TITAN X Pascal

GPU. The training time varies from minutes (for simple models on

small datasets, e.g., CNN-rand [

]) to weeks (for complex models

on large datasets, e.g., ResNet-50 [

]). Optimus takes into account

projected job completion time for dierent DL jobs when dynami-

cally adjusting their resource allocation, to minimize average job

completion time.

3 PERFORMANCE MODELING OF DL JOBS

To make good resource scheduling decisions, we would like to know

the relation between resource conguration and the time a training

job takes to achieve model convergence. We derive this relation

by estimating online how many more training epochs a job needs

to run for convergence (§3.1), and how much time a job needs to

complete one training epoch given a certain amount of resources

(§3.2).

剩余13页未读，继续阅读

评论收藏

内容反馈

Melo丶

粉丝: 113

Optimus基于机器学习的集群管理

机器学习集群

OPTIMUS入门手册

optimus:Optimus 是一个 Python web 框架项目构造器

前端开源库-optimus

Laravel开发-laravel-optimus

Optimus双显卡切换技术

2023特斯拉人形机器人Optimus发展优势及产业链梳理报告.pptx

Optimus：Optimus是Scala的数学编程库

手动设置Optimus

Optimus

optimus-manager:用于在Optimus笔记本电脑上处理GPU切换Linux程序

Python库 | optimus-sdk-0.0.11.tar.gz

双显卡怎么切换，使用NVIDIA OPTIMUS进行双显卡切换.docx

Data-Processing-with-Optimus:Optimus进行数据处理

一个帮助工具，使 Optimus Player 能够使用 AirPlay 2 流式传输音频_Swift_代码_下载

mate-optimus:NVIDIA Optimus GPU切换器

optimus-manager-qt：Optimus Manager的界面，允许在Optimus笔记本电脑上切换GPU

Optimus-Gen-2-The-Next-Generation.pptx

智能显卡切换!NVDIA-optimus技术解析.pdf

NVIDIA Optimus双显卡怎么切换.docx

Optimus:LC-MS特征分析和空间映射的工作流程

2023特斯拉人形机器人Optimus发展优势及产业链梳理报告.pdf

NVIDIA Optimus双显卡切换方法.docx

优化软件OPTIMUS案例—车辆前部结构优化设计(PAMCRASH、MADYMO)参考.pdf

F260 LG OPTIMUS LTE3 ROOT

optimus-middleware:使用jenssegersoptimus的PSR-15PSR-7兼容中间件

LG optimus 4X HD 4.12 ROOT

特斯拉 Optimus 发布，人形机器人产业巡礼

optimus

141.环形链表

C#中求商和求余数的计算方法

最新资源