【免费】19-华勤-SagePractical&ScalableML-DrivenPerformanceDebugging

需积分: 0 163 浏览量更新于2022-08-04 收藏 4.62MB PDF 举报

Sage: Practical & Scalable ML-Driven Performance Debugging inIthaca, New York, U

Sage: Practical & Scalable ML-Driven Performance Debugging in

Microservices

Yu Gan

∗

[email protected]

Cornell University

Ithaca, New York, USA

Mingyu Liang

[email protected]

Cornell University

Ithaca, New York, USA

Sundar Dev

[email protected]

Google

Sunnyvale, California, USA

David Lo

[email protected]

Google

Sunnyvale, California, USA

Christina Delimitrou

[email protected]

Cornell University

Ithaca, New York, USA

ABSTRACT

Cloud applications are increasingly shifting from large monolithic

services to complex graphs of loosely-coupled microservices. De-

spite the advantages of modularity and elasticity microservices

oer, they also complicate cluster management and performance

debugging, as dependencies between tiers introduce backpressure

and cascading QoS violations. Prior work on performance debug-

ging for cloud services either relies on empirical techniques, or uses

supervised learning to diagnose the root causes of performance

issues, which requires signicant application instrumentation, and

is dicult to deploy in practice.

We present Sage, a machine learning-driven root cause analysis

system for interactive cloud microservices that focuses on practi-

cality and scalability. Sage leverages unsupervised ML models to

circumvent the overhead of trace labeling, captures the impact of

dependencies between microservices to determine the root cause

of unpredictable performance online, and applies corrective actions

to recover a cloud service’s QoS. In experiments on both dedicated

local clusters and large clusters on Google Compute Engine we

show that Sage consistently achieves over 93% accuracy in cor-

rectly identifying the root cause of QoS violations, and improves

performance predictability.

CCS CONCEPTS

• Computer systems organization → Cloud computing

;

tier architectures

;

• Software and its engineering →

Software

performance;

• Computing methodologies →

Causal reasoning

and diagnostics; Neural networks.

∗

This work was not done at Google.

Permission to make digital or hard copies of all or part of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for prot or commercial advantage and that copies bear this notice and the full citation

on the rst page. Copyrights for components of this work owned by others than the

author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or

republish, to post on servers or to redistribute to lists, requires prior specic permission

and/or a fee. Request permissions from [email protected].

ASPLOS ’21, April 19–23, 2021, Virtual, USA

ACM ISBN 978-1-4503-8317-2/21/04.. . $15.00

https://doi.org/10.1145/3445814.3446700

KEYWORDS

cloud computing, microservices, performance debugging, QoS, coun-

terfactual, Bayesian network, variational autoencoder

ACM Reference Format:

Yu Gan, Mingyu Liang, Sundar Dev, David Lo, and Christina Delimitrou.

2021. Sage: Practical & Scalable ML-Driven Performance Debugging in

Microservices. In Proceedings of the 26th ACM International Conference on

Architectural Support for Programming Languages and Operating Systems

(ASPLOS ’21), April 19–23, 2021, Virtual, USA. ACM, New York, NY, USA,

17 pages. https://doi.org/10.1145/3445814.3446700

1 INTRODUCTION

Cloud computing has reached proliferation by oering resource

exibility, cost eciency, and fast deployment [

–

As the scale and complexity of cloud services increased, their design

started undergoing a major shift.

In place of large monolithic services that encompassed the entire

functionality in a single binary, cloud applications have progres-

sively adopted ne-grained modularity, consisting of hundreds or

thousands of single-purpose and loosely-coupled microser vices [

–

104

108

]. This shift is increasingly pervasive, with

cloud-based services, such as Amazon, Twitter, Netix, and eBay,

having already adopted this application model [

]. There

are several reasons that make microservices appealing, including

the fact that they accelerate and facilitate development, they pro-

mote elasticity, and enable software heterogeneity, only requiring

a common API for inter-microservice communication.

Despite their advantages, microservices also introduce new sys-

tem challenges. They especially complicate resource management,

as dependencies between tiers introduce backpressure eects, caus-

ing unpredictable performance to propagate through the system [

]. Diagnosing such performance issues empirically is both cum-

bersome and prone to errors, especially as typical microservices

deployments include hundreds or thousands of unique tiers. Simi-

larly, current cluster managers [

112

115

] are not expressive enough to account for the

impact of microservice dependencies, thus putting more pressure

on the need for automated root cause analysis systems.

Machine learning-based approaches have been eective in clus-

ter management for batch applications [

], and for batch and inter-

active, single-tier services [

]. On the performance debugging

front, there has been increased attention on trace-based methods to

ASPLOS ’21, April 19–23, 2021, Virtual, USA Yu Gan, Mingyu Liang, Sundar Dev, David Lo, and Christina Delimitrou

analyze [

], diagnose [

110

113

114

], and in some cases anticipate [

109

] performance

issues in cloud services. While most such systems target cloud ap-

plications, the only one focusing on microservices is Seer [

]. Seer

leverages a deep learning model to anticipate upcoming QoS vio-

lations, and adjusts the resources per microservice to avoid them.

Despite its high accuracy, Seer uses supervised learning, which

requires oine and online trace labeling, as well as considerable

kernel-level instrumentation and ne-grained tracing to track the

number of outstanding requests across the system stack. In a pro-

duction system this is non-trivial, as it involves injecting resource

contention in live applications, which can impact performance and

user experience.

We present Sage, a root cause analysis system that leverages

unsupervised learning to identify the culprit of unpredictable per-

formance in complex graphs of microservices in a scalable and

practical manner. Specically, Sage uses Causal Bayesian Networks

to capture the dependencies between the microservices in an end-

to-end application topology, and counterfactuals (events that hap-

pen given certain alternative conditions in a hypothetical world)

through a Graphical Variational Autoencoder to examine the impact

of microservices on end-to-end performance. Sage does not rely

on data labeling, hence it can be entirely transparent to both cloud

users and application developers, making it practical for large-scale

deployments, scales well with the number of microservices and

machines, and only relies on lightweight tracing that does not re-

quire application changes or kernel instrumentation, which would

be dicult to obtain in practice. Sage targets performance issues

caused by deployment, conguration, and resource provisioning

reasons, as opposed to design bugs.

We have evaluated Sage both on dedicated local clusters and

large cluster settings on Google Compute Engine (GCE) with sev-

eral end-to-end microservices [

], and showed that it correctly

identies the microservice(s) and system resources that initiated

a QoS violation in over 93% of cases, and improves performance

predictability without sacricing resource eciency.

2 RELATED WORK

Below we review work on the system implications of microservices,

cluster managers designed for multi-tier services and microservices,

and systems for cloud performance debugging.

2.1 System Implications of Microservices

The increasing popularity of ne-grained modular application de-

sign, microservices being an extreme materialization of it, has

yielded a large amount of prior work on representative benchmark

suites and studies on their characteristics [

104

Suite [

104

]

is an open-source multi-tier application benchmark suite containing

several online data-intensive (OLDI) services, such as image similar-

ity search, key-value stores, set intersections, and recommendation

systems. DeathStarBench [

] presents ve end-to-end interactive

applications built with microservices, leveraging Apache Thrift [

Spring Framework [

], and gRPC [

]. The services implement

popular cloud applications, like social networks, e-commerce sites,

and movie reviewing services. DeathStarBench also explores the

hardware/software implications of microservices, including their

resource bottlenecks, OS/networking overheads, cluster manage-

ment challenges, and sensitivity to performance unpredictability.

Accelerometer [

105

] characterizes the system overheads of several

Facebook microservices, including I/O processing, logging, and

compression. They also build an analytical model to predict the

potential speedup of a microservice from hardware acceleration.

2.2 Microservices Cluster Management

Microservices have complicated dependency graphs, strict QoS tar-

gets, and are sensitive to performance unpredictability. Recent work

has started exploring the resource management challenges of mi-

croservices. Suresh et al. [

108

] design Wisp, a dynamic rate limiting

system for microservices, which prioritizes requests in the order

of their deadline expiration. uTune [

107

] auto-tunes the threading

model of multi-tier applications to improve their end-to-end per-

formance. GrandSLAm [

] improves the resources utilization of

ML microservices by estimating the execution time of each tier,

and dynamically batching and reordering requests to meet QoS.

Finally, SoftSKU [

106

] characterizes the performance of the same

Facebook microservices as [

105

] across hardware and software con-

gurations, and searches for their optimal resource congurations

using A/B testing in production.

2.3 Cloud Performance Debugging

There is extensive prior work on monitoring and debugging perfor-

mance and eciency issues in cloud systems. Aguilera et al. [

]

built a tool to construct the causal path of a service from RPC mes-

sages without access to source code. X-Trace [

] is a tracing frame-

work portable across protocols and software systems that detects

runtime performance issues in distributed systems. It can identify

faults in several scenarios, including DNS resolution and overlay

networks. Mystery Machine [

] leverages a large amount of cloud

traces to infer the causal relationships between requests at runtime.

There are also several production-level distributed tracing systems,

including Dapper [

100

], Zipkin [

], Jaeger[

], and Google-Wide

Proling (GWP) [

]. Dapper, Zipkin and Jaeger record RPC-level

traces for sampled requests across the calling stack, while GWP

monitors low-level hardware metrics. These systems aim to facili-

tate locating performance issues, but are not geared towards taking

action to resolve them.

Autopilot [

] is an online cluster management system that

adjusts the number of tasks and CPU/memory limits automatically

to reduce resource slack while guaranteeing performance. Sage

diers from prior work on cloud scheduling, such as [

115

in that it locates the root cause of poor performance only using

the end-to-end QoS target, without explicitly requiring to dene

per-tier performance service level agreements (SLAs).

Root cause analysis systems for cloud applications are gaining

increased attention, as the number of interactive applications con-

tinues to increase. Several of these proposals leverage statistical

models to diagnose performance issues [

109

113

]. Cohen et

al. [

] build tree-augmented Bayesian networks (TANs) to predict

whether QoS will be violated, based on the correlation between

performance and low-level metrics. Unfortunately, in multi-tier

applications, correlation does not always imply causation, given

Sage: Practical & Scalable ML-Driven Performance Debugging in Microservices ASPLOS ’21, April 19–23, 2021, Virtual, USA

the existence of backpressure eects between dependent tiers. Ex-

plainIt! [

] leverages a linear regression model to nd root causes

of poor performance in multi-stage data processing pipelines which

optimize for throughput. While the regression model works well

for batch jobs, latency is more sensitive to noise, and propagates

across dependent tiers.

CauseInfer [

] as well as Microscope [

] build a causality graph

using the PC-algorithm, and use it to identify root causes with dif-

ferent anomaly detection algorithms. As with ExplainIt!, they work

well for data analytics, but would be impractical for latency-critical

applications with tens of tiers, due to the high computation com-

plexity of the PC-algorithm [

]. Finally, Seer [

] is a supervised

CNN+LSTM model that anticipates QoS violations shortly before

they happen. Because it is proactive, Seer can avoid poor perfor-

mance altogether, however, it requires considerable kernel-level

instrumentation to track the number of outstanding requests across

the system stack at ne-granularity, which is not practical in large

production systems. It also requires data labeling to train its model,

which requires injecting QoS violations in active services. This

sensitivity to tracing frequency also exists in Sieve [

111

], which

uses the Granger causality test to determine causal relationships

between tiers [21, 101].

3 ML FOR PERFORMANCE DEBUGGING

3.1 Overview

Sage is a performance debugging and root cause analysis system

for large-scale cloud applications. While the design centers around

interactive microservices, where dependencies between tiers fur-

ther complicate debugging, Sage is also applicable to monolithic

architectures. Sage diagnoses the root cause [

] of end-to-end QoS

violations, and applies appropriate corrective action to restore per-

formance. Fig. 1 shows an overview of Sage’s ML pipeline. Sage

relies on two techniques, each of which is described in detail below;

rst, it automatically captures the dependencies between microser-

vices using a Causal Bayesian Network (CBN) trained on RPC-level

distributed traces [

100

]. The CBN also captures the latency

propagation from the backend to the frontend. Second, Sage uses a

graphical variational auto-encoder (GVAE) to generate hypothetical

scenarios (counterfactuals [

]), which tweak the performance

and/or usage of individual microservices to values known to meet

QoS, and infers whether the change restores QoS. Using these two

techniques, Sage determines which set of microservices initiated a

QoS violation, and adjusts their deployment or resource allocation.

While prior work has highlighted the potential of ML for cloud

performance debugging [

], such techniques rely exclusively on

supervised models, which require injecting resource contention

on active services to correctly label the training dataset with root

causes of QoS violations [

]. This is problematic in practice, as

it disrupts the performance of live services. Additionally, prior

work requires high tracing frequency and heavy instrumentation

to collect metrics like the number of outstanding requests across

the system stack, which is not practical in a production system and

can degrade performance.

Sage instead adheres to the following design principles:

• Unsupervised learning

: Sage does not require labeling training

data, and it diagnoses QoS violations using low-frequency traces

CBN and GVAE

DBDB DB

Client

Frontend

Logic

Tiers

Backend

<latexit sha1_base64="Xi0X3qyS5Ij9VSBoutRfyzegkpE=">AAACfnicbVHLShxBFK3pmMR0XposBSkyJGiYTLpFiCEMCtm4VMio0N3I7eo7TjH1aKqqlUnTy3xCtvEbsvNX8g3xI1I9o+AjFwpOnXNfdSovBbcuiv50ggcLDx89XnwSPn32/MXLpeVXB1ZXhuGQaaHNUQ4WBVc4dNwJPCoNgswFHuaTr61+eIrGcq2+uWmJmYQTxUecgfNUkjJumMCijpvjpW7Uj2ZB74P4CnS3f//9sXqxf7l3vNzJ0kKzSqJyTIC1SRyVLqvBOO57NmFaWSyBTeAEEw8VSLRZPdu5oW89U9CRNv4oR2fszYoapLVTmftMCW5s72ot+T8tqdxoK6u5KiuHis0HjSpBnaatAbTgBpkTUw+AGe53pWwMBpjzNt2a4vjku3+FwjOmpQRVvL+2K4mzOm3l5Nr7wVrbpN9e17M6pDciVbrAxI6hxMG8vlcYOOtxpdBQP26w4W2jswbrtO7GzZemCcPQf0h81/774GCjH2/2P+9H3Z0emcciWSFvyBqJySeyQ3bJHhkSRjT5SX6R84AE74IPwcd5atC5qnlNbkWw9Q+bI8Ks</latexit>

VA E

VA E VA E VA E

VA E

VA E VA E VA E

VA E

VA E VA E VA E

VA E

Counterfactual

Latency

Root cause

services &

resources

<latexit sha1_base64="zkYfwI7RkD4GV9LUgsbhddVVGSE=">AAACfnicbVHLahRBFK1po8b2kcQshVA4KIlMJt1RMCJDAm6yTMBJAt1NuF19J1NMPZqqasPY9DKf4Fa/wZ2/4jfoR1g9k0AeXig4dc591am8FNy6KPrdCe4t3H/wcPFR+PjJ02dLyyvPj6yuDMMh00KbkxwsCq5w6LgTeFIaBJkLPM4nn1r9+Asay7X67KYlZhLOFB9xBs5TScq4YQKL+m1zutyN+tEs6F0QX4Lu7s8/F2u/Dv8enK50srTQrJKoHBNgbRJHpctqMI77nk2YVhZLYBM4w8RDBRJtVs92bugrzxR0pI0/ytEZe72iBmntVOY+U4Ib29taS/5PSyo32slqrsrKoWLzQaNKUKdpawAtuEHmxNQDYIb7XSkbgwHmvE03pjg++epfofCcaSlBFW+u7ErirE5bObnyfrDeNum3142sDum1SJUuMLFjKHEwr+8VBs57XCk01I8bbHvb6KzBBq27cfOxacIw9B8S37b/Ljja7sfv+h8Oo+5ej8xjkbwgL8k6icl7skf2yQEZEkY0+Ua+kx8BCV4Hm8HWPDXoXNaskhsR7PwDn0XCrg==</latexit>

<latexit sha1_base64="FiOctRdHRpRm3Zw0V9tFJ4gJlD4=">AAACfnicbVHLShxBFK3pvEwnJhqXgVA4JKiMk24RooQhgpssFRwVuhu5XX3HKaYeTVV1ZNL0Mp+QbfIN2eVX8g3JR1g9o+AjFwpOnXNfdSovBbcuiv50ggcPHz1+svA0fPZ88cXLpeVXx1ZXhuGQaaHNaQ4WBVc4dNwJPC0NgswFnuST/VY/+YLGcq2O3LTETMK54iPOwHkqSRk3TGBRbzdnS92oH82C3gfxFeh++vX325vfh/8OzpY7WVpoVklUjgmwNomj0mU1GMd9zyZMK4slsAmcY+KhAok2q2c7N/StZwo60sYf5eiMvVlRg7R2KnOfKcGN7V2tJf+nJZUb7WQ1V2XlULH5oFElqNO0NYAW3CBzYuoBMMP9rpSNwQBz3qZbUxyffPWvUHjBtJSgio1ru5I4q9NWTq69H6y1TfrtdT2rQ3ojUqULTOwYShzM63uFgYseVwoN9eMGW942OmuwTutu3HxsmjAM/YfEd+2/D463+vF2f/cw6u71yDwWyGuyStZITD6QPfKZHJAhYUST7+QH+RmQ4F2wGbyfpwadq5oVciuCnUuhVsKv</latexit>

<latexit sha1_base64="vCzmwc0BPsv/w+mmwxlG+lYvAKY=">AAACfnicbVHLahRBFK1po8b2legyEAoHJZFx7B4EIzIYyCbLBJwk0N2E29V3MsXUo6mqNoxNL/0Et+Yb3PkrfoP5iFTPJJCHFwpOnXNfdSovBbcuiv52gntL9x88XH4UPn7y9NnzldUXB1ZXhuGIaaHNUQ4WBVc4ctwJPCoNgswFHubTnVY//IbGcq2+ulmJmYQTxcecgfNUkjJumMCiHjTHK92oH82D3gXxJeh++f3vx/qf/fO949VOlhaaVRKVYwKsTeKodFkNxnHfswnTymIJbAonmHioQKLN6vnODX3tmYKOtfFHOTpnr1fUIK2dydxnSnATe1tryf9pSeXGW1nNVVk5VGwxaFwJ6jRtDaAFN8icmHkAzHC/K2UTMMCct+nGFMen3/0rFJ4yLSWo4u2VXUmc1WkrJ1feDzfaJv32upnVIb0WqdIFJnYCJQ4X9b3CwGmPK4WG+nHDgbeNzhts0robN5+bJgxD/yHxbfvvgoNBP/7Q/7Qfdbd7ZBHLZI28IhskJh/JNtkle2REGNHkJ/lFzgISvAneBe8XqUHnsuYluRHB1gWdNMKt</latexit>

Input latency &

metrics

Figure 1: Sage’s ML pipeline. 1 : Build Causal Bayesian

Network (CBN) and Graphical Variational Auto-Encoder

(GVAE). 2 : Process per-tier latency and usage. 3 : Generate

counterfactuals with GVAE. 4 : Identify root cause services

& resources.

collected during live trac using tracing systems readily available

in most major cloud providers.

• Robustness to sampling frequency

: Sage does not require

tracking individual requests to detect temporal patterns, making

it robust to tracing frequency. This is important, as production

tracing systems like Dapper [

100

] employ aggressive sampling to

reduce overheads [

]. In comparison, previous studies [

111

] collect traces at millisecond granularity, which can introduce

signicant overheads.

• User-level metrics

: Sage only uses user-level metrics, easily

obtained through cloud monitoring APIs and service-level traces

from distributed tracing frameworks, such as Jaeger [

]. It does

not require any kernel-level information, which is expensive, or

even inaccessible in cloud platforms.

• Partial retraining

: A major premise of microservices is enabling

frequent updates. Retraining the entire system every time the

code or deployment of a microservice changes is prohibitively

expensive. Instead Sage implements partial and incremental re-

training, whereby only the microservice that changed and its

immediate neighbors are retrained.

• Fast resolution

: Empirically examining sources of poor perfor-

mance is costly in time and resources, especially given the ingest

delay cloud systems have in consuming monitoring data, causing

a change to take time before propagating on recorded traces.

Sage models the impact of the dierent probable root causes

concurrently, restoring QoS faster.

3.2 Microservice Latency Propagation

3.2.1 Single RPC Latency Decomposition.

Fig. 2 shows the latency decomposition of an RPC across client

(sender) and server (receiver). The client initiates an RPC request

via the

rpc0_request

API at

. The request then waits in the RPC

channel’s send queue and gets written to the Linux network stack

via the

sendmsg

syscall at

. The packets pass through the TCP/IP

protocol and are sent out from the client’s NIC. They are then

transmitted over the wire and switches and arrive at the server’s

剩余16页未读，继续阅读

身份认证购VIP最低享 7 折!

30元优惠券

资源推荐

资源评论

丛乐

粉丝: 38

19-华勤-Sage Practical & Scalable ML-Driven Performance Debugging

最新资源

19-华勤-Sage Practical & Scalable ML-Driven Performance Debugging

9-华勤-MEDEA1

43-华勤-firm-osdi20-qiu1

华勤HGU421N v3光猫设置路由、Wifi无线设置

华勤光猫破解HGU421V3

【上交所-2024研报】华勤技术2024年第三季度报告.pdf

华勤java笔试题-MGC:MGC

华勤设备恢复出厂配置工具HGU421

华勤HG一N家庭网关路由设置.pdf

华勤WAP120nf不死UBOOT

华勤 HGU421N 用户手册

华勤HGU421N V3恢复出厂设置

HGU421破解工具

Java笔试题集合-常见java笔试题目

java 面试题收藏

华为Java笔试题(含答案）

20210510-西南证券-蓝黛科技-002765-触控显示前景向好，动力传动困境反转.pdf

2022.07.26-凯格精机：国内领先的锡膏印刷设备供应商-国信证券-67页.pdf

java笔试试题 涉及很多方面

华为C/C++笔试题

光猫路由设置

HGU421NV3默认设置

c语言常见面试题.pdf

陈先生-嵌入式开发-6年_网络公司it人员简历_程序员简历模板_计算机相关专业.pdf

星空极速2.5官方正式版.rar

20210812-华西证券-艾为电子-688798-数模混合设计专家，持续迭代拓展产品线.pdf

2021.10.31-闻泰科技：ODM业务引领行业变革，安世半导体持续高景气-华创证券-61页.pdf

数据内容安全终端试试手册

工业园区促进县域经济发展的经验汇报经验材料 (2) .docx

Redis学习日志【三】 --Redis做验证码过期（应用场景）

我若要用mos管和二极管实现 手柄不使用时，通过typec借口给电池充电，在手柄使用时，优先给手柄供电，再给电池充电，会用到什么电路类型

最新资源

java笔试试题涉及很多方面

我若要用mos管和二极管实现手柄不使用时，通过typec借口给电池充电，在手柄使用时，优先给手柄供电，再给电池充电，会用到什么电路类型