没有合适的资源?快使用搜索试试~ 我知道了~
19-华勤-Sage Practical & Scalable ML-Driven Performance Debugging


试读
17页
需积分: 0 0 下载量 163 浏览量
更新于2022-08-04
收藏 4.62MB PDF 举报
Sage: Practical & Scalable ML-Driven Performance Debugging inIthaca, New York, U

Sage: Practical & Scalable ML-Driven Performance Debugging in
Microservices
Yu Gan
∗
Cornell University
Ithaca, New York, USA
Mingyu Liang
Cornell University
Ithaca, New York, USA
Sundar Dev
Google
Sunnyvale, California, USA
David Lo
Google
Sunnyvale, California, USA
Christina Delimitrou
Cornell University
Ithaca, New York, USA
ABSTRACT
Cloud applications are increasingly shifting from large monolithic
services to complex graphs of loosely-coupled microservices. De-
spite the advantages of modularity and elasticity microservices
oer, they also complicate cluster management and performance
debugging, as dependencies between tiers introduce backpressure
and cascading QoS violations. Prior work on performance debug-
ging for cloud services either relies on empirical techniques, or uses
supervised learning to diagnose the root causes of performance
issues, which requires signicant application instrumentation, and
is dicult to deploy in practice.
We present Sage, a machine learning-driven root cause analysis
system for interactive cloud microservices that focuses on practi-
cality and scalability. Sage leverages unsupervised ML models to
circumvent the overhead of trace labeling, captures the impact of
dependencies between microservices to determine the root cause
of unpredictable performance online, and applies corrective actions
to recover a cloud service’s QoS. In experiments on both dedicated
local clusters and large clusters on Google Compute Engine we
show that Sage consistently achieves over 93% accuracy in cor-
rectly identifying the root cause of QoS violations, and improves
performance predictability.
CCS CONCEPTS
• Computer systems organization → Cloud computing
;
n-
tier architectures
;
• Software and its engineering →
Software
performance;
• Computing methodologies →
Causal reasoning
and diagnostics; Neural networks.
∗
This work was not done at Google.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than the
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specic permission
ASPLOS ’21, April 19–23, 2021, Virtual, USA
© 2021 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM ISBN 978-1-4503-8317-2/21/04.. . $15.00
https://doi.org/10.1145/3445814.3446700
KEYWORDS
cloud computing, microservices, performance debugging, QoS, coun-
terfactual, Bayesian network, variational autoencoder
ACM Reference Format:
Yu Gan, Mingyu Liang, Sundar Dev, David Lo, and Christina Delimitrou.
2021. Sage: Practical & Scalable ML-Driven Performance Debugging in
Microservices. In Proceedings of the 26th ACM International Conference on
Architectural Support for Programming Languages and Operating Systems
(ASPLOS ’21), April 19–23, 2021, Virtual, USA. ACM, New York, NY, USA,
17 pages. https://doi.org/10.1145/3445814.3446700
1 INTRODUCTION
Cloud computing has reached proliferation by oering resource
exibility, cost eciency, and fast deployment [
20
,
25
,
37
–
43
,
52
,
77
].
As the scale and complexity of cloud services increased, their design
started undergoing a major shift.
In place of large monolithic services that encompassed the entire
functionality in a single binary, cloud applications have progres-
sively adopted ne-grained modularity, consisting of hundreds or
thousands of single-purpose and loosely-coupled microser vices [
2
,
17
,
18
,
47
–
49
,
104
,
108
]. This shift is increasingly pervasive, with
cloud-based services, such as Amazon, Twitter, Netix, and eBay,
having already adopted this application model [
2
,
17
,
18
]. There
are several reasons that make microservices appealing, including
the fact that they accelerate and facilitate development, they pro-
mote elasticity, and enable software heterogeneity, only requiring
a common API for inter-microservice communication.
Despite their advantages, microservices also introduce new sys-
tem challenges. They especially complicate resource management,
as dependencies between tiers introduce backpressure eects, caus-
ing unpredictable performance to propagate through the system [
48
,
49
]. Diagnosing such performance issues empirically is both cum-
bersome and prone to errors, especially as typical microservices
deployments include hundreds or thousands of unique tiers. Simi-
larly, current cluster managers [
29
,
38
,
41
,
44
,
70
,
72
,
73
,
75
,
77
,
82
,
83
,
86
,
95
,
99
,
112
,
115
] are not expressive enough to account for the
impact of microservice dependencies, thus putting more pressure
on the need for automated root cause analysis systems.
Machine learning-based approaches have been eective in clus-
ter management for batch applications [
36
], and for batch and inter-
active, single-tier services [
38
,
41
]. On the performance debugging
front, there has been increased attention on trace-based methods to

ASPLOS ’21, April 19–23, 2021, Virtual, USA Yu Gan, Mingyu Liang, Sundar Dev, David Lo, and Christina Delimitrou
analyze [
30
,
46
,
85
], diagnose [
19
,
23
,
32
,
35
,
54
,
60
,
63
,
81
,
91
,
110
,
113
,
114
], and in some cases anticipate [
47
,
49
,
109
] performance
issues in cloud services. While most such systems target cloud ap-
plications, the only one focusing on microservices is Seer [
49
]. Seer
leverages a deep learning model to anticipate upcoming QoS vio-
lations, and adjusts the resources per microservice to avoid them.
Despite its high accuracy, Seer uses supervised learning, which
requires oine and online trace labeling, as well as considerable
kernel-level instrumentation and ne-grained tracing to track the
number of outstanding requests across the system stack. In a pro-
duction system this is non-trivial, as it involves injecting resource
contention in live applications, which can impact performance and
user experience.
We present Sage, a root cause analysis system that leverages
unsupervised learning to identify the culprit of unpredictable per-
formance in complex graphs of microservices in a scalable and
practical manner. Specically, Sage uses Causal Bayesian Networks
to capture the dependencies between the microservices in an end-
to-end application topology, and counterfactuals (events that hap-
pen given certain alternative conditions in a hypothetical world)
through a Graphical Variational Autoencoder to examine the impact
of microservices on end-to-end performance. Sage does not rely
on data labeling, hence it can be entirely transparent to both cloud
users and application developers, making it practical for large-scale
deployments, scales well with the number of microservices and
machines, and only relies on lightweight tracing that does not re-
quire application changes or kernel instrumentation, which would
be dicult to obtain in practice. Sage targets performance issues
caused by deployment, conguration, and resource provisioning
reasons, as opposed to design bugs.
We have evaluated Sage both on dedicated local clusters and
large cluster settings on Google Compute Engine (GCE) with sev-
eral end-to-end microservices [
48
], and showed that it correctly
identies the microservice(s) and system resources that initiated
a QoS violation in over 93% of cases, and improves performance
predictability without sacricing resource eciency.
2 RELATED WORK
Below we review work on the system implications of microservices,
cluster managers designed for multi-tier services and microservices,
and systems for cloud performance debugging.
2.1 System Implications of Microservices
The increasing popularity of ne-grained modular application de-
sign, microservices being an extreme materialization of it, has
yielded a large amount of prior work on representative benchmark
suites and studies on their characteristics [
48
,
55
,
104
].
µ
Suite [
104
]
is an open-source multi-tier application benchmark suite containing
several online data-intensive (OLDI) services, such as image similar-
ity search, key-value stores, set intersections, and recommendation
systems. DeathStarBench [
48
] presents ve end-to-end interactive
applications built with microservices, leveraging Apache Thrift [
1
],
Spring Framework [
12
], and gRPC [
5
]. The services implement
popular cloud applications, like social networks, e-commerce sites,
and movie reviewing services. DeathStarBench also explores the
hardware/software implications of microservices, including their
resource bottlenecks, OS/networking overheads, cluster manage-
ment challenges, and sensitivity to performance unpredictability.
Accelerometer [
105
] characterizes the system overheads of several
Facebook microservices, including I/O processing, logging, and
compression. They also build an analytical model to predict the
potential speedup of a microservice from hardware acceleration.
2.2 Microservices Cluster Management
Microservices have complicated dependency graphs, strict QoS tar-
gets, and are sensitive to performance unpredictability. Recent work
has started exploring the resource management challenges of mi-
croservices. Suresh et al. [
108
] design Wisp, a dynamic rate limiting
system for microservices, which prioritizes requests in the order
of their deadline expiration. uTune [
107
] auto-tunes the threading
model of multi-tier applications to improve their end-to-end per-
formance. GrandSLAm [
66
] improves the resources utilization of
ML microservices by estimating the execution time of each tier,
and dynamically batching and reordering requests to meet QoS.
Finally, SoftSKU [
106
] characterizes the performance of the same
Facebook microservices as [
105
] across hardware and software con-
gurations, and searches for their optimal resource congurations
using A/B testing in production.
2.3 Cloud Performance Debugging
There is extensive prior work on monitoring and debugging perfor-
mance and eciency issues in cloud systems. Aguilera et al. [
19
]
built a tool to construct the causal path of a service from RPC mes-
sages without access to source code. X-Trace [
46
] is a tracing frame-
work portable across protocols and software systems that detects
runtime performance issues in distributed systems. It can identify
faults in several scenarios, including DNS resolution and overlay
networks. Mystery Machine [
33
] leverages a large amount of cloud
traces to infer the causal relationships between requests at runtime.
There are also several production-level distributed tracing systems,
including Dapper [
100
], Zipkin [
16
], Jaeger[
7
], and Google-Wide
Proling (GWP) [
90
]. Dapper, Zipkin and Jaeger record RPC-level
traces for sampled requests across the calling stack, while GWP
monitors low-level hardware metrics. These systems aim to facili-
tate locating performance issues, but are not geared towards taking
action to resolve them.
Autopilot [
94
] is an online cluster management system that
adjusts the number of tasks and CPU/memory limits automatically
to reduce resource slack while guaranteeing performance. Sage
diers from prior work on cloud scheduling, such as [
41
,
50
,
76
,
115
],
in that it locates the root cause of poor performance only using
the end-to-end QoS target, without explicitly requiring to dene
per-tier performance service level agreements (SLAs).
Root cause analysis systems for cloud applications are gaining
increased attention, as the number of interactive applications con-
tinues to increase. Several of these proposals leverage statistical
models to diagnose performance issues [
54
,
109
,
113
]. Cohen et
al. [
35
] build tree-augmented Bayesian networks (TANs) to predict
whether QoS will be violated, based on the correlation between
performance and low-level metrics. Unfortunately, in multi-tier
applications, correlation does not always imply causation, given

Sage: Practical & Scalable ML-Driven Performance Debugging in Microservices ASPLOS ’21, April 19–23, 2021, Virtual, USA
the existence of backpressure eects between dependent tiers. Ex-
plainIt! [
62
] leverages a linear regression model to nd root causes
of poor performance in multi-stage data processing pipelines which
optimize for throughput. While the regression model works well
for batch jobs, latency is more sensitive to noise, and propagates
across dependent tiers.
CauseInfer [
28
] as well as Microscope [
71
] build a causality graph
using the PC-algorithm, and use it to identify root causes with dif-
ferent anomaly detection algorithms. As with ExplainIt!, they work
well for data analytics, but would be impractical for latency-critical
applications with tens of tiers, due to the high computation com-
plexity of the PC-algorithm [
65
]. Finally, Seer [
49
] is a supervised
CNN+LSTM model that anticipates QoS violations shortly before
they happen. Because it is proactive, Seer can avoid poor perfor-
mance altogether, however, it requires considerable kernel-level
instrumentation to track the number of outstanding requests across
the system stack at ne-granularity, which is not practical in large
production systems. It also requires data labeling to train its model,
which requires injecting QoS violations in active services. This
sensitivity to tracing frequency also exists in Sieve [
111
], which
uses the Granger causality test to determine causal relationships
between tiers [21, 101].
3 ML FOR PERFORMANCE DEBUGGING
3.1 Overview
Sage is a performance debugging and root cause analysis system
for large-scale cloud applications. While the design centers around
interactive microservices, where dependencies between tiers fur-
ther complicate debugging, Sage is also applicable to monolithic
architectures. Sage diagnoses the root cause [
57
] of end-to-end QoS
violations, and applies appropriate corrective action to restore per-
formance. Fig. 1 shows an overview of Sage’s ML pipeline. Sage
relies on two techniques, each of which is described in detail below;
rst, it automatically captures the dependencies between microser-
vices using a Causal Bayesian Network (CBN) trained on RPC-level
distributed traces [
16
,
100
]. The CBN also captures the latency
propagation from the backend to the frontend. Second, Sage uses a
graphical variational auto-encoder (GVAE) to generate hypothetical
scenarios (counterfactuals [
51
,
79
]), which tweak the performance
and/or usage of individual microservices to values known to meet
QoS, and infers whether the change restores QoS. Using these two
techniques, Sage determines which set of microservices initiated a
QoS violation, and adjusts their deployment or resource allocation.
While prior work has highlighted the potential of ML for cloud
performance debugging [
49
], such techniques rely exclusively on
supervised models, which require injecting resource contention
on active services to correctly label the training dataset with root
causes of QoS violations [
49
]. This is problematic in practice, as
it disrupts the performance of live services. Additionally, prior
work requires high tracing frequency and heavy instrumentation
to collect metrics like the number of outstanding requests across
the system stack, which is not practical in a production system and
can degrade performance.
Sage instead adheres to the following design principles:
• Unsupervised learning
: Sage does not require labeling training
data, and it diagnoses QoS violations using low-frequency traces
CBN and GVAE
C
DBDB DB
DB
Client
Frontend
Logic
Tiers
Backend
1
<latexit sha1_base64="Xi0X3qyS5Ij9VSBoutRfyzegkpE=">AAACfnicbVHLShxBFK3pmMR0XposBSkyJGiYTLpFiCEMCtm4VMio0N3I7eo7TjH1aKqqlUnTy3xCtvEbsvNX8g3xI1I9o+AjFwpOnXNfdSovBbcuiv50ggcLDx89XnwSPn32/MXLpeVXB1ZXhuGQaaHNUQ4WBVc4dNwJPCoNgswFHuaTr61+eIrGcq2+uWmJmYQTxUecgfNUkjJumMCijpvjpW7Uj2ZB74P4CnS3f//9sXqxf7l3vNzJ0kKzSqJyTIC1SRyVLqvBOO57NmFaWSyBTeAEEw8VSLRZPdu5oW89U9CRNv4oR2fszYoapLVTmftMCW5s72ot+T8tqdxoK6u5KiuHis0HjSpBnaatAbTgBpkTUw+AGe53pWwMBpjzNt2a4vjku3+FwjOmpQRVvL+2K4mzOm3l5Nr7wVrbpN9e17M6pDciVbrAxI6hxMG8vlcYOOtxpdBQP26w4W2jswbrtO7GzZemCcPQf0h81/774GCjH2/2P+9H3Z0emcciWSFvyBqJySeyQ3bJHhkSRjT5SX6R84AE74IPwcd5atC5qnlNbkWw9Q+bI8Ks</latexit>
1
<latexit sha1_base64="Xi0X3qyS5Ij9VSBoutRfyzegkpE=">AAACfnicbVHLShxBFK3pmMR0XposBSkyJGiYTLpFiCEMCtm4VMio0N3I7eo7TjH1aKqqlUnTy3xCtvEbsvNX8g3xI1I9o+AjFwpOnXNfdSovBbcuiv50ggcLDx89XnwSPn32/MXLpeVXB1ZXhuGQaaHNUQ4WBVc4dNwJPCoNgswFHuaTr61+eIrGcq2+uWmJmYQTxUecgfNUkjJumMCijpvjpW7Uj2ZB74P4CnS3f//9sXqxf7l3vNzJ0kKzSqJyTIC1SRyVLqvBOO57NmFaWSyBTeAEEw8VSLRZPdu5oW89U9CRNv4oR2fszYoapLVTmftMCW5s72ot+T8tqdxoK6u5KiuHis0HjSpBnaatAbTgBpkTUw+AGe53pWwMBpjzNt2a4vjku3+FwjOmpQRVvL+2K4mzOm3l5Nr7wVrbpN9e17M6pDciVbrAxI6hxMG8vlcYOOtxpdBQP26w4W2jswbrtO7GzZemCcPQf0h81/774GCjH2/2P+9H3Z0emcciWSFvyBqJySeyQ3bJHhkSRjT5SX6R84AE74IPwcd5atC5qnlNbkWw9Q+bI8Ks</latexit>
VA E
VA E
VA E VA E VA E
VA E
VA E VA E VA E
VA E
VA E VA E VA E
VA E
Counterfactual
Latency
Root cause
services &
resources
3
<latexit sha1_base64="zkYfwI7RkD4GV9LUgsbhddVVGSE=">AAACfnicbVHLahRBFK1po8b2kcQshVA4KIlMJt1RMCJDAm6yTMBJAt1NuF19J1NMPZqqasPY9DKf4Fa/wZ2/4jfoR1g9k0AeXig4dc591am8FNy6KPrdCe4t3H/wcPFR+PjJ02dLyyvPj6yuDMMh00KbkxwsCq5w6LgTeFIaBJkLPM4nn1r9+Asay7X67KYlZhLOFB9xBs5TScq4YQKL+m1zutyN+tEs6F0QX4Lu7s8/F2u/Dv8enK50srTQrJKoHBNgbRJHpctqMI77nk2YVhZLYBM4w8RDBRJtVs92bugrzxR0pI0/ytEZe72iBmntVOY+U4Ib29taS/5PSyo32slqrsrKoWLzQaNKUKdpawAtuEHmxNQDYIb7XSkbgwHmvE03pjg++epfofCcaSlBFW+u7ErirE5bObnyfrDeNum3142sDum1SJUuMLFjKHEwr+8VBs57XCk01I8bbHvb6KzBBq27cfOxacIw9B8S37b/Ljja7sfv+h8Oo+5ej8xjkbwgL8k6icl7skf2yQEZEkY0+Ua+kx8BCV4Hm8HWPDXoXNaskhsR7PwDn0XCrg==</latexit>
3
<latexit sha1_base64="zkYfwI7RkD4GV9LUgsbhddVVGSE=">AAACfnicbVHLahRBFK1po8b2kcQshVA4KIlMJt1RMCJDAm6yTMBJAt1NuF19J1NMPZqqasPY9DKf4Fa/wZ2/4jfoR1g9k0AeXig4dc591am8FNy6KPrdCe4t3H/wcPFR+PjJ02dLyyvPj6yuDMMh00KbkxwsCq5w6LgTeFIaBJkLPM4nn1r9+Asay7X67KYlZhLOFB9xBs5TScq4YQKL+m1zutyN+tEs6F0QX4Lu7s8/F2u/Dv8enK50srTQrJKoHBNgbRJHpctqMI77nk2YVhZLYBM4w8RDBRJtVs92bugrzxR0pI0/ytEZe72iBmntVOY+U4Ib29taS/5PSyo32slqrsrKoWLzQaNKUKdpawAtuEHmxNQDYIb7XSkbgwHmvE03pjg++epfofCcaSlBFW+u7ErirE5bObnyfrDeNum3142sDum1SJUuMLFjKHEwr+8VBs57XCk01I8bbHvb6KzBBq27cfOxacIw9B8S37b/Ljja7sfv+h8Oo+5ej8xjkbwgL8k6icl7skf2yQEZEkY0+Ua+kx8BCV4Hm8HWPDXoXNaskhsR7PwDn0XCrg==</latexit>
4
<latexit sha1_base64="FiOctRdHRpRm3Zw0V9tFJ4gJlD4=">AAACfnicbVHLShxBFK3pvEwnJhqXgVA4JKiMk24RooQhgpssFRwVuhu5XX3HKaYeTVV1ZNL0Mp+QbfIN2eVX8g3JR1g9o+AjFwpOnXNfdSovBbcuiv50ggcPHz1+svA0fPZ88cXLpeVXx1ZXhuGQaaHNaQ4WBVc4dNwJPC0NgswFnuST/VY/+YLGcq2O3LTETMK54iPOwHkqSRk3TGBRbzdnS92oH82C3gfxFeh++vX325vfh/8OzpY7WVpoVklUjgmwNomj0mU1GMd9zyZMK4slsAmcY+KhAok2q2c7N/StZwo60sYf5eiMvVlRg7R2KnOfKcGN7V2tJf+nJZUb7WQ1V2XlULH5oFElqNO0NYAW3CBzYuoBMMP9rpSNwQBz3qZbUxyffPWvUHjBtJSgio1ru5I4q9NWTq69H6y1TfrtdT2rQ3ojUqULTOwYShzM63uFgYseVwoN9eMGW942OmuwTutu3HxsmjAM/YfEd+2/D463+vF2f/cw6u71yDwWyGuyStZITD6QPfKZHJAhYUST7+QH+RmQ4F2wGbyfpwadq5oVciuCnUuhVsKv</latexit>
4
<latexit sha1_base64="FiOctRdHRpRm3Zw0V9tFJ4gJlD4=">AAACfnicbVHLShxBFK3pvEwnJhqXgVA4JKiMk24RooQhgpssFRwVuhu5XX3HKaYeTVV1ZNL0Mp+QbfIN2eVX8g3JR1g9o+AjFwpOnXNfdSovBbcuiv50ggcPHz1+svA0fPZ88cXLpeVXx1ZXhuGQaaHNaQ4WBVc4dNwJPC0NgswFnuST/VY/+YLGcq2O3LTETMK54iPOwHkqSRk3TGBRbzdnS92oH82C3gfxFeh++vX325vfh/8OzpY7WVpoVklUjgmwNomj0mU1GMd9zyZMK4slsAmcY+KhAok2q2c7N/StZwo60sYf5eiMvVlRg7R2KnOfKcGN7V2tJf+nJZUb7WQ1V2XlULH5oFElqNO0NYAW3CBzYuoBMMP9rpSNwQBz3qZbUxyffPWvUHjBtJSgio1ru5I4q9NWTq69H6y1TfrtdT2rQ3ojUqULTOwYShzM63uFgYseVwoN9eMGW942OmuwTutu3HxsmjAM/YfEd+2/D463+vF2f/cw6u71yDwWyGuyStZITD6QPfKZHJAhYUST7+QH+RmQ4F2wGbyfpwadq5oVciuCnUuhVsKv</latexit>
2
<latexit sha1_base64="vCzmwc0BPsv/w+mmwxlG+lYvAKY=">AAACfnicbVHLahRBFK1po8b2legyEAoHJZFx7B4EIzIYyCbLBJwk0N2E29V3MsXUo6mqNoxNL/0Et+Yb3PkrfoP5iFTPJJCHFwpOnXNfdSovBbcuiv52gntL9x88XH4UPn7y9NnzldUXB1ZXhuGIaaHNUQ4WBVc4ctwJPCoNgswFHubTnVY//IbGcq2+ulmJmYQTxcecgfNUkjJumMCiHjTHK92oH82D3gXxJeh++f3vx/qf/fO949VOlhaaVRKVYwKsTeKodFkNxnHfswnTymIJbAonmHioQKLN6vnODX3tmYKOtfFHOTpnr1fUIK2dydxnSnATe1tryf9pSeXGW1nNVVk5VGwxaFwJ6jRtDaAFN8icmHkAzHC/K2UTMMCct+nGFMen3/0rFJ4yLSWo4u2VXUmc1WkrJ1feDzfaJv32upnVIb0WqdIFJnYCJQ4X9b3CwGmPK4WG+nHDgbeNzhts0robN5+bJgxD/yHxbfvvgoNBP/7Q/7Qfdbd7ZBHLZI28IhskJh/JNtkle2REGNHkJ/lFzgISvAneBe8XqUHnsuYluRHB1gWdNMKt</latexit>
2
<latexit sha1_base64="vCzmwc0BPsv/w+mmwxlG+lYvAKY=">AAACfnicbVHLahRBFK1po8b2legyEAoHJZFx7B4EIzIYyCbLBJwk0N2E29V3MsXUo6mqNoxNL/0Et+Yb3PkrfoP5iFTPJJCHFwpOnXNfdSovBbcuiv52gntL9x88XH4UPn7y9NnzldUXB1ZXhuGIaaHNUQ4WBVc4ctwJPCoNgswFHubTnVY//IbGcq2+ulmJmYQTxcecgfNUkjJumMCiHjTHK92oH82D3gXxJeh++f3vx/qf/fO949VOlhaaVRKVYwKsTeKodFkNxnHfswnTymIJbAonmHioQKLN6vnODX3tmYKOtfFHOTpnr1fUIK2dydxnSnATe1tryf9pSeXGW1nNVVk5VGwxaFwJ6jRtDaAFN8icmHkAzHC/K2UTMMCct+nGFMen3/0rFJ4yLSWo4u2VXUmc1WkrJ1feDzfaJv32upnVIb0WqdIFJnYCJQ4X9b3CwGmPK4WG+nHDgbeNzhts0robN5+bJgxD/yHxbfvvgoNBP/7Q/7Qfdbd7ZBHLZI28IhskJh/JNtkle2REGNHkJ/lFzgISvAneBe8XqUHnsuYluRHB1gWdNMKt</latexit>
Input latency &
metrics
Figure 1: Sage’s ML pipeline. 1 : Build Causal Bayesian
Network (CBN) and Graphical Variational Auto-Encoder
(GVAE). 2 : Process per-tier latency and usage. 3 : Generate
counterfactuals with GVAE. 4 : Identify root cause services
& resources.
collected during live trac using tracing systems readily available
in most major cloud providers.
• Robustness to sampling frequency
: Sage does not require
tracking individual requests to detect temporal patterns, making
it robust to tracing frequency. This is important, as production
tracing systems like Dapper [
100
] employ aggressive sampling to
reduce overheads [
34
,
96
]. In comparison, previous studies [
49
,
98
,
111
] collect traces at millisecond granularity, which can introduce
signicant overheads.
• User-level metrics
: Sage only uses user-level metrics, easily
obtained through cloud monitoring APIs and service-level traces
from distributed tracing frameworks, such as Jaeger [
7
]. It does
not require any kernel-level information, which is expensive, or
even inaccessible in cloud platforms.
• Partial retraining
: A major premise of microservices is enabling
frequent updates. Retraining the entire system every time the
code or deployment of a microservice changes is prohibitively
expensive. Instead Sage implements partial and incremental re-
training, whereby only the microservice that changed and its
immediate neighbors are retrained.
• Fast resolution
: Empirically examining sources of poor perfor-
mance is costly in time and resources, especially given the ingest
delay cloud systems have in consuming monitoring data, causing
a change to take time before propagating on recorded traces.
Sage models the impact of the dierent probable root causes
concurrently, restoring QoS faster.
3.2 Microservice Latency Propagation
3.2.1 Single RPC Latency Decomposition.
Fig. 2 shows the latency decomposition of an RPC across client
(sender) and server (receiver). The client initiates an RPC request
via the
rpc0_request
API at
1
. The request then waits in the RPC
channel’s send queue and gets written to the Linux network stack
via the
sendmsg
syscall at
2
. The packets pass through the TCP/IP
protocol and are sent out from the client’s NIC. They are then
transmitted over the wire and switches and arrive at the server’s
剩余16页未读,继续阅读
资源推荐
资源评论
189 浏览量
141 浏览量
145 浏览量
143 浏览量
112 浏览量
143 浏览量
139 浏览量
117 浏览量
2009-07-01 上传
2015-06-18 上传
161 浏览量
2021-05-11 上传
2023-09-12 上传
146 浏览量
2011-11-07 上传
199 浏览量
137 浏览量
2022-07-14 上传
2023-08-07 上传
129 浏览量
147 浏览量
140 浏览量
2024-06-21 上传
187 浏览量
940 浏览量
资源评论


丛乐
- 粉丝: 38
上传资源 快速赚钱
我的内容管理 展开
我的资源 快来上传第一个资源
我的收益
登录查看自己的收益我的积分 登录查看自己的积分
我的C币 登录后查看C币余额
我的收藏
我的下载
下载帮助


最新资源
- WG005201-MSOFTX3000话统研究和网络优化专题-ISSUE1.0.doc
- 数控加工工艺与编程试题答案完整版.doc
- 单片机LED点阵设计.doc
- 计算机信息技术在互联网的应用思考.docx
- MegaEyes网络图像管理系统技术蓝皮书.doc
- asp-用ASP技术实现简易的检索系统----软件技术.doc
- 运用微信平台优化公共图书馆的信息化服务.docx
- 信息系统安全考题.docx
- 办公楼网络规划实施方案书.doc
- 互联网IT通信手机应用网络传媒优秀ppt模板课件【精选模板】.ppt
- 通信工程智慧教学评价系统研究.docx
- 煤炭深加工项目管理社会稳定性分析专篇.doc
- 基于大数据技术的新一代电能量数据平台.docx
- 通信工程传输技术的应用与未来发展趋势研究.docx
- 社区网络规划设计方案.doc
- 中职计算机专业双师型教师培养的探索.docx
安全验证
文档复制为VIP权益,开通VIP直接复制
