Self-adaptive container monitoring with performance-aware Load-Shedding policies

1
Self-adaptive container monitoring with
performance-aware load-shedding policies
NECST Group Conference 2017 @ Bloomberg
06/01/2017
Rolando Brondolin
rolando.brondolin@polimi.it
DEIB, Politecnico di Milano

Infrastructure monitoring 2
• Easy to follow behavior of small infrastructures
• But on large-scale systems:
– Systematic approach for monitoring and troubleshooting
– Tradeoff on data granularity and resource consumption
high visibility on system state
non negligible cost
few information on system state
cheap monitoring
VS

Infrastructure monitoring 3
• Easy to follow behavior of small infrastructures
• But on large-scale systems:
– Systematic approach for monitoring and troubleshooting
– Tradeoff on data granularity and resource consumption
high visibility on system state
non negligible cost
few information on system state
cheap monitoring
VS

Sysdig Cloud monitoring 4
http://www.sysdig.org
• Infrastructure for container monitoring
• Collects aggregated metrics and shows system state:
– “Drill-down” from cluster to single application metrics
– Dynamic network topology
– Alerting and anomaly detection
• Monitoring agent deployed on each machine in the cluster
– Traces system calls in a “streaming fashion”
– Aggregates data for Threads, FDs, applications, containers and hosts

Problem definition
• The Sysdig Cloud agent can be modelled as a server with a finite queue
• characterized by its arrival rate λ(t) and its service rate μ(t)
• Subject to overloading conditions
5
S
λ(t) φ(t)
μ(t)
Λ Φ
Q

Cause
Problem definition
5
Events arrives at
really high frequency
S
λ(t) φ(t)
μ(t)
Λ Φ
Q
S
φ(t)
μ(t)
Φ
Q
of a streaming system with queue, processing element and streaming
output flow . A server S, fed by a queue Q, is in overloading
eater than the service rate µ(t). The stability condition stated
he necessary and sufficient condition to avoid overloading. A
ncing overloading should discard part of the input to increase
to match the arrival rate (t).
µ(t)  (t) (2.1)
rmalizing is twofold, as we are interested not only in controlling
t also in maximizing the accuracy of the estimated metrics. To
which represents the input flow at a given time t; and ˜x, which
ut flow considered in case of overloading at the same time t. If

EffectCause
Problem definition
5
Events arrives at
really high frequency Queues grow
indefinitely
S
λ(t) φ(t)
μ(t)
Λ Φ
Q
S
φ(t)
μ(t)
Φ
Q
µ(t)  (t) (2.1)

IssuesEffectCause
Problem definition
5
Events arrives at
really high frequency Queues grow
indefinitely
High usage of system
resources
Loss of events
S
λ(t) φ(t)
μ(t)
Λ Φ
Q
S
φ(t)
μ(t)
Φ
Q
µ(t)  (t) (2.1)
Output quality
degradation

Proposed solution: FFWD
Fast Forward With Degradation (FFWD) is a framework that tackles load peaks
in streaming applications via load-shedding techniques
6

general approach but leveraging domain-specific details
6

6
self-adaptive container monitoring with performance-aware load-shedding policies

6
Target: monitoring
• Collects events
• Process events
• Aggregates data

6
Target: monitoring
• Collects events
• Process events
• Aggregates data
Decide when to shed load
• Observe application status
• Check system overload

6
Target: monitoring
• Collects events
• Process events
• Aggregates data
Choose how much load to shed
• domain-specific pluggable policies
• compute shedding probability

6
Target: monitoring
• Collects events
• Process events
• Aggregates data
Set where to shed load
• act directly on incoming stream
• probabilistic approach
Choose how much load to shed
• domain-specific pluggable policies
• compute shedding probability

FFWD: load shedding framework 7

Load manager
• Collect queue status and performance
• Heuristic feedback control  
(Little’s law, queuing theory)
• Implementation goals:
• Response time
• CPU utilization

Load manager
• Response time
• CPU utilization
Policy wrapper
• Hosts domain-specific policies:
• Baseline policy
• Fair policy
• Priority-based policy

Load manager
• Response time
• CPU utilization
Policy wrapper
• Baseline policy
• Fair policy
shedding
plan

Load manager
• Response time
• CPU utilization
Load shedding filter
• Placed at the input source
• Uses shedding plan and input categories
•Probabilistic approach to events drop
Policy wrapper
• Baseline policy
• Fair policy
shedding
plan

Load manager
• Response time
• CPU utilization
Metrics correction
• Counts computed and dropped events
• Rescale aggregated metrics
Load shedding filter
• Placed at the input source
• Uses shedding plan and input categories
•Probabilistic approach to events drop
Policy wrapper
• Baseline policy
• Fair policy
shedding
plan

• We evaluated FFWD within Sysdig
with 2 goals:
• System stability (slide 8)
• Output quality (slides 9 10 11 12)
• Results compared with the reference
filtering system of Sysdig
• Evaluation setup
• 2x Xeon E5-2650 v3,  
20 cores (40 w/HT) @ 2.3Ghz
• 128 GB DDR4 RAM
• Test selected from Phoronix test suite
Experimental setup 8

• We evaluated FFWD within Sysdig
with 2 goals:
• System stability (slide 8)
• Output quality (slides 9 10 11 12)
• Results compared with the reference
filtering system of Sysdig
• 2x Xeon E5-2650 v3,  
20 cores (40 w/HT) @ 2.3Ghz
• 128 GB DDR4 RAM
• Test selected from Phoronix test suite
Experimental setup 8
test ID name priority # evts/s
A nginx 3 800K
B postmark 4 1,2M
C fio 4 1,3M
D simplefile 2 1,5M
E apache 2 1,9M
test ID instances # evts/s
F 3x nginx, 1x fio 1,3M
G 1x nginx, 1x simplefile 1,3M
H
1x apache, 2x postmark,
1x fio
1,8M
Homogeneous benchmarks
Heterogeneous benchmarks
Syscall intensive benchmarks
from Phoronix test suite

Experimental results
• System stability and control error
• MAPE between QoS requirement and
CPU traces of FFWD and reference
• 3.51x average MAPE improvement, 
average MAPE below 5%
9
Test
Ut = 1.1% - Control error MAPE
reference fair priority
A 7,12% 1,78% 3,78%
B 34,06% 4,37% 4,46%
C 28,03% 2,27% 2,24%
D 11,52% 1,41% 1,54%
E 26,02% 8,51% 8,99%
F 22,67% 8,11% 3,74%
G 16,42% 3,37% 2,73%
H 19,92% 8,41% 8,01%

Experimental results
• System stability and control error
• MAPE between QoS requirement and
CPU traces of FFWD and reference
9
Test
Ut = 1.1% - Control error MAPE
A 7,12% 1,78% 3,78%
B 34,06% 4,37% 4,46%
C 28,03% 2,27% 2,24%
D 11,52% 1,41% 1,54%
E 26,02% 8,51% 8,99%
F 22,67% 8,11% 3,74%
G 16,42% 3,37% 2,73%
H 19,92% 8,41% 8,01%
• Output quality
• MAPE between exact output metrics and
approximated ones from FFWD and ref.
• The domain-specific policies outperforms
the reference in the majority of cases
0.1
1
10
100
1000
10000
100000
latency-filelatency-net
latency-filelatenc
MAPE(%)log
kernel-drop
fair
priority
nginx-3nginx-2nginx-1fio
0.1
1
10
100
1000
10000
100000
MAPE(%)log
kernel-drop
fair
priority
0.1
1
10
100
1000
10000
100000
MAPE(%)log
kernel-drop
fair
priority
0.1
1
10
100
1000
10000
100000
MAPE(%)log
kernel-drop
fair
priority
reference
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
MAPE(%)log
kernel-drop
fair
priority
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log kernel-drop
fair
priority
referenceVolume metrics (byte r/w)
Latency metrics
1x Fio, 3x Nginx, 1.3M evt/s

Conclusion
• We saw the main challenges of Load Shedding for container monitoring
– Low overhead monitoring
– High quality and granularity of metrics
10

Conclusion
• We saw the main challenges of Load Shedding for container monitoring
– Low overhead monitoring
– High quality and granularity of metrics
• Fast Forward With Degradation (FFWD)
– Heuristic controller for bounded CPU usage
– Pluggable policies for domain-specific load shedding
– Accurate computation of output metrics
– Load Shedding Filter for fast drop of events
10

11
Questions?
Rolando Brondolin, rolando.brondolin@polimi.it
DEIB, Politecnico di Milano
NGC VIII 2017 @ SF
FFWD: Latency-aware event stream processing via domain-specific load-shedding policies. R. Brondolin, M. Ferroni, M. D.
Santambrogio. In Proceedings of 14th IEEE/IFIP International Conference on Embedded and Ubiquitous Computing (EUC 2016)

14

14
Load Manager
*when*

14
Load Manager
*when*
Policy wrapper
*how much*

14
Load Manager
*when*
LS Filter
*where*
Policy wrapper
*how much*
shedding
plan

14
Load Manager
*when*
aggregated
metrics
correction
LS Filter
*where*
Policy wrapper
*how much*
shedding
plan

Load Manager 15
Load Manager
LS Filter
Policies
SP
Metrics
correction
• The Load Manager computes the throughput μ(t) that
ensures stability such that:
• Two Load Managers for two goals:
tice that it is a sum of two different contributions. On the one hand, as the error e(t)
to zero, the stability condition (4.7) is met. On the other hand, the contribution:
(t) ensures a fast actuation in case of a significant deviation from the actual system
brium.
(t)  µ(t) (4.7)
course, during the lifetime of the system, the arrival rate (t) can vary unpre-
bly and can be greater than the system capacity µc(t), defined as the rate of events
uted per second. Given the control action µ(t) (i.e., the throughput of the system)
he system capacity, we can define µd(t) as the dropping rate of the LS. As we did
t), we can estimate the current system capacity as the number of events analyzed
last time period. Thus, for a given time t, equation (4.8) shows that the service
s the sum of the system capacity estimated and the number of events that we need
p to achieve the required stability:
µ(t) = µc(t 1) + µd(t) (4.8)
Utilization
s section we describe the Utilization based Load Manager, which becomes of use
e of streaming applications which should operate with a limited overhead. The
ation based Load Manager, which is showed in Figure 4.4, resorts to queuing theory
del the behavior of an application with a CPU usage constraint, relying on an
Response time CPU Utilization

Utilization-based Load Manager
A. Shedding problem
The system in Figure 1 can be modeled by means of
Queuing Theory: the application is a single server node fed
by a queue, which provides the input jobs at a variable arrival
rate (t); the application is able to serve jobs at a service
rate µ(t). The system measures (t) and µ(t) in events per
second, where the events are respectively the input tweets and
the serviced tweets.
Starting from this, the simplest way to model the system
behavior is by means of the Little’s law (1), which states that
the number of jobs inside a system is equal to the input arrival
rate times the system response time:
N(t) = (t) · R(t) (1)
Q(t) = Q(t 1) + (t) µ(t) (2)
U(t) =
(t)
µmax
+
Q(t)
µmax
(3)
Q(t) = µmax · U(t) (t) (4)
e(t) = U(t) U(t 1) (5)
he system in Figure 1 can be modeled by means of
euing Theory: the application is a single server node fed
a queue, which provides the input jobs at a variable arrival
(t); the application is able to serve jobs at a service
µ(t). The system measures (t) and µ(t) in events per
ond, where the events are respectively the input tweets and
serviced tweets.
tarting from this, the simplest way to model the system
avior is by means of the Little’s law (1), which states that
number of jobs inside a system is equal to the input arrival
times the system response time:
N(t) = (t) · R(t) (1)
Q(t) = Q(t 1) + (t) µ(t) (2)
U(t) =
(t)
µmax
+
Q(t)
µmax
(3)
Q(t) = µmax · U(t) (t) (4)
e(t) = U(t) U(t 1) (5)
S:
Control error:
4.3. Policy wrapper and
equation (4.13). This leads to the ﬁnal formulation of the Loa
(4.14), where the throughput at time t + 1 is a function of th
the maximum available throughput times the feedback error.
e(t) = U(t) ¯U
µ(t + 1) = (t) + µmax · e(t)
The Load Manager formulation just obtained is compose
the one hand, when the contribution of the feedback error e(
Requested throughput:
4.3. Policy wrapper and L
equation (4.13). This leads to the ﬁnal formulation of the Load
(4.14), where the throughput at time t + 1 is a function of the
e(t) = U(t) ¯U
µ(t + 1) = (t) + µmax · e(t)
The Load Manager formulation just obtained is composed
the one hand, when the contribution of the feedback error e(t
condition of equation (4.15) is met; on the other hand, the secon
The system can be characterized
by its utilization and its queue size
Load Manager 15
Load Manager
LS Filter
Policies
SP
Metrics
correction
brium.
(t)  µ(t) (4.7)
µ(t) = µc(t 1) + µd(t) (4.8)
Utilization

A. Shedding problem
N(t) = (t) · R(t) (1)
Q(t) = Q(t 1) + (t) µ(t) (2)
U(t) =
(t)
µmax
+
Q(t)
µmax
(3)
Q(t) = µmax · U(t) (t) (4)
e(t) = U(t) U(t 1) (5)
serviced tweets.
N(t) = (t) · R(t) (1)
Q(t) = Q(t 1) + (t) µ(t) (2)
U(t) =
(t)
µmax
+
Q(t)
µmax
(3)
Q(t) = µmax · U(t) (t) (4)
e(t) = U(t) U(t 1) (5)
S:
Control error:
e(t) = U(t) ¯U
µ(t + 1) = (t) + µmax · e(t)
e(t) = U(t) ¯U
µ(t + 1) = (t) + µmax · e(t)
Load Manager 15
Load Manager
LS Filter
Policies
SP
Metrics
correction
brium.
(t)  µ(t) (4.7)
µ(t) = µc(t 1) + µd(t) (4.8)
Utilization
CPU utilization Arrived events Residual events

A. Shedding problem
N(t) = (t) · R(t) (1)
Q(t) = Q(t 1) + (t) µ(t) (2)
U(t) =
(t)
µmax
+
Q(t)
µmax
(3)
Q(t) = µmax · U(t) (t) (4)
e(t) = U(t) U(t 1) (5)
serviced tweets.
N(t) = (t) · R(t) (1)
Q(t) = Q(t 1) + (t) µ(t) (2)
U(t) =
(t)
µmax
+
Q(t)
µmax
(3)
Q(t) = µmax · U(t) (t) (4)
e(t) = U(t) U(t 1) (5)
S:
Control error:
e(t) = U(t) ¯U
µ(t + 1) = (t) + µmax · e(t)
e(t) = U(t) ¯U
µ(t + 1) = (t) + µmax · e(t)
Load Manager 15
Load Manager
LS Filter
Policies
SP
Metrics
correction
brium.
(t)  µ(t) (4.7)
µ(t) = µc(t 1) + µd(t) (4.8)
Utilization
Current utilization Target utilization

A. Shedding problem
N(t) = (t) · R(t) (1)
Q(t) = Q(t 1) + (t) µ(t) (2)
U(t) =
(t)
µmax
+
Q(t)
µmax
(3)
Q(t) = µmax · U(t) (t) (4)
e(t) = U(t) U(t 1) (5)
serviced tweets.
N(t) = (t) · R(t) (1)
Q(t) = Q(t 1) + (t) µ(t) (2)
U(t) =
(t)
µmax
+
Q(t)
µmax
(3)
Q(t) = µmax · U(t) (t) (4)
e(t) = U(t) U(t 1) (5)
S:
Control error:
e(t) = U(t) ¯U
µ(t + 1) = (t) + µmax · e(t)
e(t) = U(t) ¯U
µ(t + 1) = (t) + µmax · e(t)
Load Manager 15
Load Manager
LS Filter
Policies
SP
Metrics
correction
brium.
(t)  µ(t) (4.7)
µ(t) = µc(t 1) + µd(t) (4.8)
Utilization
Arrival rate
Max theoretical
throughput
Control error

A. Shedding problem
N(t) = (t) · R(t) (1)
Q(t) = Q(t 1) + (t) µ(t) (2)
U(t) =
(t)
µmax
+
Q(t)
µmax
(3)
Q(t) = µmax · U(t) (t) (4)
e(t) = U(t) U(t 1) (5)
serviced tweets.
N(t) = (t) · R(t) (1)
Q(t) = Q(t 1) + (t) µ(t) (2)
U(t) =
(t)
µmax
+
Q(t)
µmax
(3)
Q(t) = µmax · U(t) (t) (4)
e(t) = U(t) U(t 1) (5)
S:
Control error:
e(t) = U(t) ¯U
µ(t + 1) = (t) + µmax · e(t)
e(t) = U(t) ¯U
µ(t + 1) = (t) + µmax · e(t)
Load Manager 15
Load Manager
LS Filter
Policies
SP
Metrics
correction
brium.
(t)  µ(t) (4.7)
µ(t) = µc(t 1) + µd(t) (4.8)
Utilization
The requested throughput is used by the load shedding policies to derive the LS probabilities

Policy wrapper and policies
• The policy wrapper provides access to statistics of processes,
the requested throughput μ(t+1) and the system capacity μc(t)
16
Load Manager
LS Filter
Policies
SP
Metrics
correction

16
Load Manager
LS Filter
Policies
SP
Metrics
correction
Baseline policy
• Compute one LS probability for all processes (with μ(t+1) and μc(t))

16
Fair policy
• Assign to each process the “same" number  
of events
• Save metrics of small processes, still
accurate results on big ones
Load Manager
LS Filter
Policies
SP
Metrics
correction
Baseline policy

16
Fair policy
• Assign to each process the “same" number  
of events
• Save metrics of small processes, still
accurate results on big ones
Priority-based policy
• Assign a static priority to each process
• Compute a weighted priority to partition
the system capacity
• Assign a partition to each process and
compute the probabilities
Load Manager
LS Filter
Policies
SP
Metrics
correction
Baseline policy

Load Shedding Filter
• The Load Shedding Filter applies the probabilities  
computed by the policies to the input stream
17
Load Manager
LS Filter
Policies
SP
Metrics
correction

• For each event:
• Look for load shedding probability depending on input class
• If no data is found we can drop the event
• Otherwise, apply the Load Shedding probability computed by the policy
17
Load Manager
LS Filter
Policies
SP
Metrics
correction
Load Shedding
Filter
Shedding
Plan
event buffers
ok
drop probability
Event
Capture
ko

• For each event:
• Look for load shedding probability depending on input class
• If no data is found we can drop the event
• Otherwise, apply the Load Shedding probability computed by the policy
• The dropped events are reported to the application for metrics correction
17
Load Manager
LS Filter
Policies
SP
Metrics
correction
Load Shedding
Filter
Shedding
Plan
event buffers
ok
drop probability
Event
Capture
ko

Output quality - homogeneous
• QoS requirement Ut 1.1%, standard set-point for the agent
• MAPE (lower is better) between exact and approximated metrics
• Output metrics on latency and volume for file and network operations
18

Output quality - homogeneous
• QoS requirement Ut 1.1%, standard set-point for the agent
• Output metrics on latency and volume for file and network operations
• Similar or better results of FFWD fair policy w.r.t reference
• FFWD accurate even if drops more events
• Predictable and repetitive behavior of nginx, fio and apache
18
1
10
100
1000
latency-file
latency-net
volum
e-file
volum
e-net
MAPE(%)log
kernel-drop
fair
1
10
100
1000
latency-file
latency-net
volum
e-file
volum
e-net
MAPE(%)log
kernel-drop
fair
1
10
100
1000
latency-file
latency-net
volum
e-file
volum
e-net
MAPE(%)log
kernel-drop
fair
1
10
100
1000
latency-file
latency-net
volum
e-file
volum
e-net
MAPE(%)log
kernel-drop
fair
reference
1
10
100
1000
latency-file
latency-net
volum
e-file
volum
e-net
MAPE(%)log
kernel-drop
fair
apache
1.9M evt/s
postmark
1.2M evt/s
simplefile
1.5M evt/s
fio
1.3M evt/s
nginx
800K evt/s

System stability 19
• We evaluated the Load Manager with all the tests (A, B, C, D, E, F, G)
• With 3 different set points (Ut 1.0%, 1.1%, 1.2% w.r.t. system capacity)
• Measuring the CPU load of the sysdig agent with:
• reference implementation
• FFWD with fair and priority policy
• We compared the actual CPU load 
with the QoS requirement (Ut)
• Error measured with MAPE (lower  
is better) obtained running 20 times  
each benchmark
Test
Ut = 1.1%
A 7,12% 1,78% 3,78%
B 34,06% 4,37% 4,46%
C 28,03% 2,27% 2,24%
D 11,52% 1,41% 1,54%
E 26,02% 8,51% 8,99%
F 22,67% 8,11% 3,74%
G 16,42% 3,37% 2,73%
H 19,92% 8,41% 8,01%

Output quality - heterogeneous
• We tried to mix the homogeneous tests:
• simulate co-located environment
• add OS scheduling uncertainty and noise
• QoS requirement Ut 1.1%
• Compare metrics from reference, FFWD fair, FFWD priority
• Three tests with different syscall mix:
• Network based mid-throughput: 1x Fio, 3x Nginx, 1.3M evt/s
• Mixed mid-throughput: 1x Simplefile, 1x Nginx, 1.3M evt/s
• Mixed high-throughput: 1x Apache, 1x Fio, 2x Postmark, 1.8M evt/s
20

1x Fio, 3x Nginx, 1.3M evt/s 21
0.1
1
10
100
1000
10000
100000
MAPE(%)log
kernel-drop
fair
priority
0.1
1
10
100
1000
10000
100000
fair
priority
0.1
1
10
100
1000
10000
100000
MAPE(%)log
kernel-drop
fair
priority
0.1
1
10
100
1000
10000
100000
MAPE(%)log
kernel-drop
fair
priority
reference
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
referenceVolume metrics (byte r/w)
Latency metrics
MAPE lower is better

1x Apache, 1x Fio, 2x Postmark, 1.8M evt/s 22
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
postmark-2postmark-1fioapache
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
reference
0.1
1
10
100
1000
10000
100000
MAPE(%)log
kernel-drop
fair
priority
0.1
1
10
100
1000
10000
100000
MAPE(%)log
kernel-drop
fair
priority
0.1
1
10
100
1000
10000
100000
fair
priority
0.1
1
10
100
1000
10000
100000
MAPE(%)log
kernel-drop
fair
priority
reference
Volume metrics (byte r/w)
Latency metrics

23
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
reference
0.1
1
10
100
1000
10000
100000
MAPE(%)log
kernel-drop
fair
priority
0.1
1
10
100
1000
10000
100000
MAPE(%)log
kernel-drop
fair
priority
0.1
1
10
100
1000
10000
100000
fair
priority
0.1
1
10
100
1000
10000
100000
MAPE(%)log
kernel-drop
fair
priority
reference
Volume metrics (byte r/w)
Latency metrics
Test H, mixed workloads: 1x apache, 1x fio, 2x postmark, 1.8M evt/s
• Fair policy outperforms reference in almost all cases
• the LS Filter works at the single event level
• reference drops events in batches
• Priority policy improves the Fair policy results in most cases
• the prioritized processes are privileged
• other processes treated as “best-effort”
1x Apache, 1x Fio, 2x Postmark, 1.8M evt/s

1x simplefile, 1x nginx, 1.3M evt/s 24
1
10
100
1000
MAPE(%)log
kernel-drop
fair
priority
nginxsimplefile
reference
1
10
100
1000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
nginx-1simplefile
reference
Volume metrics (byte r/w)Latency metrics

Response time Load Manager 25
S:
(Little’s Law)
(Jobs in the system)
The system can be characterized by its response time and the jobs in the system
Control error:

26
S:
(Little’s Law)
Control error:
Old response time Target response time
Response time Load Manager

27
S:
(Little’s Law)
Control error:
Requested throughput Arrival rate
Control error
Response time Load Manager

Real-time sentiment analysis 28
• Real-time sentiment analysis allows to:
– Track the sentiment of a topic over time
– Correlate real world events and related sentiment, e.g.
• Toyota crisis (2010) [1]
• 2012 US Presidential Election Cycle [2]
– Track online evolution of companies reputation, derive social
profiling and allow enhanced social marketing strategies
[1] Bifet Figuerol, Albert Carles, et al. "Detecting sentiment change in Twitter streaming data." Journal of Machine Learning Research:
Workshop and Conference Proceedings Series. 2011.
[2] Wang, Hao, et al. "A system for real-time twitter sentiment analysis of 2012 us presidential election cycle." Proceedings of the ACL
2012 System Demonstrations.

Sentiment analysis: case study 29
• Simple Twitter streaming sentiment analyzer with Stanford NLP
• System components:
– Event producer
– RabbitMQ queue
– Event consumer
• Consumer components:
– Event Capture
– Sentiment Analyzer
– Sentiment Aggregator
• Real-time queue consumption, aggregated metrics emission each second
(keywords and hashtag sentiment)

FFWD: Sentiment analysis 30
• FFWD adds four components:
– Load shedding filter at the beginning of the pipeline
– Shedding plan used by the filter
– Domain-specific policy wrapper
– Application controller manager to detect load peaks
Producer
Load Shedding
Filter
Event
Capture
Sentiment
Analyzer
Sentiment
Aggregator
Policy
Wrapper
Load Manager
Shedding
Plan
real-time queue
batch queue
ok
ko
ko count
account metrics
R(t)
stream statsupdated plan
μ(t+1)
event output metricsinput tweets
drop probability
Component
Data structure
Internal information ﬂow
External information ﬂow
Queue
analyze event
λ(t)
Rt

Sentiment - experimental setup 31
• Separate tests to understand FFWD behavior:
– System stability
– Output quality
• Dataset: 900K tweets of 35th week of Premier League
• Performed tests:
– Controller: synthetic and real tweets at various λ(t)
– Policy: real tweets at various λ(t)
– Intel core i7 3770, 4 cores @ 3.4 Ghz + HT, 8MB LLC
– 8 GB RAM @ 1600 Mhz

System stability 32
case A: λ(t) = λ(t-1)
case B: λ(t) = avg(λ(t))
λ(t) estimation:

Load Manager showcase (1)
• Load Manager demo (Rt = 5s):
– λ(t) increased after 60s and 240s
– response time:
33
0
1
2
3
4
5
6
7
0 50 100 150 200 250 300
Responsetime(s)
time (s)
Controller performance
QoS = 5s
R

Load Manager showcase (2)
• Load Manager demo (Rt = 5s):
– λ(t) increased after 60s and 240s
– throughput:
34
0
100
200
300
400
500
0 50 100 150 200 250 300
#Events
time (s)
Actuation
lambda
dropped
computed
mu

Output Quality 35
• Real tweets, μc(t) ≃ 40 evt/s
• Evaluated policies:
• Baseline
• Fair
• Priority
• R = 5s, λ(t) = 100 evt/s, 200 evt/s, 400 evt/s
• Error metric: Mean Absolute Percentage
Error (MAPE %) (lower is better)
0
10
20
30
40
50
A B C D
MAPE(%)
Groups
baseline_error
fair_error
priority_error
λ(t) = 100 evt/s
0
10
20
30
40
50
A B C D
MAPE(%)
Groups
baseline_error
fair_error
priority_error
λ(t) = 200 evt/s
0
10
20
30
40
50
A B C D
MAPE(%)
Groups
baseline_error
fair_error
priority_error
λ(t) = 400 evt/s

Self-adaptive container monitoring with performance-aware Load-Shedding policies

More Related Content

What's hot (20)

Similar to Self-adaptive container monitoring with performance-aware Load-Shedding policies (20)

More from NECST Lab @ Politecnico di Milano (20)

Recently uploaded (20)

Self-adaptive container monitoring with performance-aware Load-Shedding policies