SlideShare a Scribd company logo
1
Self-adaptive container monitoring with
performance-aware load-shedding policies
NECST Group Conference 2017 @ Bloomberg
06/01/2017
Rolando Brondolin
rolando.brondolin@polimi.it
DEIB, Politecnico di Milano
Infrastructure monitoring 2
• Easy to follow behavior of small infrastructures
• But on large-scale systems:
– Systematic approach for monitoring and troubleshooting
– Tradeoff on data granularity and resource consumption
high visibility on system state
non negligible cost
few information on system state
cheap monitoring
VS
Infrastructure monitoring 3
• Easy to follow behavior of small infrastructures
• But on large-scale systems:
– Systematic approach for monitoring and troubleshooting
– Tradeoff on data granularity and resource consumption
high visibility on system state
non negligible cost
few information on system state
cheap monitoring
VS
Sysdig Cloud monitoring 4
http://www.sysdig.org
• Infrastructure for container monitoring
• Collects aggregated metrics and shows system state:
– “Drill-down” from cluster to single application metrics
– Dynamic network topology
– Alerting and anomaly detection
• Monitoring agent deployed on each machine in the cluster
– Traces system calls in a “streaming fashion”
– Aggregates data for Threads, FDs, applications, containers and hosts
Problem definition
• The Sysdig Cloud agent can be modelled as a server with a finite queue
• characterized by its arrival rate λ(t) and its service rate μ(t)
• Subject to overloading conditions
5
S
λ(t) φ(t)
μ(t)
Λ Φ
Q
Cause
Problem definition
• The Sysdig Cloud agent can be modelled as a server with a finite queue
• characterized by its arrival rate λ(t) and its service rate μ(t)
• Subject to overloading conditions
5
Events arrives at
really high frequency
S
λ(t) φ(t)
μ(t)
Λ Φ
Q
S
φ(t)
μ(t)
Φ
Q
of a streaming system with queue, processing element and streaming
output flow . A server S, fed by a queue Q, is in overloading
eater than the service rate µ(t). The stability condition stated
he necessary and sufficient condition to avoid overloading. A
ncing overloading should discard part of the input to increase
to match the arrival rate (t).
µ(t)  (t) (2.1)
rmalizing is twofold, as we are interested not only in controlling
t also in maximizing the accuracy of the estimated metrics. To
which represents the input flow at a given time t; and ˜x, which
ut flow considered in case of overloading at the same time t. If
EffectCause
Problem definition
• The Sysdig Cloud agent can be modelled as a server with a finite queue
• characterized by its arrival rate λ(t) and its service rate μ(t)
• Subject to overloading conditions
5
Events arrives at
really high frequency Queues grow
indefinitely
S
λ(t) φ(t)
μ(t)
Λ Φ
Q
S
φ(t)
μ(t)
Φ
Q
of a streaming system with queue, processing element and streaming
output flow . A server S, fed by a queue Q, is in overloading
eater than the service rate µ(t). The stability condition stated
he necessary and sufficient condition to avoid overloading. A
ncing overloading should discard part of the input to increase
to match the arrival rate (t).
µ(t)  (t) (2.1)
rmalizing is twofold, as we are interested not only in controlling
t also in maximizing the accuracy of the estimated metrics. To
which represents the input flow at a given time t; and ˜x, which
ut flow considered in case of overloading at the same time t. If
IssuesEffectCause
Problem definition
• The Sysdig Cloud agent can be modelled as a server with a finite queue
• characterized by its arrival rate λ(t) and its service rate μ(t)
• Subject to overloading conditions
5
Events arrives at
really high frequency Queues grow
indefinitely
High usage of system
resources
Loss of events
S
λ(t) φ(t)
μ(t)
Λ Φ
Q
S
φ(t)
μ(t)
Φ
Q
of a streaming system with queue, processing element and streaming
output flow . A server S, fed by a queue Q, is in overloading
eater than the service rate µ(t). The stability condition stated
he necessary and sufficient condition to avoid overloading. A
ncing overloading should discard part of the input to increase
to match the arrival rate (t).
µ(t)  (t) (2.1)
rmalizing is twofold, as we are interested not only in controlling
t also in maximizing the accuracy of the estimated metrics. To
which represents the input flow at a given time t; and ˜x, which
ut flow considered in case of overloading at the same time t. If
Output quality
degradation
Proposed solution: FFWD
Fast Forward With Degradation (FFWD) is a framework that tackles load peaks
in streaming applications via load-shedding techniques
6
Proposed solution: FFWD
Fast Forward With Degradation (FFWD) is a framework that tackles load peaks
in streaming applications via load-shedding techniques
general approach but leveraging domain-specific details
6
Proposed solution: FFWD
Fast Forward With Degradation (FFWD) is a framework that tackles load peaks
in streaming applications via load-shedding techniques
general approach but leveraging domain-specific details
6
self-adaptive container monitoring with performance-aware load-shedding policies
Proposed solution: FFWD
Fast Forward With Degradation (FFWD) is a framework that tackles load peaks
in streaming applications via load-shedding techniques
general approach but leveraging domain-specific details
6
self-adaptive container monitoring with performance-aware load-shedding policies
Target: monitoring
• Collects events
• Process events
• Aggregates data
Proposed solution: FFWD
Fast Forward With Degradation (FFWD) is a framework that tackles load peaks
in streaming applications via load-shedding techniques
general approach but leveraging domain-specific details
6
self-adaptive container monitoring with performance-aware load-shedding policies
Target: monitoring
• Collects events
• Process events
• Aggregates data
Decide when to shed load
• Observe application status
• Check system overload
Proposed solution: FFWD
Fast Forward With Degradation (FFWD) is a framework that tackles load peaks
in streaming applications via load-shedding techniques
general approach but leveraging domain-specific details
6
self-adaptive container monitoring with performance-aware load-shedding policies
Target: monitoring
• Collects events
• Process events
• Aggregates data
Decide when to shed load
• Observe application status
• Check system overload
Choose how much load to shed
• domain-specific pluggable policies
• compute shedding probability
Proposed solution: FFWD
Fast Forward With Degradation (FFWD) is a framework that tackles load peaks
in streaming applications via load-shedding techniques
general approach but leveraging domain-specific details
6
self-adaptive container monitoring with performance-aware load-shedding policies
Target: monitoring
• Collects events
• Process events
• Aggregates data
Decide when to shed load
• Observe application status
• Check system overload
Set where to shed load
• act directly on incoming stream
• probabilistic approach
Choose how much load to shed
• domain-specific pluggable policies
• compute shedding probability
FFWD: load shedding framework 7
FFWD: load shedding framework 7
Load manager
• Collect queue status and performance
• Heuristic feedback control 

(Little’s law, queuing theory)
• Implementation goals:
• Response time
• CPU utilization
FFWD: load shedding framework 7
Load manager
• Collect queue status and performance
• Heuristic feedback control 

(Little’s law, queuing theory)
• Implementation goals:
• Response time
• CPU utilization
Policy wrapper
• Hosts domain-specific policies:
• Baseline policy
• Fair policy
• Priority-based policy
FFWD: load shedding framework 7
Load manager
• Collect queue status and performance
• Heuristic feedback control 

(Little’s law, queuing theory)
• Implementation goals:
• Response time
• CPU utilization
Policy wrapper
• Hosts domain-specific policies:
• Baseline policy
• Fair policy
• Priority-based policy
shedding
plan
FFWD: load shedding framework 7
Load manager
• Collect queue status and performance
• Heuristic feedback control 

(Little’s law, queuing theory)
• Implementation goals:
• Response time
• CPU utilization
Load shedding filter
• Placed at the input source
• Uses shedding plan and input categories
•Probabilistic approach to events drop
Policy wrapper
• Hosts domain-specific policies:
• Baseline policy
• Fair policy
• Priority-based policy
shedding
plan
FFWD: load shedding framework 7
Load manager
• Collect queue status and performance
• Heuristic feedback control 

(Little’s law, queuing theory)
• Implementation goals:
• Response time
• CPU utilization
Metrics correction
• Counts computed and dropped events
• Rescale aggregated metrics
Load shedding filter
• Placed at the input source
• Uses shedding plan and input categories
•Probabilistic approach to events drop
Policy wrapper
• Hosts domain-specific policies:
• Baseline policy
• Fair policy
• Priority-based policy
shedding
plan
• We evaluated FFWD within Sysdig
with 2 goals:
• System stability (slide 8)
• Output quality (slides 9 10 11 12)
• Results compared with the reference
filtering system of Sysdig
• Evaluation setup
• 2x Xeon E5-2650 v3, 

20 cores (40 w/HT) @ 2.3Ghz
• 128 GB DDR4 RAM
• Test selected from Phoronix test suite
Experimental setup 8
• We evaluated FFWD within Sysdig
with 2 goals:
• System stability (slide 8)
• Output quality (slides 9 10 11 12)
• Results compared with the reference
filtering system of Sysdig
• Evaluation setup
• 2x Xeon E5-2650 v3, 

20 cores (40 w/HT) @ 2.3Ghz
• 128 GB DDR4 RAM
• Test selected from Phoronix test suite
Experimental setup 8
test ID name priority # evts/s
A nginx 3 800K
B postmark 4 1,2M
C fio 4 1,3M
D simplefile 2 1,5M
E apache 2 1,9M
test ID instances # evts/s
F 3x nginx, 1x fio 1,3M
G 1x nginx, 1x simplefile 1,3M
H
1x apache, 2x postmark,
1x fio
1,8M
Homogeneous benchmarks
Heterogeneous benchmarks
Syscall intensive benchmarks
from Phoronix test suite
• We evaluated FFWD within Sysdig
with 2 goals:
• System stability (slide 8)
• Output quality (slides 9 10 11 12)
• Results compared with the reference
filtering system of Sysdig
• Evaluation setup
• 2x Xeon E5-2650 v3, 

20 cores (40 w/HT) @ 2.3Ghz
• 128 GB DDR4 RAM
• Test selected from Phoronix test suite
Experimental setup 8
test ID name priority # evts/s
A nginx 3 800K
B postmark 4 1,2M
C fio 4 1,3M
D simplefile 2 1,5M
E apache 2 1,9M
test ID instances # evts/s
F 3x nginx, 1x fio 1,3M
G 1x nginx, 1x simplefile 1,3M
H
1x apache, 2x postmark,
1x fio
1,8M
Homogeneous benchmarks
Heterogeneous benchmarks
Syscall intensive benchmarks
from Phoronix test suite
• We evaluated FFWD within Sysdig
with 2 goals:
• System stability (slide 8)
• Output quality (slides 9 10 11 12)
• Results compared with the reference
filtering system of Sysdig
• Evaluation setup
• 2x Xeon E5-2650 v3, 

20 cores (40 w/HT) @ 2.3Ghz
• 128 GB DDR4 RAM
• Test selected from Phoronix test suite
Experimental setup 8
test ID name priority # evts/s
A nginx 3 800K
B postmark 4 1,2M
C fio 4 1,3M
D simplefile 2 1,5M
E apache 2 1,9M
test ID instances # evts/s
F 3x nginx, 1x fio 1,3M
G 1x nginx, 1x simplefile 1,3M
H
1x apache, 2x postmark,
1x fio
1,8M
Homogeneous benchmarks
Heterogeneous benchmarks
Syscall intensive benchmarks
from Phoronix test suite
Experimental results
• System stability and control error
• MAPE between QoS requirement and
CPU traces of FFWD and reference
• 3.51x average MAPE improvement,

average MAPE below 5%
9
Test
Ut = 1.1% - Control error MAPE
reference fair priority
A 7,12% 1,78% 3,78%
B 34,06% 4,37% 4,46%
C 28,03% 2,27% 2,24%
D 11,52% 1,41% 1,54%
E 26,02% 8,51% 8,99%
F 22,67% 8,11% 3,74%
G 16,42% 3,37% 2,73%
H 19,92% 8,41% 8,01%
Experimental results
• System stability and control error
• MAPE between QoS requirement and
CPU traces of FFWD and reference
• 3.51x average MAPE improvement,

average MAPE below 5%
9
Test
Ut = 1.1% - Control error MAPE
reference fair priority
A 7,12% 1,78% 3,78%
B 34,06% 4,37% 4,46%
C 28,03% 2,27% 2,24%
D 11,52% 1,41% 1,54%
E 26,02% 8,51% 8,99%
F 22,67% 8,11% 3,74%
G 16,42% 3,37% 2,73%
H 19,92% 8,41% 8,01%
Experimental results
• System stability and control error
• MAPE between QoS requirement and
CPU traces of FFWD and reference
• 3.51x average MAPE improvement,

average MAPE below 5%
9
Test
Ut = 1.1% - Control error MAPE
reference fair priority
A 7,12% 1,78% 3,78%
B 34,06% 4,37% 4,46%
C 28,03% 2,27% 2,24%
D 11,52% 1,41% 1,54%
E 26,02% 8,51% 8,99%
F 22,67% 8,11% 3,74%
G 16,42% 3,37% 2,73%
H 19,92% 8,41% 8,01%
Experimental results
• System stability and control error
• MAPE between QoS requirement and
CPU traces of FFWD and reference
• 3.51x average MAPE improvement,

average MAPE below 5%
9
Test
Ut = 1.1% - Control error MAPE
reference fair priority
A 7,12% 1,78% 3,78%
B 34,06% 4,37% 4,46%
C 28,03% 2,27% 2,24%
D 11,52% 1,41% 1,54%
E 26,02% 8,51% 8,99%
F 22,67% 8,11% 3,74%
G 16,42% 3,37% 2,73%
H 19,92% 8,41% 8,01%
• Output quality
• MAPE between exact output metrics and
approximated ones from FFWD and ref.
• The domain-specific policies outperforms
the reference in the majority of cases
0.1
1
10
100
1000
10000
100000
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
latency-filelatenc
MAPE(%)log
kernel-drop
fair
priority
nginx-3nginx-2nginx-1fio
0.1
1
10
100
1000
10000
100000
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
MAPE(%)log
kernel-drop
fair
priority
nginx-3nginx-2nginx-1fio
0.1
1
10
100
1000
10000
100000
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
MAPE(%)log
kernel-drop
fair
priority
nginx-3nginx-2nginx-1fio
0.1
1
10
100
1000
10000
100000
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
MAPE(%)log
kernel-drop
fair
priority
nginx-3nginx-2nginx-1fio
reference
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
MAPE(%)log
kernel-drop
fair
priority
nginx-3nginx-2nginx-1fio
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
nginx-3nginx-2nginx-1fio
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
nginx-3nginx-2nginx-1fio
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log kernel-drop
fair
priority
nginx-3nginx-2nginx-1fio
referenceVolume metrics (byte r/w)
Latency metrics
1x Fio, 3x Nginx, 1.3M evt/s
Experimental results
• System stability and control error
• MAPE between QoS requirement and
CPU traces of FFWD and reference
• 3.51x average MAPE improvement,

average MAPE below 5%
9
Test
Ut = 1.1% - Control error MAPE
reference fair priority
A 7,12% 1,78% 3,78%
B 34,06% 4,37% 4,46%
C 28,03% 2,27% 2,24%
D 11,52% 1,41% 1,54%
E 26,02% 8,51% 8,99%
F 22,67% 8,11% 3,74%
G 16,42% 3,37% 2,73%
H 19,92% 8,41% 8,01%
• Output quality
• MAPE between exact output metrics and
approximated ones from FFWD and ref.
• The domain-specific policies outperforms
the reference in the majority of cases
0.1
1
10
100
1000
10000
100000
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
latency-filelatenc
MAPE(%)log
kernel-drop
fair
priority
nginx-3nginx-2nginx-1fio
0.1
1
10
100
1000
10000
100000
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
MAPE(%)log
kernel-drop
fair
priority
nginx-3nginx-2nginx-1fio
0.1
1
10
100
1000
10000
100000
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
MAPE(%)log
kernel-drop
fair
priority
nginx-3nginx-2nginx-1fio
0.1
1
10
100
1000
10000
100000
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
MAPE(%)log
kernel-drop
fair
priority
nginx-3nginx-2nginx-1fio
reference
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
MAPE(%)log
kernel-drop
fair
priority
nginx-3nginx-2nginx-1fio
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
nginx-3nginx-2nginx-1fio
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
nginx-3nginx-2nginx-1fio
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log kernel-drop
fair
priority
nginx-3nginx-2nginx-1fio
referenceVolume metrics (byte r/w)
Latency metrics
1x Fio, 3x Nginx, 1.3M evt/s
Experimental results
• System stability and control error
• MAPE between QoS requirement and
CPU traces of FFWD and reference
• 3.51x average MAPE improvement,

average MAPE below 5%
9
Test
Ut = 1.1% - Control error MAPE
reference fair priority
A 7,12% 1,78% 3,78%
B 34,06% 4,37% 4,46%
C 28,03% 2,27% 2,24%
D 11,52% 1,41% 1,54%
E 26,02% 8,51% 8,99%
F 22,67% 8,11% 3,74%
G 16,42% 3,37% 2,73%
H 19,92% 8,41% 8,01%
• Output quality
• MAPE between exact output metrics and
approximated ones from FFWD and ref.
• The domain-specific policies outperforms
the reference in the majority of cases
0.1
1
10
100
1000
10000
100000
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
latency-filelatenc
MAPE(%)log
kernel-drop
fair
priority
nginx-3nginx-2nginx-1fio
0.1
1
10
100
1000
10000
100000
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
MAPE(%)log
kernel-drop
fair
priority
nginx-3nginx-2nginx-1fio
0.1
1
10
100
1000
10000
100000
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
MAPE(%)log
kernel-drop
fair
priority
nginx-3nginx-2nginx-1fio
0.1
1
10
100
1000
10000
100000
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
MAPE(%)log
kernel-drop
fair
priority
nginx-3nginx-2nginx-1fio
reference
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
MAPE(%)log
kernel-drop
fair
priority
nginx-3nginx-2nginx-1fio
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
nginx-3nginx-2nginx-1fio
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
nginx-3nginx-2nginx-1fio
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log kernel-drop
fair
priority
nginx-3nginx-2nginx-1fio
referenceVolume metrics (byte r/w)
Latency metrics
1x Fio, 3x Nginx, 1.3M evt/s
Conclusion
• We saw the main challenges of Load Shedding for container monitoring
– Low overhead monitoring
– High quality and granularity of metrics
10
Conclusion
• We saw the main challenges of Load Shedding for container monitoring
– Low overhead monitoring
– High quality and granularity of metrics
10
Conclusion
• We saw the main challenges of Load Shedding for container monitoring
– Low overhead monitoring
– High quality and granularity of metrics
• Fast Forward With Degradation (FFWD)
– Heuristic controller for bounded CPU usage
– Pluggable policies for domain-specific load shedding
– Accurate computation of output metrics
– Load Shedding Filter for fast drop of events
10
11
Questions?
Rolando Brondolin, rolando.brondolin@polimi.it
DEIB, Politecnico di Milano
NGC VIII 2017 @ SF
FFWD: Latency-aware event stream processing via domain-specific load-shedding policies. R. Brondolin, M. Ferroni, M. D.
Santambrogio. In Proceedings of 14th IEEE/IFIP International Conference on Embedded and Ubiquitous Computing (EUC 2016)
12
BACKUP SLIDES
13
Proposed solution: FFWD
Fast Forward With Degradation (FFWD) is a framework that tackles load peaks
in streaming applications via load-shedding techniques
14
Proposed solution: FFWD
Fast Forward With Degradation (FFWD) is a framework that tackles load peaks
in streaming applications via load-shedding techniques
general approach but leveraging domain-specific details
14
Proposed solution: FFWD
Fast Forward With Degradation (FFWD) is a framework that tackles load peaks
in streaming applications via load-shedding techniques
general approach but leveraging domain-specific details
14
Load Manager
*when*
Proposed solution: FFWD
Fast Forward With Degradation (FFWD) is a framework that tackles load peaks
in streaming applications via load-shedding techniques
general approach but leveraging domain-specific details
14
Load Manager
*when*
Policy wrapper
*how much*
Proposed solution: FFWD
Fast Forward With Degradation (FFWD) is a framework that tackles load peaks
in streaming applications via load-shedding techniques
general approach but leveraging domain-specific details
14
Load Manager
*when*
LS Filter
*where*
Policy wrapper
*how much*
shedding
plan
Proposed solution: FFWD
Fast Forward With Degradation (FFWD) is a framework that tackles load peaks
in streaming applications via load-shedding techniques
general approach but leveraging domain-specific details
14
Load Manager
*when*
aggregated
metrics
correction
LS Filter
*where*
Policy wrapper
*how much*
shedding
plan
Load Manager 15
Load Manager
LS Filter
Policies
SP
Metrics
correction
• The Load Manager computes the throughput μ(t) that
ensures stability such that:
• Two Load Managers for two goals:
tice that it is a sum of two different contributions. On the one hand, as the error e(t)
to zero, the stability condition (4.7) is met. On the other hand, the contribution:
(t) ensures a fast actuation in case of a significant deviation from the actual system
brium.
(t)  µ(t) (4.7)
course, during the lifetime of the system, the arrival rate (t) can vary unpre-
bly and can be greater than the system capacity µc(t), defined as the rate of events
uted per second. Given the control action µ(t) (i.e., the throughput of the system)
he system capacity, we can define µd(t) as the dropping rate of the LS. As we did
t), we can estimate the current system capacity as the number of events analyzed
last time period. Thus, for a given time t, equation (4.8) shows that the service
s the sum of the system capacity estimated and the number of events that we need
p to achieve the required stability:
µ(t) = µc(t 1) + µd(t) (4.8)
Utilization
s section we describe the Utilization based Load Manager, which becomes of use
e of streaming applications which should operate with a limited overhead. The
ation based Load Manager, which is showed in Figure 4.4, resorts to queuing theory
del the behavior of an application with a CPU usage constraint, relying on an
Response time CPU Utilization
Utilization-based Load Manager
A. Shedding problem
The system in Figure 1 can be modeled by means of
Queuing Theory: the application is a single server node fed
by a queue, which provides the input jobs at a variable arrival
rate (t); the application is able to serve jobs at a service
rate µ(t). The system measures (t) and µ(t) in events per
second, where the events are respectively the input tweets and
the serviced tweets.
Starting from this, the simplest way to model the system
behavior is by means of the Little’s law (1), which states that
the number of jobs inside a system is equal to the input arrival
rate times the system response time:
N(t) = (t) · R(t) (1)
Q(t) = Q(t 1) + (t) µ(t) (2)
U(t) =
(t)
µmax
+
Q(t)
µmax
(3)
Q(t) = µmax · U(t) (t) (4)
e(t) = U(t) U(t 1) (5)
he system in Figure 1 can be modeled by means of
euing Theory: the application is a single server node fed
a queue, which provides the input jobs at a variable arrival
(t); the application is able to serve jobs at a service
µ(t). The system measures (t) and µ(t) in events per
ond, where the events are respectively the input tweets and
serviced tweets.
tarting from this, the simplest way to model the system
avior is by means of the Little’s law (1), which states that
number of jobs inside a system is equal to the input arrival
times the system response time:
N(t) = (t) · R(t) (1)
Q(t) = Q(t 1) + (t) µ(t) (2)
U(t) =
(t)
µmax
+
Q(t)
µmax
(3)
Q(t) = µmax · U(t) (t) (4)
e(t) = U(t) U(t 1) (5)
S:
Control error:
4.3. Policy wrapper and
equation (4.13). This leads to the final formulation of the Loa
(4.14), where the throughput at time t + 1 is a function of th
the maximum available throughput times the feedback error.
e(t) = U(t) ¯U
µ(t + 1) = (t) + µmax · e(t)
The Load Manager formulation just obtained is compose
the one hand, when the contribution of the feedback error e(
Requested throughput:
4.3. Policy wrapper and L
equation (4.13). This leads to the final formulation of the Load
(4.14), where the throughput at time t + 1 is a function of the
the maximum available throughput times the feedback error.
e(t) = U(t) ¯U
µ(t + 1) = (t) + µmax · e(t)
The Load Manager formulation just obtained is composed
the one hand, when the contribution of the feedback error e(t
condition of equation (4.15) is met; on the other hand, the secon
The system can be characterized
by its utilization and its queue size
Load Manager 15
Load Manager
LS Filter
Policies
SP
Metrics
correction
• The Load Manager computes the throughput μ(t) that
ensures stability such that:
• Two Load Managers for two goals:
tice that it is a sum of two different contributions. On the one hand, as the error e(t)
to zero, the stability condition (4.7) is met. On the other hand, the contribution:
(t) ensures a fast actuation in case of a significant deviation from the actual system
brium.
(t)  µ(t) (4.7)
course, during the lifetime of the system, the arrival rate (t) can vary unpre-
bly and can be greater than the system capacity µc(t), defined as the rate of events
uted per second. Given the control action µ(t) (i.e., the throughput of the system)
he system capacity, we can define µd(t) as the dropping rate of the LS. As we did
t), we can estimate the current system capacity as the number of events analyzed
last time period. Thus, for a given time t, equation (4.8) shows that the service
s the sum of the system capacity estimated and the number of events that we need
p to achieve the required stability:
µ(t) = µc(t 1) + µd(t) (4.8)
Utilization
s section we describe the Utilization based Load Manager, which becomes of use
e of streaming applications which should operate with a limited overhead. The
ation based Load Manager, which is showed in Figure 4.4, resorts to queuing theory
del the behavior of an application with a CPU usage constraint, relying on an
Response time CPU Utilization
Utilization-based Load Manager
A. Shedding problem
The system in Figure 1 can be modeled by means of
Queuing Theory: the application is a single server node fed
by a queue, which provides the input jobs at a variable arrival
rate (t); the application is able to serve jobs at a service
rate µ(t). The system measures (t) and µ(t) in events per
second, where the events are respectively the input tweets and
the serviced tweets.
Starting from this, the simplest way to model the system
behavior is by means of the Little’s law (1), which states that
the number of jobs inside a system is equal to the input arrival
rate times the system response time:
N(t) = (t) · R(t) (1)
Q(t) = Q(t 1) + (t) µ(t) (2)
U(t) =
(t)
µmax
+
Q(t)
µmax
(3)
Q(t) = µmax · U(t) (t) (4)
e(t) = U(t) U(t 1) (5)
he system in Figure 1 can be modeled by means of
euing Theory: the application is a single server node fed
a queue, which provides the input jobs at a variable arrival
(t); the application is able to serve jobs at a service
µ(t). The system measures (t) and µ(t) in events per
ond, where the events are respectively the input tweets and
serviced tweets.
tarting from this, the simplest way to model the system
avior is by means of the Little’s law (1), which states that
number of jobs inside a system is equal to the input arrival
times the system response time:
N(t) = (t) · R(t) (1)
Q(t) = Q(t 1) + (t) µ(t) (2)
U(t) =
(t)
µmax
+
Q(t)
µmax
(3)
Q(t) = µmax · U(t) (t) (4)
e(t) = U(t) U(t 1) (5)
S:
Control error:
4.3. Policy wrapper and
equation (4.13). This leads to the final formulation of the Loa
(4.14), where the throughput at time t + 1 is a function of th
the maximum available throughput times the feedback error.
e(t) = U(t) ¯U
µ(t + 1) = (t) + µmax · e(t)
The Load Manager formulation just obtained is compose
the one hand, when the contribution of the feedback error e(
Requested throughput:
4.3. Policy wrapper and L
equation (4.13). This leads to the final formulation of the Load
(4.14), where the throughput at time t + 1 is a function of the
the maximum available throughput times the feedback error.
e(t) = U(t) ¯U
µ(t + 1) = (t) + µmax · e(t)
The Load Manager formulation just obtained is composed
the one hand, when the contribution of the feedback error e(t
condition of equation (4.15) is met; on the other hand, the secon
The system can be characterized
by its utilization and its queue size
Load Manager 15
Load Manager
LS Filter
Policies
SP
Metrics
correction
• The Load Manager computes the throughput μ(t) that
ensures stability such that:
• Two Load Managers for two goals:
tice that it is a sum of two different contributions. On the one hand, as the error e(t)
to zero, the stability condition (4.7) is met. On the other hand, the contribution:
(t) ensures a fast actuation in case of a significant deviation from the actual system
brium.
(t)  µ(t) (4.7)
course, during the lifetime of the system, the arrival rate (t) can vary unpre-
bly and can be greater than the system capacity µc(t), defined as the rate of events
uted per second. Given the control action µ(t) (i.e., the throughput of the system)
he system capacity, we can define µd(t) as the dropping rate of the LS. As we did
t), we can estimate the current system capacity as the number of events analyzed
last time period. Thus, for a given time t, equation (4.8) shows that the service
s the sum of the system capacity estimated and the number of events that we need
p to achieve the required stability:
µ(t) = µc(t 1) + µd(t) (4.8)
Utilization
s section we describe the Utilization based Load Manager, which becomes of use
e of streaming applications which should operate with a limited overhead. The
ation based Load Manager, which is showed in Figure 4.4, resorts to queuing theory
del the behavior of an application with a CPU usage constraint, relying on an
Response time CPU Utilization
CPU utilization Arrived events Residual events
Utilization-based Load Manager
A. Shedding problem
The system in Figure 1 can be modeled by means of
Queuing Theory: the application is a single server node fed
by a queue, which provides the input jobs at a variable arrival
rate (t); the application is able to serve jobs at a service
rate µ(t). The system measures (t) and µ(t) in events per
second, where the events are respectively the input tweets and
the serviced tweets.
Starting from this, the simplest way to model the system
behavior is by means of the Little’s law (1), which states that
the number of jobs inside a system is equal to the input arrival
rate times the system response time:
N(t) = (t) · R(t) (1)
Q(t) = Q(t 1) + (t) µ(t) (2)
U(t) =
(t)
µmax
+
Q(t)
µmax
(3)
Q(t) = µmax · U(t) (t) (4)
e(t) = U(t) U(t 1) (5)
he system in Figure 1 can be modeled by means of
euing Theory: the application is a single server node fed
a queue, which provides the input jobs at a variable arrival
(t); the application is able to serve jobs at a service
µ(t). The system measures (t) and µ(t) in events per
ond, where the events are respectively the input tweets and
serviced tweets.
tarting from this, the simplest way to model the system
avior is by means of the Little’s law (1), which states that
number of jobs inside a system is equal to the input arrival
times the system response time:
N(t) = (t) · R(t) (1)
Q(t) = Q(t 1) + (t) µ(t) (2)
U(t) =
(t)
µmax
+
Q(t)
µmax
(3)
Q(t) = µmax · U(t) (t) (4)
e(t) = U(t) U(t 1) (5)
S:
Control error:
4.3. Policy wrapper and
equation (4.13). This leads to the final formulation of the Loa
(4.14), where the throughput at time t + 1 is a function of th
the maximum available throughput times the feedback error.
e(t) = U(t) ¯U
µ(t + 1) = (t) + µmax · e(t)
The Load Manager formulation just obtained is compose
the one hand, when the contribution of the feedback error e(
Requested throughput:
4.3. Policy wrapper and L
equation (4.13). This leads to the final formulation of the Load
(4.14), where the throughput at time t + 1 is a function of the
the maximum available throughput times the feedback error.
e(t) = U(t) ¯U
µ(t + 1) = (t) + µmax · e(t)
The Load Manager formulation just obtained is composed
the one hand, when the contribution of the feedback error e(t
condition of equation (4.15) is met; on the other hand, the secon
The system can be characterized
by its utilization and its queue size
Load Manager 15
Load Manager
LS Filter
Policies
SP
Metrics
correction
• The Load Manager computes the throughput μ(t) that
ensures stability such that:
• Two Load Managers for two goals:
tice that it is a sum of two different contributions. On the one hand, as the error e(t)
to zero, the stability condition (4.7) is met. On the other hand, the contribution:
(t) ensures a fast actuation in case of a significant deviation from the actual system
brium.
(t)  µ(t) (4.7)
course, during the lifetime of the system, the arrival rate (t) can vary unpre-
bly and can be greater than the system capacity µc(t), defined as the rate of events
uted per second. Given the control action µ(t) (i.e., the throughput of the system)
he system capacity, we can define µd(t) as the dropping rate of the LS. As we did
t), we can estimate the current system capacity as the number of events analyzed
last time period. Thus, for a given time t, equation (4.8) shows that the service
s the sum of the system capacity estimated and the number of events that we need
p to achieve the required stability:
µ(t) = µc(t 1) + µd(t) (4.8)
Utilization
s section we describe the Utilization based Load Manager, which becomes of use
e of streaming applications which should operate with a limited overhead. The
ation based Load Manager, which is showed in Figure 4.4, resorts to queuing theory
del the behavior of an application with a CPU usage constraint, relying on an
Response time CPU Utilization
Current utilization Target utilization
Utilization-based Load Manager
A. Shedding problem
The system in Figure 1 can be modeled by means of
Queuing Theory: the application is a single server node fed
by a queue, which provides the input jobs at a variable arrival
rate (t); the application is able to serve jobs at a service
rate µ(t). The system measures (t) and µ(t) in events per
second, where the events are respectively the input tweets and
the serviced tweets.
Starting from this, the simplest way to model the system
behavior is by means of the Little’s law (1), which states that
the number of jobs inside a system is equal to the input arrival
rate times the system response time:
N(t) = (t) · R(t) (1)
Q(t) = Q(t 1) + (t) µ(t) (2)
U(t) =
(t)
µmax
+
Q(t)
µmax
(3)
Q(t) = µmax · U(t) (t) (4)
e(t) = U(t) U(t 1) (5)
he system in Figure 1 can be modeled by means of
euing Theory: the application is a single server node fed
a queue, which provides the input jobs at a variable arrival
(t); the application is able to serve jobs at a service
µ(t). The system measures (t) and µ(t) in events per
ond, where the events are respectively the input tweets and
serviced tweets.
tarting from this, the simplest way to model the system
avior is by means of the Little’s law (1), which states that
number of jobs inside a system is equal to the input arrival
times the system response time:
N(t) = (t) · R(t) (1)
Q(t) = Q(t 1) + (t) µ(t) (2)
U(t) =
(t)
µmax
+
Q(t)
µmax
(3)
Q(t) = µmax · U(t) (t) (4)
e(t) = U(t) U(t 1) (5)
S:
Control error:
4.3. Policy wrapper and
equation (4.13). This leads to the final formulation of the Loa
(4.14), where the throughput at time t + 1 is a function of th
the maximum available throughput times the feedback error.
e(t) = U(t) ¯U
µ(t + 1) = (t) + µmax · e(t)
The Load Manager formulation just obtained is compose
the one hand, when the contribution of the feedback error e(
Requested throughput:
4.3. Policy wrapper and L
equation (4.13). This leads to the final formulation of the Load
(4.14), where the throughput at time t + 1 is a function of the
the maximum available throughput times the feedback error.
e(t) = U(t) ¯U
µ(t + 1) = (t) + µmax · e(t)
The Load Manager formulation just obtained is composed
the one hand, when the contribution of the feedback error e(t
condition of equation (4.15) is met; on the other hand, the secon
The system can be characterized
by its utilization and its queue size
Load Manager 15
Load Manager
LS Filter
Policies
SP
Metrics
correction
• The Load Manager computes the throughput μ(t) that
ensures stability such that:
• Two Load Managers for two goals:
tice that it is a sum of two different contributions. On the one hand, as the error e(t)
to zero, the stability condition (4.7) is met. On the other hand, the contribution:
(t) ensures a fast actuation in case of a significant deviation from the actual system
brium.
(t)  µ(t) (4.7)
course, during the lifetime of the system, the arrival rate (t) can vary unpre-
bly and can be greater than the system capacity µc(t), defined as the rate of events
uted per second. Given the control action µ(t) (i.e., the throughput of the system)
he system capacity, we can define µd(t) as the dropping rate of the LS. As we did
t), we can estimate the current system capacity as the number of events analyzed
last time period. Thus, for a given time t, equation (4.8) shows that the service
s the sum of the system capacity estimated and the number of events that we need
p to achieve the required stability:
µ(t) = µc(t 1) + µd(t) (4.8)
Utilization
s section we describe the Utilization based Load Manager, which becomes of use
e of streaming applications which should operate with a limited overhead. The
ation based Load Manager, which is showed in Figure 4.4, resorts to queuing theory
del the behavior of an application with a CPU usage constraint, relying on an
Response time CPU Utilization
Arrival rate
Max theoretical
throughput
Control error
Utilization-based Load Manager
A. Shedding problem
The system in Figure 1 can be modeled by means of
Queuing Theory: the application is a single server node fed
by a queue, which provides the input jobs at a variable arrival
rate (t); the application is able to serve jobs at a service
rate µ(t). The system measures (t) and µ(t) in events per
second, where the events are respectively the input tweets and
the serviced tweets.
Starting from this, the simplest way to model the system
behavior is by means of the Little’s law (1), which states that
the number of jobs inside a system is equal to the input arrival
rate times the system response time:
N(t) = (t) · R(t) (1)
Q(t) = Q(t 1) + (t) µ(t) (2)
U(t) =
(t)
µmax
+
Q(t)
µmax
(3)
Q(t) = µmax · U(t) (t) (4)
e(t) = U(t) U(t 1) (5)
he system in Figure 1 can be modeled by means of
euing Theory: the application is a single server node fed
a queue, which provides the input jobs at a variable arrival
(t); the application is able to serve jobs at a service
µ(t). The system measures (t) and µ(t) in events per
ond, where the events are respectively the input tweets and
serviced tweets.
tarting from this, the simplest way to model the system
avior is by means of the Little’s law (1), which states that
number of jobs inside a system is equal to the input arrival
times the system response time:
N(t) = (t) · R(t) (1)
Q(t) = Q(t 1) + (t) µ(t) (2)
U(t) =
(t)
µmax
+
Q(t)
µmax
(3)
Q(t) = µmax · U(t) (t) (4)
e(t) = U(t) U(t 1) (5)
S:
Control error:
4.3. Policy wrapper and
equation (4.13). This leads to the final formulation of the Loa
(4.14), where the throughput at time t + 1 is a function of th
the maximum available throughput times the feedback error.
e(t) = U(t) ¯U
µ(t + 1) = (t) + µmax · e(t)
The Load Manager formulation just obtained is compose
the one hand, when the contribution of the feedback error e(
Requested throughput:
4.3. Policy wrapper and L
equation (4.13). This leads to the final formulation of the Load
(4.14), where the throughput at time t + 1 is a function of the
the maximum available throughput times the feedback error.
e(t) = U(t) ¯U
µ(t + 1) = (t) + µmax · e(t)
The Load Manager formulation just obtained is composed
the one hand, when the contribution of the feedback error e(t
condition of equation (4.15) is met; on the other hand, the secon
The system can be characterized
by its utilization and its queue size
Load Manager 15
Load Manager
LS Filter
Policies
SP
Metrics
correction
• The Load Manager computes the throughput μ(t) that
ensures stability such that:
• Two Load Managers for two goals:
tice that it is a sum of two different contributions. On the one hand, as the error e(t)
to zero, the stability condition (4.7) is met. On the other hand, the contribution:
(t) ensures a fast actuation in case of a significant deviation from the actual system
brium.
(t)  µ(t) (4.7)
course, during the lifetime of the system, the arrival rate (t) can vary unpre-
bly and can be greater than the system capacity µc(t), defined as the rate of events
uted per second. Given the control action µ(t) (i.e., the throughput of the system)
he system capacity, we can define µd(t) as the dropping rate of the LS. As we did
t), we can estimate the current system capacity as the number of events analyzed
last time period. Thus, for a given time t, equation (4.8) shows that the service
s the sum of the system capacity estimated and the number of events that we need
p to achieve the required stability:
µ(t) = µc(t 1) + µd(t) (4.8)
Utilization
s section we describe the Utilization based Load Manager, which becomes of use
e of streaming applications which should operate with a limited overhead. The
ation based Load Manager, which is showed in Figure 4.4, resorts to queuing theory
del the behavior of an application with a CPU usage constraint, relying on an
Response time CPU Utilization
The requested throughput is used by the load shedding policies to derive the LS probabilities
Policy wrapper and policies
• The policy wrapper provides access to statistics of processes,
the requested throughput μ(t+1) and the system capacity μc(t)
16
Load Manager
LS Filter
Policies
SP
Metrics
correction
Policy wrapper and policies
• The policy wrapper provides access to statistics of processes,
the requested throughput μ(t+1) and the system capacity μc(t)
16
Load Manager
LS Filter
Policies
SP
Metrics
correction
Baseline policy
• Compute one LS probability for all processes (with μ(t+1) and μc(t))
Policy wrapper and policies
• The policy wrapper provides access to statistics of processes,
the requested throughput μ(t+1) and the system capacity μc(t)
16
Fair policy
• Assign to each process the “same" number 

of events
• Save metrics of small processes, still
accurate results on big ones
Load Manager
LS Filter
Policies
SP
Metrics
correction
Baseline policy
• Compute one LS probability for all processes (with μ(t+1) and μc(t))
Policy wrapper and policies
• The policy wrapper provides access to statistics of processes,
the requested throughput μ(t+1) and the system capacity μc(t)
16
Fair policy
• Assign to each process the “same" number 

of events
• Save metrics of small processes, still
accurate results on big ones
Priority-based policy
• Assign a static priority to each process
• Compute a weighted priority to partition
the system capacity
• Assign a partition to each process and
compute the probabilities
Load Manager
LS Filter
Policies
SP
Metrics
correction
Baseline policy
• Compute one LS probability for all processes (with μ(t+1) and μc(t))
Load Shedding Filter
• The Load Shedding Filter applies the probabilities 

computed by the policies to the input stream
17
Load Manager
LS Filter
Policies
SP
Metrics
correction
Load Shedding Filter
• The Load Shedding Filter applies the probabilities 

computed by the policies to the input stream
• For each event:
• Look for load shedding probability depending on input class
• If no data is found we can drop the event
• Otherwise, apply the Load Shedding probability computed by the policy
17
Load Manager
LS Filter
Policies
SP
Metrics
correction
Load Shedding
Filter
Shedding
Plan
event buffers
ok
drop probability
Event
Capture
ko
Load Shedding Filter
• The Load Shedding Filter applies the probabilities 

computed by the policies to the input stream
• For each event:
• Look for load shedding probability depending on input class
• If no data is found we can drop the event
• Otherwise, apply the Load Shedding probability computed by the policy
• The dropped events are reported to the application for metrics correction
17
Load Manager
LS Filter
Policies
SP
Metrics
correction
Load Shedding
Filter
Shedding
Plan
event buffers
ok
drop probability
Event
Capture
ko
Output quality - homogeneous
• QoS requirement Ut 1.1%, standard set-point for the agent
• MAPE (lower is better) between exact and approximated metrics
• Output metrics on latency and volume for file and network operations
18
Output quality - homogeneous
• QoS requirement Ut 1.1%, standard set-point for the agent
• MAPE (lower is better) between exact and approximated metrics
• Output metrics on latency and volume for file and network operations
• Similar or better results of FFWD fair policy w.r.t reference
• FFWD accurate even if drops more events
• Predictable and repetitive behavior of nginx, fio and apache
18
1
10
100
1000
latency-file
latency-net
volum
e-file
volum
e-net
MAPE(%)log
kernel-drop
fair
1
10
100
1000
latency-file
latency-net
volum
e-file
volum
e-net
MAPE(%)log
kernel-drop
fair
1
10
100
1000
latency-file
latency-net
volum
e-file
volum
e-net
MAPE(%)log
kernel-drop
fair
1
10
100
1000
latency-file
latency-net
volum
e-file
volum
e-net
MAPE(%)log
kernel-drop
fair
reference
1
10
100
1000
latency-file
latency-net
volum
e-file
volum
e-net
MAPE(%)log
kernel-drop
fair
apache
1.9M evt/s
postmark
1.2M evt/s
simplefile
1.5M evt/s
fio
1.3M evt/s
nginx
800K evt/s
Output quality - homogeneous
• QoS requirement Ut 1.1%, standard set-point for the agent
• MAPE (lower is better) between exact and approximated metrics
• Output metrics on latency and volume for file and network operations
• Similar or better results of FFWD fair policy w.r.t reference
• FFWD accurate even if drops more events
• Predictable and repetitive behavior of nginx, fio and apache
18
1
10
100
1000
latency-file
latency-net
volum
e-file
volum
e-net
MAPE(%)log
kernel-drop
fair
1
10
100
1000
latency-file
latency-net
volum
e-file
volum
e-net
MAPE(%)log
kernel-drop
fair
1
10
100
1000
latency-file
latency-net
volum
e-file
volum
e-net
MAPE(%)log
kernel-drop
fair
1
10
100
1000
latency-file
latency-net
volum
e-file
volum
e-net
MAPE(%)log
kernel-drop
fair
reference
1
10
100
1000
latency-file
latency-net
volum
e-file
volum
e-net
MAPE(%)log
kernel-drop
fair
apache
1.9M evt/s
postmark
1.2M evt/s
simplefile
1.5M evt/s
fio
1.3M evt/s
nginx
800K evt/s
Output quality - homogeneous
• QoS requirement Ut 1.1%, standard set-point for the agent
• MAPE (lower is better) between exact and approximated metrics
• Output metrics on latency and volume for file and network operations
• Similar or better results of FFWD fair policy w.r.t reference
• FFWD accurate even if drops more events
• Predictable and repetitive behavior of nginx, fio and apache
18
1
10
100
1000
latency-file
latency-net
volum
e-file
volum
e-net
MAPE(%)log
kernel-drop
fair
1
10
100
1000
latency-file
latency-net
volum
e-file
volum
e-net
MAPE(%)log
kernel-drop
fair
1
10
100
1000
latency-file
latency-net
volum
e-file
volum
e-net
MAPE(%)log
kernel-drop
fair
1
10
100
1000
latency-file
latency-net
volum
e-file
volum
e-net
MAPE(%)log
kernel-drop
fair
reference
1
10
100
1000
latency-file
latency-net
volum
e-file
volum
e-net
MAPE(%)log
kernel-drop
fair
apache
1.9M evt/s
postmark
1.2M evt/s
simplefile
1.5M evt/s
fio
1.3M evt/s
nginx
800K evt/s
Output quality - homogeneous
• QoS requirement Ut 1.1%, standard set-point for the agent
• MAPE (lower is better) between exact and approximated metrics
• Output metrics on latency and volume for file and network operations
• Similar or better results of FFWD fair policy w.r.t reference
• FFWD accurate even if drops more events
• Predictable and repetitive behavior of nginx, fio and apache
18
1
10
100
1000
latency-file
latency-net
volum
e-file
volum
e-net
MAPE(%)log
kernel-drop
fair
1
10
100
1000
latency-file
latency-net
volum
e-file
volum
e-net
MAPE(%)log
kernel-drop
fair
1
10
100
1000
latency-file
latency-net
volum
e-file
volum
e-net
MAPE(%)log
kernel-drop
fair
1
10
100
1000
latency-file
latency-net
volum
e-file
volum
e-net
MAPE(%)log
kernel-drop
fair
reference
1
10
100
1000
latency-file
latency-net
volum
e-file
volum
e-net
MAPE(%)log
kernel-drop
fair
apache
1.9M evt/s
postmark
1.2M evt/s
simplefile
1.5M evt/s
fio
1.3M evt/s
nginx
800K evt/s
System stability 19
• We evaluated the Load Manager with all the tests (A, B, C, D, E, F, G)
• With 3 different set points (Ut 1.0%, 1.1%, 1.2% w.r.t. system capacity)
• Measuring the CPU load of the sysdig agent with:
• reference implementation
• FFWD with fair and priority policy
• We compared the actual CPU load

with the QoS requirement (Ut)
• Error measured with MAPE (lower 

is better) obtained running 20 times 

each benchmark
• 3.51x average MAPE improvement,

average MAPE below 5%
Test
Ut = 1.1%
reference fair priority
A 7,12% 1,78% 3,78%
B 34,06% 4,37% 4,46%
C 28,03% 2,27% 2,24%
D 11,52% 1,41% 1,54%
E 26,02% 8,51% 8,99%
F 22,67% 8,11% 3,74%
G 16,42% 3,37% 2,73%
H 19,92% 8,41% 8,01%
System stability 19
• We evaluated the Load Manager with all the tests (A, B, C, D, E, F, G)
• With 3 different set points (Ut 1.0%, 1.1%, 1.2% w.r.t. system capacity)
• Measuring the CPU load of the sysdig agent with:
• reference implementation
• FFWD with fair and priority policy
• We compared the actual CPU load

with the QoS requirement (Ut)
• Error measured with MAPE (lower 

is better) obtained running 20 times 

each benchmark
• 3.51x average MAPE improvement,

average MAPE below 5%
Test
Ut = 1.1%
reference fair priority
A 7,12% 1,78% 3,78%
B 34,06% 4,37% 4,46%
C 28,03% 2,27% 2,24%
D 11,52% 1,41% 1,54%
E 26,02% 8,51% 8,99%
F 22,67% 8,11% 3,74%
G 16,42% 3,37% 2,73%
H 19,92% 8,41% 8,01%
System stability 19
• We evaluated the Load Manager with all the tests (A, B, C, D, E, F, G)
• With 3 different set points (Ut 1.0%, 1.1%, 1.2% w.r.t. system capacity)
• Measuring the CPU load of the sysdig agent with:
• reference implementation
• FFWD with fair and priority policy
• We compared the actual CPU load

with the QoS requirement (Ut)
• Error measured with MAPE (lower 

is better) obtained running 20 times 

each benchmark
• 3.51x average MAPE improvement,

average MAPE below 5%
Test
Ut = 1.1%
reference fair priority
A 7,12% 1,78% 3,78%
B 34,06% 4,37% 4,46%
C 28,03% 2,27% 2,24%
D 11,52% 1,41% 1,54%
E 26,02% 8,51% 8,99%
F 22,67% 8,11% 3,74%
G 16,42% 3,37% 2,73%
H 19,92% 8,41% 8,01%
Output quality - heterogeneous
• We tried to mix the homogeneous tests:
• simulate co-located environment
• add OS scheduling uncertainty and noise
• QoS requirement Ut 1.1%
• MAPE (lower is better) between exact and approximated metrics
• Compare metrics from reference, FFWD fair, FFWD priority
• Three tests with different syscall mix:
• Network based mid-throughput: 1x Fio, 3x Nginx, 1.3M evt/s
• Mixed mid-throughput: 1x Simplefile, 1x Nginx, 1.3M evt/s
• Mixed high-throughput: 1x Apache, 1x Fio, 2x Postmark, 1.8M evt/s
20
Output quality - heterogeneous
• We tried to mix the homogeneous tests:
• simulate co-located environment
• add OS scheduling uncertainty and noise
• QoS requirement Ut 1.1%
• MAPE (lower is better) between exact and approximated metrics
• Compare metrics from reference, FFWD fair, FFWD priority
• Three tests with different syscall mix:
• Network based mid-throughput: 1x Fio, 3x Nginx, 1.3M evt/s
• Mixed mid-throughput: 1x Simplefile, 1x Nginx, 1.3M evt/s
• Mixed high-throughput: 1x Apache, 1x Fio, 2x Postmark, 1.8M evt/s
20
1x Fio, 3x Nginx, 1.3M evt/s 21
0.1
1
10
100
1000
10000
100000
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
MAPE(%)log
kernel-drop
fair
priority
nginx-3nginx-2nginx-1fio
0.1
1
10
100
1000
10000
100000
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
MAPE(%)log kernel-drop
fair
priority
nginx-3nginx-2nginx-1fio
0.1
1
10
100
1000
10000
100000
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
MAPE(%)log
kernel-drop
fair
priority
nginx-3nginx-2nginx-1fio
0.1
1
10
100
1000
10000
100000
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
MAPE(%)log
kernel-drop
fair
priority
nginx-3nginx-2nginx-1fio
reference
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
nginx-3nginx-2nginx-1fio
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
nginx-3nginx-2nginx-1fio
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
nginx-3nginx-2nginx-1fio
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
nginx-3nginx-2nginx-1fio
referenceVolume metrics (byte r/w)
Latency metrics
MAPE lower is better
1x Fio, 3x Nginx, 1.3M evt/s 21
0.1
1
10
100
1000
10000
100000
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
MAPE(%)log
kernel-drop
fair
priority
nginx-3nginx-2nginx-1fio
0.1
1
10
100
1000
10000
100000
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
MAPE(%)log kernel-drop
fair
priority
nginx-3nginx-2nginx-1fio
0.1
1
10
100
1000
10000
100000
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
MAPE(%)log
kernel-drop
fair
priority
nginx-3nginx-2nginx-1fio
0.1
1
10
100
1000
10000
100000
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
MAPE(%)log
kernel-drop
fair
priority
nginx-3nginx-2nginx-1fio
reference
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
nginx-3nginx-2nginx-1fio
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
nginx-3nginx-2nginx-1fio
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
nginx-3nginx-2nginx-1fio
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
nginx-3nginx-2nginx-1fio
referenceVolume metrics (byte r/w)
Latency metrics
MAPE lower is better
1x Fio, 3x Nginx, 1.3M evt/s 21
0.1
1
10
100
1000
10000
100000
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
MAPE(%)log
kernel-drop
fair
priority
nginx-3nginx-2nginx-1fio
0.1
1
10
100
1000
10000
100000
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
MAPE(%)log kernel-drop
fair
priority
nginx-3nginx-2nginx-1fio
0.1
1
10
100
1000
10000
100000
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
MAPE(%)log
kernel-drop
fair
priority
nginx-3nginx-2nginx-1fio
0.1
1
10
100
1000
10000
100000
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
MAPE(%)log
kernel-drop
fair
priority
nginx-3nginx-2nginx-1fio
reference
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
nginx-3nginx-2nginx-1fio
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
nginx-3nginx-2nginx-1fio
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
nginx-3nginx-2nginx-1fio
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
nginx-3nginx-2nginx-1fio
referenceVolume metrics (byte r/w)
Latency metrics
MAPE lower is better
1x Apache, 1x Fio, 2x Postmark, 1.8M evt/s 22
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
postmark-2postmark-1fioapache
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
postmark-2postmark-1fioapache
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
postmark-2postmark-1fioapache
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
postmark-2postmark-1fioapache
reference
0.1
1
10
100
1000
10000
100000
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
MAPE(%)log
kernel-drop
fair
priority
postmark-2postmark-1fioapache
0.1
1
10
100
1000
10000
100000
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
MAPE(%)log
kernel-drop
fair
priority
postmark-2postmark-1fioapache
0.1
1
10
100
1000
10000
100000
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
MAPE(%)log kernel-drop
fair
priority
postmark-2postmark-1fioapache
0.1
1
10
100
1000
10000
100000
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
MAPE(%)log
kernel-drop
fair
priority
postmark-2postmark-1fioapache
reference
Volume metrics (byte r/w)
Latency metrics
MAPE lower is better
1x Apache, 1x Fio, 2x Postmark, 1.8M evt/s 22
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
postmark-2postmark-1fioapache
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
postmark-2postmark-1fioapache
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
postmark-2postmark-1fioapache
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
postmark-2postmark-1fioapache
reference
0.1
1
10
100
1000
10000
100000
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
MAPE(%)log
kernel-drop
fair
priority
postmark-2postmark-1fioapache
0.1
1
10
100
1000
10000
100000
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
MAPE(%)log
kernel-drop
fair
priority
postmark-2postmark-1fioapache
0.1
1
10
100
1000
10000
100000
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
MAPE(%)log kernel-drop
fair
priority
postmark-2postmark-1fioapache
0.1
1
10
100
1000
10000
100000
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
MAPE(%)log
kernel-drop
fair
priority
postmark-2postmark-1fioapache
reference
Volume metrics (byte r/w)
Latency metrics
MAPE lower is better
1x Apache, 1x Fio, 2x Postmark, 1.8M evt/s 22
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
postmark-2postmark-1fioapache
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
postmark-2postmark-1fioapache
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
postmark-2postmark-1fioapache
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
postmark-2postmark-1fioapache
reference
0.1
1
10
100
1000
10000
100000
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
MAPE(%)log
kernel-drop
fair
priority
postmark-2postmark-1fioapache
0.1
1
10
100
1000
10000
100000
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
MAPE(%)log
kernel-drop
fair
priority
postmark-2postmark-1fioapache
0.1
1
10
100
1000
10000
100000
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
MAPE(%)log kernel-drop
fair
priority
postmark-2postmark-1fioapache
0.1
1
10
100
1000
10000
100000
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
MAPE(%)log
kernel-drop
fair
priority
postmark-2postmark-1fioapache
reference
Volume metrics (byte r/w)
Latency metrics
MAPE lower is better
1x Apache, 1x Fio, 2x Postmark, 1.8M evt/s 22
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
postmark-2postmark-1fioapache
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
postmark-2postmark-1fioapache
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
postmark-2postmark-1fioapache
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
postmark-2postmark-1fioapache
reference
0.1
1
10
100
1000
10000
100000
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
MAPE(%)log
kernel-drop
fair
priority
postmark-2postmark-1fioapache
0.1
1
10
100
1000
10000
100000
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
MAPE(%)log
kernel-drop
fair
priority
postmark-2postmark-1fioapache
0.1
1
10
100
1000
10000
100000
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
MAPE(%)log kernel-drop
fair
priority
postmark-2postmark-1fioapache
0.1
1
10
100
1000
10000
100000
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
MAPE(%)log
kernel-drop
fair
priority
postmark-2postmark-1fioapache
reference
Volume metrics (byte r/w)
Latency metrics
MAPE lower is better
23
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
postmark-2postmark-1fioapache
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
postmark-2postmark-1fioapache
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
postmark-2postmark-1fioapache
0.1
1
10
100
1000
10000
100000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
postmark-2postmark-1fioapache
reference
0.1
1
10
100
1000
10000
100000
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
MAPE(%)log
kernel-drop
fair
priority
postmark-2postmark-1fioapache
0.1
1
10
100
1000
10000
100000
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
MAPE(%)log
kernel-drop
fair
priority
postmark-2postmark-1fioapache
0.1
1
10
100
1000
10000
100000
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
MAPE(%)log kernel-drop
fair
priority
postmark-2postmark-1fioapache
0.1
1
10
100
1000
10000
100000
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
latency-filelatency-net
MAPE(%)log
kernel-drop
fair
priority
postmark-2postmark-1fioapache
reference
Volume metrics (byte r/w)
Latency metrics
MAPE lower is better
Test H, mixed workloads: 1x apache, 1x fio, 2x postmark, 1.8M evt/s
• Fair policy outperforms reference in almost all cases
• the LS Filter works at the single event level
• reference drops events in batches
• Priority policy improves the Fair policy results in most cases
• the prioritized processes are privileged
• other processes treated as “best-effort”
1x Apache, 1x Fio, 2x Postmark, 1.8M evt/s
1x simplefile, 1x nginx, 1.3M evt/s 24
1
10
100
1000
latency-filelatency-net
latency-filelatency-net
MAPE(%)log
kernel-drop
fair
priority
nginxsimplefile
reference
1
10
100
1000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
nginx-1simplefile
reference
Volume metrics (byte r/w)Latency metrics
MAPE lower is better
1x simplefile, 1x nginx, 1.3M evt/s 24
1
10
100
1000
latency-filelatency-net
latency-filelatency-net
MAPE(%)log
kernel-drop
fair
priority
nginxsimplefile
reference
1
10
100
1000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
nginx-1simplefile
reference
Volume metrics (byte r/w)Latency metrics
MAPE lower is better
1x simplefile, 1x nginx, 1.3M evt/s 24
1
10
100
1000
latency-filelatency-net
latency-filelatency-net
MAPE(%)log
kernel-drop
fair
priority
nginxsimplefile
reference
1
10
100
1000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
nginx-1simplefile
reference
Volume metrics (byte r/w)Latency metrics
MAPE lower is better
1x simplefile, 1x nginx, 1.3M evt/s 24
1
10
100
1000
latency-filelatency-net
latency-filelatency-net
MAPE(%)log
kernel-drop
fair
priority
nginxsimplefile
reference
1
10
100
1000
volum
e-filevolum
e-net
volum
e-filevolum
e-net
MAPE(%)log
kernel-drop
fair
priority
nginx-1simplefile
reference
Volume metrics (byte r/w)Latency metrics
MAPE lower is better
Response time Load Manager 25
S:
(Little’s Law)
(Jobs in the system)
The system can be characterized by its response time and the jobs in the system
Control error:
Requested throughput:
The requested throughput is used by the load shedding policies to derive the LS probabilities
26
S:
(Little’s Law)
(Jobs in the system)
The system can be characterized by its response time and the jobs in the system
Control error:
Requested throughput:
The requested throughput is used by the load shedding policies to derive the LS probabilities
Old response time Target response time
Response time Load Manager
27
S:
(Little’s Law)
(Jobs in the system)
The system can be characterized by its response time and the jobs in the system
Control error:
Requested throughput:
The requested throughput is used by the load shedding policies to derive the LS probabilities
Requested throughput Arrival rate
Control error
Response time Load Manager
Real-time sentiment analysis 28
• Real-time sentiment analysis allows to:
– Track the sentiment of a topic over time
– Correlate real world events and related sentiment, e.g.
• Toyota crisis (2010) [1]
• 2012 US Presidential Election Cycle [2]
– Track online evolution of companies reputation, derive social
profiling and allow enhanced social marketing strategies
[1] Bifet Figuerol, Albert Carles, et al. "Detecting sentiment change in Twitter streaming data." Journal of Machine Learning Research:
Workshop and Conference Proceedings Series. 2011.
[2] Wang, Hao, et al. "A system for real-time twitter sentiment analysis of 2012 us presidential election cycle." Proceedings of the ACL
2012 System Demonstrations.
Sentiment analysis: case study 29
• Simple Twitter streaming sentiment analyzer with Stanford NLP
• System components:
– Event producer
– RabbitMQ queue
– Event consumer
• Consumer components:
– Event Capture
– Sentiment Analyzer
– Sentiment Aggregator
• Real-time queue consumption, aggregated metrics emission each second
(keywords and hashtag sentiment)
FFWD: Sentiment analysis 30
• FFWD adds four components:
– Load shedding filter at the beginning of the pipeline
– Shedding plan used by the filter
– Domain-specific policy wrapper
– Application controller manager to detect load peaks
Producer
Load Shedding
Filter
Event
Capture
Sentiment
Analyzer
Sentiment
Aggregator
Policy
Wrapper
Load Manager
Shedding
Plan
real-time queue
batch queue
ok
ko
ko count
account metrics
R(t)
stream statsupdated plan
μ(t+1)
event output metricsinput tweets
drop probability
Component
Data structure
Internal information flow
External information flow
Queue
analyze event
λ(t)
Rt
Sentiment - experimental setup 31
• Separate tests to understand FFWD behavior:
– System stability
– Output quality
• Dataset: 900K tweets of 35th week of Premier League
• Performed tests:
– Controller: synthetic and real tweets at various λ(t)
– Policy: real tweets at various λ(t)
• Evaluation setup
– Intel core i7 3770, 4 cores @ 3.4 Ghz + HT, 8MB LLC
– 8 GB RAM @ 1600 Mhz
System stability 32
case A: λ(t) = λ(t-1)
case B: λ(t) = avg(λ(t))
λ(t) estimation:
Load Manager showcase (1)
• Load Manager demo (Rt = 5s):
– λ(t) increased after 60s and 240s
– response time:
33
0
1
2
3
4
5
6
7
0 50 100 150 200 250 300
Responsetime(s)
time (s)
Controller performance
QoS = 5s
R
Load Manager showcase (2)
• Load Manager demo (Rt = 5s):
– λ(t) increased after 60s and 240s
– throughput:
34
0
100
200
300
400
500
0 50 100 150 200 250 300
#Events
time (s)
Actuation
lambda
dropped
computed
mu
Output Quality 35
• Real tweets, μc(t) ≃ 40 evt/s
• Evaluated policies:
• Baseline
• Fair
• Priority
• R = 5s, λ(t) = 100 evt/s, 200 evt/s, 400 evt/s
• Error metric: Mean Absolute Percentage
Error (MAPE %) (lower is better)
0
10
20
30
40
50
A B C D
MAPE(%)
Groups
baseline_error
fair_error
priority_error
λ(t) = 100 evt/s
0
10
20
30
40
50
A B C D
MAPE(%)
Groups
baseline_error
fair_error
priority_error
λ(t) = 200 evt/s
0
10
20
30
40
50
A B C D
MAPE(%)
Groups
baseline_error
fair_error
priority_error
λ(t) = 400 evt/s

More Related Content

PDF
Self-adaptive container monitoring with performance-aware Load-Shedding policies
PDF
A Scalable Dataflow Implementation of Curran's Approximation Algorithm
PDF
Self-adaptive container monitoring with performance-aware Load-Shedding policies
PDF
Self-adaptive container monitoring with performance-aware Load-Shedding policies
PDF
Self-adaptive container monitoring with performance-aware Load-Shedding policies
PPTX
Optimization of Continuous Queries in Federated Database and Stream Processin...
PPTX
Adaptive Replication for Elastic Data Stream Processing
PDF
Auto-scaling Techniques for Elastic Data Stream Processing
Self-adaptive container monitoring with performance-aware Load-Shedding policies
A Scalable Dataflow Implementation of Curran's Approximation Algorithm
Self-adaptive container monitoring with performance-aware Load-Shedding policies
Self-adaptive container monitoring with performance-aware Load-Shedding policies
Self-adaptive container monitoring with performance-aware Load-Shedding policies
Optimization of Continuous Queries in Federated Database and Stream Processin...
Adaptive Replication for Elastic Data Stream Processing
Auto-scaling Techniques for Elastic Data Stream Processing

What's hot (20)

PPT
Distributed systems scheduling
PDF
FFWD - Fast Forward With Degradation
PDF
Automatic Features Generation And Model Training On Spark: A Bayesian Approach
PDF
Latency-aware Elastic Scaling for Distributed Data Stream Processing Systems
PPTX
Big Graph Analytics Systems (Sigmod16 Tutorial)
PDF
BREEZE 3D Analyst for the Advanced AERMOD Modeler
PDF
Scheduling Task-parallel Applications in Dynamically Asymmetric Environments
PDF
Continental division of load and balanced ant
PPTX
Exploiting a Synergy between Greedy Approach and NSGA for Scheduling in Compu...
PPTX
Crash course on data streaming (with examples using Apache Flink)
PDF
REDUCING THE MONITORING REGISTER FOR THE DETECTION OF ANOMALIES IN SOFTWARE D...
PPTX
Tutorial: The Role of Event-Time Analysis Order in Data Streaming
PPTX
A Hybrid Method of CART and Artificial Neural Network for Short Term Load For...
PDF
PRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems
PPTX
The data streaming processing paradigm and its use in modern fog architectures
PDF
Run-time power management in cloud and containerized environments
PDF
Stochastic Analysis of a Cold Standby System with Server Failure
PDF
Hadoop map reduce concepts
PDF
Parallel External Memory Algorithms Applied to Generalized Linear Models
PPTX
A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...
Distributed systems scheduling
FFWD - Fast Forward With Degradation
Automatic Features Generation And Model Training On Spark: A Bayesian Approach
Latency-aware Elastic Scaling for Distributed Data Stream Processing Systems
Big Graph Analytics Systems (Sigmod16 Tutorial)
BREEZE 3D Analyst for the Advanced AERMOD Modeler
Scheduling Task-parallel Applications in Dynamically Asymmetric Environments
Continental division of load and balanced ant
Exploiting a Synergy between Greedy Approach and NSGA for Scheduling in Compu...
Crash course on data streaming (with examples using Apache Flink)
REDUCING THE MONITORING REGISTER FOR THE DETECTION OF ANOMALIES IN SOFTWARE D...
Tutorial: The Role of Event-Time Analysis Order in Data Streaming
A Hybrid Method of CART and Artificial Neural Network for Short Term Load For...
PRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems
The data streaming processing paradigm and its use in modern fog architectures
Run-time power management in cloud and containerized environments
Stochastic Analysis of a Cold Standby System with Server Failure
Hadoop map reduce concepts
Parallel External Memory Algorithms Applied to Generalized Linear Models
A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...
Ad

Similar to Self-adaptive container monitoring with performance-aware Load-Shedding policies (20)

PDF
Streaming Analytics Unit 1 notes for engineers
PPTX
The Need for Complex Analytics from Forwarding Pipelines
PDF
Presentation southernstork 2009-nov-southernworkshop
PPTX
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
PDF
Dependable Systems - Structure-Based Dependabiilty Modeling (6/16)
PPTX
CLOUD RESOURCE MANAGEMENT AND SCHEDULING
PDF
PLNOG15: Network Monitoring&Data Analytics at 10/40/100GE speeds. Why spend a...
PPT
resource management
PDF
Traffic Engineering in Software-Defined Networks
PPTX
3.1 Transport Layer Presentationsss.pptx
PPTX
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
PPTX
Big Data Berlin v8.0 Stream Processing with Apache Apex
PDF
Embracing the Dynamic Mindset
PDF
Inside Kafka Streams—Monitoring Comcast’s Outside Plant
PPTX
Apache Apex: Stream Processing Architecture and Applications
PPTX
Apache Apex: Stream Processing Architecture and Applications
PDF
Defects mining in exchanges - medvedev, klimakov, yamkovi
PDF
SAND: A Fault-Tolerant Streaming Architecture for Network Traffic Analytics
PPT
Chap24
PPTX
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Streaming Analytics Unit 1 notes for engineers
The Need for Complex Analytics from Forwarding Pipelines
Presentation southernstork 2009-nov-southernworkshop
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Dependable Systems - Structure-Based Dependabiilty Modeling (6/16)
CLOUD RESOURCE MANAGEMENT AND SCHEDULING
PLNOG15: Network Monitoring&Data Analytics at 10/40/100GE speeds. Why spend a...
resource management
Traffic Engineering in Software-Defined Networks
3.1 Transport Layer Presentationsss.pptx
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
Big Data Berlin v8.0 Stream Processing with Apache Apex
Embracing the Dynamic Mindset
Inside Kafka Streams—Monitoring Comcast’s Outside Plant
Apache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and Applications
Defects mining in exchanges - medvedev, klimakov, yamkovi
SAND: A Fault-Tolerant Streaming Architecture for Network Traffic Analytics
Chap24
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Ad

More from NECST Lab @ Politecnico di Milano (20)

PDF
Mesticheria Team - WiiReflex
PPTX
Punto e virgola Team - Stressometro
PDF
BitIt Team - Stay.straight
PDF
BabYodini Team - Talking Gloves
PDF
printf("Nome Squadra"); Team - NeoTon
PPTX
BlackBoard Team - Motion Tracking Platform
PDF
#include<brain.h> Team - HomeBeatHome
PDF
Flipflops Team - Wave U
PDF
Bug(atta) Team - Little Brother
PDF
#NECSTCamp: come partecipare
PDF
NECSTLab101 2020.2021
PDF
TreeHouse, nourish your community
PDF
TiReX: Tiled Regular eXpressionsmatching architecture
PDF
Embedding based knowledge graph link prediction for drug repurposing
PDF
PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...
PDF
EMPhASIS - An EMbedded Public Attention Stress Identification System
PDF
Luns - Automatic lungs segmentation through neural network
PDF
BlastFunction: How to combine Serverless and FPGAs
PDF
Maeve - Fast genome analysis leveraging exact string matching
Mesticheria Team - WiiReflex
Punto e virgola Team - Stressometro
BitIt Team - Stay.straight
BabYodini Team - Talking Gloves
printf("Nome Squadra"); Team - NeoTon
BlackBoard Team - Motion Tracking Platform
#include<brain.h> Team - HomeBeatHome
Flipflops Team - Wave U
Bug(atta) Team - Little Brother
#NECSTCamp: come partecipare
NECSTLab101 2020.2021
TreeHouse, nourish your community
TiReX: Tiled Regular eXpressionsmatching architecture
Embedding based knowledge graph link prediction for drug repurposing
PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...
EMPhASIS - An EMbedded Public Attention Stress Identification System
Luns - Automatic lungs segmentation through neural network
BlastFunction: How to combine Serverless and FPGAs
Maeve - Fast genome analysis leveraging exact string matching

Recently uploaded (20)

PDF
ETO & MEO Certificate of Competency Questions and Answers
PPTX
Road Safety tips for School Kids by a k maurya.pptx
PPTX
Simulation of electric circuit laws using tinkercad.pptx
PPTX
Internship_Presentation_Final engineering.pptx
PPT
Drone Technology Electronics components_1
PPTX
436813905-LNG-Process-Overview-Short.pptx
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PPT
Chapter 6 Design in software Engineeing.ppt
PDF
Queuing formulas to evaluate throughputs and servers
PDF
Arduino robotics embedded978-1-4302-3184-4.pdf
PDF
composite construction of structures.pdf
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PPTX
Geodesy 1.pptx...............................................
PPTX
anatomy of limbus and anterior chamber .pptx
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PPTX
Strings in CPP - Strings in C++ are sequences of characters used to store and...
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
ETO & MEO Certificate of Competency Questions and Answers
Road Safety tips for School Kids by a k maurya.pptx
Simulation of electric circuit laws using tinkercad.pptx
Internship_Presentation_Final engineering.pptx
Drone Technology Electronics components_1
436813905-LNG-Process-Overview-Short.pptx
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
Chapter 6 Design in software Engineeing.ppt
Queuing formulas to evaluate throughputs and servers
Arduino robotics embedded978-1-4302-3184-4.pdf
composite construction of structures.pdf
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
Geodesy 1.pptx...............................................
anatomy of limbus and anterior chamber .pptx
Embodied AI: Ushering in the Next Era of Intelligent Systems
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
Strings in CPP - Strings in C++ are sequences of characters used to store and...
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx

Self-adaptive container monitoring with performance-aware Load-Shedding policies

  • 1. 1 Self-adaptive container monitoring with performance-aware load-shedding policies NECST Group Conference 2017 @ Bloomberg 06/01/2017 Rolando Brondolin [email protected] DEIB, Politecnico di Milano
  • 2. Infrastructure monitoring 2 • Easy to follow behavior of small infrastructures • But on large-scale systems: – Systematic approach for monitoring and troubleshooting – Tradeoff on data granularity and resource consumption high visibility on system state non negligible cost few information on system state cheap monitoring VS
  • 3. Infrastructure monitoring 3 • Easy to follow behavior of small infrastructures • But on large-scale systems: – Systematic approach for monitoring and troubleshooting – Tradeoff on data granularity and resource consumption high visibility on system state non negligible cost few information on system state cheap monitoring VS
  • 4. Sysdig Cloud monitoring 4 http://www.sysdig.org • Infrastructure for container monitoring • Collects aggregated metrics and shows system state: – “Drill-down” from cluster to single application metrics – Dynamic network topology – Alerting and anomaly detection • Monitoring agent deployed on each machine in the cluster – Traces system calls in a “streaming fashion” – Aggregates data for Threads, FDs, applications, containers and hosts
  • 5. Problem definition • The Sysdig Cloud agent can be modelled as a server with a finite queue • characterized by its arrival rate λ(t) and its service rate μ(t) • Subject to overloading conditions 5 S λ(t) φ(t) μ(t) Λ Φ Q
  • 6. Cause Problem definition • The Sysdig Cloud agent can be modelled as a server with a finite queue • characterized by its arrival rate λ(t) and its service rate μ(t) • Subject to overloading conditions 5 Events arrives at really high frequency S λ(t) φ(t) μ(t) Λ Φ Q S φ(t) μ(t) Φ Q of a streaming system with queue, processing element and streaming output flow . A server S, fed by a queue Q, is in overloading eater than the service rate µ(t). The stability condition stated he necessary and sufficient condition to avoid overloading. A ncing overloading should discard part of the input to increase to match the arrival rate (t). µ(t)  (t) (2.1) rmalizing is twofold, as we are interested not only in controlling t also in maximizing the accuracy of the estimated metrics. To which represents the input flow at a given time t; and ˜x, which ut flow considered in case of overloading at the same time t. If
  • 7. EffectCause Problem definition • The Sysdig Cloud agent can be modelled as a server with a finite queue • characterized by its arrival rate λ(t) and its service rate μ(t) • Subject to overloading conditions 5 Events arrives at really high frequency Queues grow indefinitely S λ(t) φ(t) μ(t) Λ Φ Q S φ(t) μ(t) Φ Q of a streaming system with queue, processing element and streaming output flow . A server S, fed by a queue Q, is in overloading eater than the service rate µ(t). The stability condition stated he necessary and sufficient condition to avoid overloading. A ncing overloading should discard part of the input to increase to match the arrival rate (t). µ(t)  (t) (2.1) rmalizing is twofold, as we are interested not only in controlling t also in maximizing the accuracy of the estimated metrics. To which represents the input flow at a given time t; and ˜x, which ut flow considered in case of overloading at the same time t. If
  • 8. IssuesEffectCause Problem definition • The Sysdig Cloud agent can be modelled as a server with a finite queue • characterized by its arrival rate λ(t) and its service rate μ(t) • Subject to overloading conditions 5 Events arrives at really high frequency Queues grow indefinitely High usage of system resources Loss of events S λ(t) φ(t) μ(t) Λ Φ Q S φ(t) μ(t) Φ Q of a streaming system with queue, processing element and streaming output flow . A server S, fed by a queue Q, is in overloading eater than the service rate µ(t). The stability condition stated he necessary and sufficient condition to avoid overloading. A ncing overloading should discard part of the input to increase to match the arrival rate (t). µ(t)  (t) (2.1) rmalizing is twofold, as we are interested not only in controlling t also in maximizing the accuracy of the estimated metrics. To which represents the input flow at a given time t; and ˜x, which ut flow considered in case of overloading at the same time t. If Output quality degradation
  • 9. Proposed solution: FFWD Fast Forward With Degradation (FFWD) is a framework that tackles load peaks in streaming applications via load-shedding techniques 6
  • 10. Proposed solution: FFWD Fast Forward With Degradation (FFWD) is a framework that tackles load peaks in streaming applications via load-shedding techniques general approach but leveraging domain-specific details 6
  • 11. Proposed solution: FFWD Fast Forward With Degradation (FFWD) is a framework that tackles load peaks in streaming applications via load-shedding techniques general approach but leveraging domain-specific details 6 self-adaptive container monitoring with performance-aware load-shedding policies
  • 12. Proposed solution: FFWD Fast Forward With Degradation (FFWD) is a framework that tackles load peaks in streaming applications via load-shedding techniques general approach but leveraging domain-specific details 6 self-adaptive container monitoring with performance-aware load-shedding policies Target: monitoring • Collects events • Process events • Aggregates data
  • 13. Proposed solution: FFWD Fast Forward With Degradation (FFWD) is a framework that tackles load peaks in streaming applications via load-shedding techniques general approach but leveraging domain-specific details 6 self-adaptive container monitoring with performance-aware load-shedding policies Target: monitoring • Collects events • Process events • Aggregates data Decide when to shed load • Observe application status • Check system overload
  • 14. Proposed solution: FFWD Fast Forward With Degradation (FFWD) is a framework that tackles load peaks in streaming applications via load-shedding techniques general approach but leveraging domain-specific details 6 self-adaptive container monitoring with performance-aware load-shedding policies Target: monitoring • Collects events • Process events • Aggregates data Decide when to shed load • Observe application status • Check system overload Choose how much load to shed • domain-specific pluggable policies • compute shedding probability
  • 15. Proposed solution: FFWD Fast Forward With Degradation (FFWD) is a framework that tackles load peaks in streaming applications via load-shedding techniques general approach but leveraging domain-specific details 6 self-adaptive container monitoring with performance-aware load-shedding policies Target: monitoring • Collects events • Process events • Aggregates data Decide when to shed load • Observe application status • Check system overload Set where to shed load • act directly on incoming stream • probabilistic approach Choose how much load to shed • domain-specific pluggable policies • compute shedding probability
  • 16. FFWD: load shedding framework 7
  • 17. FFWD: load shedding framework 7 Load manager • Collect queue status and performance • Heuristic feedback control 
 (Little’s law, queuing theory) • Implementation goals: • Response time • CPU utilization
  • 18. FFWD: load shedding framework 7 Load manager • Collect queue status and performance • Heuristic feedback control 
 (Little’s law, queuing theory) • Implementation goals: • Response time • CPU utilization Policy wrapper • Hosts domain-specific policies: • Baseline policy • Fair policy • Priority-based policy
  • 19. FFWD: load shedding framework 7 Load manager • Collect queue status and performance • Heuristic feedback control 
 (Little’s law, queuing theory) • Implementation goals: • Response time • CPU utilization Policy wrapper • Hosts domain-specific policies: • Baseline policy • Fair policy • Priority-based policy shedding plan
  • 20. FFWD: load shedding framework 7 Load manager • Collect queue status and performance • Heuristic feedback control 
 (Little’s law, queuing theory) • Implementation goals: • Response time • CPU utilization Load shedding filter • Placed at the input source • Uses shedding plan and input categories •Probabilistic approach to events drop Policy wrapper • Hosts domain-specific policies: • Baseline policy • Fair policy • Priority-based policy shedding plan
  • 21. FFWD: load shedding framework 7 Load manager • Collect queue status and performance • Heuristic feedback control 
 (Little’s law, queuing theory) • Implementation goals: • Response time • CPU utilization Metrics correction • Counts computed and dropped events • Rescale aggregated metrics Load shedding filter • Placed at the input source • Uses shedding plan and input categories •Probabilistic approach to events drop Policy wrapper • Hosts domain-specific policies: • Baseline policy • Fair policy • Priority-based policy shedding plan
  • 22. • We evaluated FFWD within Sysdig with 2 goals: • System stability (slide 8) • Output quality (slides 9 10 11 12) • Results compared with the reference filtering system of Sysdig • Evaluation setup • 2x Xeon E5-2650 v3, 
 20 cores (40 w/HT) @ 2.3Ghz • 128 GB DDR4 RAM • Test selected from Phoronix test suite Experimental setup 8
  • 23. • We evaluated FFWD within Sysdig with 2 goals: • System stability (slide 8) • Output quality (slides 9 10 11 12) • Results compared with the reference filtering system of Sysdig • Evaluation setup • 2x Xeon E5-2650 v3, 
 20 cores (40 w/HT) @ 2.3Ghz • 128 GB DDR4 RAM • Test selected from Phoronix test suite Experimental setup 8 test ID name priority # evts/s A nginx 3 800K B postmark 4 1,2M C fio 4 1,3M D simplefile 2 1,5M E apache 2 1,9M test ID instances # evts/s F 3x nginx, 1x fio 1,3M G 1x nginx, 1x simplefile 1,3M H 1x apache, 2x postmark, 1x fio 1,8M Homogeneous benchmarks Heterogeneous benchmarks Syscall intensive benchmarks from Phoronix test suite
  • 24. • We evaluated FFWD within Sysdig with 2 goals: • System stability (slide 8) • Output quality (slides 9 10 11 12) • Results compared with the reference filtering system of Sysdig • Evaluation setup • 2x Xeon E5-2650 v3, 
 20 cores (40 w/HT) @ 2.3Ghz • 128 GB DDR4 RAM • Test selected from Phoronix test suite Experimental setup 8 test ID name priority # evts/s A nginx 3 800K B postmark 4 1,2M C fio 4 1,3M D simplefile 2 1,5M E apache 2 1,9M test ID instances # evts/s F 3x nginx, 1x fio 1,3M G 1x nginx, 1x simplefile 1,3M H 1x apache, 2x postmark, 1x fio 1,8M Homogeneous benchmarks Heterogeneous benchmarks Syscall intensive benchmarks from Phoronix test suite
  • 25. • We evaluated FFWD within Sysdig with 2 goals: • System stability (slide 8) • Output quality (slides 9 10 11 12) • Results compared with the reference filtering system of Sysdig • Evaluation setup • 2x Xeon E5-2650 v3, 
 20 cores (40 w/HT) @ 2.3Ghz • 128 GB DDR4 RAM • Test selected from Phoronix test suite Experimental setup 8 test ID name priority # evts/s A nginx 3 800K B postmark 4 1,2M C fio 4 1,3M D simplefile 2 1,5M E apache 2 1,9M test ID instances # evts/s F 3x nginx, 1x fio 1,3M G 1x nginx, 1x simplefile 1,3M H 1x apache, 2x postmark, 1x fio 1,8M Homogeneous benchmarks Heterogeneous benchmarks Syscall intensive benchmarks from Phoronix test suite
  • 26. Experimental results • System stability and control error • MAPE between QoS requirement and CPU traces of FFWD and reference • 3.51x average MAPE improvement,
 average MAPE below 5% 9 Test Ut = 1.1% - Control error MAPE reference fair priority A 7,12% 1,78% 3,78% B 34,06% 4,37% 4,46% C 28,03% 2,27% 2,24% D 11,52% 1,41% 1,54% E 26,02% 8,51% 8,99% F 22,67% 8,11% 3,74% G 16,42% 3,37% 2,73% H 19,92% 8,41% 8,01%
  • 27. Experimental results • System stability and control error • MAPE between QoS requirement and CPU traces of FFWD and reference • 3.51x average MAPE improvement,
 average MAPE below 5% 9 Test Ut = 1.1% - Control error MAPE reference fair priority A 7,12% 1,78% 3,78% B 34,06% 4,37% 4,46% C 28,03% 2,27% 2,24% D 11,52% 1,41% 1,54% E 26,02% 8,51% 8,99% F 22,67% 8,11% 3,74% G 16,42% 3,37% 2,73% H 19,92% 8,41% 8,01%
  • 28. Experimental results • System stability and control error • MAPE between QoS requirement and CPU traces of FFWD and reference • 3.51x average MAPE improvement,
 average MAPE below 5% 9 Test Ut = 1.1% - Control error MAPE reference fair priority A 7,12% 1,78% 3,78% B 34,06% 4,37% 4,46% C 28,03% 2,27% 2,24% D 11,52% 1,41% 1,54% E 26,02% 8,51% 8,99% F 22,67% 8,11% 3,74% G 16,42% 3,37% 2,73% H 19,92% 8,41% 8,01%
  • 29. Experimental results • System stability and control error • MAPE between QoS requirement and CPU traces of FFWD and reference • 3.51x average MAPE improvement,
 average MAPE below 5% 9 Test Ut = 1.1% - Control error MAPE reference fair priority A 7,12% 1,78% 3,78% B 34,06% 4,37% 4,46% C 28,03% 2,27% 2,24% D 11,52% 1,41% 1,54% E 26,02% 8,51% 8,99% F 22,67% 8,11% 3,74% G 16,42% 3,37% 2,73% H 19,92% 8,41% 8,01% • Output quality • MAPE between exact output metrics and approximated ones from FFWD and ref. • The domain-specific policies outperforms the reference in the majority of cases 0.1 1 10 100 1000 10000 100000 latency-filelatency-net latency-filelatency-net latency-filelatency-net latency-filelatenc MAPE(%)log kernel-drop fair priority nginx-3nginx-2nginx-1fio 0.1 1 10 100 1000 10000 100000 latency-filelatency-net latency-filelatency-net latency-filelatency-net latency-filelatency-net MAPE(%)log kernel-drop fair priority nginx-3nginx-2nginx-1fio 0.1 1 10 100 1000 10000 100000 latency-filelatency-net latency-filelatency-net latency-filelatency-net latency-filelatency-net MAPE(%)log kernel-drop fair priority nginx-3nginx-2nginx-1fio 0.1 1 10 100 1000 10000 100000 latency-filelatency-net latency-filelatency-net latency-filelatency-net latency-filelatency-net MAPE(%)log kernel-drop fair priority nginx-3nginx-2nginx-1fio reference 0.1 1 10 100 1000 10000 100000 volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum MAPE(%)log kernel-drop fair priority nginx-3nginx-2nginx-1fio 0.1 1 10 100 1000 10000 100000 volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net MAPE(%)log kernel-drop fair priority nginx-3nginx-2nginx-1fio 0.1 1 10 100 1000 10000 100000 volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net MAPE(%)log kernel-drop fair priority nginx-3nginx-2nginx-1fio 0.1 1 10 100 1000 10000 100000 volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net MAPE(%)log kernel-drop fair priority nginx-3nginx-2nginx-1fio referenceVolume metrics (byte r/w) Latency metrics 1x Fio, 3x Nginx, 1.3M evt/s
  • 30. Experimental results • System stability and control error • MAPE between QoS requirement and CPU traces of FFWD and reference • 3.51x average MAPE improvement,
 average MAPE below 5% 9 Test Ut = 1.1% - Control error MAPE reference fair priority A 7,12% 1,78% 3,78% B 34,06% 4,37% 4,46% C 28,03% 2,27% 2,24% D 11,52% 1,41% 1,54% E 26,02% 8,51% 8,99% F 22,67% 8,11% 3,74% G 16,42% 3,37% 2,73% H 19,92% 8,41% 8,01% • Output quality • MAPE between exact output metrics and approximated ones from FFWD and ref. • The domain-specific policies outperforms the reference in the majority of cases 0.1 1 10 100 1000 10000 100000 latency-filelatency-net latency-filelatency-net latency-filelatency-net latency-filelatenc MAPE(%)log kernel-drop fair priority nginx-3nginx-2nginx-1fio 0.1 1 10 100 1000 10000 100000 latency-filelatency-net latency-filelatency-net latency-filelatency-net latency-filelatency-net MAPE(%)log kernel-drop fair priority nginx-3nginx-2nginx-1fio 0.1 1 10 100 1000 10000 100000 latency-filelatency-net latency-filelatency-net latency-filelatency-net latency-filelatency-net MAPE(%)log kernel-drop fair priority nginx-3nginx-2nginx-1fio 0.1 1 10 100 1000 10000 100000 latency-filelatency-net latency-filelatency-net latency-filelatency-net latency-filelatency-net MAPE(%)log kernel-drop fair priority nginx-3nginx-2nginx-1fio reference 0.1 1 10 100 1000 10000 100000 volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum MAPE(%)log kernel-drop fair priority nginx-3nginx-2nginx-1fio 0.1 1 10 100 1000 10000 100000 volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net MAPE(%)log kernel-drop fair priority nginx-3nginx-2nginx-1fio 0.1 1 10 100 1000 10000 100000 volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net MAPE(%)log kernel-drop fair priority nginx-3nginx-2nginx-1fio 0.1 1 10 100 1000 10000 100000 volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net MAPE(%)log kernel-drop fair priority nginx-3nginx-2nginx-1fio referenceVolume metrics (byte r/w) Latency metrics 1x Fio, 3x Nginx, 1.3M evt/s
  • 31. Experimental results • System stability and control error • MAPE between QoS requirement and CPU traces of FFWD and reference • 3.51x average MAPE improvement,
 average MAPE below 5% 9 Test Ut = 1.1% - Control error MAPE reference fair priority A 7,12% 1,78% 3,78% B 34,06% 4,37% 4,46% C 28,03% 2,27% 2,24% D 11,52% 1,41% 1,54% E 26,02% 8,51% 8,99% F 22,67% 8,11% 3,74% G 16,42% 3,37% 2,73% H 19,92% 8,41% 8,01% • Output quality • MAPE between exact output metrics and approximated ones from FFWD and ref. • The domain-specific policies outperforms the reference in the majority of cases 0.1 1 10 100 1000 10000 100000 latency-filelatency-net latency-filelatency-net latency-filelatency-net latency-filelatenc MAPE(%)log kernel-drop fair priority nginx-3nginx-2nginx-1fio 0.1 1 10 100 1000 10000 100000 latency-filelatency-net latency-filelatency-net latency-filelatency-net latency-filelatency-net MAPE(%)log kernel-drop fair priority nginx-3nginx-2nginx-1fio 0.1 1 10 100 1000 10000 100000 latency-filelatency-net latency-filelatency-net latency-filelatency-net latency-filelatency-net MAPE(%)log kernel-drop fair priority nginx-3nginx-2nginx-1fio 0.1 1 10 100 1000 10000 100000 latency-filelatency-net latency-filelatency-net latency-filelatency-net latency-filelatency-net MAPE(%)log kernel-drop fair priority nginx-3nginx-2nginx-1fio reference 0.1 1 10 100 1000 10000 100000 volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum MAPE(%)log kernel-drop fair priority nginx-3nginx-2nginx-1fio 0.1 1 10 100 1000 10000 100000 volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net MAPE(%)log kernel-drop fair priority nginx-3nginx-2nginx-1fio 0.1 1 10 100 1000 10000 100000 volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net MAPE(%)log kernel-drop fair priority nginx-3nginx-2nginx-1fio 0.1 1 10 100 1000 10000 100000 volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net MAPE(%)log kernel-drop fair priority nginx-3nginx-2nginx-1fio referenceVolume metrics (byte r/w) Latency metrics 1x Fio, 3x Nginx, 1.3M evt/s
  • 32. Conclusion • We saw the main challenges of Load Shedding for container monitoring – Low overhead monitoring – High quality and granularity of metrics 10
  • 33. Conclusion • We saw the main challenges of Load Shedding for container monitoring – Low overhead monitoring – High quality and granularity of metrics 10
  • 34. Conclusion • We saw the main challenges of Load Shedding for container monitoring – Low overhead monitoring – High quality and granularity of metrics • Fast Forward With Degradation (FFWD) – Heuristic controller for bounded CPU usage – Pluggable policies for domain-specific load shedding – Accurate computation of output metrics – Load Shedding Filter for fast drop of events 10
  • 35. 11 Questions? Rolando Brondolin, [email protected] DEIB, Politecnico di Milano NGC VIII 2017 @ SF FFWD: Latency-aware event stream processing via domain-specific load-shedding policies. R. Brondolin, M. Ferroni, M. D. Santambrogio. In Proceedings of 14th IEEE/IFIP International Conference on Embedded and Ubiquitous Computing (EUC 2016)
  • 36. 12
  • 38. Proposed solution: FFWD Fast Forward With Degradation (FFWD) is a framework that tackles load peaks in streaming applications via load-shedding techniques 14
  • 39. Proposed solution: FFWD Fast Forward With Degradation (FFWD) is a framework that tackles load peaks in streaming applications via load-shedding techniques general approach but leveraging domain-specific details 14
  • 40. Proposed solution: FFWD Fast Forward With Degradation (FFWD) is a framework that tackles load peaks in streaming applications via load-shedding techniques general approach but leveraging domain-specific details 14 Load Manager *when*
  • 41. Proposed solution: FFWD Fast Forward With Degradation (FFWD) is a framework that tackles load peaks in streaming applications via load-shedding techniques general approach but leveraging domain-specific details 14 Load Manager *when* Policy wrapper *how much*
  • 42. Proposed solution: FFWD Fast Forward With Degradation (FFWD) is a framework that tackles load peaks in streaming applications via load-shedding techniques general approach but leveraging domain-specific details 14 Load Manager *when* LS Filter *where* Policy wrapper *how much* shedding plan
  • 43. Proposed solution: FFWD Fast Forward With Degradation (FFWD) is a framework that tackles load peaks in streaming applications via load-shedding techniques general approach but leveraging domain-specific details 14 Load Manager *when* aggregated metrics correction LS Filter *where* Policy wrapper *how much* shedding plan
  • 44. Load Manager 15 Load Manager LS Filter Policies SP Metrics correction • The Load Manager computes the throughput μ(t) that ensures stability such that: • Two Load Managers for two goals: tice that it is a sum of two different contributions. On the one hand, as the error e(t) to zero, the stability condition (4.7) is met. On the other hand, the contribution: (t) ensures a fast actuation in case of a significant deviation from the actual system brium. (t)  µ(t) (4.7) course, during the lifetime of the system, the arrival rate (t) can vary unpre- bly and can be greater than the system capacity µc(t), defined as the rate of events uted per second. Given the control action µ(t) (i.e., the throughput of the system) he system capacity, we can define µd(t) as the dropping rate of the LS. As we did t), we can estimate the current system capacity as the number of events analyzed last time period. Thus, for a given time t, equation (4.8) shows that the service s the sum of the system capacity estimated and the number of events that we need p to achieve the required stability: µ(t) = µc(t 1) + µd(t) (4.8) Utilization s section we describe the Utilization based Load Manager, which becomes of use e of streaming applications which should operate with a limited overhead. The ation based Load Manager, which is showed in Figure 4.4, resorts to queuing theory del the behavior of an application with a CPU usage constraint, relying on an Response time CPU Utilization
  • 45. Utilization-based Load Manager A. Shedding problem The system in Figure 1 can be modeled by means of Queuing Theory: the application is a single server node fed by a queue, which provides the input jobs at a variable arrival rate (t); the application is able to serve jobs at a service rate µ(t). The system measures (t) and µ(t) in events per second, where the events are respectively the input tweets and the serviced tweets. Starting from this, the simplest way to model the system behavior is by means of the Little’s law (1), which states that the number of jobs inside a system is equal to the input arrival rate times the system response time: N(t) = (t) · R(t) (1) Q(t) = Q(t 1) + (t) µ(t) (2) U(t) = (t) µmax + Q(t) µmax (3) Q(t) = µmax · U(t) (t) (4) e(t) = U(t) U(t 1) (5) he system in Figure 1 can be modeled by means of euing Theory: the application is a single server node fed a queue, which provides the input jobs at a variable arrival (t); the application is able to serve jobs at a service µ(t). The system measures (t) and µ(t) in events per ond, where the events are respectively the input tweets and serviced tweets. tarting from this, the simplest way to model the system avior is by means of the Little’s law (1), which states that number of jobs inside a system is equal to the input arrival times the system response time: N(t) = (t) · R(t) (1) Q(t) = Q(t 1) + (t) µ(t) (2) U(t) = (t) µmax + Q(t) µmax (3) Q(t) = µmax · U(t) (t) (4) e(t) = U(t) U(t 1) (5) S: Control error: 4.3. Policy wrapper and equation (4.13). This leads to the final formulation of the Loa (4.14), where the throughput at time t + 1 is a function of th the maximum available throughput times the feedback error. e(t) = U(t) ¯U µ(t + 1) = (t) + µmax · e(t) The Load Manager formulation just obtained is compose the one hand, when the contribution of the feedback error e( Requested throughput: 4.3. Policy wrapper and L equation (4.13). This leads to the final formulation of the Load (4.14), where the throughput at time t + 1 is a function of the the maximum available throughput times the feedback error. e(t) = U(t) ¯U µ(t + 1) = (t) + µmax · e(t) The Load Manager formulation just obtained is composed the one hand, when the contribution of the feedback error e(t condition of equation (4.15) is met; on the other hand, the secon The system can be characterized by its utilization and its queue size Load Manager 15 Load Manager LS Filter Policies SP Metrics correction • The Load Manager computes the throughput μ(t) that ensures stability such that: • Two Load Managers for two goals: tice that it is a sum of two different contributions. On the one hand, as the error e(t) to zero, the stability condition (4.7) is met. On the other hand, the contribution: (t) ensures a fast actuation in case of a significant deviation from the actual system brium. (t)  µ(t) (4.7) course, during the lifetime of the system, the arrival rate (t) can vary unpre- bly and can be greater than the system capacity µc(t), defined as the rate of events uted per second. Given the control action µ(t) (i.e., the throughput of the system) he system capacity, we can define µd(t) as the dropping rate of the LS. As we did t), we can estimate the current system capacity as the number of events analyzed last time period. Thus, for a given time t, equation (4.8) shows that the service s the sum of the system capacity estimated and the number of events that we need p to achieve the required stability: µ(t) = µc(t 1) + µd(t) (4.8) Utilization s section we describe the Utilization based Load Manager, which becomes of use e of streaming applications which should operate with a limited overhead. The ation based Load Manager, which is showed in Figure 4.4, resorts to queuing theory del the behavior of an application with a CPU usage constraint, relying on an Response time CPU Utilization
  • 46. Utilization-based Load Manager A. Shedding problem The system in Figure 1 can be modeled by means of Queuing Theory: the application is a single server node fed by a queue, which provides the input jobs at a variable arrival rate (t); the application is able to serve jobs at a service rate µ(t). The system measures (t) and µ(t) in events per second, where the events are respectively the input tweets and the serviced tweets. Starting from this, the simplest way to model the system behavior is by means of the Little’s law (1), which states that the number of jobs inside a system is equal to the input arrival rate times the system response time: N(t) = (t) · R(t) (1) Q(t) = Q(t 1) + (t) µ(t) (2) U(t) = (t) µmax + Q(t) µmax (3) Q(t) = µmax · U(t) (t) (4) e(t) = U(t) U(t 1) (5) he system in Figure 1 can be modeled by means of euing Theory: the application is a single server node fed a queue, which provides the input jobs at a variable arrival (t); the application is able to serve jobs at a service µ(t). The system measures (t) and µ(t) in events per ond, where the events are respectively the input tweets and serviced tweets. tarting from this, the simplest way to model the system avior is by means of the Little’s law (1), which states that number of jobs inside a system is equal to the input arrival times the system response time: N(t) = (t) · R(t) (1) Q(t) = Q(t 1) + (t) µ(t) (2) U(t) = (t) µmax + Q(t) µmax (3) Q(t) = µmax · U(t) (t) (4) e(t) = U(t) U(t 1) (5) S: Control error: 4.3. Policy wrapper and equation (4.13). This leads to the final formulation of the Loa (4.14), where the throughput at time t + 1 is a function of th the maximum available throughput times the feedback error. e(t) = U(t) ¯U µ(t + 1) = (t) + µmax · e(t) The Load Manager formulation just obtained is compose the one hand, when the contribution of the feedback error e( Requested throughput: 4.3. Policy wrapper and L equation (4.13). This leads to the final formulation of the Load (4.14), where the throughput at time t + 1 is a function of the the maximum available throughput times the feedback error. e(t) = U(t) ¯U µ(t + 1) = (t) + µmax · e(t) The Load Manager formulation just obtained is composed the one hand, when the contribution of the feedback error e(t condition of equation (4.15) is met; on the other hand, the secon The system can be characterized by its utilization and its queue size Load Manager 15 Load Manager LS Filter Policies SP Metrics correction • The Load Manager computes the throughput μ(t) that ensures stability such that: • Two Load Managers for two goals: tice that it is a sum of two different contributions. On the one hand, as the error e(t) to zero, the stability condition (4.7) is met. On the other hand, the contribution: (t) ensures a fast actuation in case of a significant deviation from the actual system brium. (t)  µ(t) (4.7) course, during the lifetime of the system, the arrival rate (t) can vary unpre- bly and can be greater than the system capacity µc(t), defined as the rate of events uted per second. Given the control action µ(t) (i.e., the throughput of the system) he system capacity, we can define µd(t) as the dropping rate of the LS. As we did t), we can estimate the current system capacity as the number of events analyzed last time period. Thus, for a given time t, equation (4.8) shows that the service s the sum of the system capacity estimated and the number of events that we need p to achieve the required stability: µ(t) = µc(t 1) + µd(t) (4.8) Utilization s section we describe the Utilization based Load Manager, which becomes of use e of streaming applications which should operate with a limited overhead. The ation based Load Manager, which is showed in Figure 4.4, resorts to queuing theory del the behavior of an application with a CPU usage constraint, relying on an Response time CPU Utilization CPU utilization Arrived events Residual events
  • 47. Utilization-based Load Manager A. Shedding problem The system in Figure 1 can be modeled by means of Queuing Theory: the application is a single server node fed by a queue, which provides the input jobs at a variable arrival rate (t); the application is able to serve jobs at a service rate µ(t). The system measures (t) and µ(t) in events per second, where the events are respectively the input tweets and the serviced tweets. Starting from this, the simplest way to model the system behavior is by means of the Little’s law (1), which states that the number of jobs inside a system is equal to the input arrival rate times the system response time: N(t) = (t) · R(t) (1) Q(t) = Q(t 1) + (t) µ(t) (2) U(t) = (t) µmax + Q(t) µmax (3) Q(t) = µmax · U(t) (t) (4) e(t) = U(t) U(t 1) (5) he system in Figure 1 can be modeled by means of euing Theory: the application is a single server node fed a queue, which provides the input jobs at a variable arrival (t); the application is able to serve jobs at a service µ(t). The system measures (t) and µ(t) in events per ond, where the events are respectively the input tweets and serviced tweets. tarting from this, the simplest way to model the system avior is by means of the Little’s law (1), which states that number of jobs inside a system is equal to the input arrival times the system response time: N(t) = (t) · R(t) (1) Q(t) = Q(t 1) + (t) µ(t) (2) U(t) = (t) µmax + Q(t) µmax (3) Q(t) = µmax · U(t) (t) (4) e(t) = U(t) U(t 1) (5) S: Control error: 4.3. Policy wrapper and equation (4.13). This leads to the final formulation of the Loa (4.14), where the throughput at time t + 1 is a function of th the maximum available throughput times the feedback error. e(t) = U(t) ¯U µ(t + 1) = (t) + µmax · e(t) The Load Manager formulation just obtained is compose the one hand, when the contribution of the feedback error e( Requested throughput: 4.3. Policy wrapper and L equation (4.13). This leads to the final formulation of the Load (4.14), where the throughput at time t + 1 is a function of the the maximum available throughput times the feedback error. e(t) = U(t) ¯U µ(t + 1) = (t) + µmax · e(t) The Load Manager formulation just obtained is composed the one hand, when the contribution of the feedback error e(t condition of equation (4.15) is met; on the other hand, the secon The system can be characterized by its utilization and its queue size Load Manager 15 Load Manager LS Filter Policies SP Metrics correction • The Load Manager computes the throughput μ(t) that ensures stability such that: • Two Load Managers for two goals: tice that it is a sum of two different contributions. On the one hand, as the error e(t) to zero, the stability condition (4.7) is met. On the other hand, the contribution: (t) ensures a fast actuation in case of a significant deviation from the actual system brium. (t)  µ(t) (4.7) course, during the lifetime of the system, the arrival rate (t) can vary unpre- bly and can be greater than the system capacity µc(t), defined as the rate of events uted per second. Given the control action µ(t) (i.e., the throughput of the system) he system capacity, we can define µd(t) as the dropping rate of the LS. As we did t), we can estimate the current system capacity as the number of events analyzed last time period. Thus, for a given time t, equation (4.8) shows that the service s the sum of the system capacity estimated and the number of events that we need p to achieve the required stability: µ(t) = µc(t 1) + µd(t) (4.8) Utilization s section we describe the Utilization based Load Manager, which becomes of use e of streaming applications which should operate with a limited overhead. The ation based Load Manager, which is showed in Figure 4.4, resorts to queuing theory del the behavior of an application with a CPU usage constraint, relying on an Response time CPU Utilization Current utilization Target utilization
  • 48. Utilization-based Load Manager A. Shedding problem The system in Figure 1 can be modeled by means of Queuing Theory: the application is a single server node fed by a queue, which provides the input jobs at a variable arrival rate (t); the application is able to serve jobs at a service rate µ(t). The system measures (t) and µ(t) in events per second, where the events are respectively the input tweets and the serviced tweets. Starting from this, the simplest way to model the system behavior is by means of the Little’s law (1), which states that the number of jobs inside a system is equal to the input arrival rate times the system response time: N(t) = (t) · R(t) (1) Q(t) = Q(t 1) + (t) µ(t) (2) U(t) = (t) µmax + Q(t) µmax (3) Q(t) = µmax · U(t) (t) (4) e(t) = U(t) U(t 1) (5) he system in Figure 1 can be modeled by means of euing Theory: the application is a single server node fed a queue, which provides the input jobs at a variable arrival (t); the application is able to serve jobs at a service µ(t). The system measures (t) and µ(t) in events per ond, where the events are respectively the input tweets and serviced tweets. tarting from this, the simplest way to model the system avior is by means of the Little’s law (1), which states that number of jobs inside a system is equal to the input arrival times the system response time: N(t) = (t) · R(t) (1) Q(t) = Q(t 1) + (t) µ(t) (2) U(t) = (t) µmax + Q(t) µmax (3) Q(t) = µmax · U(t) (t) (4) e(t) = U(t) U(t 1) (5) S: Control error: 4.3. Policy wrapper and equation (4.13). This leads to the final formulation of the Loa (4.14), where the throughput at time t + 1 is a function of th the maximum available throughput times the feedback error. e(t) = U(t) ¯U µ(t + 1) = (t) + µmax · e(t) The Load Manager formulation just obtained is compose the one hand, when the contribution of the feedback error e( Requested throughput: 4.3. Policy wrapper and L equation (4.13). This leads to the final formulation of the Load (4.14), where the throughput at time t + 1 is a function of the the maximum available throughput times the feedback error. e(t) = U(t) ¯U µ(t + 1) = (t) + µmax · e(t) The Load Manager formulation just obtained is composed the one hand, when the contribution of the feedback error e(t condition of equation (4.15) is met; on the other hand, the secon The system can be characterized by its utilization and its queue size Load Manager 15 Load Manager LS Filter Policies SP Metrics correction • The Load Manager computes the throughput μ(t) that ensures stability such that: • Two Load Managers for two goals: tice that it is a sum of two different contributions. On the one hand, as the error e(t) to zero, the stability condition (4.7) is met. On the other hand, the contribution: (t) ensures a fast actuation in case of a significant deviation from the actual system brium. (t)  µ(t) (4.7) course, during the lifetime of the system, the arrival rate (t) can vary unpre- bly and can be greater than the system capacity µc(t), defined as the rate of events uted per second. Given the control action µ(t) (i.e., the throughput of the system) he system capacity, we can define µd(t) as the dropping rate of the LS. As we did t), we can estimate the current system capacity as the number of events analyzed last time period. Thus, for a given time t, equation (4.8) shows that the service s the sum of the system capacity estimated and the number of events that we need p to achieve the required stability: µ(t) = µc(t 1) + µd(t) (4.8) Utilization s section we describe the Utilization based Load Manager, which becomes of use e of streaming applications which should operate with a limited overhead. The ation based Load Manager, which is showed in Figure 4.4, resorts to queuing theory del the behavior of an application with a CPU usage constraint, relying on an Response time CPU Utilization Arrival rate Max theoretical throughput Control error
  • 49. Utilization-based Load Manager A. Shedding problem The system in Figure 1 can be modeled by means of Queuing Theory: the application is a single server node fed by a queue, which provides the input jobs at a variable arrival rate (t); the application is able to serve jobs at a service rate µ(t). The system measures (t) and µ(t) in events per second, where the events are respectively the input tweets and the serviced tweets. Starting from this, the simplest way to model the system behavior is by means of the Little’s law (1), which states that the number of jobs inside a system is equal to the input arrival rate times the system response time: N(t) = (t) · R(t) (1) Q(t) = Q(t 1) + (t) µ(t) (2) U(t) = (t) µmax + Q(t) µmax (3) Q(t) = µmax · U(t) (t) (4) e(t) = U(t) U(t 1) (5) he system in Figure 1 can be modeled by means of euing Theory: the application is a single server node fed a queue, which provides the input jobs at a variable arrival (t); the application is able to serve jobs at a service µ(t). The system measures (t) and µ(t) in events per ond, where the events are respectively the input tweets and serviced tweets. tarting from this, the simplest way to model the system avior is by means of the Little’s law (1), which states that number of jobs inside a system is equal to the input arrival times the system response time: N(t) = (t) · R(t) (1) Q(t) = Q(t 1) + (t) µ(t) (2) U(t) = (t) µmax + Q(t) µmax (3) Q(t) = µmax · U(t) (t) (4) e(t) = U(t) U(t 1) (5) S: Control error: 4.3. Policy wrapper and equation (4.13). This leads to the final formulation of the Loa (4.14), where the throughput at time t + 1 is a function of th the maximum available throughput times the feedback error. e(t) = U(t) ¯U µ(t + 1) = (t) + µmax · e(t) The Load Manager formulation just obtained is compose the one hand, when the contribution of the feedback error e( Requested throughput: 4.3. Policy wrapper and L equation (4.13). This leads to the final formulation of the Load (4.14), where the throughput at time t + 1 is a function of the the maximum available throughput times the feedback error. e(t) = U(t) ¯U µ(t + 1) = (t) + µmax · e(t) The Load Manager formulation just obtained is composed the one hand, when the contribution of the feedback error e(t condition of equation (4.15) is met; on the other hand, the secon The system can be characterized by its utilization and its queue size Load Manager 15 Load Manager LS Filter Policies SP Metrics correction • The Load Manager computes the throughput μ(t) that ensures stability such that: • Two Load Managers for two goals: tice that it is a sum of two different contributions. On the one hand, as the error e(t) to zero, the stability condition (4.7) is met. On the other hand, the contribution: (t) ensures a fast actuation in case of a significant deviation from the actual system brium. (t)  µ(t) (4.7) course, during the lifetime of the system, the arrival rate (t) can vary unpre- bly and can be greater than the system capacity µc(t), defined as the rate of events uted per second. Given the control action µ(t) (i.e., the throughput of the system) he system capacity, we can define µd(t) as the dropping rate of the LS. As we did t), we can estimate the current system capacity as the number of events analyzed last time period. Thus, for a given time t, equation (4.8) shows that the service s the sum of the system capacity estimated and the number of events that we need p to achieve the required stability: µ(t) = µc(t 1) + µd(t) (4.8) Utilization s section we describe the Utilization based Load Manager, which becomes of use e of streaming applications which should operate with a limited overhead. The ation based Load Manager, which is showed in Figure 4.4, resorts to queuing theory del the behavior of an application with a CPU usage constraint, relying on an Response time CPU Utilization The requested throughput is used by the load shedding policies to derive the LS probabilities
  • 50. Policy wrapper and policies • The policy wrapper provides access to statistics of processes, the requested throughput μ(t+1) and the system capacity μc(t) 16 Load Manager LS Filter Policies SP Metrics correction
  • 51. Policy wrapper and policies • The policy wrapper provides access to statistics of processes, the requested throughput μ(t+1) and the system capacity μc(t) 16 Load Manager LS Filter Policies SP Metrics correction Baseline policy • Compute one LS probability for all processes (with μ(t+1) and μc(t))
  • 52. Policy wrapper and policies • The policy wrapper provides access to statistics of processes, the requested throughput μ(t+1) and the system capacity μc(t) 16 Fair policy • Assign to each process the “same" number 
 of events • Save metrics of small processes, still accurate results on big ones Load Manager LS Filter Policies SP Metrics correction Baseline policy • Compute one LS probability for all processes (with μ(t+1) and μc(t))
  • 53. Policy wrapper and policies • The policy wrapper provides access to statistics of processes, the requested throughput μ(t+1) and the system capacity μc(t) 16 Fair policy • Assign to each process the “same" number 
 of events • Save metrics of small processes, still accurate results on big ones Priority-based policy • Assign a static priority to each process • Compute a weighted priority to partition the system capacity • Assign a partition to each process and compute the probabilities Load Manager LS Filter Policies SP Metrics correction Baseline policy • Compute one LS probability for all processes (with μ(t+1) and μc(t))
  • 54. Load Shedding Filter • The Load Shedding Filter applies the probabilities 
 computed by the policies to the input stream 17 Load Manager LS Filter Policies SP Metrics correction
  • 55. Load Shedding Filter • The Load Shedding Filter applies the probabilities 
 computed by the policies to the input stream • For each event: • Look for load shedding probability depending on input class • If no data is found we can drop the event • Otherwise, apply the Load Shedding probability computed by the policy 17 Load Manager LS Filter Policies SP Metrics correction Load Shedding Filter Shedding Plan event buffers ok drop probability Event Capture ko
  • 56. Load Shedding Filter • The Load Shedding Filter applies the probabilities 
 computed by the policies to the input stream • For each event: • Look for load shedding probability depending on input class • If no data is found we can drop the event • Otherwise, apply the Load Shedding probability computed by the policy • The dropped events are reported to the application for metrics correction 17 Load Manager LS Filter Policies SP Metrics correction Load Shedding Filter Shedding Plan event buffers ok drop probability Event Capture ko
  • 57. Output quality - homogeneous • QoS requirement Ut 1.1%, standard set-point for the agent • MAPE (lower is better) between exact and approximated metrics • Output metrics on latency and volume for file and network operations 18
  • 58. Output quality - homogeneous • QoS requirement Ut 1.1%, standard set-point for the agent • MAPE (lower is better) between exact and approximated metrics • Output metrics on latency and volume for file and network operations • Similar or better results of FFWD fair policy w.r.t reference • FFWD accurate even if drops more events • Predictable and repetitive behavior of nginx, fio and apache 18 1 10 100 1000 latency-file latency-net volum e-file volum e-net MAPE(%)log kernel-drop fair 1 10 100 1000 latency-file latency-net volum e-file volum e-net MAPE(%)log kernel-drop fair 1 10 100 1000 latency-file latency-net volum e-file volum e-net MAPE(%)log kernel-drop fair 1 10 100 1000 latency-file latency-net volum e-file volum e-net MAPE(%)log kernel-drop fair reference 1 10 100 1000 latency-file latency-net volum e-file volum e-net MAPE(%)log kernel-drop fair apache 1.9M evt/s postmark 1.2M evt/s simplefile 1.5M evt/s fio 1.3M evt/s nginx 800K evt/s
  • 59. Output quality - homogeneous • QoS requirement Ut 1.1%, standard set-point for the agent • MAPE (lower is better) between exact and approximated metrics • Output metrics on latency and volume for file and network operations • Similar or better results of FFWD fair policy w.r.t reference • FFWD accurate even if drops more events • Predictable and repetitive behavior of nginx, fio and apache 18 1 10 100 1000 latency-file latency-net volum e-file volum e-net MAPE(%)log kernel-drop fair 1 10 100 1000 latency-file latency-net volum e-file volum e-net MAPE(%)log kernel-drop fair 1 10 100 1000 latency-file latency-net volum e-file volum e-net MAPE(%)log kernel-drop fair 1 10 100 1000 latency-file latency-net volum e-file volum e-net MAPE(%)log kernel-drop fair reference 1 10 100 1000 latency-file latency-net volum e-file volum e-net MAPE(%)log kernel-drop fair apache 1.9M evt/s postmark 1.2M evt/s simplefile 1.5M evt/s fio 1.3M evt/s nginx 800K evt/s
  • 60. Output quality - homogeneous • QoS requirement Ut 1.1%, standard set-point for the agent • MAPE (lower is better) between exact and approximated metrics • Output metrics on latency and volume for file and network operations • Similar or better results of FFWD fair policy w.r.t reference • FFWD accurate even if drops more events • Predictable and repetitive behavior of nginx, fio and apache 18 1 10 100 1000 latency-file latency-net volum e-file volum e-net MAPE(%)log kernel-drop fair 1 10 100 1000 latency-file latency-net volum e-file volum e-net MAPE(%)log kernel-drop fair 1 10 100 1000 latency-file latency-net volum e-file volum e-net MAPE(%)log kernel-drop fair 1 10 100 1000 latency-file latency-net volum e-file volum e-net MAPE(%)log kernel-drop fair reference 1 10 100 1000 latency-file latency-net volum e-file volum e-net MAPE(%)log kernel-drop fair apache 1.9M evt/s postmark 1.2M evt/s simplefile 1.5M evt/s fio 1.3M evt/s nginx 800K evt/s
  • 61. Output quality - homogeneous • QoS requirement Ut 1.1%, standard set-point for the agent • MAPE (lower is better) between exact and approximated metrics • Output metrics on latency and volume for file and network operations • Similar or better results of FFWD fair policy w.r.t reference • FFWD accurate even if drops more events • Predictable and repetitive behavior of nginx, fio and apache 18 1 10 100 1000 latency-file latency-net volum e-file volum e-net MAPE(%)log kernel-drop fair 1 10 100 1000 latency-file latency-net volum e-file volum e-net MAPE(%)log kernel-drop fair 1 10 100 1000 latency-file latency-net volum e-file volum e-net MAPE(%)log kernel-drop fair 1 10 100 1000 latency-file latency-net volum e-file volum e-net MAPE(%)log kernel-drop fair reference 1 10 100 1000 latency-file latency-net volum e-file volum e-net MAPE(%)log kernel-drop fair apache 1.9M evt/s postmark 1.2M evt/s simplefile 1.5M evt/s fio 1.3M evt/s nginx 800K evt/s
  • 62. System stability 19 • We evaluated the Load Manager with all the tests (A, B, C, D, E, F, G) • With 3 different set points (Ut 1.0%, 1.1%, 1.2% w.r.t. system capacity) • Measuring the CPU load of the sysdig agent with: • reference implementation • FFWD with fair and priority policy • We compared the actual CPU load
 with the QoS requirement (Ut) • Error measured with MAPE (lower 
 is better) obtained running 20 times 
 each benchmark • 3.51x average MAPE improvement,
 average MAPE below 5% Test Ut = 1.1% reference fair priority A 7,12% 1,78% 3,78% B 34,06% 4,37% 4,46% C 28,03% 2,27% 2,24% D 11,52% 1,41% 1,54% E 26,02% 8,51% 8,99% F 22,67% 8,11% 3,74% G 16,42% 3,37% 2,73% H 19,92% 8,41% 8,01%
  • 63. System stability 19 • We evaluated the Load Manager with all the tests (A, B, C, D, E, F, G) • With 3 different set points (Ut 1.0%, 1.1%, 1.2% w.r.t. system capacity) • Measuring the CPU load of the sysdig agent with: • reference implementation • FFWD with fair and priority policy • We compared the actual CPU load
 with the QoS requirement (Ut) • Error measured with MAPE (lower 
 is better) obtained running 20 times 
 each benchmark • 3.51x average MAPE improvement,
 average MAPE below 5% Test Ut = 1.1% reference fair priority A 7,12% 1,78% 3,78% B 34,06% 4,37% 4,46% C 28,03% 2,27% 2,24% D 11,52% 1,41% 1,54% E 26,02% 8,51% 8,99% F 22,67% 8,11% 3,74% G 16,42% 3,37% 2,73% H 19,92% 8,41% 8,01%
  • 64. System stability 19 • We evaluated the Load Manager with all the tests (A, B, C, D, E, F, G) • With 3 different set points (Ut 1.0%, 1.1%, 1.2% w.r.t. system capacity) • Measuring the CPU load of the sysdig agent with: • reference implementation • FFWD with fair and priority policy • We compared the actual CPU load
 with the QoS requirement (Ut) • Error measured with MAPE (lower 
 is better) obtained running 20 times 
 each benchmark • 3.51x average MAPE improvement,
 average MAPE below 5% Test Ut = 1.1% reference fair priority A 7,12% 1,78% 3,78% B 34,06% 4,37% 4,46% C 28,03% 2,27% 2,24% D 11,52% 1,41% 1,54% E 26,02% 8,51% 8,99% F 22,67% 8,11% 3,74% G 16,42% 3,37% 2,73% H 19,92% 8,41% 8,01%
  • 65. Output quality - heterogeneous • We tried to mix the homogeneous tests: • simulate co-located environment • add OS scheduling uncertainty and noise • QoS requirement Ut 1.1% • MAPE (lower is better) between exact and approximated metrics • Compare metrics from reference, FFWD fair, FFWD priority • Three tests with different syscall mix: • Network based mid-throughput: 1x Fio, 3x Nginx, 1.3M evt/s • Mixed mid-throughput: 1x Simplefile, 1x Nginx, 1.3M evt/s • Mixed high-throughput: 1x Apache, 1x Fio, 2x Postmark, 1.8M evt/s 20
  • 66. Output quality - heterogeneous • We tried to mix the homogeneous tests: • simulate co-located environment • add OS scheduling uncertainty and noise • QoS requirement Ut 1.1% • MAPE (lower is better) between exact and approximated metrics • Compare metrics from reference, FFWD fair, FFWD priority • Three tests with different syscall mix: • Network based mid-throughput: 1x Fio, 3x Nginx, 1.3M evt/s • Mixed mid-throughput: 1x Simplefile, 1x Nginx, 1.3M evt/s • Mixed high-throughput: 1x Apache, 1x Fio, 2x Postmark, 1.8M evt/s 20
  • 67. 1x Fio, 3x Nginx, 1.3M evt/s 21 0.1 1 10 100 1000 10000 100000 latency-filelatency-net latency-filelatency-net latency-filelatency-net latency-filelatency-net MAPE(%)log kernel-drop fair priority nginx-3nginx-2nginx-1fio 0.1 1 10 100 1000 10000 100000 latency-filelatency-net latency-filelatency-net latency-filelatency-net latency-filelatency-net MAPE(%)log kernel-drop fair priority nginx-3nginx-2nginx-1fio 0.1 1 10 100 1000 10000 100000 latency-filelatency-net latency-filelatency-net latency-filelatency-net latency-filelatency-net MAPE(%)log kernel-drop fair priority nginx-3nginx-2nginx-1fio 0.1 1 10 100 1000 10000 100000 latency-filelatency-net latency-filelatency-net latency-filelatency-net latency-filelatency-net MAPE(%)log kernel-drop fair priority nginx-3nginx-2nginx-1fio reference 0.1 1 10 100 1000 10000 100000 volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net MAPE(%)log kernel-drop fair priority nginx-3nginx-2nginx-1fio 0.1 1 10 100 1000 10000 100000 volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net MAPE(%)log kernel-drop fair priority nginx-3nginx-2nginx-1fio 0.1 1 10 100 1000 10000 100000 volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net MAPE(%)log kernel-drop fair priority nginx-3nginx-2nginx-1fio 0.1 1 10 100 1000 10000 100000 volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net MAPE(%)log kernel-drop fair priority nginx-3nginx-2nginx-1fio referenceVolume metrics (byte r/w) Latency metrics MAPE lower is better
  • 68. 1x Fio, 3x Nginx, 1.3M evt/s 21 0.1 1 10 100 1000 10000 100000 latency-filelatency-net latency-filelatency-net latency-filelatency-net latency-filelatency-net MAPE(%)log kernel-drop fair priority nginx-3nginx-2nginx-1fio 0.1 1 10 100 1000 10000 100000 latency-filelatency-net latency-filelatency-net latency-filelatency-net latency-filelatency-net MAPE(%)log kernel-drop fair priority nginx-3nginx-2nginx-1fio 0.1 1 10 100 1000 10000 100000 latency-filelatency-net latency-filelatency-net latency-filelatency-net latency-filelatency-net MAPE(%)log kernel-drop fair priority nginx-3nginx-2nginx-1fio 0.1 1 10 100 1000 10000 100000 latency-filelatency-net latency-filelatency-net latency-filelatency-net latency-filelatency-net MAPE(%)log kernel-drop fair priority nginx-3nginx-2nginx-1fio reference 0.1 1 10 100 1000 10000 100000 volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net MAPE(%)log kernel-drop fair priority nginx-3nginx-2nginx-1fio 0.1 1 10 100 1000 10000 100000 volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net MAPE(%)log kernel-drop fair priority nginx-3nginx-2nginx-1fio 0.1 1 10 100 1000 10000 100000 volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net MAPE(%)log kernel-drop fair priority nginx-3nginx-2nginx-1fio 0.1 1 10 100 1000 10000 100000 volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net MAPE(%)log kernel-drop fair priority nginx-3nginx-2nginx-1fio referenceVolume metrics (byte r/w) Latency metrics MAPE lower is better
  • 69. 1x Fio, 3x Nginx, 1.3M evt/s 21 0.1 1 10 100 1000 10000 100000 latency-filelatency-net latency-filelatency-net latency-filelatency-net latency-filelatency-net MAPE(%)log kernel-drop fair priority nginx-3nginx-2nginx-1fio 0.1 1 10 100 1000 10000 100000 latency-filelatency-net latency-filelatency-net latency-filelatency-net latency-filelatency-net MAPE(%)log kernel-drop fair priority nginx-3nginx-2nginx-1fio 0.1 1 10 100 1000 10000 100000 latency-filelatency-net latency-filelatency-net latency-filelatency-net latency-filelatency-net MAPE(%)log kernel-drop fair priority nginx-3nginx-2nginx-1fio 0.1 1 10 100 1000 10000 100000 latency-filelatency-net latency-filelatency-net latency-filelatency-net latency-filelatency-net MAPE(%)log kernel-drop fair priority nginx-3nginx-2nginx-1fio reference 0.1 1 10 100 1000 10000 100000 volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net MAPE(%)log kernel-drop fair priority nginx-3nginx-2nginx-1fio 0.1 1 10 100 1000 10000 100000 volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net MAPE(%)log kernel-drop fair priority nginx-3nginx-2nginx-1fio 0.1 1 10 100 1000 10000 100000 volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net MAPE(%)log kernel-drop fair priority nginx-3nginx-2nginx-1fio 0.1 1 10 100 1000 10000 100000 volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net MAPE(%)log kernel-drop fair priority nginx-3nginx-2nginx-1fio referenceVolume metrics (byte r/w) Latency metrics MAPE lower is better
  • 70. 1x Apache, 1x Fio, 2x Postmark, 1.8M evt/s 22 0.1 1 10 100 1000 10000 100000 volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache 0.1 1 10 100 1000 10000 100000 volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache 0.1 1 10 100 1000 10000 100000 volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache 0.1 1 10 100 1000 10000 100000 volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache reference 0.1 1 10 100 1000 10000 100000 latency-filelatency-net latency-filelatency-net latency-filelatency-net latency-filelatency-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache 0.1 1 10 100 1000 10000 100000 latency-filelatency-net latency-filelatency-net latency-filelatency-net latency-filelatency-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache 0.1 1 10 100 1000 10000 100000 latency-filelatency-net latency-filelatency-net latency-filelatency-net latency-filelatency-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache 0.1 1 10 100 1000 10000 100000 latency-filelatency-net latency-filelatency-net latency-filelatency-net latency-filelatency-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache reference Volume metrics (byte r/w) Latency metrics MAPE lower is better
  • 71. 1x Apache, 1x Fio, 2x Postmark, 1.8M evt/s 22 0.1 1 10 100 1000 10000 100000 volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache 0.1 1 10 100 1000 10000 100000 volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache 0.1 1 10 100 1000 10000 100000 volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache 0.1 1 10 100 1000 10000 100000 volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache reference 0.1 1 10 100 1000 10000 100000 latency-filelatency-net latency-filelatency-net latency-filelatency-net latency-filelatency-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache 0.1 1 10 100 1000 10000 100000 latency-filelatency-net latency-filelatency-net latency-filelatency-net latency-filelatency-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache 0.1 1 10 100 1000 10000 100000 latency-filelatency-net latency-filelatency-net latency-filelatency-net latency-filelatency-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache 0.1 1 10 100 1000 10000 100000 latency-filelatency-net latency-filelatency-net latency-filelatency-net latency-filelatency-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache reference Volume metrics (byte r/w) Latency metrics MAPE lower is better
  • 72. 1x Apache, 1x Fio, 2x Postmark, 1.8M evt/s 22 0.1 1 10 100 1000 10000 100000 volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache 0.1 1 10 100 1000 10000 100000 volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache 0.1 1 10 100 1000 10000 100000 volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache 0.1 1 10 100 1000 10000 100000 volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache reference 0.1 1 10 100 1000 10000 100000 latency-filelatency-net latency-filelatency-net latency-filelatency-net latency-filelatency-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache 0.1 1 10 100 1000 10000 100000 latency-filelatency-net latency-filelatency-net latency-filelatency-net latency-filelatency-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache 0.1 1 10 100 1000 10000 100000 latency-filelatency-net latency-filelatency-net latency-filelatency-net latency-filelatency-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache 0.1 1 10 100 1000 10000 100000 latency-filelatency-net latency-filelatency-net latency-filelatency-net latency-filelatency-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache reference Volume metrics (byte r/w) Latency metrics MAPE lower is better
  • 73. 1x Apache, 1x Fio, 2x Postmark, 1.8M evt/s 22 0.1 1 10 100 1000 10000 100000 volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache 0.1 1 10 100 1000 10000 100000 volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache 0.1 1 10 100 1000 10000 100000 volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache 0.1 1 10 100 1000 10000 100000 volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache reference 0.1 1 10 100 1000 10000 100000 latency-filelatency-net latency-filelatency-net latency-filelatency-net latency-filelatency-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache 0.1 1 10 100 1000 10000 100000 latency-filelatency-net latency-filelatency-net latency-filelatency-net latency-filelatency-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache 0.1 1 10 100 1000 10000 100000 latency-filelatency-net latency-filelatency-net latency-filelatency-net latency-filelatency-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache 0.1 1 10 100 1000 10000 100000 latency-filelatency-net latency-filelatency-net latency-filelatency-net latency-filelatency-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache reference Volume metrics (byte r/w) Latency metrics MAPE lower is better
  • 74. 23 0.1 1 10 100 1000 10000 100000 volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache 0.1 1 10 100 1000 10000 100000 volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache 0.1 1 10 100 1000 10000 100000 volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache 0.1 1 10 100 1000 10000 100000 volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net volum e-filevolum e-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache reference 0.1 1 10 100 1000 10000 100000 latency-filelatency-net latency-filelatency-net latency-filelatency-net latency-filelatency-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache 0.1 1 10 100 1000 10000 100000 latency-filelatency-net latency-filelatency-net latency-filelatency-net latency-filelatency-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache 0.1 1 10 100 1000 10000 100000 latency-filelatency-net latency-filelatency-net latency-filelatency-net latency-filelatency-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache 0.1 1 10 100 1000 10000 100000 latency-filelatency-net latency-filelatency-net latency-filelatency-net latency-filelatency-net MAPE(%)log kernel-drop fair priority postmark-2postmark-1fioapache reference Volume metrics (byte r/w) Latency metrics MAPE lower is better Test H, mixed workloads: 1x apache, 1x fio, 2x postmark, 1.8M evt/s • Fair policy outperforms reference in almost all cases • the LS Filter works at the single event level • reference drops events in batches • Priority policy improves the Fair policy results in most cases • the prioritized processes are privileged • other processes treated as “best-effort” 1x Apache, 1x Fio, 2x Postmark, 1.8M evt/s
  • 75. 1x simplefile, 1x nginx, 1.3M evt/s 24 1 10 100 1000 latency-filelatency-net latency-filelatency-net MAPE(%)log kernel-drop fair priority nginxsimplefile reference 1 10 100 1000 volum e-filevolum e-net volum e-filevolum e-net MAPE(%)log kernel-drop fair priority nginx-1simplefile reference Volume metrics (byte r/w)Latency metrics MAPE lower is better
  • 76. 1x simplefile, 1x nginx, 1.3M evt/s 24 1 10 100 1000 latency-filelatency-net latency-filelatency-net MAPE(%)log kernel-drop fair priority nginxsimplefile reference 1 10 100 1000 volum e-filevolum e-net volum e-filevolum e-net MAPE(%)log kernel-drop fair priority nginx-1simplefile reference Volume metrics (byte r/w)Latency metrics MAPE lower is better
  • 77. 1x simplefile, 1x nginx, 1.3M evt/s 24 1 10 100 1000 latency-filelatency-net latency-filelatency-net MAPE(%)log kernel-drop fair priority nginxsimplefile reference 1 10 100 1000 volum e-filevolum e-net volum e-filevolum e-net MAPE(%)log kernel-drop fair priority nginx-1simplefile reference Volume metrics (byte r/w)Latency metrics MAPE lower is better
  • 78. 1x simplefile, 1x nginx, 1.3M evt/s 24 1 10 100 1000 latency-filelatency-net latency-filelatency-net MAPE(%)log kernel-drop fair priority nginxsimplefile reference 1 10 100 1000 volum e-filevolum e-net volum e-filevolum e-net MAPE(%)log kernel-drop fair priority nginx-1simplefile reference Volume metrics (byte r/w)Latency metrics MAPE lower is better
  • 79. Response time Load Manager 25 S: (Little’s Law) (Jobs in the system) The system can be characterized by its response time and the jobs in the system Control error: Requested throughput: The requested throughput is used by the load shedding policies to derive the LS probabilities
  • 80. 26 S: (Little’s Law) (Jobs in the system) The system can be characterized by its response time and the jobs in the system Control error: Requested throughput: The requested throughput is used by the load shedding policies to derive the LS probabilities Old response time Target response time Response time Load Manager
  • 81. 27 S: (Little’s Law) (Jobs in the system) The system can be characterized by its response time and the jobs in the system Control error: Requested throughput: The requested throughput is used by the load shedding policies to derive the LS probabilities Requested throughput Arrival rate Control error Response time Load Manager
  • 82. Real-time sentiment analysis 28 • Real-time sentiment analysis allows to: – Track the sentiment of a topic over time – Correlate real world events and related sentiment, e.g. • Toyota crisis (2010) [1] • 2012 US Presidential Election Cycle [2] – Track online evolution of companies reputation, derive social profiling and allow enhanced social marketing strategies [1] Bifet Figuerol, Albert Carles, et al. "Detecting sentiment change in Twitter streaming data." Journal of Machine Learning Research: Workshop and Conference Proceedings Series. 2011. [2] Wang, Hao, et al. "A system for real-time twitter sentiment analysis of 2012 us presidential election cycle." Proceedings of the ACL 2012 System Demonstrations.
  • 83. Sentiment analysis: case study 29 • Simple Twitter streaming sentiment analyzer with Stanford NLP • System components: – Event producer – RabbitMQ queue – Event consumer • Consumer components: – Event Capture – Sentiment Analyzer – Sentiment Aggregator • Real-time queue consumption, aggregated metrics emission each second (keywords and hashtag sentiment)
  • 84. FFWD: Sentiment analysis 30 • FFWD adds four components: – Load shedding filter at the beginning of the pipeline – Shedding plan used by the filter – Domain-specific policy wrapper – Application controller manager to detect load peaks Producer Load Shedding Filter Event Capture Sentiment Analyzer Sentiment Aggregator Policy Wrapper Load Manager Shedding Plan real-time queue batch queue ok ko ko count account metrics R(t) stream statsupdated plan μ(t+1) event output metricsinput tweets drop probability Component Data structure Internal information flow External information flow Queue analyze event λ(t) Rt
  • 85. Sentiment - experimental setup 31 • Separate tests to understand FFWD behavior: – System stability – Output quality • Dataset: 900K tweets of 35th week of Premier League • Performed tests: – Controller: synthetic and real tweets at various λ(t) – Policy: real tweets at various λ(t) • Evaluation setup – Intel core i7 3770, 4 cores @ 3.4 Ghz + HT, 8MB LLC – 8 GB RAM @ 1600 Mhz
  • 86. System stability 32 case A: λ(t) = λ(t-1) case B: λ(t) = avg(λ(t)) λ(t) estimation:
  • 87. Load Manager showcase (1) • Load Manager demo (Rt = 5s): – λ(t) increased after 60s and 240s – response time: 33 0 1 2 3 4 5 6 7 0 50 100 150 200 250 300 Responsetime(s) time (s) Controller performance QoS = 5s R
  • 88. Load Manager showcase (2) • Load Manager demo (Rt = 5s): – λ(t) increased after 60s and 240s – throughput: 34 0 100 200 300 400 500 0 50 100 150 200 250 300 #Events time (s) Actuation lambda dropped computed mu
  • 89. Output Quality 35 • Real tweets, μc(t) ≃ 40 evt/s • Evaluated policies: • Baseline • Fair • Priority • R = 5s, λ(t) = 100 evt/s, 200 evt/s, 400 evt/s • Error metric: Mean Absolute Percentage Error (MAPE %) (lower is better) 0 10 20 30 40 50 A B C D MAPE(%) Groups baseline_error fair_error priority_error λ(t) = 100 evt/s 0 10 20 30 40 50 A B C D MAPE(%) Groups baseline_error fair_error priority_error λ(t) = 200 evt/s 0 10 20 30 40 50 A B C D MAPE(%) Groups baseline_error fair_error priority_error λ(t) = 400 evt/s