Insights on Serverless Platform Performance
Insights on Serverless Platform Performance
of Serverless Platforms
Liang Wang, UW-Madison; Mengyuan Li and Yinqian Zhang, The Ohio State University;
Thomas Ristenpart, Cornell Tech; Michael Swift, UW-Madison
https://www.usenix.org/conference/atc18/presentation/wang-liang
No. of VMs
variety of configurations. The variety, likely resulting 25
20
from infrastructure upgrades, can cause inconsistent 15
function performance. To estimate the fraction of 10
different types of VM in a given service, we examined 5
0
the configurations of the host VMs of 50,000 unique 0 100 200
function instances in each service.
No. of concurrent requests
In AWS, we checked the model name and the
processor numbers in the /proc/cpuinfo, and the Figure 4: The total number of VMs being used after
MemTotal in the /proc/meminfo, and found five types sending a given number of concurrent requests in AWS.
of VMs: two E5-2666 vCPUs (2.90 GHZ), two E5-
2680 vCPUs (2.80 GHZ), two E5-2676 vCPUs (2.40
dedicated adversary might use it as a steppingstone to
GHZ), two E5-2686 vCPUs (2.30 GHZ), and one E5-
more sophisticated attacks. Overall, accesses to runtime
2676 vCPUs. These types account for 59.3%, 37.5%,
information, unless necessary, should be restricted for
3.1%, 0.09% and 0.01% of 20,447 distinct VMs.
security purposes. Additionally, providers should expose
Azure shows a greater diversity of VM configurations. such information in an auditable way, i.e., via API calls,
The instances in Azure report various vCPU counts: so they are able to detect and block suspicious behaviors.
of 4,104 unique VMs, 54.1% use 1 vCPU, 24.6% use
2 vCPUs, and 21.3% use 4 vCPUs. For a given 5 Resource Scheduling
vCPU count, there are three CPU models: two Intel
We examine how instances and VMs are scheduled in the
and one AMD. Thus, nine (at least) different types
three serverless platforms in terms of instance coldstart
of VMs are being used in Azure. Performance may
latency, lifetime, scalability, and more.
vary substantially based on what kind of host (more
specifically, the number of vCPUs) runs the function. 5.1 Scalability and instance placement
See §6 for more details. Elastic, automatic scaling in response to changes in
In Google, the model name is always “unknown”, but demand is a main advertised benefit of the serverless
there are 4 unique model versions (79, 85, 63, 45), model. We measure how well platforms scale up.
corresponding to 47.1%, 44.7%, 4.2%, and 4.0% of
We created 40 measurement functions of the same
selected function instances.
memory size f1 , f2 , . . . , f40 and invoked each fi with 5i
4.5 Discussion concurrent requests. We paused for 10 seconds between
Being able to identify VMs in AWS is essential for our batches of invocations to cope with rate limits in the
measurements. It helps to reduce noise in experiments platforms. All measurement functions simply sleep for
and get more accurate results. For the sake of 15 seconds and then return. For each configuration we
comparison, we evaluated the heuristic designed by performed 50 rounds of measurements.
Lloyd et al. [33]. The heuristic assumes that different AWS. AWS is the best among the three services
VMs have distinct boot times, which can be obtained with regard to supporting concurrent execution. In
from /proc/stat, and group function instances based our measurements, N concurrent invocations always
on the boot time. We sent 10 – 50 concurrent requests produced N concurrently running function instances.
at a time to 1536 MB functions for 100 rounds, used AWS could easily scale up to 200 (the maximum
our methodology (instance root ID + IP) to label the measured concurrency level) fresh function instances.
VMs, and compared against the heuristic. The heuristic We observed that 3,328 MB was the maximum
identified 940 VMs as 600 VMs, so 340 (36%) VMs aggregate memory that can be allocated across all
were incorrectly labeled. So, we conclude this heuristic function instances on any VM in AWS Lambda. AWS
is not reliable. Lambda appears to treat instance placement as a bin-
None of these serverless providers completely hide packing problem, and tries to place a new function
runtime information from tenants. More knowledge of instance on an existing active VM to maximize VM
instance runtime and the backend infrastructure could memory utilization rates, i.e., the sum of instance
make finding vulnerabilities in function instances easier memory sizes divided by 3,328. We invoked a single
for an adversary. In prior studies, procfs has been function with sets of concurrent requests, increasing
used as a side-channel [9, 21, 46]. In the serverless from 5 to 200 with a step of 5, and recorded the
setting, one actually can use it to monitor the activity total number of VMs being used after each number
of coresident instances; while seemingly harmless, a of requests. A few examples are shown in Figure 4.
th 3.6 6
no 3.6 28
no js4 36
no js4. -128
de .1 6
.1 128
a 6
a8 28
6
150
py on -12
py n2 -51
py on 53
no ejs6 -153
jav 153
53
on -1
de -15
jav 8-1
on 1
th -1
js6 0-
-1
th .7
d e .3
0-
d 3
py n2
100
o
th
o
py
50
AWS
0
1 12 24 36 48 60 72 84 96 108 120 132 144 156 168
1600
Figure 6: Median coldstart latency with min-max error 1400
1200
Google
bars (across 1,000 rounds) under different combinations 1000
800
b
600
of function languages and memory sizes in AWS. Y-axis 400
200
is truncated at 1,000 ms. 0
1 12 24 36 48 60 72 84 96 108 120 132 144 156 168
18000
16000 Azure
14000
12000
So, AWS has a pool of ready VMs. The extra delays in 10000
8000
6000
case (1) are more likely introduced by scheduling (e.g., 4000
2000
selecting a VM) rather than launching a VM. 0
1 12 24 36 48 60 72 84 96 108 120 132 144 156 168
Mon Tue Wed Thu Fri Sat Sun
Our results are consistent with prior observations:
function memory and language affect coldstart la- Figure 8: Coldstart latency (in ms) over 168 hours. All
tency [10], as shown in Figure 6. Python 2.7 achieves the measurements were started at right after midnight on
the lowest median coldstart latencies (167–171 ms) a Sunday. Each data point is the median of all coldstart
while Java functions have significantly higher latencies latencies collected in a given hour. For clarity, the y-axes
than other languages (824–974 ms). Coldstart latency use different ranges for each service.
generally decreases as function memory increases. One
possible explanation is that AWS allocates CPU power
168 hours (7 days), and calculated the median of the
proportionally to the memory size; with more CPU
coldstart latencies collected in a given hour. The changes
power, environment set up becomes faster (see §6.1).
of coldstart latency are shown in Figure 8. The coldstart
A number of function instances may be launched
latencies in AWS were relatively stable, as were those
on the same VM concurrently, due to AWS’s instance
in Google (except for a few spikes). Azure had the
placement strategy. In this case, the coldstart latency
highest network variation over time, ranging from about
increases as more instances are launched simultaneously.
1.5 seconds up to 16 seconds.
For example, launching 20 function instances of a
We repeated our coldstart measurements in May 2018.
Python 2.7-based function with 128 MB memory on a
We did not find significant changes in coldstart latency
given VM took 1,321 ms on average, which is about 7
in AWS. But, the coldstart latencies became 4x slower
times slower than launching 1 function instance on the
on average in Google, probably due to its infrastructure
same VM (186 ms).
update in February 2018 [15], and 15x better in Azure.
Azure and Google. The median coldstart latency in This result demonstrates the importance of developing a
Google ranged from 110 ms to 493 ms (see Table 7). measurement platform for serverless systems (similar to
Google also allocates CPU proportionally to memory, [39] for IaaS) to do continuous measurements for better
but in Google memory size has greater impact on performance characterization.
coldstart latency than in AWS. It took much longer
to launch a function instance in Azure, though their 5.3 Instance lifetime
instances are always assigned 1.5 GB memory. The A serverless provider may terminate a function instance
median coldstart latency was 3,640 ms in Azure. even if still in active use. We define the longest time
Anecdotes online [3] suggest that the long latency is a function instance stays active as instance lifetime.
caused by design and engineering issues in the platform Tenants prefer long lifetimes because their applications
that Azure is both aware of and working to improve. will be able to maintain in-memory state (e.g., database
Latency variance. We collected the coldstart latencies connections) longer and suffer less from coldstarts.
of 128 MB, Python 2.7 (AWS) or Nodejs 6.* (Google To estimate instance lifetime, we set up functions
and Azure) based functions every 10 seconds for over of different memory sizes and languages, and invoked
1 128MB,1 req/5s 1 1
1536MB,1 req/5s
0.8 128MB,1 req/60s 0.8 0.8
1536MB,1 req/60s
Fraction
Fraction
Fraction
0.6 0.6 0.6
0.4 0.4 128MB,1 req/5s 0.4
2048MB,1 req/5s
0.2 0.2 128MB,1 req/60s 0.2 1 req/5s
2048MB,1 req/60s 1 req/60s
0 0 0
0 100 200 300 400 500 0 200 400 600 800 1000 0 20 40 60 80 100 120 140
Lifetime (mins.) Lifetime (mins.) Lifetime (hours.)
Figure 9: The CDFs of instance lifetime in AWS, Google, and Azure under different memory and request frequency.
them at different frequencies (one request per 5/30/60 shut down after 26 minutes. When their host VM is
seconds). The lifetime of a function instance is the “idle”, i.e., no active instances on that VM, idle function
difference between the first time and the last time we saw instances will be recycled the following way: Assuming
the instance. We ran the experiment for 7 days (AWS that the function instances of N functions f1 , . . . , fN are
and Google) or longer (Azure) so that we could collect at coresident on a VM, and k fi instances are from fi . For a
least 50 lifetimes under a given setting. given function fi , AWS will shut down bk fi /2c of the
In general, Azure function instances have significantly idle instances of fi every 300 (more or less) seconds
longer lifetimes than AWS and Google as shown in until there are two or three instances left, and eventually
Figure 9. In AWS, the median instance lifetime across shut down the remaining instances after 27 minutes (we
all settings was 6.2 hours, with the maximum being 8.3 have tested with k fi = 5, 10, 15, 20). AWS performs these
hours. The host VMs in AWS usually lives longer: operations to f1 , . . . , fN on a given VM independently,
the longest observed VM kernel uptime was 9.2 hours. and also on individual VMs independently. Function
When request frequency increases instance lifetime tends memory or language does not affect maximum idle time.
to become shorter. Other factors have little effect on If there are active instances on the VM, instances
lifetime except in Google, where instances of larger can stay inactive for a longer time. We kept one
memory tend to have longer lifetimes. For example, instance active on a given VM by sending a request
when being invoked every five seconds, the lifetimes every 10 seconds and found: (1) AWS still adopted the
were 3–31 minutes and 19–580 minutes for 90% of the same strategy to recycle the idle instances of the same
instances of 128 MB and 2,048 MB memory in Google, function, but (2) somehow idle time was reset for other
respectively. So, for functions with small memory under coresident instances. We observed some idle instances
a heavy workload, Google seems to launch new instances could stay idle in such cases for 1–3 hours.
aggressively rather than reusing existing instances. This Azure and Google. In Azure, we could not find a
can increase the performance penalty from coldstarts consistent maximum instance idle time. We repeated
5.4 Idle instance recycling the experiment several times on different days and found
the maximum idle times of 22, 40, and more than 120
To efficiently use resources, Serverless providers shut-
minutes. In Google, the idle time of instances could be
down idle instances to recycle allocated resources (see,
more than 120 minutes. After 120 minutes, instances
e.g., [32]). We define the longest time an instance can
remained active in 18% of our experiments.
stay idle before getting shut down as instance maximum
idle time. There is a trade-off between long and short 5.5 Inconsistent function usage
idle time, as maintaining more idle instances is a waste Tenants expect the requests following a function update
of VM memory resources, while fewer ready-to-serve should be handled by the new function code, especially
instances cause more coldstarts. if the update is security-critical. However, we found in
We performed a binary search on the minimum delay AWS there was a small chance that requests could be
tidle between two invocations of the function that resulted handled by an old version of the function. We call such
in distinct function instances. We created a function, cases inconsistent function usage. In the experiment, we
invoked it twice with some delay between 1 and 120 sent k = 1 or k = 50 concurrent requests to a function,
minutes, and determined whether the two requests used and did this again without delay after updating one of
the same function instance. We repeated until we the following aspects of the function: IAM role, memory
identified tidle . We confirmed tidle (to minute granularity) size, environment variables, or function code. For a
by repeating the measurement 100 times for delays close given setting, we performed these operations for 100
to tidle . rounds. When k = 1, 1%–4% of the tests used an
AWS. An instance could usually stay inactive for at most inconsistent function. When there were more associated
27 minutes. In fact, in 80% of the rounds instances were instances before the update (k = 50), 80% of our
Fraction
Fraction
cases, we found two situations: (1) AWS launched new 0.6
0.4 0.5
instances of the outdated function (2% of all the cases),
0.2
and (2) AWS reused existing instances of the outdated 0
function. Inconsistent instances never handle more than 0 500 1,000 1,500 0 1,000 2,000
Function memory (MB) Function memory (MB)
one request before terminating (note that max execution
time is 300 s in AWS), but still, a considerable faction of (a) AWS (b) Google
requests may fail to get desired results. Figure 10: The median instance CPU utilization rates
As we waited for a longer time after the function with min-max error bars in AWS and Google as function
update to send requests, we found fewer inconsistent memory increases, averaged across 1,000 instances for a
cases, and eventually zero cases with a 6-second given memory size.
waiting time. So, we suspect that the inconsistency
issues are caused by race conditions in the instance 1 vCPU 2 vCPUs 4 vCPUs 1 vCPU 2 vCPUs 4 vCPUs
scheduler. The results suggest coordinating function
Fraction
0.6
as the scheduler cannot do an atomic update. 0.5
0.4
5.6 Discussion 0
0.2
Figure 12: Aggregate I/O and network throughput across coresident instances as concurrency level increases. The
coresident instances perform the same task simultaneously. The values are the median values across 50 rounds.
Figure 10a. Instances with higher memory get more increased, the CPU utilization of instances on 1-vCPU
CPU cycles. The median instance CPU utilization rate VMs drops more dramatically, as shown in Figure 11b.
increased from 7.7% to 92.3% as memory increased
from 128 to 1,536 MB, and the corresponding standard 6.2 I/O and network
deviations (SD) were 0.7% and 8.7%. When there is To measure I/O throughput, our measurement functions
no contention from other coresident instances, the CPU in AWS and Google used the dd command to write
utilization rate of an instance can vary significantly, 512 KB of data to the local disk 1,000 times (with
resulting in inconsistent application performance. That fdatasync and dsync flags to ensure the data is
said, an upper bound on CPU share is approximated by written to disk). In Azure, we performed the same
2 ∗ m/3328, where m is the memory size. operations using a Python script (which used os.fsync
We further examine how CPU time is allocated among to ensure data is written to disk). For network
coresident instances. We let colevel be the number of throughput measurement, the function used iperf 3.13
coresident instances and a colevel of 1 indicates only with default configurations to run the throughput test for
a single instance on the VM. For memory size m, we 10 seconds with different same-region iperf servers, so
selected a colevel in the range 2 to b3328/mc. We that iperf server-side bandwidth was not a bottleneck.
then measured the CPU utilization rate in each of the The iperf servers used the same types of VMs as the
coresident instances. Examining the results over 20 vantage points.
rounds of tests, we found that the currently running
AWS. Figure 12 shows aggregate I/O and network
instances share CPU fairly, since they had nearly the
throughput across a given number of coresident in-
same CPU utilization rate (SD <0.5%). With more
stances, averaged across 50 rounds. All the coresident
coresident instances, each instance’s CPU share becomes
instances performed the same measurement concur-
slightly less than, but still close to 2 ∗ m/3328 (SD
rently. Though the aggregate I/O and network throughput
<2.5% in any setting).
remains relatively stable, each instance gets a smaller
The above results indicate that AWS tries to allocate a
share of the I/O and network resources as colevel
fixed amount of CPU cycles to an instance based only on
increases. When colevel increased from 1 to 20, the
function memory.
average I/O throughput per 128 MB instance dropped
Azure and Google. Google adopts the same mechanism by 4x, from 13.1 Mbps to 2.9 Mbps, and network
as AWS to allocate CPU cycles based on function throughput by 19x, from 538.6 MB/s to 28.7 MB/s.
memory [13]. In Google, the median instance CPU Coresident instances get less share of the network with
utilization rates ranged from 11.1% to 100% as function more contention. We calculate the Coefficient of Varia-
memory increased. For a given memory size, the tion (CV), which is defined as SD divided by the mean,
standard deviations of the rates across different instances for each colevel. A higher CV suggests the performance
are very low (Figure 10b), ranging from 0.62% to 2.30%. of instances differ more. For 128 MB instances, the CV
Azure has a relatively high variance in the CPU of network throughput ranged from 9% to 83% across all
utilization rates (14.1%–90%), while the median was colevels, suggesting significant performance variability
66.9% and the SD was 16%. This is true even though the due to contention with coresident instances. In contrast,
instances are allocated the same amount of memory. The the I/O performance was similar between instances (CV
breakdown by vCPU number shows that the instances on of 1% to 6% across all colevels). However, the I/O
4-vCPU VMs tend to gain higher CPU shares, ranging performance is affected by function memory (CPU) for
from 47% to 90% (Figure 11a). The distributions of small memory sizes (≤ 512 MB), and therefore the I/O
utilization rates of instances on 1-vCPU VMs and 2- throughput of an instance could degrade more when
vCPU VMs are in fact similar; however, when colevel competing with instances of higher memory.
10 seconds to a server under our control, and the working on fixing it as of May 2018.