A Performance Perspective on Web Optimized Protocol Stacks TCP+TLS+HTTP2 vs QUIC
A Performance Perspective on Web Optimized Protocol Stacks TCP+TLS+HTTP2 vs QUIC
neglecting available TCP improvements inherently included TCP, TLS, HTTP/2, tightly coupled into a new protocol that
in QUIC, comparisons do not shed light on the performance enables to utilize cross-layer information and to evolve with-
of current web stacks. In this paper, we can show that tuning out ossification. While it fixes some of TCP’s shortcomings
TCP parameters is not negligible and directly yields signif- like head-of-line blocking when used with HTTP, its design,
icant improvements. Nevertheless, QUIC still outperforms in the first place, should enable evolution.
even our tuned variant of TCP. This performance advantage A number of studies showed that QUIC outperforms the
is mostly caused by QUIC’s reduced RTT design during con- classical TCP-based stack [2, 7, 8, 13, 17, 30]—that is by com-
nection establishment, and, in case of lossy networks due to paring QUIC to an unoptimized TCP-based stack; a limitation
its ability to circumvent head-of-line blocking. that we address in this paper. Current QUIC implementations
were specifically designed and parameterized for the Web. In
CCS CONCEPTS contrast, stock TCP implementations, as in the Linux kernel,
are not specialized and are built to perform well on a large
• Networks → Network measurement;
set of devices, networks, and workloads. However, we have
ACM Reference Format: shown [26] that large content providers fine-tune their TCP
Konrad Wolsing, Jan Rüth, Klaus Wehrle, and Oliver Hohlfeld. 2019. stacks (e.g., by tuning the initial window size) to improve
A Performance Perspective on Web Optimized Protocol Stacks: content delivery. All studies known to us neglect this fact
TCP+TLS+HTTP/2 vs. QUIC. In ANRW ’19: Applied Networking and indeed compare an out-of-the-box TCP with a highly-
Research Workshop (ANRW ’19), July 22, 2019, Montreal, QC, Canada.
tuned QUIC Web stack and show that the optimized version
ACM, New York, NY, USA, 7 pages. https://doi.org/10.1145/334030
is superior. Furthermore, they often utilize simple Web per-
1.3341123
formance metrics like page load time (PLT) to reason about
the page loading speed, even though it is long known that
1 INTRODUCTION PLT does not correlate to user-perceived speeds [3, 14, 31].
The advancement of Web application and services resulted In this paper, we seek to close this gap by parameteriz-
in an ongoing evolution of the Web protocol stack. Driving ing TCP similar to QUIC to enable a fair comparison. This
reasons are security and privacy or the realization of latency- includes increasing the initial congestion window, enabling
sensitive Web services. Today, the typical Web stack involves pacing, setting no slow start after idle, and tuning the kernel
using HTTP/2 over TLS over TCP, making it practically one buffers to match QUIC’s defaults. We further enable BBR
(ossified) protocol. While parts of the protocols have been instead of the CUBIC as the congestion control algorithm in
designed to account for the others, this protocol stacking still one scenario. We show that this previously neglected tuning
suffers from inefficiencies, e.g., head-of-line blocking. Even of TCP impacts its performance. We find that for broadband
though protocol extensions promise higher efficiency (e.g., access, QUIC’s RTT-optimized connection establishment in-
∗ Is
deed increases the loading speed, but otherwise compares to
now at Brandenburg University of Technology
TCP. If optimizations such as TLS 1.3 early-data or TCP Fast
Open were deployed, QUIC and TCP would compare well. In
ANRW ’19, July 22, 2019, Montreal, QC, Canada
© 2019 Copyright held by the owner/author(s). Publication rights licensed
lossy networks, QUIC clearly outperforms the current Web
to ACM. stack, which we mainly attribute to its ability to progress
This is the author’s version of the work. It is posted here for your personal streams independently of head-of-line blocking. Our com-
use. Not for redistribution. The definitive Version of Record was published parison is based on visual Web performance metrics that
in ANRW ’19: Applied Networking Research Workshop (ANRW ’19), July 22, better correlate to human perception than traditionally used
2019, Montreal, QC, Canada, https://doi.org/10.1145/3340301.3341123.
loading times. To evaluate real-world websites, we extend
ANRW ’19, July 22, 2019, Montreal, QC, Canada Wolsing and Rüth, et al.
the Mahimahi framework to utilize the Google QUIC Web by a single server only [2, 7, 17]. To study realistic Web sites,
stack to perform reproducible comparisons between TCP and the Mahimahi framework [21] was designed to replicate this
QUIC on a large scale of settings. This work does not raise multi-server nature of current websites into a testbed (see
any ethical issues and makes the following contributions: Section 2.2). Nepomuceno et al. [20] perform a study with
• We provide the first study that performs an eye-level com- Mahimahi but find that QUIC is outperformed by TCP which
parison of TCP+TLS+HTTP/2 and QUIC. does not coincide with our and related work. We believe
• Our study highlights that QUIC can indeed outperform this is due to the use of the Caddy QUIC server, which is
TCP in a variety of settings but so does a tuned TCP. known to not (yet) perform very well [19]. Also, they did not
• Tuning TCP closes the gap to QUIC and shows that TCP configure any bandwidth limitations.
is still very competitive to QUIC.
• Our study further highlights the immense impact of choice 2.1 Web Performance Metrics
of congestion control, especially in lossy environments. We aim to evaluate the performance of a different protocol
• We add QUIC support to Mahimahi to enable reproducible stack on a broad set of standard Web performance metrics.
QUIC research. It replays real-world websites in a testbed Besides network characteristics like goodput or link utiliza-
subject to different protocols and network settings. tion as measured in [7, 30], Page Load Time (PLT) is the most
Structure. Section 2 examines related work, highlights the used metric. But PLT does not always match user-perceived
evaluation metrics and introduces to the Mahimahi frame- performance [3, 14, 31], e.g., it includes the loading perfor-
work. Section 3 explains our testbed, network configuration, mance of below-the-fold content that is not displayed and
and protocol considerations. Section 4 shows the results of thus not reflected in end-user perception. This is why we de-
the measurement. Finally, Section 5 concludes this paper. cide to focus more closely on state-of-the-art visual metrics
2 RELATED WORK AND BACKGROUND that are known to better correlate with human perception.
These metrics are derived from video recordings of the pages
QUIC is subject to a body of studies [2, 7, 8, 13, 17, 20, 30],
loading process above-the-fold as recommended by [5, 9].
most compare QUIC against some combination of TCP+TLS+-
Metrics of interest are the time of the First Visual Change
HTTP/1.1 or HTTP/2. But to the best of our knowledge, all
(FVC), Last Visual Change (LVC), and time the website reaches
use stock TCP configurations measuring a likely unoptimized
visual completeness of a desired threshold in percent. In our
TCP version to a QUIC version that inherently contains avail-
case, Visual Complete 85 (VC85), which corresponds to the
able TCP optimizations. Yu et al. [30] is the only study on
point in time measured from the navigation start when the
the impact of packet pacing for QUIC as a tuning option.
currently rendered website’s above-the-fold matches to 85%
However, no further comparison to TCP is made.
the final website picture. Only navigation start can be used
Generally, the related work can be divided into two cate-
as start point since visual metrics are derived from video
gories depending on their measurement approach. One body
recordings only (see Section 3.2 how we deal with DNS im-
of research [8, 17, 27] measures against websites hosted on
pacting the measurement). Lastly, we also take into account
public servers utilizing both QUIC and TCP—however, usu-
the Speed Index (SI) [11].
ally operated by Google. Thus, they do not have any access
to the servers, which makes tuning the protocol impossi-
2.2 Website Replay with Mahimahi
ble and the configurations in use are unknown. The second
body [2, 7, 13, 20] uses self-hosted servers, in principle al- Mahimahi [21] is a framework designed to replicate real-
lowing for tuning, however, none of them does so. world websites with their multi-server structure in a testbed.
One critical difference between TCP and QUIC is their It uses HTTP traffic recordings that are later replayed. Mahi-
connection establishment since QUIC by design needs fewer mahi preserves the multi-server nature with the help of
RTTs than traditional TCP+TLS until actual website payload virtualized Web servers. Mahimahi is built upon multiple
can be exchanged. Cook et al. [8] already take into account shell commands that can be stacked to create a virtual net-
that there is a difference between first and repeated con- work. Each shell allows for modifying a single aspect of the
nections that require each one less RTT for both protocols. traversing network flow, e.g., generating loss or limiting the
Nevertheless, QUIC still has a one RTT advantage in both bandwidth. Mahimahi yields realistic conditions for perfor-
connections, repeated as well as first, and again this fact is mance measurements [21]. This way, it enables repeatable
not dealt with any further. and controllable studies with real-world websites.
Since today’s websites consist of various resources hosted
by several providers, many connections to different servers 3 TESTBED SETUP
are established even for fetching a single website. Many stud- We will now continue to explain how we design our testbed
ies consider websites with varying resources but deployed to perform eye-level comparisons of TCP and QUIC.
A Performance Perspective on Web Optimized
Protocol Stacks: TCP+TLS+HTTP/2 vs. QUIC ANRW ’19, July 22, 2019, Montreal, QC, Canada
Protocol Description
6 30
Size [MB]
IPs [#]
4 20
IW 32, Pacing, Cubic, tuned buffers,
2 10 TCP+
no slow start after idle
0 0
apache.org
TCP+BBR TCP+, but with BBR as congestion control
pinterest.com
youtube.com
wordpress.com
researchgate.net
w3.org
google.com
ed.gov
gov.uk
msn.com
intel.com
gnu.org
spotify.com
bit.ly
imgur.com
etsy.com
imdb.com
facebook.com
reddit.com
github.com
nytimes.com
telegraph.com
phpbb.com
gravatar.com
statcounter.com
dotdash.com
nature.com
harvard.edu
sciencemag.org
opera.com
joomla.com
vtm.be
sciencedirect.com
academia.edu
demorgen.be
columbia.edu
canvas.be
wikipedia.org
Network Uplink Downlink Delay Loss number of runs/videos manageable. We utilize the Browser-
DSL 5 Mbps 25 Mbps 24 ms 0.0 % time [28] framework to instrument the browser. It records
LTE 2.8 Mbps 10.5 Mbps 74 ms 0.0 % videos of the loading process that we subsequently evaluate
DA2GC .468 Mbps .468 Mbps 262 ms 3.3 % for the visual metrics. For each run, Browsertime opens up a
MSS 1.89 Mbps 1.89 Mbps 760 ms 6.0 % fresh Chromium browser Version 70.0.3538.77. In total, this
Table 2: Network configurations. Queue size is set to leads to 760 configurations (38 domains, 4 network, and 5
200 ms except for DSL with 12 ms. protocol settings). We validated that each run completed
successfully by reviewing the video recordings manually.
QUIC 10MB
NGINX 10MB
4 QUIC VS. TCP PERFORMANCE
QUIC 1MB We evaluate the performance difference with all metrics in
NGINX 1MB the different network settings (across all tested websites)
QUIC 10KB
by means of a performance gain. The following equation
NGINX 10KB
QUIC 2B explains the calculation of the performance gain between a
NGINX 2B reference protocol, e.g., TCP and a protocol to compare with
0 100 200 300 400 500 like QUIC. X correspond to the mean over the 31 runs.
Download time [ms]
Figure 2: Boxplot of server download speeds in our X QU IC − X T C P
performance gainTQU IC =
CP
testbed (31 repetitions and no bandwidth limitation). XTCP
If not stated otherwise, numbers provided in the text are
from [4], we assume no additional loss here. The last two mean performance gains over all websites for SI. Besides com-
networks emulate slow links measured from in-flight WLAN paring means we also utilize an ANOVA test to tell whether
services [25]. Except for the DSL link with 12ms maximal there is a statistically significant difference in the distribution
queueing delay, we assume rather bloated buffers of 200 ms. of the 31 runs of two protocols. If the ANOVA test for two
Thus, our configured delay is the minimum delay and queu- settings is p < 0.01 (significance level), we count the setting
ing further adds jitter up to the configured buffer size. with the lower mean as significantly faster otherwise no con-
Validation. Before conducting measurements, we validate clusion can be drawn. The results of our measurements are
the implemented testbed regarding the network and protocol depicted in Figure 3. We show the CDFs of the performance
parameters ensuring the correct protocol choice. We found gain of the different metrics comparing stock TCP to the
that the Chromium browser’s DNS timeout of 5 s signifi- other protocol stacks. LVC is left out in this figure because
cantly distorts a measurement when a DNS packet is lost in contrast to PLT there is no relevant difference visible.
and thus moved the DNS server such that no traffic shaping DSL and LTE. For the lossless DSL and LTE scenarios, the
is applied to DNS traffic. Moreover, Figure 2 shows that both protocols separate into two groups both yielding similar per-
server variants yield similar performance for files ≤ 1 MB. formance gains. TCP+ (DSL: -0.05TCP+ TCP , LTE: -0.08TCP ) and
TCP+
This suggests that our results are not biased by the servers’ TCP+BBR (DSL: -0.05TCP , LTE: -0.09TCP ) perform al-
TCP+BBR TCP+BBR
implementations. For this test, we repeated 31 downloads most indistinguishable but against TCP, there is a notice-
of a single file with the Chromium browser under static net- able improvement visible throughout all metrics. Similarly,
work conditions—only 10 ms minimum delay, no loss, and no QUIC (DSL: -0.09QUIC QUIC
TCP+ , LTE: -0.14TCP+ ) and QUIC+BBR (DSL:
bandwidth limits. The gap between NGINX and QUIC server QUIC+BBR QUIC+BBR
-0.09TCP+BBR , LTE: -0.13TCP+BBR ) perform equally but are still
emerging at a file size of 10 MB is not relevant since our quite a bit faster than the two tuned TCP variants. For these
website sizes are much smaller (see Figure 1). Independent two networks, the congestion control choice does not make a
resources are even smaller, the largest being 4 MB. significant difference, which is likely due to the small queue.
Stock TCP indeed lags behind all other protocols show-
3.3 Performing Measurements ing that stock TCP should not be used to compare against
The actual measurements are performed inside a virtual ma- QUIC here. QUIC achieves to decrease the average SI by
chine equipped with 6 cores and 8 GB of memory running -131.3 msQUIC QUIC
TCP (DSL) and -344.9 msTCP (LTE), but also against
Arch Linux kernel Version 4.18.16. To measure a single set- TCP+ by still -87.1 msTCP+ (DSL) and -215.9msQUIC
QUIC
TCP+ (LTE).
ting consisting of one website, network, and protocol config- In a second step, we take a look at the ANOVA test re-
uration, a Mahimahi replay shell with the described network sults focussing on DSL (LTE yields equivalent results). When
stack is used. A single setting gets measured over 31 runs to comparing the runs of TCP+ and QUIC in DSL with PLT
gain statistical significance and at the same time keep the as the metric with each other, 30 of the 38 websites yield
A Performance Perspective on Web Optimized
Protocol Stacks: TCP+TLS+HTTP/2 vs. QUIC ANRW ’19, July 22, 2019, Montreal, QC, Canada
Network DSL Network MSS
1.00 FVC SI VC85 PLT 1.00 FVC SI VC85 PLT
0.75 0.75
CDF
CDF
0.50 0.50
0.25 0.25
0.00 0.00
-.5 -.25 0 -.5 -.25 0 -.5 -.25 0 -.5 -.25 0 -.6 -.3 0 .3 -.6 -.3 0 .3 -.6 -.3 0 .3 -.6 -.3 0 .3
Network LTE Network DA2GC
1.00 FVC SI VC85 PLT 1.00 FVC SI VC85 PLT
0.75 0.75
CDF
CDF
0.50 0.50
0.25 0.25
0.00 0.00
-.5 -.25 0 -.5 -.25 0 -.5 -.25 0 -.5 -.25 0 -.6 -.3 0 .3 -.6 -.3 0 .3 -.6 -.3 0 .3 -.6 -.3 0 .3
QUIC QUIC+BBR TCP TCP+ TCP+BBR
Figure 3: CDF of the performance gain over all websites with TCP as reference protocol. If the performance gain
is < 0 (left side of plot) then the compared protocol is faster than TCP.
a significant improvement with QUIC. For the remaining 8 and mostly better than TCP+BBR. But as the loading pro-
websites, none was significantly faster than TCP+. For SI cess commences QUIC+BBR outperforms QUIC slightly, e.g.,
even 32 are faster and only 6 show no significant difference. -1828.3 msQUIC+BBR
QUIC better SI. QUIC with CUBIC, nevertheless,
Similar results can be seen when comparing QUIC+BBR with is reasonably fast being still a legit option to use. The shape
TCP+BBR this way. For TCP+ and TCP in the same scenario of the performance gain CDFs of QUIC+BBR and TCP+BBR
with PLT as the metric, 25 websites are faster with TCP+, are very similar especially for PLT highlighting the influence
for 12 there is no significant difference and only 1 website of the congestion control once again. We believe that QUIC
was significantly slower. Again when comparing TCP+BBR with CUBIC is still competitive due to QUIC’s ability to cir-
with TCP+ and similarly QUIC+BBR with QUIC for DSL and cumvent head-of-line blocking and its large SACK ranges.
LTE throughout all metrics, we find for the majority of the For the MSS network, QUIC reduces the SI by -8364.8 msQUIC TCP+
websites no difference. These results line up with the results (avg.) compared to TCP+ and by -2091.5 msQUIC+BBR
TCP+BBR when tak-
shown in Figure 3. Moreover, the steep incline of the CDFs ing both BBR protocols into account.
for QUIC and TCP+ indicate that the website size or struc- The last network, DA2GC, also has a high loss rate (3.3 %)
ture seems to have little influence on the achievable gain. but a much lower bandwidth. This is the only scenario where
Only looking at SI and VC85, we see a small percentage of we observe no significant difference for most websites among
measurements where QUIC has a significantly higher gain. all TCP configurations even with the ANOVA test. We also
In-Flight Wifi. For the networks MSS and DA2GC, the see that in a small fraction of our measurements stock TCP
overall picture is quite similar—meaning QUIC as well as outperforms QUIC and the tuned TCP variants. Nevertheless,
QUIC+BBR, are usually faster than TCP+ (MSS: -0.36QUIC TCP+ , again the QUIC variants are generally significantly faster
DA2GC: -0.14QUICTCP+ ) and TCP+BBR (MSS: -0.18 QUIC+BRR
TCP+BBR , DA2GC: with a higher performance gain at the FVC (e.g., -0.14QUIC TCP+ )
-0.10QUIC+BBR
TCP+BBR ). But there are some important differences, for that persists towards the PLT (e.g., -0.16QUIC
TCP+ ). The choice of
the MSS link with the highest loss rate (6 %), TCP+BBR op- the congestion control algorithm does not seem to have a
erates much better than TCP+ (-0.26TCP+BBR
TCP+ ). Since BBR does significant impact here likely due to the low bandwidth. Only
not use loss as a congestion signal it increases its rate re- for PLT we find QUIC with CUBIC to be slightly superior
gardless of this random loss. This means that in this case, over QUIC with BBR. There is not a single website where
the choice in congestion control has a greater impact on the QUIC+BBR yields a significantly faster performance. The
performance than the protocol choice itself. At the time of SI decreases with QUIC by -2632.5 msQUIC TCP+ vs. TCP+ and by
the FVC, TCP+BBR is already -2866.2 ms (avg.) quicker than -1372.5 msQUIC+BBR
TCP+BRR for BBR.
TCP+ but with each later metric, the gap widens so that at Discussing Metrics. Some of the websites exhibit very poor
PLT, TCP+BBR can keep up the pace even against QUIC and performance regarding the visual metrics VC85 and SI. We
is 11395.4 ms (0.21×) quicker. This shows that TCP with BBR observe this behavior especially for the DA2GC network with
needs some time to catch up and thus affects the FVC much performance gains of up to +1.0 compared to stock TCP (not
more than the later PLT. For the QUIC protocols, the picture shown, plots cropped for readability). The reason for these
is similar. At first, QUIC and QUIC+BBR are similarly fast outliers is that the protocol choice has such a substantial
ANRW ’19, July 22, 2019, Montreal, QC, Canada Wolsing and Rüth, et al.