Advanced Research in Computing and Software Science
Advanced Research in Computing and Software Science
Fundamental Approaches
to Software Engineering
24th International Conference, FASE 2021
Held as Part of the European Joint Conferences
on Theory and Practice of Software, ETAPS 2021
Luxembourg City, Luxembourg, March 27 – April 1, 2021
Proceedings
Lecture Notes in Computer Science 12649
Founding Editors
Gerhard Goos, Germany
Juris Hartmanis, USA
Fundamental Approaches
to Software Engineering
24th International Conference, FASE 2021
Held as Part of the European Joint Conferences
on Theory and Practice of Software, ETAPS 2021
Luxembourg City, Luxembourg, March 27 – April 1, 2021
Proceedings
123
Editors
Esther Guerra Mariëlle Stoelinga
Universidad Autónoma de Madrid University of Twente
Madrid, Spain Enschede, The Netherlands
Radboud University
Nijmegen, The Netherlands
© The Editor(s) (if applicable) and The Author(s) 2021. This book is an open access publication.
Open Access This book is licensed under the terms of the Creative Commons Attribution 4.0 International
License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution
and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and
the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this book are included in the book’s Creative Commons license,
unless indicated otherwise in a credit line to the material. If material is not included in the book’s Creative
Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use,
you will need to obtain permission directly from the copyright holder.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book are
believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors
give a warranty, expressed or implied, with respect to the material contained herein or for any errors or
omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in
published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
ETAPS Foreword
Welcome to the 24th ETAPS! ETAPS 2021 was originally planned to take place in
Luxembourg in its beautiful capital Luxembourg City. Because of the Covid-19 pan-
demic, this was changed to an online event.
ETAPS 2021 was the 24th instance of the European Joint Conferences on Theory
and Practice of Software. ETAPS is an annual federated conference established in
1998, and consists of four conferences: ESOP, FASE, FoSSaCS, and TACAS. Each
conference has its own Program Committee (PC) and its own Steering Committee
(SC). The conferences cover various aspects of software systems, ranging from theo-
retical computer science to foundations of programming languages, analysis tools, and
formal approaches to software engineering. Organising these conferences in a coherent,
highly synchronised conference programme enables researchers to participate in an
exciting event, having the possibility to meet many colleagues working in different
directions in the field, and to easily attend talks of different conferences. On the
weekend before the main conference, numerous satellite workshops take place that
attract many researchers from all over the globe.
ETAPS 2021 received 260 submissions in total, 115 of which were accepted,
yielding an overall acceptance rate of 44.2%. I thank all the authors for their interest in
ETAPS, all the reviewers for their reviewing efforts, the PC members for their con-
tributions, and in particular the PC (co-)chairs for their hard work in running this entire
intensive process. Last but not least, my congratulations to all authors of the accepted
papers!
ETAPS 2021 featured the unifying invited speakers Scott Smolka (Stony Brook
University) and Jane Hillston (University of Edinburgh) and the conference-specific
invited speakers Işil Dillig (University of Texas at Austin) for ESOP and Willem Visser
(Stellenbosch University) for FASE. Inivited tutorials were provided by Erika Ábrahám
(RWTH Aachen University) on analysis of hybrid systems and Madhusudan
Parthasararathy (University of Illinois at Urbana-Champaign) on combining machine
learning and formal methods.
ETAPS 2021 was originally supposed to take place in Luxembourg City, Luxem-
bourg organized by the SnT - Interdisciplinary Centre for Security, Reliability and
Trust, University of Luxembourg. University of Luxembourg was founded in 2003.
The university is one of the best and most international young universities with 6,700
students from 129 countries and 1,331 academics from all over the globe. The local
organisation team consisted of Peter Y.A. Ryan (general chair), Peter B. Roenne (or-
ganisation chair), Joaquin Garcia-Alfaro (workshop chair), Magali Martin (event
manager), David Mestel (publicity chair), and Alfredo Rial (local proceedings chair).
ETAPS 2021 was further supported by the following associations and societies:
ETAPS e.V., EATCS (European Association for Theoretical Computer Science),
EAPLS (European Association for Programming Languages and Systems), and EASST
(European Association of Software Science and Technology).
vi ETAPS Foreword
This volume contains the papers presented at FASE 2021, the 24th International
Conference on Fundamental Approaches to Software Engineering. FASE 2021 was
organized as part of the annual European Joint Conferences on Theory and Practice of
Software (ETAPS 2021).
FASE is concerned with the foundations on which software engineering is built,
including topics like software engineering as an engineering discipline, requirements
engineering, software architectures, software quality, model-driven development,
software processes, software evolution, search-based software engineering, and the
specification, design, and implementation of particular classes of systems, such as
(self-)adaptive, collaborative, intelligent, embedded, distributed, mobile, pervasive,
cyber-physical, or service-oriented applications.
FASE 2021 received 51 submissions. The submissions came from the following
countries (in alphabetical order): Argentina, Australia, Austria, Belgium, Brazil,
Canada, China, France, Germany, Iceland, India, Ireland, Italy, Luxembourg, Mace-
donia, Malta, Netherlands, Norway, Russia, Singapore, South Korea, Spain, Sweden,
Taiwan, United Kingdom, and United States. FASE used a double-blind reviewing
process. Each submission was reviewed by three Program Committee members. After
an online discussion period, the Program Committee accepted 16 papers as part of the
conference program (31% acceptance rate).
FASE 2021 hosted the 3rd International Competition on Software Testing
(Test-Comp 2021). Test-Comp is an annual comparative evaluation of testing tools.
This edition contained 11 participating tools, from academia and industry. These
proceedings contain the competition report and three system descriptions of partici-
pating tools. The system-description papers were reviewed and selected by a separate
program committee: the Test-Comp jury. Each paper was assessed by at least three
reviewers. Two sessions in the FASE program were reserved for the presentation of the
results: the summary by the Test-Comp chair and the participating tools by the
developer teams in the first session, and the community meeting in the second session.
A lot of people contributed to the success of FASE 2021. We are grateful to the
Program Committee members and reviewers for their thorough reviews and con-
structive discussions. We thank the ETAPS 2021 organizers, in particular,
Peter Y. A. Ryan (General Chair), Joaquin Garcia-Alfaro (Workshops Chair), Peter
Roenne (Organization Chair), Magali Martin (Event Manager), David Mestel (Publicity
Chair) and Alfredo Rial (Local Proceedings Chair). We also thank Marieke Huisman
(Steering Committee Chair of ETAPS 2021) for managing the process, and Gabriele
Taenzter (Steering Committee Chair of FASE 2021) for her feedback and support. Last
but not least, we would like to thank the authors for their excellent work.
Steering Committee
Wil van der Aalst RWTH Aachen, Germany
Jordi Cabot ICREA - Universitat Oberta de Catalunya, Spain
Marsha Chechik University of Toronto, Canada
Reiner Hähnle Technische Universität Darmstadt, Germany
Reiko Heckel University of Leicester, UK
Tiziana Margaria University of Limerick, Ireland
Fernando Orejas Universitat Politècnica de Catalunya, Spain
Julia Rubin University of British Columbia, Canada
Alessandra Russo Imperial College London, UK
Andy Schürr Technische Universität Darmstadt, Germany
Perdita Stevens University of Edinburgh, UK
Gabriele Taentzer Philipps-Universität Marburg, Germany
Andrzej Wąsowski IT University of Copenhagen, Denmark
Heike Wehrheim Universtät Paderborn, Germany
Additional Reviewers
FASE Contributions
Test-Comp Contributions
1
University of Malta, Msida, Malta {duncan.attard.01,afra1}@um.edu.mt
2
Reykjavík University, Reykjavík, Iceland {luca,duncanpa17,annai}@ru.is
3
Gran Sasso Science Institute, L’Aquila, Italy {luca.aceto}@gssi.it
1 Introduction
Large-scale software design has shifted from the classic monolithic architecture
to one where applications are structured in terms of independently-executing
asynchronous components [17]. This shift poses new challenges to the validation
of such systems. Runtime Verification (RV) [9,27] is a post-deployment technique
that is used to complement other methods such as testing [46] to assess the func-
tional (e.g. correctness) and non-functional (e.g. quality of service) aspects of
concurrent software. RV relies on instrumenting the system to be analysed with
monitors, which inevitably introduce runtime overhead that should be kept min-
imal [9]. While the worst-case complexity bounds for monitor-induced overheads
can be calculated via standard methods (see, e.g. [40,14,1,28]), benchmarking is,
by far, the preferred method for assessing these overheads [9,27]. One reason for
Supported by the doctoral student grant (No: 207055-051) and the TheoFoMon
project (No: 163406-051) under the Icelandic Research Fund, the BehAPI project
funded by the EU H2020 RISE under the Marie Skłodowska-Curie action
(No: 778233), the ENDEAVOUR Scholarship Scheme (Group B, national funds),
and the MIUR project PRIN 2017FTXR7S IT MATTERS.
c The Author(s) 2021
E. Guerra and M. Stoelinga (Eds.): FASE 2021, LNCS 12649, pp. 3–23, 2021.
https://doi.org/10.1007/978-3-030-71500-7_1
4 L. Aceto et al.
The state of the art in benchmarking for concurrent RV suffers from an-
other issue. Existing benchmarks—conceived for validating other tools—are re-
purposed for RV and often fail to cater for concurrent scenarios where RV is
realistically put to use. SPECjvm2008, DaCapo, and ScalaBench lack workloads
that leverage the JVM concurrency primitives [52]; meanwhile, [12] shows that
the Savina microbenchmarks are essentially sequential, and that the rest of the
programs in the suite are sufficiently simple to be regarded as microbenchmarks
too. The CRV suite mostly targets monolithic software with limited concurrency,
where the potential for scaling up to high loads is, therefore, severely curbed.
This paper presents a benchmarking framework for evaluating runtime mon-
itoring tools written for verification purposes. Our tool focusses on component
systems for asynchronous message-passing concurrency. It generates synthetic
system models following the master-slave architecture [61]. The master-slave ar-
chitecture is pervasive in distributed (e.g. DNS, IoT) and concurrent (e.g. web
servers, thread pools) systems [61,29], and lies at the core of the MapReduce
model [22] supported by Big Data frameworks such as Hadoop [63]. This justi-
fies our aim to build a benchmarking tool targeting this architecture. Concretely:
– We detail the design of a configurable benchmark that emulates various
master-slave models under commonly-observed load profiles, and gathers dif-
ferent metrics that give a multi-faceted view of runtime overhead, Sec. 2.
– We demonstrate that our synthetic benchmarks can be engineered to ap-
proximate the realistic behaviour of web server traffic with high degrees of
precision and repeatability, Sec. 3.1.
– We present a case study that (i) shows how the load profiles and parametris-
ability of our benchmarks can produce edge cases that can be measured
through our performance metrics to asses runtime monitoring tools in a
comprehensive manner, and (ii) confirms that the results from (i) coincide
with those obtained via a real-world use case using OTS software, Sec. 3.2.
2.1 Approach
We target concurrent applications that execute on a single node. Nevertheless,
our design adheres to three criteria that facilitate its extension to a distributed
setting. Specifically, components: (i) share neither a common clock, (ii) nor
memory, and (iii) communicate via asynchronous messages. Our present set-up
assumes that communication is reliable and components do not fail.
6 L. Aceto et al.
Load generation. Load on the system is induced by the master when it creates
slave processes and allocates tasks. The total number of slaves in one run can be
set via the parameter n. Tasks are allocated to slave processes by the master,
and consist of one or more work requests that a slave receives, handles, and relays
back. A slave terminates its execution when all of its allocated work requests have
been processed and acknowledged by the master. The number of work requests
that can be batched in a task is controlled by the parameter w; the actual batch
size per slave is then drawn randomly from a normal distribution with mean
μ = w and standard deviation σ = μ×0.02. This induces a degree of variability in
the amount of work requests exchanged between master and slaves. The master
and slaves communicate asynchronously: an allocated work request is delivered
to a slave process’ incoming work queue where it is eventually handled. Work
responses issued by a slave are queued and processed similarly on the master.
Load configuration. We consider three load profiles (see fig. 3 for examples) that
determine how the creation of slaves is distributed along the load timeline t.
The timeline is modelled as a sequence of discrete logical time units representing
instants at which a new set of slaves is created by the master. Steady loads
replicate executions where a system operates under stable conditions. These are
modelled on a homogeneous Poisson distribution with rate λ, specifying the mean
number of slaves that are created at each time instant along the load timeline
with duration t=n/λ. Pulse loads emulate settings where a system experiences
gradually increasing load peaks. The Pulse load shape is parametrised by t and
the spread, s, that controls how slowly or sharply the system load increases as it
approaches its maximum peak, halfway along t. Pulses are modelled on a normal
distribution with μ=t/2 and σ =s. Burst loads capture scenarios where a system
is stressed
due to load spikes; these
are based on a log-normal distribution with
μ = ln(m2 / p2 + m2 ) and σ = ln(1 + p2 /m2 ), where m = t/2, and parameter p
is the pinch controlling the concentration of the initial load burst.
Wall-clock time. A load profile created for a logical timeline t is put into effect
by the master process when the system starts running. The master does not
create the slave processes that are set to execute in a particular time unit in one
go, since this naïve strategy risks saturating the system, deceivingly increasing
the load. In doing so, the system may become overloaded not because the mean
request rate is high, but because the created slaves overwhelm the master when
they send their requests all at once. We address this issue by introducing the
notion of concrete time that maps one discrete time unit in t to a real time period,
π. The parameter π is given in milliseconds (ms), and defaults to 1000 ms.
Slave scheduling. The master process employs a scheduling scheme to distribute
the creation of slaves uniformly across the time period π. It makes use of three
queues: the Order queue, Ready queue, and Await queue, denoted by QO , QR ,
and QA respectively. QO is initially populated with the load profile, step 1 in
fig. 1a. The load profile consists of an array with t elements—each corresponding
to a discrete time instant in t—where the value l of every element indicates the
number of slaves to be created at that instant. Slaves, S1 ,S2 ,...,Sn , are scheduled
and created in rounds, as follows. The master picks the first element from QO
On Benchmarking for Concurrent Runtime Verification 7
t = 4 units
l Time unit 1; round 1
4
QO 4 2 1 1 + 3
2
QO 2 1 1
1
1 1 2 3 4
t
l=4 π ms
Load profile 4 p1 p2 p3 p4 QR
2 3 p1 p2 p3 p4 QR
M
S1 S2 QA
M 7
c c+π fork req. fork
5 8 6
queue empty QA
S1 S2
(a) Master schedules the first batch of (b) Slaves S1 and S2 created and added
four slaves for execution in QR to QA ; a work request is sent to S1
QO 2 1 1 QO 2 1 1
l=2
9 15 16
p3 p4 QR p1 p2 QR
M M
S1 S2 S3 S4 QA S1 S3 S4 S5 QA
12 19
fork req. fork fork reqs. resp.
10 13 11 14 18 17
exit
S3 S4 S1 S2 S5 20 S1 S3 S4
(c) Slaves S3 and S4 created and added (d) QR becomes empty; master schedules
to QA ; slave S2 completes its execution the next batch of two slaves
to compute the upcoming schedule, step 2 , that starts at the current time,
c, and finishes at c + π. A series of l time points, p1 ,p2 ,...,pl , in the schedule
period π are cumulatively calculated by drawing the next pi from a normal
distribution with μ = π/l and σ = μ×0.1. Each time point stipulates a moment
in wall-clock time when a new slave Sj is to be created; this set of time points
is monotonic, and constitutes the Ready queue, QR , step 3 . The master checks
QR , step 4 in fig. 1b, and creates the slaves whose time point pi is smaller
than or equal to the current wall-clock time4 , steps 5 and 6 in fig. 1b. The
time point pi of a newly-created slave is removed from QO , and an entry for
the corresponding slave Sj is appended to the Await queue QA ; this is shown
in step 7 for S1 and S2 . Slaves in QA are now ready to receive work requests
from the master process, e.g. step 8 . QA is traversed by the master at this
stage so that work requests can be allocated to existing slaves. The master
continues processing queue QR in subsequent rounds, creating slaves, issuing
work requests, and updating QR and QA accordingly as shown in steps 9 – 13
4
We assume that the platform scheduling the master and slave processes is fair.
8 L. Aceto et al.
in fig. 1c. At any point, the master can receive responses, e.g. step 17 in fig. 1d;
these are buffered inside the masters’ incoming work queue and handled once
the scheduling and work allocation phases are complete. A fresh batch of slaves
from QO is scheduled by the master whenever QR becomes empty, step 15 , and
the described procedure is repeated. The master stops scheduling slaves when all
the entries in QO are processed. It then transitions to work-only mode, where it
continues allocating work requests and handling incoming responses from slaves.
Reactiveness and task allocation. Systems generally respond to load with dif-
fering rates, due to the computational complexity of the task at hand, IO, or
slowdown when the system itself becomes gradually loaded. We simulate these
phenomena using the parameters Pr(send) and Pr(recv). The master interleaves
the processing of work requests to allocate them uniformly among the various
slaves: Pr(send) and Pr(recv) bias this behaviour. Specifically, Pr(send) con-
trols the probability that a work request is sent by the master to a slave, whereas
Pr(recv) determines the probability that a work response received by the master
is processed. Sending and receiving is turn-based and modelled on a Bernoulli
trial. The master picks a slave Sj from QA and sends at least one work request
when X ≤ Pr(send), i.e., the Bernoulli trial succeeds; X is drawn from a uni-
form distribution on the interval [0,1]. Further requests to the same slave are
allocated following this scheme (steps 8 , 13 and 20 in fig. 1) and the entry for
Sj in QA is updated accordingly with the number of work requests remaining.
When X > Pr(send), i.e., the Bernoulli trial fails, the slave misses its turn, and
the next slave in QA is picked. The master also queries its incoming work queue
to determine whether a response can be processed. It dequeues one response
when X ≤ Pr(recv), and the attempt is repeated for the next response in the
queue until X > Pr(recv). The master signals slaves to terminate once it ac-
knowledges all of their work responses (e.g. step 14 ). Due to the load imbalance
that may occur when the master becomes overloaded with work responses re-
layed by slaves, dequeuing is repeated |QA | times. This encourages an even load
distribution in the system as the number of slaves fluctuates at runtime.
2.2 Realisability
The set-up detailed in sec. 2.1 is easily translatable to the actor model of compu-
tation [2]. In this model, the basic units of decomposition are actors: concurrent
entities that do not share mutable memory with other actors. Instead, they in-
teract via asynchronous messaging. Each actor owns an incoming message buffer
called the mailbox. Besides sending and receiving messages, an actor can also fork
other child actors. Actors are uniquely addressable via a dynamically-assigned
identifier, often referred to as the PID. Actor frameworks such as Erlang [16],
Akka [55] for Scala [51], and Thespian [53] for Python [44] implement actors as
lightweight processes to enable highly-scalable architectures that span multiple
machines. The terms actor and process are used interchangeably henceforth.
Implementation. We use Erlang to implement the set-up of sec. 2.1. Our im-
plementation maps the master and slave processes to actors, where slaves are
On Benchmarking for Concurrent Runtime Verification 9
forked by the master via the Erlang function spawn(); in Akka and Thespian
ActorContext.spawn() and Actor.createActor() can be respectively used to
the same effect. The work request queues for both master and slave processes co-
incide with actor mailboxes. We abstract the task computation and model work
requests as Erlang messages. Slaves emulate no delay, but respond instantly to
work requests once these have been processed; delay in the system can be in-
duced via parameters Pr(send) and Pr(recv). To maximise efficiency, the Order,
Ready and Await queues used by our scheduling scheme are maintained locally
within the master. The master process keeps track of other details, such as the
total number of work requests sent and received, to determine when the system
should stop executing. We extend the parameters in sec. 2.1 with a seed parame-
ter, r, to fix the Erlang pseudorandom number generator to output reproducible
number sequences.
.
S1
eq
Tstart 10 % samples ,r
• 1
...... •2 ,req. 1 •2 ,req. 2
...... recorded metrics
...... Collector M S2
5 •2 ,resp. 3 •2 ,resp. .
.
csv time in queue .
round-trip = Tstart −Tfinish
Metric 4
records Sn
timestamped reference
Fig. 2: Collector tracking the round-trip time for work requests and responses
3 Evaluation
We evaluate our synthetic benchmarking tool developed as described in Sec. 2
in a number of ways. In sec. 3.1, we discuss sanity checks for its measurement
collection mechanisms, and assess the repeatability of the results obtained from
the synthetic system executions. Crucially, sec. 3.1 provides evidence that the
benchmarking tool is sufficiently expressive to cover a number of execution pro-
files that are shown to emulate realistic scenarios. Sec. 3.2 demonstrates the
utility of the features offered by our tool for the purposes of assessing RV tools.
Experiment set-up. We define an experiment to consist of ten benchmarks, each
performed by running the system set-up with incremental loads. Our experiments
were performed on an Intel Core i7 M620 64-bit machine with 8GB of memory,
running Ubuntu 18.04 LTS and Erlang/OTP 22.2.1.
calibrated by taking various window sizes over numerous runs for different load
profiles of ≈ 1M slaves. The results were compared to the actual mean calcu-
lated on all work request and response messages exchanged between master and
slaves. Window sizes close to 10 % yielded the best results (≈ ±1.4% discrep-
ancy from the actual RT). Smaller window sizes produced excessive discrepancy;
larger sizes induced noticeably higher system loads. We also cross-checked the
precision of our sampling method of the scheduler utilisation against readings
obtained via the Erlang Observer tool [16] to confirm that these coincide.
Experiment repeatability. Data variability affects the repeatability of experi-
ments. It also plays a role when determining the number of repeated readings, k,
required before the data measured is deemed sufficiently representative. Choos-
ing the lowest k is crucial when experiment runs are time consuming. The coef-
ficient of variation (CV)—i.e., the ratio of the standard deviation to the mean,
CV = σx̄ × 100—can be used to establish the value of k empirically, as follows.
Initially, the CVk for one batch of experiments for some number of repetitions k
is calculated. The result is then compared to the CVk for the next batch of repe-
titions k =k+b, where b is the step size. When the difference between successive
CV metrics k and k is sufficiently small (for some percentage ), the value of k
is chosen, otherwise the described procedure is repeated with k . Crucially, this
condition must hold for all variables measured in the experiment before k can
be fixed. For the results presented next, the CV values were calculated manually.
The mechanism that determines the CV automatically is left for future work.
Data variability. The data variability between experiments can be reduced by
seeding the Erlang pseudorandom number generator (parameter r in sec. 2.2)
with a constant value. This, in turn, tends to require fewer repeated runs be-
fore the metrics of interest—scheduler utilisation, memory consumption, RT,
and execution duration—converge to an acceptable CV. We conduct experiment
sets with three, six and nine repetitions. For the majority of cases, the CV for
our metrics is lower when a fixed seed is used, by comparison to its unseeded
counterpart. In fact, very low CV values for the scheduler utilisation, memory
consumption, RT, and execution duration, 0.17 %, 0.15 %, 0.52 % and 0.47 % re-
spectively, were obtained with three repeated runs. We thus set the number of
repetitions to three for all experiment runs in the sequel. Note that fixing the
seed still permits the system to exhibit a modicum of variability that stems from
the inherent interleaved execution of components due to process scheduling.
Load profiles. Our tool is expressive enough to generate the load profiles intro-
duced in sec. 2.1 (see fig. 3), enabling us to gauge the behaviour of monitoring
set-ups under varying forms of loads. These loads make it possible to mock spe-
cific system scenarios that test different implementation aspects. For example, a
benchmark configured with load surges could uncover buffer overflows in a par-
ticular monitoring implementation that only arise under stress when the length
of the request queue exceeds some preset length.
System reactivity. The reactivity of the master-slave system correlates with the
idle time of each slave which, in turn, affects the capacity of the system to absorb
12 L. Aceto et al.
0 0 0
0 25 50 75 100 25 50 75 100 25 50 75 100
Fig. 3: Steady, Pulse and Burst load distributions of 500 k slaves for 100 s
overheads. Since this can skew the results obtained when assessing overheads, it is
imperative that the benchmarking tool provides methods to control this aspect.
The parameters Pr(send) and Pr(recv) regulate the speed with which the system
reacts to load. We study how these parameters affect the overall performance of
system models set up with Pr(send) = Pr(recv) ∈ {0.1,0.5,0.9}. The results are
shown in fig. 4, where each metric (e.g. memory consumption) is plotted against
the total number of slaves. At Pr(send)=Pr(recv)=0.1, the system has the lowest
RT out of the three configurations (bottom left), as indicated by the gentle linear
increase of the plot. One may expect the RT to be lower for the system models
configured with probability values of 0.5 and 0.9. However, we recall that with
Pr(send) = 0.1, work requests are allocated infrequently by the master, so that
slaves are often idle, and can readily respond to (low numbers of) incoming work
requests. At the same time, this prolongs the execution duration, when compared
to that of the system set with Pr(send) = Pr(recv) ∈ {0.5,0.9} (bottom right).
This effect of slave idling can be gleaned from the relatively lower scheduler
utilisation as well (top left). Idling increases memory consumption (top right),
since slaves created by the master typically remain alive for extended periods.
By contrast, the plots set with Pr(send)=Pr(recv)∈{0.5,0.9} exhibit markedly
gentler gradients in the memory consumption and execution duration charts;
corresponding linear slopes can be observed in the RT chart. This indicates that
values between 0.5 and 0.9 yield system models that: (i) consume reasonable
amounts of memory, (ii) execute in respectable amounts of time, and (iii) main-
tain tolerable RT. Since master-slave architectures are typically employed in
settings where high throughput is demanded, choosing values smaller than 0.5
goes against this principle. In what follows, we opt for Pr(send)=Pr(recv)=0.9.
Scheduler Memory
50 5.00
Consumption (GB)
Utilisation (%)
4.00
25
3.00
2.00
0
100 200 300 400 500 100 200 300 400 500
Response Execution
2500
3000
Duration (s)
Time (ms)
2000
1500 2000
1000
1000
500
0
100 200 300 400 500 100 200 300 400 500
establish whether the RT in our system set-ups resembles the aforementioned dis-
tributions. Our results, summarised in fig. 5, were obtained by estimating the pa-
rameters for a set of candidate probability distributions (e.g. normal, log-normal,
gamma, etc.) using maximum likelihood estimation [56] on the RT obtained from
each experiment. We then performed goodness-of-fit tests on these parametrised
distributions using the Kolmogorov-Smirnov test, selecting the most appropriate
RT fit for each of the three experiments. The fitted distributions in fig. 5 indi-
cate that the RT of our system models follows the findings reported in [31,20,37].
This makes a strong case in favour of our benchmarking tool striking a balance
between the realism of benchmarks based on OTS programs and the controlla-
bility offered by synthetic benchmarking. Lastly, we point out that fig. 5 matches
the observations made in fig. 4, which show an increase in the mean RT as the
system becomes more reactive. This is evident in the histogram peaks that grow
shorter as Pr(send) = Pr(recv) progresses from 0.1 to 0.9.
Mean response time (ms) Mean response time (ms) Mean response time (ms)
The key construct in sHML is the modal formula [p]ϕ, stating that whenever a
satisfying system exhibits an event e matching pattern p, its continuation then
satisfies ϕ. In property ϕs , the invariant—denoted by recursion binder maxX—
asserts that a slave Slv does not crash, specified by sub-formula 1 . It further
stipulates in sub-formula 2 that when a request-carrying payload, Req is re-
ceived, 2.1 , Slv cannot crash, 3.1 , and if the slave replies to Req with the pay-
load Req + 1, the property recurses on variable X, 3.2 . Action patterns use two
types of value variables: binders, \x , that are pattern-matched to concrete values
learnt at runtime, and variable instances, x , that are bound by the respective
binders and instantiated to concrete data via pattern matching at runtime. This
On Benchmarking for Concurrent Runtime Verification 15
induces the usual notion of free and bound value variables; we assume closed
terms. For example, when checking property ϕs against the trace event pid?42,
the analysis unfolds the sub-formula guarded by maxX, matching the event with
the pattern \Slv ? \Req in 2.1 . Variables Slv and Req are substituted with pid
and 42 respectively in property ϕs , leaving the residual formula:
[\Slv ]ff ∧
[pid ] ff ∧ [pid ! (42 + 1)]max X.
[\Slv ? \Req] [Slv ]ff ∧ [Slv !(Req + 1)]X
The RV tool under scrutiny produces inlined monitor code that executes in the
same process space of system components (see fig. 6a), yielding the lowest pos-
sible amount of runtime overhead. This enables us to scale our benchmarks to
considerably high loads. Our experiments focus on correctness properties that
are parametric w.r.t. to system components [7,19,54,48]: with this approach,
monitors need not interact with one another and can reach verdicts indepen-
dently. Verdicts are communicated by monitors to a central entity that records
the expected number of verdicts in order to determine when the experiment can
be stopped. The set of properties used in our benchmarks translate to monitors
that loop continually to exert the maximum level of runtime overhead possible.
Fig. 6b shows the monitor synthesised from property ϕs , consisting of states
Q0 , Q1 , the rejection state , and inconclusive state ? . The rejection state cor-
responds to a violation of the property, i.e., ff, whereas the inconclusive state
is reached when the analysed trace events do not contain enough information
to enable the monitor to transition to any other state. Both of these states are
sinks, modelling the irrevocability of verdicts [24,26]. The modality [\Slv ? \Req]
in property ϕs corresponds to the transition between Q0 and Q1 in fig. 6b. The
monitor follows this transition when it analyses the trace event pid1 ?d1 exhibited
by the slave with PID pid1 when it receives data payload d1 from the master;
as a side effect, the transition binds the variable Slv to pid1 and Req to d1 in
_?_, _!_, _
_?_,
_!_ ? pid2 !d2 when pid2 = pid1
or when d2 = d1 + 1,
pid2 when pid2 = pid1
M M
pid1 ?d1 ,{Slv → pid1 ,Req → d1 }
Q0 Q1
pid1 !d1 + 1
... _ pid1
S1 M S2 M Sn M
_?_, _!_, _
state Q1 . From Q1 , the monitor transitions to Q0 only when the event pid1 !d2
is analysed, where d2 = d1 + 1 and pid1 is the slave PID (previously) bound to
Slv . From Q0 and Q1 , the rejection state can be reached when a crash event
is analysed. In the case of Q0 , the transition to is followed for any crash event
_ (the wildcard _ denotes the anonymous variable). By contrast, the monitor
reaches from Q1 only when the slave with PID pid1 crashes, otherwise it tran-
sitions to the inconclusive state ? . Other transitions from Q0 and Q1 leading to
? follow a similar reasoning. Interested readers are encouraged to consult [25,6,5]
for more information on the specification logic and monitor synthesis.
Synthetic Benchmarks We set the total number of slaves to n = 20k for mod-
erate loads and n = 500k for high loads; Pr(send) = Pr(recv) is fixed at 0.9 as in
sec. 3.1. These configurations generate ≈n×w×(work requests and responses)=
4M and 100M messages respectively to produce 8M and 200M analysable trace
events per run. The pseudorandom number generator is seeded with a constant
value and three experiment repetitions are performed for the Steady, Pulse and
Burst load profiles (see fig. 3). A loading time of t=100s is used. Our results are
summarised in figs. 7 and 8. Each chart in these figures plots the particular per-
formance metric (e.g. memory consumption) for the system without monitors,
i.e., the baseline, together with the overhead induced by the RV monitors.
Moderate loads. Fig. 7 shows the plots for the system set with n = 20k. These
loads are similar to those employed by the state-of-the-art frameworks to evalu-
ate component-based runtime monitoring, e.g. [57,7,10,23,48] (ours are slightly
higher). We remark that none of the benchmarks used in these works consider
different load profiles: they either model load on a Poisson process, or fail to
specify the kind of load used. In fig. 7, the execution duration chart (bottom
right) shows that, regardless of the load profile used, the running time of each
experiment is comparable to the baseline. With the moderate size of 20k slaves,
the execution duration on its own does not give a detailed enough view of run-
time overhead, despite the fact that our benchmarks provide a broad coverage in
terms of the Steady, Pulse and Burst load profiles. This trend is mirrored in the
scheduler utilisation plot (top left), where both baseline and monitored system
induce a constant load of ≈ 17.5%. On this account, we deem these results to
be inconclusive. By contrast, our three load profiles induce different overhead
for the RT (bottom left), and, to a lesser extent, the memory consumption plots
(top right). Specifically, when the system is subjected to a Burst load, it exhibits
a surge in the RT for the baseline and monitored system alike, at ≈ 16k slaves.
While this is not reflected in the consumption of memory, the Burst plots do
exhibit a larger—albeit linear—rate of increase in memory when compared to
their Steady and Pulse counterparts. The latter two plots once again show anal-
ogous trends, indicating that both Steady and Pulse loads exact similar memory
requirements and exhibit comparable responsiveness under the respectable load
of 20k slaves. Crucially, the data plots in fig. 7 do not enable us to confidently
extrapolate our results. The edge case in the RT chart for Burst plots raises the
question of whether the surge in the trend observed at ≈ 16k remains consistent
On Benchmarking for Concurrent Runtime Verification 17
Scheduler Memory
50
Consumption (GB)
1.594
Utilisation (%)
1.592
25 1.590
1.588
1.586
0 1.584
2 5 7 10 12 15 17 20 2 5 7 10 12 15 17 20
Response Execution
5.0 101.5
Duration (s)
101.4
Time (ms)
4.0
101.3
3.0
101.2
2.0
101.1
1.0
101.0
2 5 7 10 12 15 17 20 2 5 7 10 12 15 17 20
Fig. 7: Mean runtime overhead for master and slave processes (20 k slaves)
when the number of slaves goes beyond 20k. Similarly, although for a different
reason, the execution duration plots do not allow us to distinguish between the
overhead induced by monitors for different loads on this small scale—this occurs
due to the perturbations introduced by the underlying OS (e.g. scheduling other
processes, IO, etc.) that affect the sensitive time keeping of benchmarks.
High loads. We increase the load to n = 500k slaves to determine whether our
benchmark set-up can adequately scale, and show how the monitored system per-
forms under stress. The RT chart in fig. 8 indicates that for Burst loads (bottom
left), the overhead induced by monitors grows linearly in the number of slaves.
This contradicts the results in fig. 7, confirming our supposition that moderate
loads may provide scant empirical evidence to extrapolate to general conclu-
sions. However, the memory consumption for Burst loads (top right) exhibits
similar trends to the ones in fig. 7. Subjecting the system to high loads renders
discernible the discrepancy between the RT and memory consumption gradients
for the Steady and Pulse plots that appeared to be similar under the moderate
loads of 20k slaves. Considering the execution duration chart (bottom right of
fig. 8) as the sole indicator of overhead could deceivingly suggest that runtime
monitoring induces virtually identical overhead for the distinct load profiles of
fig. 3. However, this erroneous observation is easily refuted by the memory con-
sumption and RT plots that show otherwise. This stresses the merit of gathering
multi-faceted metrics to assist in the interpretation of runtime overhead.
We extend the argument for multi-faceted views to the scheduler utilisation
metric in fig. 8 that reveals a subtle aspect of our concurrent set-up. Specifically,
18 L. Aceto et al.
Scheduler Memory
50
Consumption (GB)
2.60
Utilisation (%)
2.40
2.20
25
2.00
1.80
0 1.60
100 200 300 400 500 100 200 300 400 500
Response Execution
1400
8000 1200
Duration (s)
Time (ms)
1000
6000
800
4000
600
2000 400
200
0
100 200 300 400 500 100 200 300 400 500
Fig. 8: Mean runtime overhead for master and slave processes (500 k slaves)
the charts show that while the execution duration, RT and memory consumption
plots grow in the number of slave processes, scheduler utilisation stabilises at ≈
22.7%. This is partly caused by the master-slave design that becomes susceptible
to bottlenecks when the master is overloaded with requests [61]. In addition,
the preemptive scheduling of the EVM [16] ensures that the master shares the
computational resources of the same machine with the rest of the slaves. We
conjecture that, in a distributed set-up where the master resides on a dedicated
node, the overall system throughput may be further pushed. Fig. 8 also attests
to the utility of having a benchmarking framework that scales considerably well
to increase the chances of detecting potential trends. For instance, the evidence
gathered earlier in fig. 7 could have misled one to assert that the RV tool under
scrutiny scales poorly under Burst loads of moderate and larger sizes.
Consumption (MB)
Utilisation (%)
1.605
75
1.600
50
1.595
25 1.590
1.585
0
2 5 7 10 12 15 17 20 2 5 7 10 12 15 17 20
3.0 6000
Duration (s)
Time (ms)
2.5
2.0 4000
1.5
2000
1.0
0.5
0
2 5 7 10 12 15 17 20 2 5 7 10 12 15 17 20
Fig. 9: Mean overhead for synthetic and Cowboy benchmarks (20 k threads)
HTTP protocol parsing. We generate load on Cowboy using the popular stress
testing tool JMeter [3] to issue HTTP requests from a dedicated machine resid-
ing on the same network where Cowboy is hosted. The latter machine is the one
used in the experiments discussed earlier. To emulate the typical behaviour of
web clients (e.g. browsers) that fetch resources via multiple HTTP requests, our
Cowboy application serves files of various sizes that are randomly accessed by
JMeter during the benchmark. In our experiments, we monitored fragments of
the Cowboy and Ranch communication protocol used to handle client requests.
Moderate loads. Fig. 9 plots our results for Steady loads from fig. 7, together
with the ones obtained from the Cowboy benchmarks; JMeter did not enable
us to reproduce the Pulse and Burst load profiles. For our Cowboy benchmarks,
we fixed the total number of JMeter request threads to 20k over the span of
100s, where each thread issued 100 HTTP requests. This configuration coincides
with parameter settings used in the experiments of fig. 7. In fig. 9, the sched-
uler utilisation, memory consumption and RT charts (top, bottom left) show
a correspondence between the baseline plots of our synthetic benchmarks and
those taken with Cowboy and JMeter. This indicates that, for these metrics,
our synthetic system model exhibits analogous characteristics to the ones of the
OTS system, under the chosen load profile. The argument can be extended to
the monitored versions of these systems which follow identical trends. We point
out the similarity in the RT trends of our synthetic and Cowboy benchmarks,
despite the fact that the latter set of experiments were conducted over a local
network. This suggests that, for our single-machine configuration, the synthetic
20 L. Aceto et al.
4 Conclusion
Concurrent RV necessitates benchmarking tools that can scale dynamically to
accommodate considerable load sizes, and are able to provide a multi-faceted view
of runtime overhead. This paper presents a benchmarking tool that fulfils these
requirements. We demonstrate its implementability in Erlang, arguing that the
design is easily instantiatable to other actor frameworks such as Akka and Thes-
pian. Our set-up emulates various system models through configurable parame-
ters, and scales to reveal behaviour that emerges only when software is pushed
to its limit. The benchmark harness gathers different performance metrics, offer-
ing a multi-faceted view of runtime overhead that, to wit, other state-of-the-art
tools do not currently offer. Our experiments demonstrate that these metrics
benefit the interpretation of empirical measurements: they increase visibility
that may spare one from drawing insufficiently general, or otherwise, erroneous
conclusions. We establish that—despite its synthetic nature—our master-slave
model faithfully approximates the mean response times observed in realistic web
server traffic. We also compare the results of our synthetic benchmarks against
those obtained from a real-world use case to confirm that our tool captures the
behaviour of this realistic set-up. It is worth noting that, while our empirical
measurements of secs. 3.1 and 3.2 depend on the implementation language, our
conclusions are transferrable to other frameworks, e.g. Akka and Play [42].
Related work. There are other less popular benchmarks targeting the JVM be-
sides those mentioned in sec. 1. Renaissance [52] employs workloads that leverage
the concurrency primitives of the JVM, focussing on the performance of com-
piler optimisations similar to DaCapo and ScalaBench. These benchmarks gather
metrics that measure software quality and complexity, as opposed to metrics that
gauge runtime overhead. The CRV suite [8] aims to standardise the evaluation
of RV tools, and mainly focusses on RV for monolithic programs. We are un-
aware of RV-centric benchmarks for concurrent systems such as ours. In [43], the
authors propose a queueing model to analyse web server traffic, and develop a
benchmarking tool to validate it. Their model coincides with our master-slave
set-up, and considers loads based on a Poisson process. A study of message-
passing communication on parallel computers conducted in [31] uses systems
loaded with different numbers of processes; this is similar to our approach. Im-
portantly, we were able to confirm the findings reported in [43] and [31] (sec. 3.1).
On Benchmarking for Concurrent Runtime Verification 21
References
1. Aceto, L., Achilleos, A., Francalanza, A., Ingólfsdóttir, A., Kjartansson, S.Ö.: De-
terminizing Monitors for HML with Recursion. JLAMP 111, 100515 (2020)
2. Agha, G., Mason, I.A., Smith, S.F., Talcott, C.L.: A Foundation for Actor Com-
putation. JFP 7(1), 1–72 (1997)
3. Apache Software Foundtation: Jmeter (2020), https://jmeter.apache.org
4. Attard, D.P.: detectEr (2020), https://github.com/duncanatt/detecter-inline
5. Attard, D.P., Cassar, I., Francalanza, A., Aceto, L., Ingólfsdóttir, A.: Introduction
to Runtime Verification. In: Behavioural Types: from Theory to Tools, pp. 49–76.
Automation, Control and Robotics, River (2017)
6. Attard, D.P., Francalanza, A.: A Monitoring Tool for a Branching-Time Logic. In:
RV. LNCS, vol. 10012, pp. 473–481 (2016)
7. Attard, D.P., Francalanza, A.: Trace Partitioning and Local Monitoring for Asyn-
chronous Components. In: SEFM. LNCS, vol. 10469, pp. 219–235 (2017)
8. Bartocci, E., Falcone, Y., Bonakdarpour, B., Colombo, C., Decker, N., Havelund,
K., Joshi, Y., Klaedtke, F., Milewicz, R., Reger, G., Rosu, G., Signoles, J., Thoma,
D., Zalinescu, E., Zhang, Y.: First International Competition on Runtime Verifi-
cation: Rules, Benchmarks, Tools, and Final Results of CRV 2014. Int. J. Softw.
Tools Technol. Transf. 21(1), 31–70 (2019)
9. Bartocci, E., Falcone, Y., Francalanza, A., Reger, G.: Introduction to Runtime
Verification. In: Lectures on RV, LNCS, vol. 10457, pp. 1–33. Springer (2018)
10. Berkovich, S., Bonakdarpour, B., Fischmeister, S.: Runtime Verification with Min-
imal Intrusion through Parallelism. FMSD 46(3), 317–348 (2015)
11. Blackburn, S.M., Garner, R., Hoffmann, C., Khan, A.M., McKinley, K.S., Bentzur,
R., Diwan, A., Feinberg, D., Frampton, D., Guyer, S.Z., Hirzel, M., Hosking, A.L.,
Jump, M., Lee, H.B., Moss, J.E.B., Phansalkar, A., Stefanovic, D., VanDrunen, T.,
von Dincklage, D., Wiedermann, B.: The DaCapo Benchmarks: Java Benchmarking
Development and Analysis. In: OOPSLA. pp. 169–190 (2006)
12. Blessing, S., Fernandez-Reyes, K., Yang, A.M., Drossopoulou, S., Wrigstad,
T.: Run, Actor, Run: Towards Cross-Actor Language Benchmarking. In:
AGERE!@SPLASH. pp. 41–50 (2019)
13. Bodden, E., Hendren, L.J., Lam, P., Lhoták, O., Naeem, N.A.: Collaborative Run-
time Verification with Tracematches. J. Log. Comput. 20(3), 707–723 (2010)
14. Bonakdarpour, B., Finkbeiner, B.: The Complexity of Monitoring Hyperproperties.
In: CSF. pp. 162–174 (2018)
15. Buyya, R., Broberg, J., Goscinski, A.M.: Cloud Computing: Principles and
Paradigms. Wiley-Blackwell (2011)
16. Cesarini, F., Thompson, S.: Erlang Programming: A Concurrent Approach to Soft-
ware Development. O’Reilly Media (2009)
17. Chappell, D.: Enterprise Service Bus: Theory in Practice. O’Reilly Media (2004)
18. Chen, F., Rosu, G.: Mop: An Efficient and Generic Runtime Verification Frame-
work. In: OOPSLA. pp. 569–588 (2007)
19. Chen, F., Rosu, G.: Parametric Trace Slicing and Monitoring. In: TACAS. LNCS,
vol. 5505, pp. 246–261 (2009)
20. Ciemiewicz, D.M.: What Do You mean? - Revisiting Statistics for Web Response
Time Measurements. In: CMG. pp. 385–396 (2001)
21. Cornejo, O., Briola, D., Micucci, D., Mariani, L.: In the Field Monitoring of Inter-
active Application. In: ICSE-NIER. pp. 55–58 (2017)
22 L. Aceto et al.
22. Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clus-
ters. Commun. ACM 51(1), 107–113 (2008)
23. El-Hokayem, A., Falcone, Y.: Monitoring Decentralized Specifications. In: ISSTA.
pp. 125–135 (2017)
24. Francalanza, A.: A Theory of Monitors (Extended Abstract). In: FoSSaCS. LNCS,
vol. 9634, pp. 145–161 (2016)
25. Francalanza, A., Aceto, L., Achilleos, A., Attard, D.P., Cassar, I., Della Monica,
D., Ingólfsdóttir, A.: A Foundation for Runtime Monitoring. In: RV. LNCS, vol.
10548, pp. 8–29 (2017)
26. Francalanza, A., Aceto, L., Ingólfsdóttir, A.: Monitorability for the Hennessy-
Milner Logic with Recursion. FMSD 51(1), 87–116 (2017)
27. Francalanza, A., Pérez, J.A., Sánchez, C.: Runtime Verification for Decentralised
and Distributed Systems. In: Lectures on RV, LNCS, vol. 10457, pp. 176–210.
Springer (2018)
28. Francalanza, A., Xuereb, J.: On Implementing Symbolic Controllability. In: CO-
ORDINATION. LNCS, vol. 12134, pp. 350–369 (2020)
29. Ghosh, S.: Distributed Systems: An Algorithmic Approach. CRC (2014)
30. Gray, J.: The Benchmark Handbook for Database and Transaction Processing Sys-
tems. Morgan Kaufmann (1993)
31. Grove, D.A., Coddington, P.D.: Analytical Models of Probability Distributions
for MPI Point-to-Point Communication Times on Distributed Memory Parallel
Computers. In: ICA3PP. LNCS, vol. 3719, pp. 406–415 (2005)
32. Harman, M., O’Hearn, P.W.: From Start-ups to Scale-ups: Opportunities and Open
Problems for Static and Dynamic Program Analysis. In: SCAM. pp. 1–23 (2018)
33. Hoguin, L.: Cowboy (2020), https://ninenines.eu
34. Hoguin, L.: Ranch (2020), https://ninenines.eu
35. Imam, S.M., Sarkar, V.: Savina - An Actor Benchmark Suite: Enabling Empirical
Evaluation of Actor Libraries. In: AGERE!@SPLASH. pp. 67–80 (2014)
36. Jin, D., Meredith, P.O., Lee, C., Rosu, G.: JavaMOP: Efficient Parametric Runtime
Monitoring Framework. In: ICSE. pp. 1427–1430 (2012)
37. Kayser, B.: What is the expected distribution of website response times?
(2017, last accessed, 19th Jan 2021), https://blog.newrelic.com/engineering/
expected-distributions-website-response-times
38. Kim, M., Viswanathan, M., Kannan, S., Lee, I., Sokolsky, O.: Java-mac: A Run-
Time Assurance Approach for Java Programs. FMSD 24(2), 129–155 (2004)
39. Kshemkalyani, A.D.: Distributed Computing: Principles, Algorithms, and Systems.
Cambridge University Press (2011)
40. Kuhtz, L., Finkbeiner, B.: LTL Path Checking is Efficiently Parallelizable. In:
ICALP (2). LNCS, vol. 5556, pp. 235–246 (2009)
41. Larsen, K.G.: Proof Systems for Satisfiability in Hennessy-Milner Logic with Re-
cursion. TCS 72(2&3), 265–288 (1990)
42. Lightbend: Play framework (2020), https://www.playframework.com
43. Liu, Z., Niclausse, N., Jalpa-Villanueva, C.: Traffic Model and Performance Eval-
uation of Web Servers. Perform. Evaluation 46(2-3), 77–100 (2001)
44. Matthes, E.: Python Crash Course: A Hands-On, Project-Based Introduction to
Programming. No Starch Press (2019)
45. Meredith, P.O., Jin, D., Griffith, D., Chen, F., Rosu, G.: An Overview of the MOP
Runtime Verification Framework. STTT 14(3), 249–289 (2012)
46. Myers, G.J., Sandler, C., Badgett, T.: The Art of Software Testing. Wiley (2011)
On Benchmarking for Concurrent Runtime Verification 23
47. Navabpour, S., Joshi, Y., Wu, C.W.W., Berkovich, S., Medhat, R., Bonakdarpour,
B., Fischmeister, S.: RiTHM: A Tool for Enabling Time-Triggered Runtime Veri-
fication for C Programs. In: ESEC/SIGSOFT FSE. pp. 603–606. ACM (2013)
48. Neykova, R., Yoshida, N.: Let it Recover: Multiparty Protocol-Induced Recovery.
In: CC. pp. 98–108 (2017)
49. Niclausse, N.: Tsung (2017), http://tsung.erlang-projects.org
50. Nielsen, J.: Usability Engineering. Morgan Kaufmann (1993)
51. Odersky, M., Spoon, L., Venners, B.: Programming in Scala. Artima Inc. (2020)
52. Prokopec, A., Rosà, A., Leopoldseder, D., Duboscq, G., Tuma, P., Studener, M.,
Bulej, L., Zheng, Y., Villazón, A., Simon, D., Würthinger, T., Binder, W.: Renais-
sance: Benchmarking Suite for Parallel Applications on the JVM. In: PLDI. pp.
31–47 (2019)
53. Quick, K.: Thespian (2020), http://thespianpy.com
54. Reger, G., Cruz, H.C., Rydeheard, D.E.: MarQ: Monitoring at Runtime with QEA.
In: TACAS. LNCS, vol. 9035, pp. 596–610 (2015)
55. Roestenburg, R., Bakker, R., Williams, R.: Akka in Action. Manning (2015)
56. Rossi, R.J.: Mathematical Statistics: An Introduction to Likelihood Based Infer-
ence. Wiley (2018)
57. Scheffel, T., Schmitz, M.: Three-Valued Asynchronous Distributed Runtime Veri-
fication. In: MEMOCODE. pp. 52–61 (2014)
58. Seow, S.C.: Designing and Engineering Time: The Psychology of Time Perception
in Software. Addison-Wesley (2008)
59. Sewe, A., Mezini, M., Sarimbekov, A., Binder, W.: DaCapo con Scala: design and
analysis of a Scala benchmark suite for the JVM. In: OOPSLA. pp. 657–676 (2011)
60. SPEC: SPECjvm2008 (2008), https://www.spec.org/jvm2008
61. Tarkoma, S.: Overlay Networks: Toward Information Networking. Auerbach (2010)
62. Welford, B.P.: Note on a Method for Calculating Corrected Sums of Squares and
Products. Technometrics 4(3), 419–420 (1962)
63. White, T.: Hadoop: The Definitive Guide: Storage and Analysis at Internet Scale.
O’Reilly Media (2015)
Open Access This chapter is licensed under the terms of the Creative Commons
Attribution 4.0 International License (http://creativecommons.org/licenses/by/
4.0/), which permits use, sharing, adaptation, distribution and reproduction in any
medium or format, as long as you give appropriate credit to the original author(s) and
the source, provide a link to the Creative Commons license and indicate if changes
were made.
The images or other third party material in this chapter are included in the chapter’s
Creative Commons license, unless indicated otherwise in a credit line to the material. If
material is not included in the chapter’s Creative Commons license and your intended
use is not permitted by statutory regulation or exceeds the permitted use, you will need
to obtain permission directly from the copyright holder.
Certified Abstract Cost Analysis
1 Introduction
We present a generalization of automated cost analysis that can handle pro-
grams containing placeholders for unspecified statements. Consider the program
Q ≡ “i =0; while (i < t) {P; i ++;}”, where P is any statement not modifying
i or t. We call P an abstract statement; a program like Q containing abstract
statements is called abstract program. The (exact or upper bound) cost of execut-
ing P is described by a function acP (x) depending on the variables x occurring
in P. We call this function the abstract cost of P. Assuming that executing any
statement has unit cost and that t ≥ 0, one can compute the (abstract) cost of
Q as 2 + t · (acP (x) + 2) depending on acP and t. For any concrete instance of P,
we can derive its concrete cost as usual and then obtain the concrete cost of Q
simply by instantiating acP . In this paper, we define and implement an abstract
cost analysis to infer abstract cost bounds. Our implementation consists of an
automatic abstract cost analysis tool and an automatic certifier for the correct-
ness of inferred abstract bounds. Both steps are performed with an approach
called Quantitative Abstract Execution (QAE).
Fine, but what is this good for? Abstract programs occur in program trans-
formation rules used in compilation, optimization, parallelization, refactoring,
© The Author(s) 2021
E. Guerra and M. Stoelinga (Eds.): FASE 2021, LNCS 12649, pp. 24–45, 2021.
https://doi.org/10.1007/978-3-030-71500-7 2
Certified Abstract Cost Analysis 25
etc.: Transformations are specified as rules over program schemata which are
nothing but abstract programs. If we can perform cost analysis of abstract pro-
grams, we can analyze the cost effect of program transformations. Our approach
is the first method to analyze the cost impact of program transformations.
counterintuitive that this is possible: after all, nothing is known about an ab-
stract symbol. But this is not quite true: one can equip an abstract symbol with
an abstract description of the behavior of its instances: a set of memory loca-
tions its behavior may depend on, commonly called footprint and a (possibly
different) set of memory locations it can change, commonly called frame [21].
Cost Invariants. In automated cost analysis, one infers cost bounds often from
loop invariants, ranking functions, and size relations computed during SE [3, 11,
16, 40]. For abstract programs, we need a more general concept, namely a loop
invariant expressing a valid abstract cost bound at the beginning of any iteration
(e.g., 2 + i ∗ (acP (x) + 2) for the program Q above). We call this a cost invariant.
This is an important technical innovation of this paper, increasing the modularity
of cost analysis, because each loop can be verified and certified separately.
Certification. Cost annotations inferred by abstract cost analysis, i.e., cost in-
variants and abstract cost bounds, are automatically certified by a deductive ver-
ification system, extending the approach reported in [4] to abstract cost and ab-
stract programs. This is possible because the specification (i.e., the cost bound)
and the loop (cost) invariants are inferred by the cost analyzer—the verification
system does not need to generate them.
To argue correctness of an abstract cost analysis is complex, because it must
be valid for an infinite set of concrete programs. For this reason alone, it is
useful to certify the abstract cost inferred for a given abstract program: during
development of the abstract cost analysis reported here, several errors in abstract
cost computation were detected—analysis of the failed verification attempt gave
immediate feedback on the cause. We built a test suite of problems so that any
change in the cost analyzer can be validated in the future.
Certification is crucial for the correctness of quantitative relational prop-
erties: The inferred cost invariants might not be precise enough to establish,
e.g., that a program transformation does not increase cost for any possible pro-
gram instance and run. This is only established at the certification stage, where
relational properties are formally verified. A relational setting requires provably
precise cost bounds. This feature is not offered by existing cost analysis methods.
2 QAE by Example
We introduce our approach and terminology informally by means of a motivat-
ing example: Code Motion [1] is a compiler optimization technique moving a
statement not affected by a loop from the beginning of the loop body to before
the loop. This code transformation should preserve behavior provided the loop
is executed at least once, but can be expected to improve computation effort,
i.e. quantitative properties of the program, such as execution time and memory
Certified Abstract Cost Analysis 27
for the certifier. As usual, loop invariants (keyword “loop invariant”) are needed
to describe the behavior of loops with symbolic bounds. The loop invariant in
Fig. 1 allows inferring the final value t of loop counter i after loop termination.
To prove termination, the loop variant (keyword “decreases”) is inferred.
So far, this is standard automated cost analysis [3]. The ability to infer
automatically the remaining annotations represents our main contribution: Each
AS P has an associated abstract cost function parametric in the locations of its
footprint, represented by an abstract cost symbol acP . The symbol acp (t, w) in
the “assert” statement in Fig. 1 can be instantiated with any concrete function
parametric in t, w being a valid cost bound for the instance of P. For example,
for the instantiation “P ≡ x=t+1;” the constant function acP (t, w) = 1 is the
correct exact cost, while acP (t, w) = t with t ≥ 1 is a correct upper bound cost.
As pointed out in Sect. 1 we require cost invariants to capture the cost of each
loop iteration. They are declared by the keyword “cost invariant ”. To generate
them, it is necessary to infer the cost growth of abstract programs that bounds
the number of loop iterations executed so far. In Sect. 4 we describe automated
inference of cost invariants including the generation of cost growth for all loops.
Our technique is compositional and also works in the presence of nested loops.
The QAE framework can express and prove quantitative relational properties.
The assertions in the last lines in Fig. 1 use the expression \cost referring to the
total accumulated cost of the program, i.e., the quantitative postcondition. We
support quantitative relational postconditions such as \cost 1 ≥ \cost 2, where
\cost 1, \cost 2 refer to the total cost of the original (on the left) and trans-
formed (on the right) program, respectively. To prove relational properties, one
must be able to deduce exact cost invariants for loops such that the comparison
of the invariants allows concluding that the programs from which the invariants
are obtained fulfill the proven relational property. Otherwise, over-approximation
introduced by cost analysis could make the relation for the postconditions hold,
while the relational property does not necessarily hold for the programs.
To obtain a formal account of QAE with correctness guarantees we require a
mathematically rigorous semantic foundation of abstract cost. This is provided
in the following section.
3 (Quantitative) Abstract Execution
Abstract Execution [37, 38] extends symbolic execution by permitting abstract
statements to occur in programs. Thus AE reasons about an infinite set of
concrete programs. An abstract program contains at least one AS. The semantics
of an AS is given by the set of concrete programs it represents, its set of legal
instances. To simplify presentation, we only consider normally completing Java
code as instances: an instance may not throw an exception, break from a loop,
etc. Each AS has an identifier and a specification consisting of its frame and
footprint. Semantically, instances of an AS with identifier P may at most write
to memory locations specified in P’s frame and may only read the values of
locations in its footprint. All occurrences of an AS with the same identifier
symbol have the same legal instances (possibly modulo renaming of variables,
if variable names in frame and footprint specifications differ). For example, by
Certified Abstract Cost Analysis 29
to the program logic and calculus underlying AE; (ii) translate non-functional
(cost) properties to functional ones. We opt for the second, as it is less prone to
introduce soundness issues stemming from the addition of new concepts to the
existing framework. It is also faster to realize and allows early testing.
The translation consists of three elements: (a) A global “ghost” variable
“cost” (representing keyword “\cost”) for tracking accumulated cost; (b) explicit
encoding of a chosen cost model by suitable ghost setter methods that update this
variable; (c) functional loop invariants and method postconditions expressing
cost invariants and cost postconditions.
Regarding item (c), we support three kinds of cost specification. These are,
descending in the order of their strength: exact, upper bound, and asymptotic
cost. At the analysis stage, it is usually impossible to determine the best match.
For this reason, there is merely one cost invariant keyword, not three. However,
when translating cost to functional properties, a decision has to be made. A
natural strategy is to start with the strongest kind of specification, then proceed
towards the weaker ones when a proof fails.
An exact cost invariant has the shape “cost == expr ”, an upper bound
on the invariant cost is specified by “cost <= expr ”; asymptotic cost is ex-
pressed by the idiom “asymptotic(cost ) <= asymptotic(expr )”. The function
“asymptotic” abstracts from constant symbols in the argument. For example,
the (exact) cost postcondition of the abstract program on the right in Fig. 1 is:
cost == 2 + acP (t, w) + t · (acQ (t, z) + 2) (†)
Asymptotic cost would be expressed as asymptotic(cost) <= asymptotic(2 +
acP (t, w) + t · (acQ (t, z) + 2)) where the right-hand side of the equation is equiv-
alent to asymptotic(acP (t, w) + t · (acQ (t, z))).
Listing 2 shows the result of translating the cost invariant in Fig. 1 to a
functional loop invariant (highlighted lines), using cost model Minstr in ghost
setters and postconditions of AS (“ensures” clauses). ASs P, Q must include
the ghost variable “cost” in their frame, because they update its value. The
keyword \before in the postcondition of an AS refers to the value a variable
had just before executing the AS. In loops we use “inner” cost variables “iCost”
tracking the cost inside the loop. When the loop terminates, we add the final
value of “iCost” to “cost”. After every evaluation of the guard of the loop, the
cost is incremented accordingly. Using the translation in Listing 2 of the inferred
annotations in Fig. 1, the AE system proves cost postcondition (†) automatically.
Size relations. We assume that for each loop sets of size constraints have been
computed. These sets capture the size relation among the variables in the loop
upon exit (called base case, denoted ϕB ), and when moving from one iteration to
the next (denoted ϕI ). ASs are ignored by the size analysis. While this would be
Certified Abstract Cost Analysis 33
where:
(1) CB ≥ 0 is the cost of exiting the loop (executing the base case) w.r.t. M.
(2) Each acj (·) ≥ 0 represents the abstract cost for the abstract statement Aj
in L w.r.t. to M. Each acj is parameterized with the variables in the cost
footprint of the corresponding Aj , as it may depend on any of them.
(3) Each CNi ≥ 0 is the cost of the non-abstract statement Ni w.r.t. to M.
(4) C is a recursive call.
(5) x are variables x when renamed after executing the loop.
(6) The assignable variables wj,∗ in the acj get an unknown value in x (denoted
with “ ” in the examples below).
Ignoring the abstract statements, one can apply a complete algorithm for cost re-
lation systems [6] to an ACRS to obtain automatically a linear 3 ranking function
f for loop L: f is a linear, non-negative function over x that decreases strictly
at every loop iteration. Function f yields directly the “//@ decreases f ;” anno-
tation required for QAE.
As in Sect. 3, the definition of ACRS assumes a generic cost model M and
uses C to refer in a generic way to cost according to M. For example, to infer
the number of executed steps, C is set to 1 per instruction, while for memory
usage C records the amount of memory allocated by an instruction.
2
For complex data structures, one would need heap analyses [35] to infer size relations.
3
There exist (more expensive) algorithms to obtain also polynomial ranking func-
tions [5] but for the sake of efficiency we are not using them in our system.
34 E. Albert et al.
General Case of ACRS. The definition of ACRS was simplified for presenta-
tion. The following generalizations, not requiring any new concept, are possible:
(1) We assume an ACRS for a loop has only two equations, one for the base case
(the guard G does not hold) and one for the iterative case (G holds). In general,
there might be more than one equation for the base case, e.g., if the guard in-
volves multiple conditions and the cost varies depending on the condition that
holds on the exit. Similarly, there might be multiple equations in the iterative
case, e.g., if the loop body contains conditional statements and each iteration
has different cost depending on the taken branch. This issue is orthogonal to
the extension to abstract cost. (2) A loop might contain method calls that in
turn contain ASs. In absence of recursion, such calls can be inlined. For recur-
sive methods, it is possible to compute the call graph and solve the equations
in reverse topological order such that the abstract cost of the (inner) method
calls is obtained first and then inserted into the surrounding equations. (3) The
cost of code fragments not part of any loop (before, after, and in between loops)
is defined as well by abstract cost equations accumulating the cost of all in-
structions these fragments include, just as for concrete programs. This aspect
does not require changes to the framework for concrete programs, so we do not
formalize it, but just illustrate it in the next example.
Example 3. The ACRSs of the programs in Fig. 1 are (left program above line,
right program below):
Notation c refers to the generic cost that can be instantiated to a chosen cost
model M. Cost equation Cbefore for the first program is composed of the instruc-
tions appearing before the loop is cbefore plus the cost of executing the while loop
Cw0 . The size constraint fixes the initial value of i. Following Def. 3, there are two
equations corresponding to the base case of the loop and executing one iteration,
respectively. Observe that assignable variables in ASs have unknown values in
the ACRS (according to item (6) in Def. 3). Program after has a similar struc-
ture. A ranking function for both loops is t − i which is used to generate the
annotation “//@ decreases t−i;” inserted just before each loop in Fig. 1.
assignable variables in an AS, then the ACRS will not be solvable (i.e., the analy-
sis returns “unbound cost”). The ACRS in the example contains “ ” in equations
that do not prevent solvability of the system nor its evaluation, because they
do not interfere with cost. However, if we had “forgotten” a cost-relevant vari-
able (such as t), we would be unable to solve or evaluate the equations: without
knowing t the equation guard is not evaluable. Requirement (ii) is ensured by the
following definition ensuring that variables in the cost footprint are not modified
by other statements in the loop.
Definition 4 (Cost neutral AS). Given a loop L, where
– W (L) is the set of variables written by the non-abstract statements of L.
– Abstr(L) is the set of all ASs in loop L.
– Frame(Abstr(L)) is the set of variables assigned by any AS A ∈ Abstr(L).
– CostFootprint (A) is the set of variables which the cost of an A depends on.
L is a loop with cost neutral ASs if, for all A ∈ Abstr(L), it is the case that
(W (L) ∪ Frame(Abstr(L))) ∩ CostFootprint(A) = ∅.
The definition above constitutes a sufficient, but not necessary criterion that
could be tightened by a more expensive analysis. For instance, our framework
easily extends to allow conditions in the cost footprint that the concretizations
of the AS must fulfill. In our example, the cost footprint might include condition
i ≥ i, where i is the value of i after executing the AS. This permits the abstract
statement to modify i provided it does not decrease its value. Thus, the AS is
not cost neutral, but the upper bound remains sound. The formalization of this
generalization is left to future work.
Example 4. It is easy to check that both loops in Fig. 1 have cost neutral ASs. On
the left: W (L) = {i}, Frame({P, Q}) = {x, y}, CostFootprint(P ) = {t, w}, and
CostFootprint (Q) = {t, z}, so (W (L) ∪ Frame({P, Q})) ∩ CostFootprint(P ) = ∅,
and (W (L) ∪ Frame({P, Q})) ∩ CostFootprint (Q) = ∅. The program on the right
is checked analogously.
Given a program P with variables x and ACRS with initial equation Cini (x).
We denote by eval(Cini (x), σ0 ) the evaluation of the ACRS for a given initial
assignment σ0 of the variables. This is a standard evaluation of recurrence equa-
tions performed by instantiating the right-hand side of the equations with the
values of the variables in σ0 and checking the satisfiability of the size constraints
(if the expression being checked or accumulated contains “ ”, the evaluation re-
turns “unbound”). As usual, the process is repeated until an equation without
calls is reached.
Example 5. Consider the ACRS of the left program in Fig. 1 with variables
(t, x, w, y, z), initial state σ0 = (2, 0, 0, 0, 0), and cost model Minst (thus cbefore ,
cBw0 and cw0 take values 1, 1 and 2 respectively). The evaluation of the ACRS
results in eval(Cini (t, x, w, y, z), (2, 0, 0, 0, 0)) = 6 + 2 · acP (2, 0) + 2 · acQ (2, 0).
The following theorem states soundness of the ACRS obtained by applying Def. 3
provided that all loops satisfy Def. 4.
36 E. Albert et al.
Example 6. We look at four simple loops with ranking function decreases and
the growth inferred automatically by applying Def. 5:
We can now define the concept of ACI that relies on abstract cost relations
defined in Sect. 4.1 and growth as defined above.
Definition 6 (Abstract Cost Invariant). Given an ACRS as in Def. 3
and its growth as in Def. 5, an abstract cost invariant is defined as follows:
n m
cinv(x) = CB +growth ·
max
max
j=1 acj cj,1 , . . . , cj,hcj + i=1 CNi where CB max
stands for the maximal value that the expression CB can take under the constraints
ϕB , and CNi max the maximal value of CNi under ϕI . We generate the annotation
“//@ cost invariant cinv(x);”.
Certified Abstract Cost Analysis 37
5 Experimental Evaluation
We implemented a prototype of our approach downloadable from https://tinyurl.
com/qae-impl (including required libraries). The archive contains the bench-
marks of this section and additional examples as well as build and usage instruc-
tions. The prototype is a command-line implementation backed by an existing
cost analysis library for (non-abstract) Java bytecode as well as the deductive
verification system KeY [2] including the AE framework [37, 38]. Our implemen-
tation consists of three components: (1) An extension of a cost analyzer (written
in Python) to handle abstract Java programs, (2) a conversion tool (written
in Java) translating the output of the analyzer to a set of input files for KeY,
(3) a bash script orchestrating the whole tool chain, specifically, the interplay
between item (1), item (2) and the two libraries. In case of a failed certification
attempt, our script offers the choice to open the generated proof in KeY for fur-
ther debugging. In total, our implementation (excluding the libraries) consists
Certified Abstract Cost Analysis 39
of 1,802 lines of Python, 703 lines of Java, and 389 lines of bash code (without
blank lines and comments).
To assess effectiveness and efficiency of our approach, we used our QAE im-
plementation to analyze seven typical code optimization rules using cost models
Minstr (rows “1∗”–“6∗” in Table 1) and Mheap (rows “7∗”). While Minstr counts
the number of instructions, Mheap measures heap consumption. The first column
identifies the benchmark (“a” refers to the original program, “b” to the trans-
formed one), the second P refers to the kind of proven cost result (asymptotic
“a”, exact “e”, upper “u”), column three shows the inferred growth function for
each loop in the program (separated by “,” if there are two or more loops), in
the fourth column we list the cost postcondition obtained by the analysis (ex-
pressions indicating the number of loop iterations are highlighted), and columns
five to eight display performance metrics. Time tcost , given in milliseconds, is
the time needed to perform the cost analysis. The proof generation time tproof
is given in seconds. We also display the time tcheck needed for checking integrity
of an already generated proof certificate. Finally, sproof is the size of the gener-
ated KeY proof in terms of number of proof steps. Even though the time needed
for certification is significantly higher than for cost analysis (which is to be ex-
pected), each analysis can be performed within one minute. The time to check
a proof certificate amounts to approximately one fourth to one third of the time
needed to generate it. We stress that all analyses are fully automatic.
6 Related Work
The present paper builds on the original AE framework [37,38], which we extend
to Quantitative AE. At the moment no other approach or tool is able to analyze
and certify the cost of schematic programs, specifically relational properties, so
a direct comparison is impossible.
Cost Analysis. There are many resource analysis tools, including: [20], based
on introducing counters and inferring loop invariants; [23], based on an analysis
over the depth of functional programs formalized by means of type systems.
Approaches that bound the number of execution steps include [19,29], working at
the level of compilers. Systems such as AProVE [17] analyze the complexity of
Java programs by transforming them to integer transition systems; COSTA [3]
and CoFloCo [16] are based on the generation of cost recurrence equations
from which upper bounds can be inferred. That is also the basis of the approach
we pursue to infer abstract upper bounds in Sect. 4.1, hence our technique can be
viewed as a generalization of these systems. Approaches based on type systems
could also be generalized to work on abstract programs by introducing abstract
cost as in Sect. 4.1.
For our work it is crucial to use ranking functions to infer growth of cost
invariants. Ranking functions were used to generate bounds on the number of
loop iterations in several systems, but none used them to define growth: [10]
obtain runtime complexity bounds via symbolic representation from ranking
functions, likewise PUBS [3], Loopus [40], and ABC [8]. PUBS analyses all
loop transitions at once, Loopus uses an iterative procedure where bounds are
propagated from inner to outer loops, ABC deals with nested, but not sequential
loops. In our work, when inferring upper bounds, we solve all transitions at once
and handle nested as well as sequential loops.
Certified Abstract Cost Analysis 41
References
1. Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman. Compilers: Principles, Tech-
niques, and Tools. Addison-Wesley, 1986.
2. Wolfgang Ahrendt, Bernhard Beckert, Richard Bubel, Reiner Hähnle, Peter H.
Schmitt, and Mattias Ulbrich, editors. Deductive Software Verification - The KeY
Book - From Theory to Practice, volume 10001 of LNCS. Springer, 2016.
3. Elvira Albert, Puri Arenas, Samir Genaim, German Puebla, and Damiano Zanar-
dini. Cost analysis of object-oriented bytecode programs. Theor. Comput. Sci.,
413(1):142–159, 2012.
4. Elvira Albert, Richard Bubel, Samir Genaim, Reiner Hähnle, Germán Puebla, and
Guillermo Román-Dı́ez. A formal verification framework for static analysis - as
well as its instantiation to the resource analyzer COSTA and formal verification
tool KeY. Software and Systems Modeling, 15(4):987–1012, 2016.
5. Roberto Bagnara, Patricia M. Hill, and Enea Zaffanella. The Parma polyhedra
library: Toward a complete set of numerical abstractions for the analysis and ver-
ification of hardware and software systems. Sci. Comput. Program., 72(1-2):3–21,
2008.
6. Roberto Bagnara, Fred Mesnard, Andrea Pescetti, and Enea Zaffanella. A new look
at the automatic synthesis of linear ranking functions. Inf. Comput., 215:47–67,
2012.
7. Yves Bertot and Pierre Castéran. Interactive Theorem Proving and Program Devel-
opment - Coq’Art: The Calculus of Inductive Constructions. Texts in Theoretical
Computer Science. An EATCS Series. Springer, 2004.
8. Régis Blanc, Thomas A. Henzinger, Thibaud Hottelier, and Laura Kovács. ABC:
algebraic bound computation for loops. In Edmund M. Clarke and Andrei
Voronkov, editors, Logic for Programming, Artificial Intelligence, and Reasoning -
16th International Conference, LPAR-16, Dakar, Senegal, April 25-May 1, 2010,
Revised Selected Papers, volume 6355 of LNCS, pages 103–118. Springer, 2010.
9. Robert S. Boyer, Bernard Elspas, and Karl N. Levitt. SELECT—A formal sys-
tem for testing and debugging programs by symbolic execution. ACM SIGPLAN
Notices, 10(6):234–245, June 1975.
10. Marc Brockschmidt, Fabian Emmes, Stephan Falke, Carsten Fuhs, and Jürgen
Giesl. Alternating runtime and size complexity analysis of integer programs. In
Erika Ábrahám and Klaus Havelund, editors, Tools and Algorithms for the Con-
struction and Analysis of Systems - 20th Intl. Conf., TACAS, Grenoble, France,
volume 8413 of LNCS, pages 140–155. Springer, 2014.
11. Marc Brockschmidt, Richard Musiol, Carsten Otto, and Jürgen Giesl. Automated
termination proofs for Java programs with cyclic data. In P. Madhusudan and
Sanjit A. Seshia, editors, Computer Aided Verification - 24th International Con-
ference, CAV 2012, Berkeley, CA, USA, July 7-13, 2012 Proceedings, volume 7358
of LNCS, pages 105–122. Springer, 2012.
12. Richard Bubel, Andreas Roth, and Philipp Rümmer. Ensuring the Correctness of
Lightweight Tactics for JavaCard Dynamic Logic. Electr. Notes Theor. Comput.
Sci., 199:107–128, 2008.
13. Patrick Cousot and Nicolas Halbwachs. Automatic discovery of linear restraints
among variables of a program. In Alfred V. Aho, Stephen N. Zilles, and Thomas G.
Szymanski, editors, Conference Record of the Fifth Annual ACM Symposium on
Principles of Programming Languages, Tucson, Arizona, USA, January 1978,
pages 84–96. ACM Press, 1978.
Certified Abstract Cost Analysis 43
14. Karl Crary and Stephanie Weirich. Resource bound certification. In Mark N.
Wegman and Thomas W. Reps, editors, POPL 2000, Proceedings of the 27th ACM
SIGPLAN-SIGACT Symposium on Principles of Programming Languages, Boston,
Massachusetts, USA, January 19-21, 2000, pages 184–198. ACM, 2000.
15. Jean-Christophe Filliâtre and Claude Marché. The Why/Krakatoa/Caduceus plat-
form for deductive program verification. In Werner Damm and Holger Hermanns,
editors, Computer Aided Verification, 19th Intl. Conf., CAV, Berlin, Germany,
volume 4590 of LNCS, pages 173–177. Springer, 2007.
16. Antonio Flores-Montoya and Reiner Hähnle. Resource analysis of complex pro-
grams with cost equations. In Jacques Garrigue, editor, Programming Languages
and Systems - 12th Asian Symposium, APLAS 2014, Singapore, November 17-19,
2014, Proceedings, volume 8858 of LNCS, pages 275–295. Springer, 2014.
17. Jürgen Giesl, Marc Brockschmidt, Fabian Emmes, Florian Frohn, Carsten Fuhs,
Carsten Otto, Martin Plücker, Peter Schneider-Kamp, Thomas Ströder, Stephanie
Swiderski, and René Thiemann. Proving termination of programs automatically
with AProVE. In Stéphane Demri, Deepak Kapur, and Christoph Weidenbach,
editors, Automated Reasoning - 7th Intl. Joint Conf., IJCAR, Vienna, Austria,
volume 8562 of LNCS, pages 184–191. Springer, 2014.
18. Benny Godlin and Ofer Strichman. Regression Verification: Proving the Equiva-
lence of Similar Programs. Softw. Test., Verif. Reliab., 23(3):241–258, 2013.
19. Neville Grech, Kyriakos Georgiou, James Pallister, Steve Kerrison, and Ker-
stin Eder. Static energy consumption analysis of LLVM IR programs. CoRR,
abs/1405.4565, 2014.
20. Sumit Gulwani, Krishna K. Mehra, and Trishul M. Chilimbi. SPEED: precise and
efficient static estimation of program computational complexity. In Zhong Shao
and Benjamin C. Pierce, editors, Proceedings of the 36th ACM SIGPLAN-SIGACT
Symposium on Principles of Programming Languages, POPL 2009, Savannah, GA,
USA, January 21-23, 2009, pages 127–139. ACM, 2009.
21. Reiner Hähnle and Marieke Huisman. Deductive verification: from pen-and-paper
proofs to industrial tools. In Bernhard Steffen and Gerhard Woeginger, editors,
Computing and Software Science: State of the Art and Perspectives, volume 10000
of LNCS, pages 345–373. Springer, 2019.
22. Reiner Hähnle and Dominic Steinhöfel. Modular, correct compilation with au-
tomatic soundness proofs. In Tiziana Margaria and Bernhard Steffen, editors,
Leveraging Applications of Formal Methods, Verification and Validation: Founda-
tional Techniques, 8th Intl. Symp., Proc. Part I, ISoLA, Cyprus, volume 11244 of
LNCS, pages 424–447. Springer, 2018.
23. Jan Hoffmann and Martin Hofmann. Amortized resource analysis with polynomial
potential. In Andrew D. Gordon, editor, Programming Languages and Systems,
19th European Symposium on Programming, ESOP, Paphos, Cyprus, volume 6012
of LNCS, pages 287–306. Springer, 2010.
24. John Hughes, Lars Pareto, and Amr Sabry. Proving the correctness of reactive
systems using sized types. In Proceedings of the 23rd ACM SIGPLAN-SIGACT
Symposium on Principles of Programming Languages, POPL ’96, page 410–423,
New York, NY, USA, 1996. Association for Computing Machinery.
25. James C. King. Symbolic execution and program testing. Communications of the
ACM, 19(7):385–394, July 1976.
26. Sudipta Kundu, Zachary Tatlock, and Sorin Lerner. Proving Optimizations Correct
Using Parameterized Program Equivalence. In Proc. PLDI 2009, pages 327–337,
2009.
44 E. Albert et al.
27. Gary T. Leavens, Erik Poll, Curtis Clifton, Yoonsik Cheon, Clyde Ruby, David
Cok, Peter Müller, Joseph Kiniry, Patrice Chalin, Daniel M. Zimmerman, and
Werner Dietl. JML Reference Manual, May 2013. Draft revision 2344.
28. Rustan Leino. Dafny: An automatic program verifier for functional correctness. In
16th International Conference, LPAR-16, Dakar, Senegal, pages 348–370. Springer
Berlin Heidelberg, April 2010.
29. Umer Liqat, Kyriakos Georgiou, Steve Kerrison, Pedro López-Garcı́a, John P. Gal-
lagher, Manuel V. Hermenegildo, and Kerstin Eder. Inferring parametric energy
consumption functions at different software levels: ISA vs. LLVM IR. In Marko
C. J. D. van Eekelen and Ugo Dal Lago, editors, Foundational and Practical As-
pects of Resource Analysis - 4th Intl. Workshop, FOPARA, London, UK, Revised
Selected Papers, volume 9964 of LNCS, pages 81–100, 2015.
30. Nuno P. Lopes, David Menendez, Santosh Nagarakatte, and John Regehr. Practical
Verification of Peephole Optimizations with Alive. Commun. ACM, 61(2):84–91,
2018.
31. Tobias Nipkow, Lawrence C. Paulson, and Markus Wenzel. Isabelle/HOL - A Proof
Assistant for Higher-Order Logic, volume 2283 of LNCS. Springer, 2002.
32. Ivan Radiček, Gilles Barthe, Marco Gaboardi, Deepak Garg, and Florian Zuleger.
Monadic refinements for relational cost analysis. Proc. ACM Program. Lang.,
2(POPL), December 2017.
33. Wolfgang Reif. The KIV-approach to software verification. In KORSO - Methods,
Languages, and Tools for the Construction of Correct Software, volume 1009 of
LNCS, pages 339–370. Springer, 1995.
34. Jan Smans, Bart Jacobs, Frank Piessens, and Wolfram Schulte. An automatic
verifier for Java-like programs based on dynamic frames. In José Luiz Fiadeiro
and Paola Inverardi, editors, Fundamental Approaches to Software Engineering,
11th Intl. Conf., FASE, Budapest, Hungary, volume 4961 of LNCS, pages 261–275.
Springer, 2008.
35. Fausto Spoto, Fred Mesnard, and Étienne Payet. A termination analyzer for Java
bytecode based on path-length. ACM Trans. Program. Lang. Syst., 32(3):8:1–8:70,
2010.
36. Dominic Steinhöfel. REFINITY to Model and Prove Program Transformation
Rules. In Bruno C. d. S. Oliveira, editor, Proc. 18th Asian Symposium on Pro-
gramming Languages and Systems (APLAS), LNCS. Springer, 2020.
37. Dominic Steinhöfel and Reiner Hähnle. Abstract execution. In Maurice H. ter
Beek, Annabelle McIver, and José N. Oliveira, editors, Formal Methods - The
Next 30 Years - Third World Congress, FM 2019, Porto, Portugal, October 7-11,
2019, Proceedings, volume 11800 of LNCS, pages 319–336. Springer, 2019.
38. Dominic Steinhöfel. Abstract Execution: Automatically Proving Infinitely Many
Programs. PhD thesis, Technical University of Darmstadt, Department of Com-
puter Science, Darmstadt, Germany, 2020.
39. Ben Wegbreit. Mechanical program analysis. Commun. ACM, 18(9):528–539, 1975.
40. Florian Zuleger, Sumit Gulwani, Moritz Sinn, and Helmut Veith. Bound analysis of
imperative programs with the size-change abstraction (extended version). CoRR,
abs/1203.5303, 2012.
Certified Abstract Cost Analysis 45
Open Access This chapter is licensed under the terms of the Creative Commons
Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/),
which permits use, sharing, adaptation, distribution and reproduction in any medium
or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license and indicate if changes were
made.
The images or other third party material in this chapter are included in the chapter’s
Creative Commons license, unless indicated otherwise in a credit line to the material. If
material is not included in the chapter’s Creative Commons license and your intended
use is not permitted by statutory regulation or exceeds the permitted use, you will need
to obtain permission directly from the copyright holder.
Bootstrapping Automated Testing
for RESTful Web Services
1 Introduction
string-typed, about 32% are number-typed, and the remaining 1% are boolean-
typed or object-typed. Overusing primitive data types significantly increases
the possible input value space. For example, a string-typed parameter can
take values varying from a specific URL to a comment about a YouTube video.
This poses difficulties for generating effective test cases. Consequently, many
automated REST testing tools are ineffective while RESTful web services suffer
from various input-related attacks, such as integer overflow attacks and SQL
injection attacks [18]. We call this phenomenon the type collapse problem.
The solution is to bridge the gap for automated testing tools to have a better
understanding of parameters. We observe that though parameter types are weak,
their values usually have distinct formats. For example, a datetime parameter
may require an ISO8601 date string. This motivates us to introduce the FET
(Format-encoded Type) which combines data types and value formats to describe
parameters in fine grains. For instance, the SHA1 FET represents 40-digit-hex
string-typed parameters. Furthermore, we introduce the FET lattice which
hierarchically organizes a set of FETs by a partial order, along with the FET
inference which seeks suitable FETs among a FET lattice for parameters in an
unambiguous manner.
To manifest how to enhance automated REST testing by FET techniques, we
implement Leif, a trace-driven fuzz testing tool. Leif gains fine-grained parameter
information by performing FET inference on HTTP traffic and then mutates
parameter values to mimic real attacks based on the inferred results. We apply
Leif to real-world web services, and the experiment results are encouraging. FET
techniques provide better bug-finding capability and bring 72% ∼ 86% fuzzing
time reduction for Leif when compared to state-of-the-art fuzzing tools.
In particular, this paper makes the following contributions:
– We introduce FET techniques, including the FET, the FET lattice, and the
FET inference, to remedy the type collapse problem and serve as a cornerstone
for high-level automated testing tools.
– We implement Leif, a FET-enhanced fuzzing tool which showcases how to
construct a ubiquitous FET lattice for common RESTful APIs and embed
FET techniques in an existing testing workflow.
– We evaluate the accuracy of FET inference, and the result is encouraging
(67% exact matches, 32% partial matches, and 1% mismatches on average).
– We evaluate Leif’s bug-finding capability (11 distinct bugs detected in 27
commercial web services) as well as its testing efficiency (72% ∼ 86% fuzzing
time reduction as compared to existing fuzzing tools).
The remainder of the paper is organized as follows. Section 2 analyzes the type
collapse problem in detail. Section 3 introduces FET techniques to solve the type
collapse problem. Section 4 introduces Leif as a proof-of-concept implementation
of FET techniques. Section 5 presents the evaluation of FET techniques and Leif.
Section 6 discusses related works and Section 7 concludes.
48 Y. Chen et al.
2 Motivation
It is essential for automated REST testing tools to generate test cases by filling
parameters with automatically generated values. This procedure requires ade-
quate information about parameters. Otherwise, the possible candidate space
would become enormous even for one single parameter. Therefore, a majority of
state-of-the-art automated testing tools focus on reducing the candidate space
by sophisticated methodologies. For instance, RESTler [13] arranges multiple
APIs in the producer-consumer order, and uses response data gained from the
previous APIs to request the next. Chizpurfle [23] and EvoMaster [12] generate
optimal candidate values based on evolutionary algorithms.
Nevertheless, the previous works have not focused on the root cause of the
candidate space explosion. Since most RESTful APIs are designed for exchang-
ing data between programs implemented by different languages (e.g., Java for
mobile applications while Python for the service), only a few common primitive
data types can be used to represent API parameters. For example, Amazon’s
online shopping web service takes about 2,400 parameters, among which 748
are number-typed (31%) and 1,581 are string-typed (66%) [19]. That is, types,
which are supposed to be diversified, now collapse into very limited cases. Conse-
quently, existing automated testing tools encounter a huge candidate space, e.g.,
solely knowing a parameter is string-typed spans a boundless candidate space
from paragraphs of Shakespeare to specific datetime strings. In addition, it is
difficult to pick up effective values that can pass parameter checking, then reach
actual business logic, and finally trigger bugs. Figure 1 shows a code sample of
a RESTful API (requires four parameters: string-typed start, string-typed
end, number-typed amount, and number-typed interest). In order to generate
an effective value which can reach business logic for the parameter start, a
testing tool has to know it is an ISO8601 datetime string. Unfortunately, since
parameters are mainly in primitive data types, this information is usually hard
to obtain. Therefore, the testing tool may treat it as an ordinary string and
generate arbitrary strings which are all rejected by the parameter checking and
thus are basically useless.
1 def calculate_monthly_installment():
2 try:
3 start = parse(request.get("start"), "YYYY-MM-DDTHH:MM:SSZ")
4 end = parse(request.get("end"), "YYYY-MM-DDTHH:MM:SSZ")
5 amount = float(request.get("amount"))
6 interest = float(request.get("interest"))
7 except Exception:
8 return make_response("Invalid Parameter", 400, "Bad Request")
9 # business logic
10 ...
The type collapse problem is the major obstacle to obtaining adequate pa-
rameter information and leads to inefficient automated testing. Therefore, our
solution is to provide a fine-grained description method for parameters by ex-
ploiting both its data type and its value format. Leveraged by such information,
we are able to bootstrap and enhance automated testing techniques to gain
efficiency improvement when testing RESTful web services.
3 FET Techniques
To address the type collapse problem, we introduce FET techniques, including
the FET (Format-encoded Type), the FET lattice, and the FET inference. A
FET models an API parameter by its data type and its value format. A FET
lattice hierarchically organizes a set of FETs based on a partial order. We design
FET inference algorithms to seek suitable FETs among a FET lattice for pa-
rameters, and the inferred results are the critical information for bootstrapping
test case generation strategies.
Interface Iterable
NoType
holds if and only if tψi is type-convertible to tψj and fψi is a subset of fψj ,
denoted by tψi tψj and fψi ⊆ fψj . A FET ψi covered by ψj implies that ψi
describes parameter features in a finer grain than ψj . ψ and ψ⊥ are defined
as (AnyType, U ) and (NoType, ∅), where U is the set containing arbitrary values.
Figure 3 depicts an example FET lattice (a FET’s name describes its value
format, and FETs at the same level are identically colored).
FET Acceptance for Parameter Values. Similar to type lattices, FET
lattices help to determine FETs for given parameter values. To achieve this, we
define that a value v is accepted by a FET ψ if and only if typeof (v) tψ and
v ∈ fψ , denoted by ψ ∈ acceptance(v). Otherwise v is said to be rejected by
ψ, denoted by ψ ∈ / acceptance(v). Spontaneously, ψ accepts all values while
ψ⊥ accepts none. A value v can be accepted by more than one FET, while the
greatest lower bound of the acceptances describes the value in the finest grain.
We call such an acceptance the minimum acceptance of v. The predecessors
of the minimum acceptance accept v but describe it in a coarser grain, while
the siblings reject v but describe other similar values in the same grain. The
minimum acceptance, the predecessors, and the siblings of v compose a tree,
denoted by ψ-tree(v). For example, for a SHA1 string v, its minimum acceptance
(the SHA1 FET in Figure 3), the predecessors (Hash, String, and ψ ) and the
siblings (MD5, and SHA256) compose the ψ-tree(v).
Avoiding the Ambiguity of FET Lattices. As seen in Figure 3, if a sin-
gle value is accepted by two sibling FETs (e.g. MD5 and SHA1), the minimum
acceptance will fall into the trivial ψ⊥ . Generally, a FET lattice is said to be
ambiguous if there exist two FETs with the same predecessor can both accept
the same value. To avoid ambiguity, a validation procedure is obligatory after
a FET lattice is constructed, which is to ensure the value formats of every two
sibling FETs with the same data type are always disjoint.
Bootstrapping Automated Testing for RESTful Web Services 51
߰١
Decimal
Datetime Identifier Hash
Integer
ISO Date
URI UUID MD5 SHA1
8601 Only
Epoch Package Version SHA
Boolean
String Name Tag 256
߰٣
߰١ ߰١
߰١
~ ~
߰٣ ߰٣
determined, so the ψ-tree for every FET can be computed before inference; (3)
merging two ψ-trees is equivalent to performing a bitwise OR operation on their
corresponding bitfields.
Hence, we give the forward computation algorithm and the bitfield-boosting
FET inference. The forward computation traverses the lattice in breadth-first
order, assigns a unique bitfield ID per FET, and computes the ψ-tree, as shown
in Algorithm 1. Leveraged by the forward computation, the bitfield-boosting
inference only needs to find the minimum acceptance by the depth-first search-
ing, yields the bitfield tree, and merges it into the ψ-treei−1 (Vi−1 ), as shown
in Algorithm 2. Therefore, the ψ-treen (Vn ) can be efficiently computed by a
series of bitwise OR operations instead of graph computations, reducing the time
complexity from O(n · (m + l)) to O(n · m).
Responses Responses
Applications RESTful web
service
HTTP traffic
HTTP
tracer Leif built-in or user-
specified FET lattice
<object> { $
“title”: “A Brief History of Time”,
“price”: 45.00,
“catalogue”: { title catalogue price <string> $.title: “A Brief History of Time”
“main”: “Science”, <number> $.price: 45.00
“sub”: { <string> $.catalogue.main: “Science”
“main”: “Cosmology” main sub <string> $.catalogue.sub.main: “Cosmology”
}
}
} main
(a) The Original Parameter. (b) The Tree Structure. (c) The Flattening Result.
the Datetime and the Integer FETs are included in the final ψ-treen of an
epoch datetime parameter.
5 Evaluation
In this section, we evaluate Leif with real-world RESTful web services, and the
complete dataset of our evaluation is publicly available [19]. Specifically, we
design three experiments to answer the following research questions:
RQ-1 How accurately do FET inference results describe RESTful API param-
eters of complicated real-world web services?
RQ-2 Can Leif generate effective test cases and therefore help developers to
detect web service vulnerabilities in practice?
RQ-3 Does Leif have better bug-finding capability with reduced fuzzing time
when compared to existing state-of-the-art trace-driven and specification-
driven fuzz testing tools?
by comparing the inferred results with the ground truth. We choose GitHub 3
and Twitter4 , and we randomly pick up 50 RESTful APIs (25 from each). We
extract two pieces of information from document text: (1) parameter data types,
as explicitly listed in the documents; (2) parameter value formats, as provided
in the detailed descriptions (e.g. “This [the parameter since] is a timestamp in
ISO8601 format.”5 ). We feed example requests gained from the documents to
FET inference, compare the inferred FETs with the ground truth, and observe
three levels of matching:
(1) exact match, the inferred FET is said to be an exact match if it has the
exactly same data type and the value format as the ground truth;
(2) partial match, the inferred FET is said to be a partial match if it has
the exact data type, but its value format is a proper superset of the ground
truth;
(3) mismatch, for the remaining cases.
90% Decimal
11%
70%
80% String
67%
65%
14%
Matching Level Ratios
70% Integer
60% 16% number
50% UUID
34%
32%
string
28%
40% 13%
boolean
30% Boolean
12%
20% ISO8601
2%
9%
1%
10%
1%
MD5
5% URI
0% Version SHA1 5%
5% Epoch
exact
exact partial
partial mismatch
mis- 5%
5%
match
match match
match match
(a) FET Inference Accuracy. (b) Exact Match Distribution.
response data of bug 10 contains the full Java exception stack trace without
any obfuscation. From the stack trace, attackers can obtain that the service uses
an outdated Spring Framework6 version which suffers from numerous security
vulnerabilities [5,6,8–11]. By exploiting CVE-2020-5421 and CVE-2020-5398 [10,
11], attackers can initiate reflected file download attacks [31] to mislead users
into downloading malware. And by exploiting CVE-2018-1257 [5], attackers can
expose STOMP over WebSocket and then initiate denial of service attacks [17].
They can also obtain that the service uses com.alibaba.fastjson library7 to
deserialize user inputs. Therefore attackers can launch remote code executions
by exploiting known defects in that specific library version [7, 32].
Upon such cases, we suggest developers should first avoid information leakage
problems by checking the service data flow, ensuring that no sensitive methods
6
Spring Framework, https://spring.io/projects/spring-framework
7
Fastjson, https://github.com/alibaba/fastjson
60 Y. Chen et al.
APIs. We suggest that developers should capture application traffic and apply
Leif to test untrusted third-party APIs. In addition, they should design proper
exception handling logic for third-party code and timely upgrade to the latest
API versions with known bugs fixed.
Bugs with Limited Information. We obtain very limited information from
bug 8 and 9, because their responses solely contain HTTP status codes. These
bugs could be as critical as the security bugs since they involve a private API
and cause the service to crash. Therefore service developers can debug such APIs
by following the analysis methods for the security bugs as mentioned.
247
234
228
# Bugs Found in Total
250
209
180
2
178
2 200
146
143
128
150
1
1
1
1
1
1 100
36
35
31
50
0
0
0
0
0
0 0
Sina News Toutiao Amazon Sina News Toutiao Amazon
(a) Bug-finding Capabilities. (b) Fuzzing Time.
primitive data types and uses a plain candidate dictionary (consisting of 0, 1, "",
and "sampleString"). Yet none of the bugs found by Leif can be triggered by
these values, indicating that performing RESTler would fail to detect any of the
bugs. And TnT-Fuzzer generates candidate values simply based on the Python
random() function (i.e. purely random fuzzing). We configure it to generate
1,000 test cases per parameter (about 5× of NaiveFuzzer and 30× of Leif). Still,
TnT-Fuzzer fails to find any bugs in the three services. We conclude that the
two fuzzers’ effectiveness is limited by the practical hardness of finding well-
written OpenAPI specifications and the quality of their candidates. These are
also the main shortcomings of all specification-driven fuzzers. Besides, many
modern APIs require short-lived session tokens for access control or throttling.
Specification-driven fuzzers require manual configuration or even repeated re-
configuration for such parameters. In contrast, it is easy for trace-driven fuzzers
to achieve this requirement by mutating freshly captured requests.
6 Related Work
Model-driven Testing. Model-driven testing [15, 26, 27, 47, 48] is usually
white-box and requires using some specific modeling method (e.g. UML or
DSL) through the whole lifecycle of developing, which is human-intensive and
technically-limited for services across multiple servers and micro-services from
different vendors. Essentially, FET techniques are also model-driven (i.e. driven
by the lattice model) but only intervene in the test phase. Thus FET techniques
can be practically employed to test diversified RESTful web services in black-box
approaches.
Trace-driven Fuzzing. Trace-driven fuzzing generates test cases by mutating
recorded requests. Fuzzapi [3], BurpSuite [2], AppSpider [1] and Leif all fall
into this category. Existing trace-driven fuzzers mainly focus on improving the
Bootstrapping Automated Testing for RESTful Web Services 63
ability to capture and replay HTTP traffic. However, Leif demonstrates that FET
techniques provide fundamental parameter information to fuzzers, bringing the
enhanced bug-finding capability and significant fuzzing time reduction.
Specification-driven Fuzzing. Another main class of fuzz testing techniques
is specification-driven fuzzing, such as TnT-Fuzzer [4], EvoMaster [12], and
RESTler [13], which avoids the type collapse problem by assuming developers
provide well-defined specifications with detailed parameter information. How-
ever, the OpenAPI [40] is the only well-established standard up to now, yet is
not widely used. A survey [41] reveals that 71% developers lack the knowledge of
the OpenAPI framework. Therefore, the specification-driven fuzzing is still too
idealistic for testing real-world RESTful web services. In comparison, instead of
asking developers for good specifications, FET techniques generate fine-grained
specifications (i.e. ψ-treesn of parameters) on its own.
Security Penetration Testing. Fuzz testing techniques are also commonly
purposed for security penetration testing. Commercial security penetration tools,
such as BurpSuite [2], use values of SQL injections, unescaped HTML charac-
ters, XML/JSON external entities, etc., to expose system vulnerabilities. FET
techniques can also be employed in security penetration testing, as demonstrated
in Section 5.2. While our main goal is not limited to security testing for RESTful
web services, because FET techniques improve the value selecting strategy for
general-purpose REST fuzzing.
In this paper, we analyze the type collapse problem and propose FET tech-
niques to remedy this problem. As a proof-of-concept, we design and implement
Leif, a FET-enhanced trace-driven fuzzing tool. We demonstrate that using FET
techniques greatly improves a fuzzer’s understanding of parameters, resulting in
more effective fuzz testing. Our experiment results show that Leif unveils 11 new
bugs in application-specific web services as well as general third-party open API
platforms with 72% ∼ 86% fuzzing time reduction.
FET techniques are capable of effectively bootstrapping automated testing
tools. We believe they are also helpful for parameter validity checking because
these two technical problems are isomorphic in a sense. Thus we are beginning to
study how to automatically generate or enhance parameter checking code based
on FET techniques for RESTful web services.
Acknowledgments
We would like to thank the anonymous reviewers for their valuable comments.
This work was supported in part by National Key Research Development Pro-
gram of China (No. 2016YFB1000502), National NSF of China (No. 61672344,
61525204, and 61732010), Shanghai Pujiang Program (No. 19PJ1430900), and
Shanghai Key Laboratory of Scalable Computing and Systems.
64 Y. Chen et al.
References
1. AppSpider. https://www.rapid7.com/products/appspider
2. BurpSuite. https://portswigger.net/burp
3. Fuzzapi. https://github.com/Fuzzapi/fuzzapi
4. TnT-Fuzzer. https://github.com/Teebytes/TnT-Fuzzer
5. CVE-2018-1257. Available from MITRE, CVE-ID CVE-2018-1257 (Dec 6 2017),
https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2018-1257
6. CVE-2018-1275. Available from MITRE, CVE-ID CVE-2018-1275 (Dec 6 2017),
https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2018-1275
7. CVE-2017-18349. Available from MITRE, CVE-ID CVE-2017-18349 (Oct 23
2018), https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2017-18349
8. CVE-2018-15756. Available from MITRE, CVE-ID CVE-2018-15756 (Aug 23
2018), https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2018-15756
9. CVE-2020-5397. Available from MITRE, CVE-ID CVE-2020-5397 (Jan 3 2020),
https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-5397
10. CVE-2020-5398. Available from MITRE, CVE-ID CVE-2020-5398 (Jan 3 2020),
https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-5398
11. CVE-2020-5421. Available from MITRE, CVE-ID CVE-2020-5421 (Jan 3 2020),
https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-5421
12. Arcuri, A.: RESTful API automated test case generation with EvoMaster. ACM
Trans. Softw. Eng. Methodol. 28(1), 3:1–3:37 (2019), https://doi.org/10.1145/
3293455
13. Atlidakis, V., Godefroid, P., Polishchuk, M.: RESTler: Stateful REST API fuzzing.
In: Atlee, J.M., Bultan, T., Whittle, J. (eds.) Proceedings of the 41st International
Conference on Software Engineering, ICSE 2019, Montreal, QC, Canada, May 25-
31, 2019. pp. 748–758. IEEE/ACM (2019), https://doi.org/10.1109/ICSE.2019.
00083
14. Aycock, J.: A brief history of just-in-time. ACM Comput. Surv. 35(2), 97–113
(2003), https://doi.org/10.1145/857076.857077
15. Baker, P., Dai, Z.R., Grabowski, J., Schieferdecker, I., Williams, C.: Model-driven
Testing: Using the UML Testing Profile. Springer Science & Business Media (2007)
16. Berners-Lee, T., Fielding, R., Masinterm, L.: RFC3986: Uniform Resource Iden-
tifier (URI): Generic Syntax. Internet Engineering Task Force (Jan 2005), https:
//www.rfc-editor.org/info/rfc3986
17. Breslaw, D., Bekerman, D.: How Mirai uses STOMP protocol to launch DDoS
attacks. Tech. rep., Imperva Inc. (Nov15 2016), https://www.imperva.com/blog/
mirai-stomp-protocol-ddos/
18. Chandrashekhar, R., Mardithaya, M., Thilagam, S., Saha, D.: SQL injection attack
mechanisms and prevention techniques. In: International Conference on Advanced
Computing, Networking and Security. pp. 524–533. Springer (2011)
19. Chen, Y., Yang, Y., Lei, Z., Xia, M., Qi, Z.: The public dataset of Leif evaluation
(Jan 2021), https://doi.org/10.6084/m9.figshare.12377150
20. Chen, Y., Yang, Y., Lei, Z., Xia, M., Qi, Z.: The ubiquitous FET lattice model
and verification (Jan 2021), https://doi.org/10.6084/m9.figshare.13622720
21. Chodorow, K.: MongoDB: The Definitive Guide: Powerful and Scalable Data Stor-
age. O’Reilly Media, Inc. (2013)
22. Cortesi, A., Hils, M., Kriechbaumer, T.: MitmProxy: A free and open source in-
teractive HTTPS proxy (2010), https://mitmproxy.org
Bootstrapping Automated Testing for RESTful Web Services 65
23. Cotroneo, D., Iannillo, A.K., Natella, R.: Evolutionary fuzzing of android OS
vendor system services. Empirical Software Engineering 24(6), 3630–3658 (2019),
https://doi.org/10.1007/s10664-019-09725-6
24. Cousot, P., Cousot, R.: Abstract interpretation: A unified lattice model for static
analysis of programs by construction or approximation of fixpoints. In: Graham,
R.M., Harrison, M.A., Sethi, R. (eds.) Conference Record of the Fourth ACM
Symposium on Principles of Programming Languages, Los Angeles, California,
USA, January 1977. pp. 238–252. ACM (1977), https://doi.org/10.1145/512950.
512973
25. Cox, N.: Directory Services: Design, Implementation and Management. Elsevier
(2001)
26. Ed-Douibi, H., Izquierdo, J.L.C., Cabot, J.: Automatic generation of test cases
for REST APIs: A specification-based approach. In: 22nd IEEE International En-
terprise Distributed Object Computing Conference, EDOC 2018, Stockholm, Swe-
den, October 16-19, 2018. pp. 181–190. IEEE Computer Society (2018), https:
//doi.org/10.1109/EDOC.2018.00031
27. Fertig, T., Braun, P.: Model-driven testing of RESTful APIs. In: Gangemi, A.,
Leonardi, S., Panconesi, A. (eds.) Proceedings of the 24th International Confer-
ence on World Wide Web Companion, WWW 2015, Florence, Italy, May 18-22,
2015 - Companion Volume. pp. 1497–1502. ACM (2015), https://doi.org/10.1145/
2740908.2743045
28. Fielding, R.: Representational state transfer. Architectural Styles and the Design
of Netowork-based Software Architecture pp. 76–85 (2000)
29. Goessner, S.: JSONPath - XPath for JSON. http://goessner.net/articles/JsonPath
p. 48 (2007)
30. Google: Android Monkey. https://developer.android.com/studio/test/monkey
31. Hafif, O., Spiderlabs, T.: Reflected file download: A new web attack vector. Trust-
wave. Retrieved March 15, 2016 (2014), https://bit.ly/2F8YZEp
32. Hao, M.: Fastjson 1.2.68 and earlier remote code execution vulnerability threat
alert. Tech. rep., NSFOCUS, Inc. (Jun 2020), https://bit.ly/3iG0jwh
33. Jensen, S.H., Møller, A., Thiemann, P.: Type analysis for JavaScript. In: Pals-
berg, J., Su, Z. (eds.) Static Analysis, 16th International Symposium, SAS 2009,
Los Angeles, CA, USA, August 9-11, 2009. Proceedings. Lecture Notes in Com-
puter Science, vol. 5673, pp. 238–255. Springer (2009), https://doi.org/10.1007/
978-3-642-03237-0 17
34. Joy, B., Steele, G., Gosling, J., Bracha, G.: The Java language specification (2000)
35. Klyne, G., Newman, C.: RFC3339: Date and Time on the Internet: Timestamps. In-
ternet Engineering Task Force (Jul 2002), https://www.rfc-editor.org/info/rfc3339
36. Martin-Lopez, A., Segura, S., Ruiz-Cortés, A.: A catalogue of inter-parameter
dependencies in RESTful web APIs. In: Yangui, S., Rodriguez, I.B., Drira, K.,
Tari, Z. (eds.) Service-Oriented Computing - 17th International Conference, IC-
SOC 2019, Toulouse, France, October 28-31, 2019, Proceedings. Lecture Notes in
Computer Science, vol. 11895, pp. 399–414. Springer (2019), https://doi.org/10.
1007/978-3-030-33702-5 31
37. Møller, A., Bakic, A., Moran, J., et al.: Package dk.brics.automaton. Aarhus Uni-
versity (Jul 4 2017), https://www.brics.dk/automaton/
38. Møller, A., Schwartzbach, M.I.: Static program analysis. Notes. Feb (2012)
39. Morlitz, D.: HTTP archive file (May 2002), US Patent App. 09/726,985
40. OAI (OpenAPI Initiative): The OpenAPI specification. https://github.com/OAI/
OpenAPI-Specification
66 Y. Chen et al.
41. Open API CSA Working Group: Open API survey report. Tech. rep., Cloud
Security Alliance (Sep 2019), https://cloudsecurityalliance.org/blog/2019/09/11/
open-api-survey-report/
42. Ouyang, L.: Bayesian inference of regular expressions from human-generated ex-
ample strings. CoRR abs/1805.08427 (2018), http://arxiv.org/abs/1805.08427
43. Pham, V., Böhme, M., Roychoudhury, A.: Model-based whitebox fuzzing for
program binaries. In: Lo, D., Apel, S., Khurshid, S. (eds.) Proceedings of the
31st IEEE/ACM International Conference on Automated Software Engineering,
ASE 2016, Singapore, September 3-7, 2016. pp. 543–553. ACM (2016), https:
//doi.org/10.1145/2970276.2970316
44. Raychev, V., Vechev, M.T., Krause, A.: Predicting program properties from “big
code”. In: Rajamani, S.K., Walker, D. (eds.) Proceedings of the 42nd Annual ACM
SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL
2015, Mumbai, India, January 15-17, 2015. pp. 111–124. ACM (2015), https://doi.
org/10.1145/2676726.2677009
45. Scheurer, D., Hähnle, R., Bubel, R.: A general lattice model for merging symbolic
execution branches. In: Ogata, K., Lawford, M., Liu, S. (eds.) Formal Methods
and Software Engineering - 18th International Conference on Formal Engineering
Methods, ICFEM 2016, Tokyo, Japan, November 14-18, 2016, Proceedings. Lecture
Notes in Computer Science, vol. 10009, pp. 57–73 (2016), https://doi.org/10.1007/
978-3-319-47846-3 5
46. Thompson, K.: Programming techniques: Regular expression search algorithm.
Commun. ACM 11(6), 419–422 (Jun 1968), https://doi.org/10.1145/363347.
363387
47. Vu, H., Fertig, T., Braun, P.: Towards model-driven hypermedia testing for REST-
ful systems. In: Majchrzak, T.A., Traverso, P., Krempels, K..H., é rie Monfort, V.
(eds.) Proceedings of the 13th International Conference on Web Information Sys-
tems and Technologies, WEBIST 2017, Porto, Portugal, April 25-27, 2017. pp.
340–343. SciTePress (2017), https://doi.org/10.5220/0006353403400343
48. Yuan, Q., Wu, J., Liu, C., Zhang, L.: A model driven approach toward busi-
ness process test case generation. In: Liu, C., Ricca, F. (eds.) Proceedings of
the 10th IEEE International Symposium on Web Systems Evolution, WSE 2010,
3-4 October 2008, Beijing, China. pp. 41–44. IEEE Computer Society (2008),
https://doi.org/10.1109/WSE.2008.4655394
Open Access This chapter is licensed under the terms of the Creative Commons
Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/),
which permits use, sharing, adaptation, distribution and reproduction in any medium
or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license and indicate if changes were
made.
The images or other third party material in this chapter are included in the chapter’s
Creative Commons license, unless indicated otherwise in a credit line to the material. If
material is not included in the chapter’s Creative Commons license and your intended
use is not permitted by statutory regulation or exceeds the permitted use, you will need
to obtain permission directly from the copyright holder.
A Decision Tree Lifted Domain for Analyzing
Program Families with Numerical Features
1 2 3
Aleksandar S. Dimovski , Sven Apel , and Axel Legay
1
Mother Teresa University, 12 Udarna Brigada 2a, 1000 Skopje, North Macedonia
[email protected]
2
Saarland University, Saarland Informatics Campus, E1.1, 66123 Saarbrücken,
Germany
3
Université catholique de Louvain, 1348 Ottignies-Louvain-la-Neuve, Belgium
1 Introduction
Many software systems today are configurable [6]: they use features (or config-
urable options) to control the presence and absence of functionality. Different
family members, called variants, are derived by switching features on and off, while
the reuse of common code is maximized, leading to productivity gains, shorter
time to market, greater market coverage, etc. Program families (e.g., software
product lines) are commonly seen in the development of commercial embedded
software, such as cars, phones, avionics, medicine, robotics, etc. Configurable
c The Author(s) 2021
E. Guerra and M. Stoelinga (Eds.): FASE 2021, LNCS 12649, pp. 67–86, 2021.
https://doi.org/10.1007/978-3-030-71500-7 4
68 A. S. Dimovski et al.
options (features) are used to either support different application scenarios for
embedded components, to provide portability across different hardware platforms
and configurations, or to produce variations of products for different market
segments or customers. We consider here program families implemented using
#if directives from the C preprocessor CPP [20]. They use #if-s to specify in
which conditions parts of code should be included or excluded from a variant.
Classical program families use only Boolean features that have two values: on and
off. However, Boolean features are insufficient for real-world program families,
as there exist features that have a range of numbers as possible values. These
features are called numerical features [25]. For instance, Linux kernel, BusyBox,
Apache web server, Java Garbage Collector represent some real-world program
families with numerical features. Analyzing such program families is very chal-
lenging, due to the fact that from only a few features, a huge number of variants
can be derived.
In this paper, we are concerned with the verification of program families with
Boolean and numerical features using abstract interpretation-based static analysis.
Abstract interpretation [7,24] is a general theory for approximating the semantics
of programs. It provides sound (all confirmative answers are correct) and efficient
(with a good trade-off between precision and cost) static analyses of run-time
properties of real programs. It has been used as the foundation for various
successful industrial-scale static analyzers, such as ASTREE [8]. Still, the static
analysis of program families is harder than the static analysis of single programs,
because the number of possible variants can be very large (often huge) in practice.
The simplest brute-force approach that uses a preprocessor to generate all variants
of a family, and then applies an existing off-the-shelf single-program analyzer to
each individual variant, one-by-one, is very inefficient [3,27]. Therefore, we use
so-called lifted (family-based) static analyses [3,22,27], which analyze all variants
of the family simultaneously without generating any of the variants explicitly.
They take as input the common code base, which encodes all variants of a
program family, and produce precise analysis results corresponding to all variants.
They use a lifted analysis domain, which represents an n-fold product of an
existing single-program analysis domain used for expressing program properties
(where n is the number of valid configurations). That is, the lifted analysis
domain maintains one property element per valid variant in tuples. The problem
is that this explicit property enumeration in tuples becomes computationally
intractable with larger program families because the number of variants (i.e.,
configurations) grows exponentially with the number of features. This problem
has been successfully addressed for program families that contain only Boolean
features [1,2,11], by using sharing through binary decision diagrams (BDDs).
However, the fundamental limitation of existing lifted analysis techniques is that
they are not able to handle numerical features.
To overcome this limitation, we present a new, refined lifted abstract domain
for effectively analyzing program families with numerical features by means of
abstract interpretation. The elements of the lifted abstract domain are constraint-
based decision trees, where the decision nodes are labelled with linear constraints
A Decision Tree Lifted Domain for Analyzing Program Families 69
over numerical features, whereas the leaf nodes belong to a single-program analysis
domain. The decision trees recursively partition the space of configurations (i.e.,
the space of possible combinations of feature values), whereas the program
properties at the leaves provide analysis information corresponding to each
partition, i.e. to the variants (configurations) that satisfy the constraints along
the path to the given leaf node. The partitioning is dynamic, which means that
partitions are split by feature-based tests (at #if directives), and joined when
merging the corresponding control flows again. In terms of decision trees, this
means that new decision nodes are added by feature-based tests and removed
when merging control flows. In fact, the partitioning of the set of configurations
is semantics-based, which means that linear constraints over numerical features
that occur in decision nodes are automatically inferred by the analysis and do
not necessarily occur syntactically in the code base.
Our lifted abstract domain is parametric in the choice of numerical property
domain [7,24] that underlies the linear constraints over numerical features labelling
decision nodes, and the choice of the single-program analysis domain for leaf
nodes. In fact, in our implementation, we also use numerical property domains
for leaf nodes, which encode linear constraints over program variables. We
rely on the well-known numerical domains, such as intervals [7], octagons [23],
polyhedra [10], from the APRON library [19] to obtain a concrete decision
tree-based implementation of the lifted abstract domain. This way, we have
implemented a forward reachability analysis of C program families with numerical
(and Boolean) features for the automatic inference of invariants. Our tool, called
SPLNum2 Analyzer4 , computes a set of possible invariants, which represent
linear constraints over program variables. We can use the implemented lifted
static analyzer to check invariance properties of C program families, such as
assertions, buffer overflows, null pointer references, division by zero, etc [8].
In summary, we make several contributions: (1) We propose a new, param-
eterized lifted analysis domain based on decision trees for analyzing program
families with numerical features; (2) We implement a prototype lifted static
analyzer, SPLNum2 Analyzer, that performs a forward analysis of #if-enriched
C programs, where numerical property domains from the APRON library are
used as parameters in the lifted analysis domain; (3) We evaluate our approach for
automatic inference of invariants by comparing performances of lifted analyzers
based on tuples and decision trees.
2 Motivating Example
4
Num2 in the name of the tool refers to its ability to both handle Numerical features
and to perform Numerical client analysis of SPLs (program families).
70 A. S. Dimovski et al.
1 int x := 10, y := 0;
2 while (x != 0) {
3 x := x-1;
4 #if (SIZE ≤ 3) y := y+1; #else y := y-1; #endif
5 #if (!B) y := 0; #else skip; #endif }
6
7 assert (y > 1);
The set F of features is {B, SIZE}, where B is a Boolean feature and SIZE is a
numerical feature whose domain is [1, 4] = {1, 2, 3, 4}. Thus, the set of valid
configurations is K = {B ∧ (SIZE = 1), B ∧ (SIZE = 2), B ∧ (SIZE = 3), B ∧ (SIZE =
4), ¬B ∧ (SIZE = 1), ¬B ∧ (SIZE = 2), ¬B ∧ (SIZE = 3), ¬B ∧ (SIZE = 4)}. The
code of SIMPLE contains two #if directives, which change the value assigned
to y, depending on how features from F are set at compile-time. For each
configuration from K, a different variant (single program) can be generated
by appropriately resolving #if-s. For example, the variant corresponding to
configuration B ∧ (SIZE = 1) will have B and SIZE set to true and 1, so that the
assignment y := y+1 and skip in program locations 4 and ,5 respectively, will
be included in this variant. The variant for configuration ¬B ∧ (SIZE = 4) will have
features B and SIZE set to false and 4, so the assignments y := y-1 and y := 0 in
program locations 4 and ,5 respectively, will be included in this variant. There
are |K| = 8 variants that can be derived from the family SIMPLE.
Assume that we want to perform lifted polyhedra analysis of SIMPLE using
the Polyhedra numerical domain [10]. The standard lifted analysis domain used
in the literature [3,22] is defined as cartesian product of |K| copies of the basic
analysis domain (e.g. polyhedra). Hence, elements of the lifted domain are tuples
containing one component for each valid configuration from K, where each
component represents a polyhedra linear constraint over program variables (x
and y in this case). The lifted analysis result in location 7 of SIMPLE is an
8-sized tuple shown in Fig. 1. Note that the first component of the tuple in
Fig. 1 corresponds to configuration B ∧ (SIZE = 1), the second to B ∧ (SIZE = 2),
the third to B ∧ (SIZE = 3), and so on. We can see in Fig. 1 that the polyhedra
analysis discovers very precise results for the variable y: (y = 10) for configurations
B ∧ (SIZE = 1), B ∧ (SIZE = 2), and B ∧ (SIZE = 3); (y = −10) for configuration
B ∧ (SIZE = 4); and (y = 0) for all other configurations. This is due to the fact that
the polyhedra domain is fully relational and is able to track all relations between
program variables x and y. Using this result in location , 7 we can successfully
conclude that the assertion is valid for configurations B∧(SIZE = 1), B∧(SIZE = 2),
and B ∧ (SIZE = 3), whereas the assertion fails for all other configurations.
If we perform lifted polyhedra analysis based on the decision tree domain
proposed in this work, then the corresponding decision tree inferred in the final
program location 7 of SIMPLE is depicted in Fig. 2. Notice that the inner
nodes of the decision tree in Fig. 2 are labeled with Interval linear constraints
over features (SIZE and B), while the leaves are labeled with the Polyhedra
linear constraints over program variables x and y. Hence, we use two different
numerical abstract domains in our decision trees: Interval domain [7] for expressing
properties in decision nodes, and Polyhedra domain [10] for expressing properties
A Decision Tree Lifted Domain for Analyzing Program Families 71
in leaf nodes. The edges of decision trees are labeled with the truth value of
the decision on the parent node; we use solid edges for true (i.e. the constraint
in the parent node is satisfied) and dashed edges for false (i.e. the negation of
the constraint in the parent node is satisfied). As decision nodes partition the
space of valid configurations K, we implicitly assume the correctness of linear
constraints that take into account domains of numerical features. For example,
the node with constraint (SIZE ≤ 3) is satisfied when (SIZE ≤ 3) ∧ (1 ≤ SIZE ≤ 4),
whereas its negation is satisfied when (SIZE > 3) ∧ (1 ≤ SIZE ≤ 4). The constraints
(1 ≤ SIZE ≤ 4) represent the domain [1, 4] of SIZE. We can see that decision trees
offer more possibilities for sharing and interaction between analysis properties
corresponding to different configurations, they provide symbolic and compact
representation of lifted analysis elements. For example, Fig. 2 presents polyhedra
properties of two program variables x and y, which are partitioned with respect
to features B and SIZE. When (B ∧ (SIZE ≤ 3)) is true the shared property is
(y = 10, x = 0), whereas when (B ∧ ¬(SIZE ≤ 3)) is true the shared property is
(y = −10, x = 0). When ¬B is true, the property is independent from the value
of SIZE, hence a node with a constraint over SIZE is not needed. Therefore, all
such cases are identical and so they share the same leaf node (y = 0, x = 0). In
effect, the decision tree-based representation uses only three leafs, whereas the
tuple-based representation uses eight properties. This ability for sharing is the
key motivation behind the decision trees-based representation.
which is a mapping that assigns a value from dom(A) to each feature A, i.e.
k(A) ∈ dom(A) for any A ∈ F. We assume that only a subset K of all possible
configurations are valid. An alternative representation of configurations is based
upon propositional formulae. Each configuration k ∈ K can be represented by
a formula: (A1 = k(A1 )) ∧ . . . ∧ (Ak = k(Ak )). We often abbreviate (B = 1)
with B and (B = 0) with ¬B, for a Boolean feature B ∈ F. The set of valid
configurations K can be also represented as a formula: ∨k∈K k.
We define feature expressions, denoted FeatExp(F), as the set of propositional
logic formulas over constraints of F generated by the grammar:
s ::= skip | x:=e | s; s | if (e) then s else s | while (e) do s | #if (θ) s #endif,
e ::= n | [n, n ] | x | e⊕e
where n ranges over integers, [n, n ] over integer intervals, x over program variables
Var, and ⊕ over binary arithmetic operators. Integer intervals [n, n ] denote a
random choice of an integer in the interval. The set of all statements s is denoted
by Stm; the set of all expressions e is denoted by Exp.
A program family is evaluated in two stages. First, the C preprocessor CPP
takes a program family s and a configuration k ∈ K as inputs, and produces a
variant (without #if-s) corresponding to k as the output. Second, the obtained
variant is evaluated using the standard single-program semantics. The first
stage is specified by the projection function Pk , which is an identity for all
basic statements and recursively pre-processes all sub-statements of compound
A Decision Tree Lifted Domain for Analyzing Program Families 73
statements. Hence, Pk (skip) = skip and Pk (s;s ) = Pk (s);Pk (s ). The interesting
case is “#if (θ) s #endif”, where statement s is included in the variant if k |= θ,
Pk (s) if k |= θ
otherwise, s is removed 5 : Pk (#if (θ) s #endif) = . For example,
skip if k |= θ
variants PB∧(SIZE=1) (SIMPLE), PB∧(SIZE=4) (SIMPLE), P¬B∧(SIZE=1) (SIMPLE), as well as
P¬B∧(SIZE=4) (SIMPLE) shown in Fig. 3a, Fig. 3b, Fig. 3c, and Fig. 3d, respectively,
are derived from the SIMPLE family defined in Section 2.
copy of A for each configuration of K. For example, consider the tuple in Fig. 1.
Lifted Abstract Operations. Given a tuple (lifted domain element) a ∈ AK , the
projection πk selects the k th component of a. All abstract lifted operations are
defined by lifting the abstract operations of the domain A configuration-wise.
5
Since k ∈ K is a valuation function, either k |= θ holds or k |= θ holds for any θ.
74 A. S. Dimovski et al.
Lifted Transfer Functions. We now define lifted transfer functions for tests,
forward assignments (ASSIGN), and #if-s (IFDEF). There are two types of
tests: expression-based tests, denoted FILTER, that occur in while-s and if-
s, and feature-based tests, denoted FEAT-FILTER, that occur in #if-s. Each
lifted transfer function takes as input a tuple from AK representing the invariant
before evaluating the statement (resp., expression) to handle, and returns a tuple
representing the invariant after evaluating the given statement (resp., expression).
where [[s]](a) is the lifted transfer function for statement s. FILTER and ASSIGN
are defined by applying FILTERA and ASSIGNA independently on each com-
ponent of the input tuple a. FEAT-FILTER keeps those components k of the
input tuple a that satisfy θ, otherwise it replaces the other components with ⊥A .
IFDEF captures the effect of analyzing the statement s in the components k of
a that satisfy θ, otherwise it is an identity for the other components.
Lifted Analysis. Lifted abstract operators and transfer functions of the lifted
analysis domain AK are combined together to analyze program families. Initially,
we build a tuple ain where all components are set to A for the first program
location, and tuples where all components are set to ⊥A for all other locations.
The analysis properties are propagated forward from the first program location
towards the final location taking assignments, #if-s, and tests into account with
join and widening around while-s. The soundness of the lifted analysis based on
AK follows immediately from the soundness of all abstract operators and transfer
functions of A (proved in [22]).
Abstract domain for decision nodes. We define the family of abstract domains for
linear constraints CD , which are parameterized by any of the numerical property
domains D (intervals I, octagons O, polyhedra P). We use CI = {+ −Ai ≥ β |
Ai ∈ F, β ∈ Z} to denote the set of interval constraints, CO = {+ −Ai + − Aj ≥ β |
Ai , Aj ∈ F, β ∈ Z} to denote the set of octagonal constraints, and CP = {α1 A1 +
. . .+αk Ak +β ≥ 0 | A1 , . . . Ak ∈ F, α1 , . . . , αk , β ∈ Z, gcd(|α1 |, . . . , |αk |, |β|) = 1}
to denote the set of polyhedral constraints. We have CI ⊆ CO ⊆ CP .
The set CD of linear constraints over features F is constructed by the
underlying numerical property domain D, D using the Galois connection
γCD
←−
P(CD ), D − −−
−−− D, D , where P(CD ) is the power set of CD . The abstrac-
α −→CD
tion function αCD : P(CD ) → D maps a set of interval (resp., octagon, polyhedral)
constraints to an interval (resp., an octagon, polyhedral) that represents a con-
junction of constraints; the concretization function γCD : D → P(CD ) maps
an interval (resp., an octagon, a polyhedron) that represents a conjunction of
constraints to a set of interval (resp., octagonal, polyhedral) constraints. We have
γCD (D ) = ∅ and γCD (⊥D ) = {⊥CD }, where ⊥CD is an unsatisfiable constraint.
The domain of decision nodes is CD . We assume F = {A1 , . . . , Ak } be a finite
and totally ordered set of features, such that the ordering is A1 > A2 > . . . > Ak .
We impose a total order <CD on CD to be the lexicographic order on the coefficients
α1 , . . . , αk and constant αk+1 of the linear constraints, such that:
of configurations (those that satisfy the encountered constraints), and the leaf
nodes represent the analysis properties for the corresponding configurations.
Example 2. The following two constraint-based decision trees t1 and t2 have
decision nodes labelled with Interval linear constraints over the numeric feature
SIZE with domain {1, 2, 3, 4}, whereas leaf nodes are Interval properties:
t1 = [[SIZE ≥ 4 :$[y ≥ 2]%, $[y = 0]%]], t2 = [[SIZE ≥ 2 :$[y ≥ 0]%, $[y ≤ 0]%]]
Algorithm 1: UNIFICATION(t1 , t2 , C)
1 if isLeaf(t1 ) ∧ isLeaf(t2 ) then return (t1 , t2 );
2 if isLeaf(t1 ) ∨ (isNode(t1 ) ∧ isNode(t2 ) ∧ t2 .c <CD t1 .c) then
3 if isRedundant(t2 .c, C) then return UNIFICATION(t1 , t2 .l, C);
4 if isRedundant(¬t2 .c, C) then return UNIFICATION(t1 , t2 .r, C);
5 (l1 , l2 ) = UNIFICATION(t1 , t2 .l, C ∪ {t2 .c});
6 (r1 , r2 ) = UNIFICATION(t1 , t2 .r, C ∪ {¬t2 .c});
7 return ([[t2 .c : l1 , r1 ]], [[t2 .c : l2 , r2 ]]);
8 if isLeaf(t2 ) ∨ (isNode(t1 ) ∧ isNode(t2 ) ∧ t1 .c <CD t2 .c) then
9 if isRedundant(t1 .c, C) then return UNIFICATION(t1 .l, t2 , C);
10 if isRedundant(¬t1 .c, C) then return UNIFICATION(t1 .r, t2 , C);
11 (l1 , l2 ) = UNIFICATION(t1 .l, t2 , C ∪ {t1 .c});
12 (r1 , r2 ) = UNIFICATION(t1 .r, t2 , C ∪ {¬t1 .c});
13 return ([[t1 .c : l1 , r1 ]], [[t1 .c : l2 , r2 ]]);
14 else
15 if isRedundant(t1 .c, C) then return UNIFICATION(t1 .l, t2 .l, C);
16 if isRedundant(¬t1 .c, C) then return UNIFICATION(t1 .r, t2 .r, C);
17 (l1 , l2 ) = UNIFICATION(t1 .l, t2 .l, C ∪ {t1 .c});
18 (r1 , r2 ) = UNIFICATION(t1 .r, t2 .r, C ∪ {¬t1 .c});
19 return ([[t1 .c : l1 , r1 ]], [[t1 .c : l2 , r2 ]]);
IFDEFT (t, #if (θ) s) = [[s]]T (FEAT-FILTERT (t, θ)) T FEAT-FILTERT (t, ¬θ)
where [[s]]T (t) denotes the transfer function in T(CD , A) for statement s.
After applying transfer functions, the obtained decision trees may contain
some redundancy that can be exploited to further compress them. Function
COMPRESST (t, C), described by Algorithm 5, is applied to decision trees t in order
to compress (reduce) their representation. We use five different optimizations.
First, if constraints on a path to some leaf are unsatisfiable, we eliminate that
leaf node (Lines 9,10). Second, if a decision node contains two same subtrees,
then we keep only one subtree and we also eliminate the decision node (Lines
11–13). Third, if a decision node contains a left leaf and a right subtree, such that
its left leaf is the same with the left leaf of its right subtree and the constraint in
the decision node is less or equal to the constraint in the root of its right subtree,
then we can eliminate the decision node and its left leaf (Lines 14,15). A similar
rule exists when a decision node has a left subtree and a right leaf (Lines 16,17).
Lifted analysis. The abstract operations and transfer functions of T(CD , A) can
be used to define the lifted analysis for program families. Tree tin at the initial
A Decision Tree Lifted Domain for Analyzing Program Families 79
Algorithm 4: RESTRICT(t, C, J)
1 if isEmpty(J) then
2 if isLeaf(t) then return t;
3 if isRedundant(t.c, C) then return RESTRICT(t.l, C, J);
4 if isRedundant(¬t.c, C) then return RESTRICT(t.r, C, J);
5 l = RESTRICT(t.l, C ∪ {t.c}, J) ;
6 r = RESTRICT(t.r, C ∪ {¬t.c}, J) ;
7 return ([[t.c : l, r]]);
8 else
9 j = min<CD (J) ;
10 if isLeaf(t) ∨ (isNode(t) ∧ j <CD t.c) then
11 if isRedundant(j, C) then return RESTRICT(t, C, J\{j});
12 if isRedundant(¬j, C) then return ⊥A ;
13 if j =CD t.c then (if bj then t = t.l; else t = t.r) ;
14 if bj then return ([[j : RESTRICT(t, C ∪ {j}, J\{j}), ⊥A ]]) ;
15 else return ([[j : ⊥A , RESTRICT(t, C ∪ {j}, J\{j})]]) ;
16 else
17 if isRedundant(t.c, C) then return RESTRICT(t.l, C, J);
18 if isRedundant(¬t.c, C) then return RESTRICT(t.r, C, J);
19 l = RESTRICT(t.l, C ∪ {t.c}, J) ;
20 r = RESTRICT(t.r, C ∪ {¬t.c}, J) ;
21 return ([[t.c : l, r]]);
location has only one leaf node A and decision nodes that define the set K. Note
that if K ≡ true, then tin = T . In this way, we collect the possible invariants in
the form of decision trees at all program locations.
We establish correctness of the lifted analysis based on T(CD , A) by showing
that it produces identical results with tuple-based domain AK . Let [[s]]T and [[s]]
denote transfer functions of statement s in T(CD , A) and AK , respectively. Recall
that ain = k∈K A , and so γT (tin ) = γ(ain ).
Theorem 1. γT [[s]]T (tin ) = γ [[s]](ain ) .
Proof. The proof is by induction on the structure of s. We consider the most
interesting cases: #if (θ) s #endif. Transfer functions for #if are identical in
both lifted domains. We only need to show that FEAT-FILTER(a, θ) and FEAT-
FILTERT (t, θ) are identical. This is shown by induction on θ [13].
Example 5. Let us consider the code base of a program family P given in Fig. 4.
It contains only one numerical feature SIZE with domain N. The decision tree
inferred at the final location
4 is depicted in Fig. 5. It uses the Interval domain
for both decision and leaf nodes. Note that the constraint (SIZE < 3) does
not explicitly appear in the code base, but we obtain it in the decision tree
representation. This shows that partitioning of the configuration space K induced
by decision trees is semantics-based rather than syntactic-based.
80 A. S. Dimovski et al.
SIZE<3
1 int x := 0;
2 #if (SIZE ≤ 4) x := x+1; #else x := x-1; #endif
3 #if (SIZE==3 || SIZE==4) x := x-2; #endif 4
[x=1] [x=-1]
Fig. 4: Code base for program family P . Fig. 5: Decision tree at loc.
4 of P .
Example 6. Let us consider the code base of a program family P given in Fig. 6.
It contains one numerical feature A with domain [1, 4] and a non-linear feature
expression A ∗ A < 9. At program location ,
2 FEAT-FILTERT ($x = 0%, A ∗ A < 9)
returns an over-approximating tree $x = 0%, whereas FEAT-FILTERT ($x =
0%, ¬(A ∗ A < 9)) returns [[A ≥ 3, $x = 0%, $⊥I %]]. In effect, we obtain an
over-approximating result at the final program location 3 as shown in Fig. 7.
The precise result at the program location , which can be obtained in case we
3
have numerical domains that can handle non-linear constraints, is given in Fig. 8.
We observe that when ¬(A ≤ 2), we obtain an over-approximating analysis result
(−1 ≤ x ≤ 1 instead of x = −1) due to the over-approximation of the non-linear
feature expression in the numerical domains we use.
A Decision Tree Lifted Domain for Analyzing Program Families 81
A≤2 A≤2
1 int x := 0;
2 #if (A ∗ A < 9) x := x+1;
6 Evaluation
Implementation We have developed a prototype lifted static analyzer, called
SPLNum2 Analyzer, that uses lifted abstract domains of tuples AK and deci-
sion trees T(CD , A). The abstract domains A for encoding properties of tuple
components and leaf nodes as well as the abstract domain D for encoding linear
constraints over numerical features are based on intervals, octagons, and poly-
hedra domains. Their abstract operations and transfer functions are provided
by the APRON library [19]. Our proof-of-concept implementation is written
in OCaml and consists of around 6K lines of code. The current front-end of
the tool accepts programs written in a (subset of) C with #if directives, but
without struct and union types. It currently provides only a limited support
for arrays, pointers, and recursion. The only basic data type is mathematical
integers. SPLNum2 Analyzer automatically infers numerical invariants in all
program locations corresponding to all variants in the given family. We use
delayed widening and narrowing [7,24] to improve the precision of while-s.
Table 1: Performance results for lifted static analyses based on decision trees vs.
tuples (which are used as baseline). All times are in seconds.
AT (I) AT (O) AT (P )
Benchmark folder |F| |K| LOC
Time Impr. Time Impr. Time Impr.
half 2.c invgen 2 36 60 0.010 2.4× 0.017 3.5× 0.022 4.6×
heapsort.c invgen 2 36 60 0.036 2.2× 0.226 1.1× 0.191 2.0×
seq.c invgen 3 125 40 0.039 9.3× 0.460 4.3× 0.164 11×
eq1.c loops 2 36 20 0.015 3.4× 0.049 3.1× 0.052 4×
eq2.c loops 2 25 20 0.013 1.9× 0.047 1.3× 0.040 1.9×
sum01*.c loops 2 25 20 0.016 1.7× 0.086 1.5× 0.062 2.2×
hhk2008.c lit 3 216 30 0.023 10× 0.153 4.5× 0.074 12.5×
gsv2008.c lit 2 25 25 0.013 1.5× 0.035 1.2× 0.037 2×
gcnr2008.c lit 2 25 30 0.021 2× 0.070 2.1× 0.102 2.6×
Toulouse*.c crafted 3 125 75 0.043 6.1× 0.259 2.4× 0.175 7.6×
Mysore.c crafted 3 125 35 0.019 3.7× 0.090 1.1× 0.056 5.4×
copyfd.c BusyBox 1 16 84 0.013 3.9× 0.041 6.2× 0.054 5.2×
real path.c BusyBox 2 128 45 0.023 14× 0.077 28× 0.085 32×
list: the file name (Benchmark), the category (folder), the number of features
and configurations (|F|, |K|), and lines of code (LOC).
Performance Results Table 1 shows the results of analyzing our benchmark files
by using different versions of our lifted static analyses based on decision trees
and on tuples. For each version of decision tree-based lifted analysis, there are
two columns. In the first column, Time, we report the running time in seconds
to analyze the given benchmark using the corresponding version of lifted analysis
based on decision trees. In the second column, Impr., we report the speed up
factor for each version of lifted analysis based on decision trees relative to the
corresponding baseline lifted analysis based on tuples (AT (I) vs. AΠ (I), AT (O)
vs. AΠ (O), and AT (P ) vs. AΠ (P )). The performance results confirm that sharing
is indeed effective and especially so for large values of |K|. On our benchmarks,
it translates to speed ups (i.e., (AT (−) vs. AΠ (−)) that range from 1.1 to 4.6
times when |K| < 100, and from 3.7 to 32 times when |K| > 100.
A2 =0
A1 =0∧A2 =0 A1 =0∧A2 =1 A1 =0∧A2 =2
[i = 2] , [i = 0] , [i = 0] ,
A1 =1∧A2 =0 A1 =1∧A2 =1 A1 =1∧A2 =2 [i=0]
A1 =0
[i = 1] , [i = 0] , [i = 0] ,
A1 =2∧A2 =0 A1 =2∧A2 =1 A1 =2∧A2 =2
[i = 1] , [i = 0] , [i = 0] [i=2] [i=1]
Fig. 9: AΠ (P ) results at
4 of test3 (). Fig. 10: A (P ) results at
2 D 4 of test3 ().
2
value in the range from value 2 when A1 and A2 are assigned to 0, to value 0 when
A2 ≥ 1. The analysis results in location
4 of test3 () obtained using A
2 Π (P ) and
AT (P ) are shown in Fig. 9 and Fig. 10, respectively. AΠ (P ) uses tuples with 9
interval properties (components), while AT (P ) uses 3 interval properties (leafs).
7 Related Work
Decision-tree abstract domains have been successfully used in the field of abstract
interpretation recently [18,9,4,26]. Decision trees have been applied for the disjunc-
tive refinement of Interval domain [18]. That is, each element of the new domain
is a propositional formula over interval linear constraints. Segmented decision
tree abstract domains has also been defined [9,4] to enable path dependent static
analysis. Their elements contain decision nodes that are determined either by
values of program variables [9] or by the branch (if) conditions [4], whereas the
leaf nodes are numerical properties. Urban and Mine [26] use decision tree-based
abstract domains to prove program termination. Decision nodes are labelled
with linear constraints that split the memory space and leaf nodes contain affine
ranking functions for proving program termination.
Recently, two main styles of static analysis have been a topic of considerable
research in the SPL community: a dataflow analysis from the monotone framework
developed by Kildall [21] that is algorithmically defined on syntactic CFGs, and an
abstract interpretation-based static analysis developed by Cousot and Cousot [7]
that is more general and semantically defined. Brabrand et. al. [3] lift a dataflow
analysis from the monotone framework, resulting in a tuple-based lifted dataflow
analysis. Another efficient implementation of the lifted dataflow analysis from the
monotone framework is based on using variational data structures [27]. Midtgaard
et. al. [22] have proposed a formal methodology for systematic derivation of tuple-
based lifted static analyses in the abstract interpretation framework. A more
efficient lifted static analysis by abstract interpretation obtained by improving
representation via BDD domains is given in [11]. Another approach to speed up
lifted analyses is by using so-called variability abstractions [14,15], which are
used to derive abstract lifted analyses. They tame the combinatorial explosion
of the number of configurations and reduce it to something more tractable by
manipulating the configuration space. The work [5] presents a model checking
technique to analyze probabilistic program families.
8 Conclusion
References
1. Sven Apel, Hendrik Speidel, Philipp Wendler, Alexander von Rhein, and Dirk
Beyer. Detection of feature interactions using feature-aware verification. In 26th
IEEE/ACM International Conference on Automated Software Engineering (ASE
2011), pages 372–375, 2011.
2. Sven Apel, Alexander von Rhein, Philipp Wendler, Armin Größlinger, and Dirk
Beyer. Strategies for product-line verification: case studies and experiments. In
35th Intern. Conference on Software Engineering, ICSE ’13, pages 482–491, 2013.
3. Claus Brabrand, Márcio Ribeiro, Társis Tolêdo, Johnni Winther, and Paulo Borba.
Intraprocedural dataflow analysis for software product lines. T. Aspect-Oriented
Software Development, 10:73–108, 2013.
4. Junjie Chen and Patrick Cousot. A binary decision tree abstract domain functor.
In Static Analysis - 22nd International Symposium, SAS 2015, Proceedings, volume
9291 of LNCS, pages 36–53. Springer, 2015.
5. Philipp Chrszon, Clemens Dubslaff, Sascha Klüppelholz, and Christel Baier. Profeat:
feature-oriented engineering for family-based probabilistic model checking. Formal
Aspects Comput., 30(1):45–75, 2018.
6. Paul Clements and Linda Northrop. Software Product Lines: Practices and Patterns.
Addison-Wesley, 2001.
7. Patrick Cousot and Radhia Cousot. Abstract interpretation: A unified lattice model
for static analysis of programs by construction or approximation of fixpoints. In
Conference Record of the Fourth ACM Symposium on Principles of Programming
Languages, pages 238–252. ACM, 1977.
8. Patrick Cousot, Radhia Cousot, Jérôme Feret, Laurent Mauborgne, Antoine Miné,
David Monniaux, and Xavier Rival. The astreé analyzer. In Programming Languages
and Systems, 14th European Symposium on Programming, ESOP 2005, Proceedings,
volume 3444 of LNCS, pages 21–30. Springer, 2005.
9. Patrick Cousot, Radhia Cousot, and Laurent Mauborgne. A scalable segmented
decision tree abstract domain. In Time for Verification, Essays in Memory of Amir
Pnueli, volume 6200 of LNCS, pages 72–95. Springer, 2010.
10. Patrick Cousot and Nicolas Halbwachs. Automatic discovery of linear restraints
among variables of a program. In Conference Record of the Fifth Annual ACM
Symposium on Principles of Programming Languages (POPL’78), pages 84–96.
ACM Press, 1978.
11. Aleksandar S. Dimovski. Lifted static analysis using a binary decision diagram
abstract domain. In Proceedings of the 18th ACM SIGPLAN International Confer-
ence on Generative Programming: Concepts and Experiences, GPCE 2019, pages
102–114. ACM, 2019.
12. Aleksandar S. Dimovski. On calculating assertion probabilities for program families.
Prilozi Contributions, Sec. Nat. Math. Biotech. Sci, MASA, 41(1):13–23, 2020.
13. Aleksandar S. Dimovski, Sven Apel, and Axel Legay. A decision tree lifted domain
for analyzing program families with numerical features (extended version). CoRR,
abs/2012.05863, 2020.
14. Aleksandar S. Dimovski, Claus Brabrand, and Andrzej Wasowski. Variability
abstractions: Trading precision for speed in family-based analyses. In 29th European
Conference on Object-Oriented Programming, ECOOP 2015, volume 37 of LIPIcs,
pages 247–270. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, 2015.
15. Aleksandar S. Dimovski, Claus Brabrand, and Andrzej Wasowski. Finding suitable
variability abstractions for lifted analysis. Formal Aspects Comput., 31(2):231–259,
2019.
86 A. S. Dimovski et al.
16. Aleksandar S. Dimovski and Axel Legay. Computing program reliability using
forward-backward precondition analysis and model counting. In Fundamental
Approaches to Software Engineering - 23rd International Conference, FASE 2020,
Proceedings, volume 12076 of LNCS, pages 182–202. Springer, 2020.
17. Philippe Granger. Static analysis of arithmetical congruences. International Journal
of Computer Mathematics, 30(3-4):165–190, 1989.
18. Arie Gurfinkel and Sagar Chaki. Boxes: A symbolic abstract domain of boxes. In
Static Analysis - 17th International Symposium, SAS 2010. Proceedings, volume
6337 of LNCS, pages 287–303. Springer, 2010.
19. Bertrand Jeannet and Antoine Miné. Apron: A library of numerical abstract domains
for static analysis. In Computer Aided Verification, 21st Intern. Conference, CAV
2009. Proceedings, volume 5643 of LNCS, pages 661–667. Springer, 2009.
20. Christian Kästner. Virtual Separation of Concerns: Toward Preprocessors 2.0. PhD
thesis, University of Magdeburg, Germany, May 2010.
21. Gary A. Kildall. A unified approach to global program optimization. In Con-
ference Record of the ACM Symposium on Principles of Programming Languages,
(POPL’73), pages 194–206, 1973.
22. Jan Midtgaard, Aleksandar S. Dimovski, Claus Brabrand, and Andrzej Wasowski.
Systematic derivation of correct variability-aware program analyses. Sci. Comput.
Program., 105:145–170, 2015.
23. Antoine Miné. The octagon abstract domain. Higher-Order and Symbolic Compu-
tation, 19(1):31–100, 2006.
24. Antoine Miné. Tutorial on static inference of numeric invariants by abstract
interpretation. Foundations and Trends in Programming Languages, 4(3-4):120–372,
2017.
25. Daniel-Jesus Munoz, Jeho Oh, Mónica Pinto, Lidia Fuentes, and Don S. Batory.
Uniform random sampling product configurations of feature models that have
numerical features. In Proceedings of the 23rd International Systems and Software
Product Line Conference, SPLC 2019, Volume A, pages 39:1–39:13. ACM, 2019.
26. Caterina Urban and Antoine Miné. A decision tree abstract domain for proving
conditional termination. In Static Analysis - 21st International Symposium, SAS
2014. Proceedings, volume 8723 of LNCS, pages 302–318. Springer, 2014.
27. Alexander von Rhein, Jörg Liebig, Andreas Janker, Christian Kästner, and Sven
Apel. Variability-aware static analysis at scale: An empirical study. ACM Trans.
Softw. Eng. Methodol., 27(4):18:1–18:33, 2018.
Open Access This chapter is licensed under the terms of the Creative Commons
Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/),
which permits use, sharing, adaptation, distribution and reproduction in any medium or
format, as long as you give appropriate credit to the original author(s) and the source,
provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s
Creative Commons license, unless indicated otherwise in a credit line to the material. If
material is not included in the chapter’s Creative Commons license and your intended
use is not permitted by statutory regulation or exceeds the permitted use, you will need
to obtain permission directly from the copyright holder.
Finding a Universal Execution Strategy for
Model Transformation Networks
1 Introduction
When modelling systems, one is often confronted with the task of model consis-
tency: Since model-driven development aims at separating concerns by tailoring
models to the needs of the people working on the system, there are typically
different models, each one capturing the parts of the system that are relevant to
the model’s target audience. All those models taken together should describe a
coherent system and not contain contradictory information. We say that the mod-
els should be consistent. Automatic detection and resolution of inconsistencies is,
however, still poorly addressed in current development processes [12].
There are different means of maintaining consistency. A popular one is to define
incremental model transformations, which update models based on information
that was changed in one of them. While there has been significant research
on model transformations themselves, particularly on binary transformations,
maintaining consistency of multiple models is less researched [2]. There are
approaches for multiary model transformations which can transform between
multiple models by means of a single transformation. Nevertheless, one will likely
This work was supported by funding of the Helmholtz Association (HGF) through
the Competence Center for Applied Security Technology (KASTEL).
© The Author(s) 2021
E. Guerra and M. Stoelinga (Eds.): FASE 2021, LNCS 12649, pp. 87–107, 2021.
https://doi.org/10.1007/978-3-030-71500-7_5
88 J. Gleitze et al.
2 Problem Statement
In this section, we will further motivate our research by giving an example and
clarifying its context. We provide a formalisation for transformation networks
and execution strategies to generate a common understanding and formal basis
for transformation network orchestration, constituting contribution C1 .
UML
developers
architects
UI
UX designers
program the software in Java. These two models overlap: Although they cannot
be derived completely from each other, the implementation should follow the
architecture and architects want to see how code changes affect the architecture.
UX designers develop the UI for the software. Their designs overlap with
the UML model, because, first, the software’s requirements mandate certain
properties of the UI, and, second, the architecture may restrict which information
can be shown at which point in the interface. The UI design also overlaps with
the code, since static parts of the UI can be derived from the UI model. Ideally,
changes in the UI code can even be propagated back into the UI model.
The developers use OpenAPI™ [32] to exchange specifications of HTTP APIs.
These specifications overlap with the parsing and serialisation code. Architects
want to analyse how their architecture choices influence performance, using the
Palladio Component Model (PCM) [24]. The architecture specification used in
the PCM overlaps with the one defined in UML. Additionally, the PCM model
contains information about performance properties and the deployment structure,
which can partially be derived from the code.
Those relations can be encoded in transformations to avoid re-specification
of similar information, such as the architecture in PCM and UML, to derive
information, like appropriate Java stubs from OpenAPI specifications, and to
preserve information consistency. Figure 1 shows the resulting transformation
network. In this paper, we will find an execution strategy for such transformations,
which is needed to correctly propagate changes from one model to the others.
2.2 Context
We discuss model transformation networks in a specific usage context. We assume
that different roles are involved in a development project, each using some
models to describe their view of the system. The models are kept consistent
by model transformations. For the sake of simplicity, we only discuss binary
transformations between two models. To foster independent specification and reuse
of transformations, we assume that they are not tailor-made, but may be general-
purpose. As a consequence, we cannot assume that the models or transformations
are or can be aligned, for example, to ensure that their execution in a specific
90 J. Gleitze et al.
order always results in consistent models. Neither can we assume that the network
has a certain topology. We do, however, assume that all transformations are in
accordance to a well-defined overall notion of consistency (reaching a consistent
state would be impossible otherwise). This means that all requirements we pose
on the transformations must only concern a transformation itself. A requirement
like “no transformation overwrites the result of another” would not fit our context.
We require that transformations are synchronising [4], i.e., that they can deal
with the situation that both of their models have been changed. This is essential
to find an execution strategy: When propagating changes in a transformation
network that contains cycles, it will inevitably happen that both models that are
connected by a transformation will be changed. In addition, the well-researched
bidirectional transformations only change one of the models [28] and could in
such a situation be forced to overwrite changes to yield a consistent result. This
assumption also enables concurrent modifications by different project members.
2.3 Formalisation
We are not concerned with how models are structured, so we simply resort to
defining a universe M that contains all models. First, we define the kind of
transformations that we use:
Definition 1. A synchronising binary transformation (syncx) t is a function
that updates two models:
t : (M × M) → (M × M)
A syncx’ image consists of fixed points:
∀a ∈ M ∀b ∈ M : t t(a, b) = t(a, b)
An execution strategy will not always be able to find a consistent new model
assignment (i.e., there will be some N, M such that S(N, M ) = ⊥). First, there
may not be a consistent model assignment at all (i.e., RN = ∅). Second, there may
be a consistent model assignment but no execution order of the transformations
that yields that assignment [30, 16]. We call such inputs “unresolvable” [30].
Conversely, if there is an execution order of the transformations that yields a
consistent model assignment, we call the inputs “resolvable”.
An execution strategy may even fail for resolvable inputs: The execution
strategy may not “find” a consistent model assignment, even though it is reachable.
For example, the strategy may abort before having executed the transformations
often enough, or finding the assignment might require an order of execution
which the strategy does not consider. We call such a strategy “conservative”:
The higher the probability that an execution strategy yields a result for
resolvable inputs (we also say the lower its “level of conservativeness”), the more
useful the strategy will be. It is, however, also desirable that the strategy is
predictable, meaning that one can determine beforehand for which inputs the
strategy will succeed. For example, it would be useful to know whether a strategy
yields a result for a given network for any resolvable model assignment. Informally
speaking, we would like to have an “easy-to-check” criterion for transformation
networks determining whether this is the case. An even better criterion could be
applied to a single syncx, such that the strategy can resolve all inputs with a
network of syncx that fulfil the criterion. This would be ideal for the motivated
context of independently developing and freely combining syncx to a network.
To summarise, we aim to find a correct, hippocratic execution strategy that is
able to keep models consistent via transformation networks. The strategy should
succeed for realistic inputs with a high probability. Additionally, we aim to find
criteria that determine the cases in which the strategy will succeed.
3 Related Work
Approaches for restoring model consistency have been subject to intensive research,
surveyed by Macedo et al. [21]. Model transformations are a well-researched option,
and several tools and languages have been developed to support them [27, 18, 25].
Research has, however, mainly focused on consistency between two models, which
also concerns theoretical properties like termination as one of the properties
that we investigate for the execution of transformation networks [7]. Maintaining
consistency between more than two models has recently gained more attention,
especially in terms of a dedicated Dagstuhl seminar [2]. The central approaches
of multiary transformations and networks of binary transformations can be
distinguished. In Section 1, we have discussed that multiary transformations are
complex to specify, whereas networks of binary transformations have limited
expressiveness [30], which does, however, not seem to be practically relevant [2].
Finding a Universal Execution Strategy for Model Transformation Networks 93
«interface»
interface ExampleService {
ExampleService
public List<Example> getExamples();
+ getExamples() }
2.
1.
+
+ +
+
UML 3. Java
3.
OpenApi
! class ExampleServer GET /example
implements ExampleService { ... }
Fig. 2. Example yielding inconsistent models after executing each transformation once.
Numbers in italics indicate the order in which changes are performed.
Execution Strategies: Di Rocco et al. [3] describe a simple strategy for or-
chestrating transformations, but make strong assumptions requiring that each of
them is only applied once. Stevens [30] proposes a strategy that also executes each
transformation only once in one direction. It includes a notion of authoritative
models, which are not allowed to be changed, and does not consider synchronising
transformations. Likewise, Stevens [29] proposes to find an orientation model
defining in which direction transformations are executed. If, however, several
transformations modify the same model, the approach leaves it to the developer
to determine an execution order after which all consistency relations hold. Such
strategies are only correct if the network is a tree, or if no transformations interfere
with each other. We present a simple scenario in which this is already too limiting
in Section 4.1. We overcome this limitation by executing transformations more
than once and thereby letting them “negotiate” a result even if they interfere,
which yields a universal execution strategy for arbitrary network topologies.
4 Design Space
We use the example of Section 2.1, and focus on the UML, Java and OpenAPI
models to consider the scenario visualised in Figure 2: An architect creates a new
UML interface and applies an execution strategy that executes every transforma-
tion once. First, the UML-to-Java syncx creates an appropriate interface in Java.
The OpenAPI-to-Java syncx recognises that the interface should be exposed
via an HTTP API and creates a matching endpoint in the OpenAPI model.
Additionally, it creates a stub implementation with parsing and serialisation code
in Java. The stub implementation classes can, however, not be propagated back
to UML, because the UML-to-Java syncx has already been executed.
We see that if we limit the number of executions to one per transformation,
transformations cannot propagate back the changes that other transformations
have made. However, in the context described in Section 2.2, it is necessary that
transformations are able to “react” to the changes made by other transformations.
This offers, for instance, separation of concerns: The logic for a certain aspect of
consistency can be put in only one transformation and other transformations will
propagate it throughout the network. Without such a mechanism, all aspects of
consistency would need to be implemented in all transformations. This would
cause duplication of logic and reduce reusability of transformations, which would
be impractical and contradicts our assumption of independent development. If
we added the logic for creating implementations of relevant Java interfaces to
the UML-to-Java syncx, then it would implicitly assume the presence of the
Java-to-OpenAPI syncx. It could, thus, not be easily reused in networks where
the Java-to-OpenAPI syncx is not used.
We can generalise the previous example: Let the model universe be the natural
numbers: M = N0 . Let further for any 1 ≤ j ≤ n the syncx ij be defined as
(m + 1, m + 1) if m = j
i :
j (a, b) '→ with m := max{a, b}
(m, m) else
i sets both models to the higher number of the two, except if that number is j.
j
Then ij increments the result by one. This is an abstraction of syncx “reacting”
to each other: The ij s seek to set all models to the same value, except that after
i
j−1 was executed, ij changes its behaviour and increments the value by one.
We now construct the transformation network Nn for n = 2k, k ∈ N+ (see
Figure 3) with n indicating the number of syncx within the network, and examine
how many executions it requires:
i
2i if i ≤ n
Tn = (i, i + 1) '→
i
2
2i−n−1 else
Nn = (([1, n + 1], {(i, i + 1) | i ∈ [1, n]}), Tn )
96 J. Gleitze et al.
Lemma 1. in must be executed at least n times to resolve Nn with the initial
model assignment
1 if i = 1
M1 : i '→
0 else
Theorem 1. For any execution strategy that uses O(1) executions of each trans-
formation, there are inputs that the execution strategy cannot resolve.
and b (with −1, 0 and 1 indicating the head movements “left”, “stay” and “right”).
We define Ttm with w|p←r := w[0 .. p−1] · r · w[p+1 .. |w|−1] such that:
∀(a, b) ∈ E : Ttm (a, b)(α =: (ta , wa , pa ), β =: (tb , wb , pb ))
(α, (ta +1, wa |pa ←r , pa +d)) if ta > tb ∧ ∃ (wa [pa ], d, r) ∈ Tr(a, b)
⎧
⎪
⎪
⎨
= ((tb +1, wb |pb ←r , pb +d), β) if ta < tb ∧ ∃ (wb [pb ], d, r) ∈ Tr(b, a)
⎪
⎪
(α, β) else
⎩
1. There is exactly one v ∈ V such that the model Mi (v) =: (t, x, p) has the
highest timestamp t of all models in Im(Mi ).
2. There is at most one edge (a, b) ∈ E whose transformation is inconsistent, i.e.,
(Mi (a), Mi (b)) ∈ / RTtm (a,b) . This follows from the definitions of tm and the
last executed transformation. Additionally, a = v or b = v, because otherwise
there would have been two transformations to which models in Im(Mi−1 ) are
inconsistent. We assume without loss of generality a = v.
3. If (a, b) exists, then m := Mi+1 (b) will contain the same tape content and the
same tape position as would result if tm was executed one step from state v
with tape content x and tape position p. Additionally, m will be the model
with the highest timestamp of all models in Im(Mi+1 ).
4. (a, b) does not exist if, and only if, tm would halt in state v with tape content
x and tape position p.
Proof. It follows from Lemma 2 that deciding whether S terminates could decide
the halting problem for a universal Turing machine.
Even worse, this construction makes it unlikely that we will find a practicable
criterion that ensures success of an execution strategy like we have motivated in
Section 2.4. Because we want the criterion to apply to a single syncx, it would
need to restrict the syncx so much that it makes building a network simulating
98 J. Gleitze et al.
Turing machines out of the syncx impossible. But since the definition of the
syncx in Im(Ttm ) is structurally simple, it seems unlikely that a syncx fulfilling
the hypothetical criterion would still be apt for most practical use cases.
We could avoid undecidability if we restricted the models’ size. The models
could then no longer store an unbounded tape and, thus, only simulate space-
restricted Turing machines. There is, however, no reasonable bound for a necessary
model size, to which they could be limited. In consequence, determining a universal
space bound for models would be an arbitrary and thus impractical restriction.
Finally, one could question whether it is relevant if an execution strategy can
be guaranteed to terminate. Execution strategies will be used to tell users whether
changes they made can be incorporated into the other models automatically.
In consequence, users should reliably and timely get a response. We might
compare this situation to merging changes in version control systems. There,
users also want a reliable and timely response on whether their changes could be
incorporated automatically, or whether they need to resolve conflicts manually.
5 Proposed Strategy
but the last model assignment is only consistent with some of them. There would
be no clear pattern and little clues for users where to start investigating the
failure’s cause. To improve explainability, the authors thus propose the following
principle for an execution order:
Since a syncx can change both models, executing it may results in models that
are inconsistent with the syncx that have been executed previously. Following
Principle 1, these inconsistencies should be addressed first. In effect, a strategy
applying the principle will maintain a subnetwork of syncx with a consistent model
assignment and try to expand the subnetwork transformation by transformation.
To exemplify how Principle 1 provides explainability, suppose that an execution
strategy applying that principle fails after having executed the set of syncx E ⊆ T.
Let t ∈ E be the last syncx that was executed for its first time. The strategy can
then inform users that integrating t into the subnetwork induced by E failed.
Furthermore, it can inform users that a result that is consistent with the syncx
in E \ {t} exists. By that, users gain valuable information for handling the error:
First, when trying to understand the error, they can ignore any syncx that is
not in E. Second, some aspect of consistency that is present in the consistency
relation realised by t, but absent in the consistency relations realised by the syncx
in E \ {t}, hinders the strategy from creating a consistent result. Third, when
users try to find a consistent model assignment manually, they can start with the
consistent result that exists for E \ {t} instead of having to start from scratch.
it it it
Fig. 4. Exemplary execution of the explanatory strategy for a change in the topmost
model, depicting the iterations (horizontal) and recursion steps (vertical).
strategy”. At a high level, it acts like this: Given a changed model assignment, the
strategy picks the next candidate syncx to execute. After executing the candidate,
the strategy calls itself on the subnetwork formed by the already executed syncx.
By that, it propagates the changes of the last execution throughout the sub-
network and ensures that they are consistent with the executed syncx. Finally,
the strategy executes the initial candidate again to ensure that the changes added
during the subnetwork propagation are consistent with the candidate. If that
repeated execution of the candidate generates new changes in any model that
is kept consistent by an already executed syncx, the execution fails, because
the candidate does not fulfil the definition of being N -converging, as we will
see in the following. In that case, the procedure returns the already executed
syncx to which consistency was restored by the also returned changes in order to
support a user in examining the reasons for the strategy to fail. If the models
are consistent with the candidate, the strategy picks the next one. In effect,
the strategy realises Principle 1 in a recursive fashion and ensures that each
permutation of all yet executed syncx is executed at every recursion level.
Figure 4 depicts an exemplary execution of the strategy for a network with
four models and four transformations. We assume that after an initially consistent
state of the models, the topmost one was modified. We can see that each recursion
only treats the subnetwork of previously executed transformations. Hence, the
network gets smaller at each recursion level.
Unlike the formalisation in Section 2.3, the presented algorithm is based on
changes instead of model states. Changes contain information that cannot be
recovered by comparing model states [6]. Thus in practice, we want to support
change-based execution. The algorithm also uses changes to determine potential
candidates for the next transformation to execute: It only picks candidates that
are adjacent to a model that was changed. The input changes describe all changes
that occurred since the last model assignment M that was known to be consistent.
The procedure returns accumulatedChanges that, when applied to M , yield a
new model assignment M . For our formalisation, M is the algorithm’s output.
102 J. Gleitze et al.
Proof. Because all called functions terminate, only the loop (Line 5) and the
recursive call in Line 8 can lead to non-termination. Let m denote the number
of edges of network. The set executed is initialised to be empty (Line 2) and
grows by one element in every iteration of the loop. The loop is executed no more
than m times, because after m iterations there is no transformation that is not
in executed and, thus, the loop condition cannot be fulfilled.
The recursive call receives a network that is smaller than network in terms of
edges, because it does not contain the current candidate. If network is empty,
then the algorithm will not enter the loop and not make a recursive call. Hence,
the recursive stack never gets higher than m.
Proof. Let T (m) denote the number of syncx executions the algorithm invokes
for a network with m edges. The set executed is initialised to be empty and
grows by one syncx every loop iteration (Line 13). It follows that the recursive
call in Line 8 receives a network that is one syncx larger each time. Thus, we find
m−1
T (0) = 0, T (m) = 2m + T (i) = 2 + 2 T (m − 1) = 2 (2m − 1) ∈ O(2m )
i=0
Next, we show that the strategy fulfils the fundamental Requirements 1 and 2
regarding correctness and hippocraticness, which we defined in Section 2.4.
Proof. Assume the contrary, i.e., that the strategy produces a model assignment
M for network N such that M ∈ / RN . That means that there is an edge (a, b) ∈ E
such that (M (a), M (b)) ∈/ R , where t := T (a, b). We distinguish these cases:
t
1. t was never executed. Then accumulatedChanges never contained any change
adjacent to a or b (Line 5). Since the initial changes were relative to a
consistent model assignment, we know that (M (a), M (b)) ∈ R .
t
2. t was executed and no other transformation adjacent to a or b was executed
afterwards. Then (M (a), M (b)) ∈ R per definition.
t
3. t was executed and another transformation u adjacent to a or b was executed
afterwards. Because u was executed after t, t was in executed when u was the
candidate. So t’s last execution was in the recursion after u ’s first execution
in Line 6. Afterwards, u was only executed in Line 9. If u would have changed
M (a) or M (b), the strategy would have raised a failure. Hence, M (a) and
M (b) are the same as after the execution of t, and (M (a), M (b)) ∈ R .
t
All cases lead to a contradiction.
Finding a Universal Execution Strategy for Model Transformation Networks 103
Proof. The strategy only produces changes by executing syncx, which, per defini-
tion, only generate changes if the models are not in their consistency relations.
Finally, we verify that we have indeed realised Principle 1 and that the
strategy does not fail for a network N of only N -converging transformations.
Proof. After the recursive call in Line 8, the current model assignment is consistent
with all executed syncx (Theorem 5) and no changes to models adjacent to an
executed syncx are allowed.
Proof. First, we note that when calling the algorithm on a network with m trans-
formations, the first m − 1 iterations of the loop act identically to executing the
algorithm on a network without the last candidate. Second, we note that the sec-
ond part of the loop condition, “accumulatedChanges.adjacentTo (candidate)”
(Line 5), does not change the algorithm’s result apart from controlling the order
in which the syncx are executed. If any syncx was never executed because of
this condition, then executing it would not have changed any model. Hence, we
assume w.l.o.g. that all syncx in network will get executed.
Now we show the following, stronger statement by induction over the number
m of edges in network: “After running the explanatory strategy, the sequence
of executed syncx contains each permutation of those syncx (not necessarily
continuously)”. Since the transformations are network-converging and because
of our first note above, proving this statement shows that the condition leading
to a failure (Line 10) will never evaluate to true. The statement is trivially true
for m = 1. Assume that the statement is true for all networks of size 1 ≤ n < m
but not true for a network of size m. That means that after executing the last
iteration of the loop, there is an order o of the m syncx in network in which they
have not been executed yet. Let t be the candidate of the last iteration. Let j be
the index of t in o. Per induction assumption, the order o[1] . . . o[j −1] has been
executed in the previous iterations of the loop. Afterwards, t was executed in
Line 6. Per induction assumption, the order o[j +1] . . . o[m] has been executed in
the recursive call (Line 8) of the last iteration. This happened after Line 6. Hence,
the transformations have been executed in the order o. This is a contradiction.
The explanatory strategy only guarantees to produce a consistent model as-
signment if all syncx are N -converging. We can, unfortunately, not provide an ap-
proach to achieve N -convergence by construction or to determine N -convergence.
We have, however, also discussed that every universal execution strategy needs to
operate conservatively and thus fails in certain cases. Thus, even if a network N
104 J. Gleitze et al.
contains syncx that are not N -converging, the explanatory strategy still operates
conservatively and at least fails based on the notion of a sensible and well-defined
property. In addition, the exponential worst-case performance of the strategy is
no limitation, because it does only represent a bound to ensure termination. In
cases in which the strategy terminates, we expect the repeated execution of each
syncx to perform only few changes in reaction to the changes made by other syncx,
as otherwise they are unlikely to be N -converging. The interested reader can try
out the explanatory strategy using the previously mentioned simulator [11].
In its current formulation, the explanatory strategy does not prevent the
syncx from overwriting the initial user changes. This seems inappropriate, as
user changes should usually not be reverted. Other authors address this issue by
forbidding changes to models that have been edited by users [3, 30, 29], called
“authoritative models”. There are, however, practical use cases where such changes
should be allowed—the example in Section 4.1 is one of them. An option would
be to let the strategy fail as soon as a syncx execution overwrites a user change.
6 Conclusion
In this paper, we have discussed influencing factors for designing a universal exe-
cution strategy for model transformation networks. Such a strategy orchestrates
transformations to create a consistent set of models. It involves determining
an order to execute the transformations in, and a bound for the number of
executions. We have proven that every universal execution strategy that always
terminates needs to be conservative, i.e., it will fail for certain cases in which an
execution order of transformations that yields a consistent solution exists. We
have argued that providing explainability in cases where an execution strategy
fails should be a central design goal. As a result, we have proposed the explanatory
strategy, which is proven correct and terminates for every input. Additionally, it
improves explainability of failures and has a well-defined bound for the number
of transformation executions to ensure a reasonable level of conservativeness.
We have formalised our findings on execution bounds and the behaviour of
the proposed execution strategy to prove the insights and expected properties of
the strategy. In consequence, this paper provides fundamental knowledge about
the design space and relevant design goals of transformation network execution
strategies. While the statements on correctness and well-definedness are proven,
those on the usefulness of the strategy were derived by argumentation. To improve
evidence of the results, the authors plan to apply the strategy to realistic use
cases, involving larger networks of more complex transformations.
Furthermore, the authors want to examine how the strategy can be further op-
timised: It might, e.g., be improved by backtracking and trying further candidate
transformations, or by selecting the next candidate more carefully. Since early
executed transformations will be executed most often, starting with those that
will most unlikely cause conflicts might be beneficial. Finally, this paper assumes
transformations to be binary. Since the presented strategy does not require this,
future research could investigate transferability to multiary transformations.
Finding a Universal Execution Strategy for Model Transformation Networks 105
References
1. Anjorin, A., Rose, S., Deckwerth, F., and Schürr, A.: “Efficient Model Synchro-
nization with View Triple Graph Grammars”. In: Modelling Foundations and
Applications, pp. 1–17. Springer International Publishing (2014)
2. Cleve, A., Kindler, E., Stevens, P., and Zaytsev, V.: “Multidirectional Transforma-
tions and Synchronisations (Dagstuhl Seminar 18491)”. Dagstuhl Reports 8(12),
1–48 (2019)
3. Di Rocco, J., Di Ruscio, D., Heinz, M., Iovino, L., Lämmel, R., and Pierantonio, A.:
“Consistency Recovery in Interactive Modeling”. In: 3rd International Workshop on
Executable Modeling co-Located with ACM/IEEE 20th International Conference
on Model Driven Engineering Languages and Systems. Vol-2019, pp. 116–122.
CEUR-WS.org (2017)
4. Diskin, Z., Gholizadeh, H., Wider, A., and Czarnecki, K.: “A Three-Dimensional Tax-
onomy for Bidirectional Model Synchronization”. Journal of Systems and Software
111, 298–322 (2016)
5. Diskin, Z., König, H., and Lawford, M.: “Multiple Model Synchronization with
Multiary Delta Lenses”. In: Fundamental Approaches to Software Engineering,
pp. 21–37. Springer International Publishing (2018)
6. Diskin, Z., Xiong, Y., Czarnecki, K., Ehrig, H., Hermann, F., and Orejas, F.: “From
State- to Delta-Based Bidirectional Model Transformations: The Symmetric Case”.
In: Model Driven Engineering Languages and Systems, pp. 304–318. Springer Berlin
Heidelberg (2011)
7. Ehrig, H., Ehrig, K., Lara, J. de, Taentzer, G., Varró, D., and Varró-Gyapay, S.:
“Termination Criteria for Model Transformation”. In: Fundamental Approaches to
Software Engineering, pp. 49–63. Springer Berlin Heidelberg (2005)
8. Etien, A., Aranega, V., Blanc, X., and Paige, R.F.: “Chaining Model Transforma-
tions”. In: First Workshop on the Analysis of Model Transformations, pp. 9–14.
ACM (2012)
9. Etien, A., Muller, A., Legrand, T., and Blanc, X.: “Combining Independent Model
Transformations”. In: 2010 ACM Symposium on Applied Computing, pp. 2237–2243.
ACM (2010)
10. Gleitze, J.: GitHub: Transformation Network Simulator, (2021). https://github.
com/jGleitz/transformationnetwork-simulator (visited on 01/14/2021)
11. Gleitze, J.: Transformation Network Simulator, (2021). https://jgleitz.github.io/
transformationnetwork-simulator (visited on 01/14/2021)
12. Guissouma, H., Klare, H., Sax, E., and Burger, E.: “An Empirical Study on the
Current and Future Challenges of Automotive Software Release and Configuration
Management”. In: 2018 44th Euromicro Conference on Software Engineering and
Advanced Applications, pp. 298–305. IEEE (2018)
13. Klare, H.: “Multi-model Consistency Preservation”. In: 21st ACM/IEEE Interna-
tional Conference on Model Driven Engineering Languages and Systems: Companion
Proceedings, pp. 156–161. ACM (2018)
14. Klare, H., and Gleitze, J.: “Commonalities for Preserving Consistency of Mul-
tiple Models”. In: 22nd ACM/IEEE International Conference on Model Driven
Engineering Languages and Systems Companion, pp. 371–378. IEEE (2019)
15. Klare, H., Kramer, M.E., Langhammer, M., Werle, D., Burger, E., and Reussner, R.:
“Enabling consistency in view-based system development – The Vitruvius approach”.
Journal of Systems and Software 171 (2020)
106 J. Gleitze et al.
16. Klare, H., Syma, T., Burger, E., and Reussner, R.: “A Categorization of Interoper-
ability Issues in Networks of Transformations”. In: 12th International Conference
on Model Transformations. Journal of Object Technology (2019)
17. Königs, A., and Schürr, A.: “MDI: A Rule-based Multi-document and Tool Integra-
tion Approach”. Software and Systems Modeling 5(4), 349–368 (2006)
18. Kusel, A., Etzlstorfer, J., Kapsammer, E., Langer, P., Retschitzegger, W., Schoen-
boeck, J., Schwinger, W., and Wimmer, M.: “A Survey on Incremental Model
Transformation Approaches”. In: Workshop on Models and Evolution co-located
with ACM/IEEE 16th International Conference on Model Driven Engineering
Languages and Systems. Vol-1090, pp. 4–13. CEUR-WS.org (2013)
19. Lúcio, L., Mustafiz, S., Denil, J., Vangheluwe, H., and Jukss, M.: “FTG+PM:
An Integrated Framework for Investigating Model Transformation Chains”. In:
SDL 2013: Model-Driven Dependability Engineering, pp. 182–202. Springer Berlin
Heidelberg (2013)
20. Macedo, N., Cunha, A., and Pacheco, H.: “Towards a Framework for Multi-
Directional Model Transformations”. In: 3rd International Workshop on Bidirec-
tional Transformations. Vol-1133. CEUR-WS.org (2014)
21. Macedo, N., Jorge, T., and Cunha, A.: “A Feature-Based Classification of Model
Repair Approaches”. IEEE Transactions on Software Engineering 43(7), 615–640
22. Object Management Group (OMG): “Meta Object Facility (MOF) 2.0—Query/
View/Transformation Specification”, Version 1.3 (2016)
23. Pilgrim, J. von, Vanhooff, B., Schulz-Gerlach, I., and Berbers, Y.: “Constructing and
Visualizing Transformation Chains”. In: Model Driven Architecture – Foundations
and Applications, pp. 17–32. Springer Berlin Heidelberg (2008)
24. Reussner, R.H., Becker, S., Happe, J., Heinrich, R., Koziolek, A., Koziolek, H.,
Kramer, M., and Krogmann, K.: “Modeling and Simulating Software Architectures
– the Palladio Approach”. MIT Press (2016)
25. Samimi-Dehkordi, L., Zamani, B., and Kolahdouz-Rahimi, S.: “Bidirectional Model
Transformation Approaches – A Comparative Study”. In: 6th International Confer-
ence on Computer and Knowledge Engineering, pp. 314–320. IEEE (2016)
26. Schürr, A.: “Specification of graph translators with triple graph grammars”. In:
Graph-Theoretic Concepts in Computer Science, pp. 151–163. Springer Berlin
Heidelberg (1995)
27. Stevens, P.: “A Landscape of Bidirectional Model Transformations”. In: Generative
and Transformational Techniques in Software Engineering II, pp. 408–424. Springer
Berlin Heidelberg (2008)
28. Stevens, P.: “Bidirectional Model Transformations in QVT: Semantic Issues and
Open Questions”. Software and Systems Modeling 9(1), 7 (2010)
29. Stevens, P.: “Connecting software build with maintaining consistency between
models: towards sound, optimal, and flexible building from megamodels”. Software
and Systems Modeling 19(4), 935–958 (2020)
30. Stevens, P.: “Maintaining consistency in networks of models: bidirectional transfor-
mations in the large”. Software and Systems Modeling 19(1), 39–65 (2020)
31. Stünkel, P., König, H., Lamo, Y., and Rutle, A.: “Multimodel Correspondence
through Inter-Model Constraints”. In: 2nd International Conference on Art, Science,
and Engineering of Programming Companion, pp. 9–17. ACM (2018)
32. The Linux Foundation: OpenAPI Initiative, (2021). https://www.openapis.org/
(visited on 01/14/2021)
Finding a Universal Execution Strategy for Model Transformation Networks 107
33. Trollmann, F., and Albayrak, S.: “Extending Model Synchronization Results from
Triple Graph Grammars to Multiple Models”. In: Theory and Practice of Model
Transformations, pp. 91–106. Springer International Publishing (2016)
34. Trollmann, F., and Albayrak, S.: “Extending Model to Model Transformation
Results from Triple Graph Grammars to Multiple Models”. In: Theory and Practice
of Model Transformations, pp. 214–229. Springer International Publishing (2015)
35. Vanhooff, B., Ayed, D., Van Baelen, S., Joosen, W., and Berbers, Y.: “UniTI: A
Unified Transformation Infrastructure”. In: Model Driven Engineering Languages
and Systems, pp. 31–45. Springer Berlin Heidelberg (2007)
36. Wagelaar, D., Tisi, M., Cabot, J., and Jouault, F.: “Towards a General Composition
Semantics for Rule-Based Model Transformation”. In: Model Driven Engineering
Languages and Systems, pp. 623–637. Springer Berlin Heidelberg (2011)
37. Xiong, Y., Song, H., Hu, Z., and Takeichi, M.: “Synchronizing Concurrent Model
Updates Based on Bidirectional Transformation”. Software and Systems Modeling
12(1), 89–104 (2013)
Image Sources
paintingred: “Default Avatar Headshot Icons”, found on Vecteezy.
https://www.vecteezy.com/vector-art/141712-default-avatar-headshot-icons.
Vecteezy Free License.
Object Management Group: UML logo.
https://www.uml.org/index.htm.
Trademark.
Palladio logo.
https://sdqweb.ipd.kit.edu/wiki/File:Palladio-Logo-stilisiert-vektor.pdf.
Authorized use.
The Linux Foundation: OpenAPI™ logo.
https://github.com/OAI/OpenAPI-Style-Guide/blob/master/graphics/
vector/OpenAPI_Logo_Black.svg. Trademark.
Freepik: “Computer”.
https://www.flaticon.com/free-icon/computer_1077701.
Flaticon Basic License.
Open Access This chapter is licensed under the terms of the Creative Commons
Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/),
which permits use, sharing, adaptation, distribution and reproduction in any medium or
format, as long as you give appropriate credit to the original author(s) and the source,
provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s
Creative Commons license, unless indicated otherwise in a credit line to the material. If
material is not included in the chapter’s Creative Commons license and your intended
use is not permitted by statutory regulation or exceeds the permitted use, you will need
to obtain permission directly from the copyright holder.
CoVEGI: Cooperative Verification via
Externally Generated Invariants
1 Introduction
Recent years have seen a major progress in software verification as for instance
witnessed by the annual competition on software verification SV-COMP [2]. This
success is on the one hand due to advances in SAT and SMT solving and on the
other hand due to novel verification methods like interpolation in model check-
ing [36], automata-based software verification [31] or property directed reacha-
bility [16]. Still, automatic verification remains a complex and error-prone task.
In particular, it is often the case that one tool can verify a particular class
This author was partially supported by the German Research Foundation (DFG)
under contract WE2290/13-1.
c The Author(s) 2021
E. Guerra and M. Stoelinga (Eds.): FASE 2021, LNCS 12649, pp. 108–129, 2021.
https://doi.org/10.1007/978-3-030-71500-7 6
CoVEGI: Cooperative Verification via Externally Generated Invariants 109
of programs, but fails to verify other classes (or even gives incorrect answers),
whereas it is the reverse situation for another tool. Moreover, to keep their tools
up to date with novel techniques, tool developers keep on integrating them by
re-implementation within their framework.
An approach for changing this unsatisfactory situation is cooperative veri-
fication (for an overview see [13]). Cooperative verification builds on the idea
of letting tools (and thus techniques) cooperate on verification tasks, thereby
leveraging the tool’s individual strengths. In particular, cooperative verification
aims at black box combinations of tools, using existing tools off-the-shelf without
re-implementation. While this sounds like a natural idea, its realization poses a
number of challenges, the major one being the exchange and usage of analysis in-
formation. For cooperation, tools are required to produce (partial) results which
other tools can understand and employ in their verification run. With conditional
model checking [7], the first proposal of an exchange format for verification re-
sults was made. A conditional model checker outputs its (potentially partial)
result in the form of a condition which can be read by other conditional model
checkers in order to complete the verification task. Since verification tools nor-
mally do not understand conditions, reducers [23,9] have been proposed to bring
conditions back into a form understandable by verifiers, namely into (residual)
programs describing the so far unverified program part. This allows the result
of a conditional model checker to be made usable by arbitrary other verifiers.
A second type of existing result usage is the validation of tool’s results [4,34],
similar to proof-carrying code [37]. Both of these types are sequential forms
of cooperation: a first verifier starts and a second verifier continues, either by
completing or by validating a first result.
In this paper, we propose CoVEGI, a cooperation framework which comple-
ments these existing approaches by a new type of cooperation. Conceptually,
this framework (depicted in Figure 1) consists of a master verifier and a number
of helper invariant generators. The master verifier has the overall control on the
verification process and can delegate tasks to helpers as well as continue its own
verification process with (partial) results provided by helpers. The helpers run
in parallel as black boxes without cooperation. The task to be delegated is an in-
tegral part of software verification, namely invariant generation. The framework
allows cooperation via outsourcing the task of invariant generation, leveraging
the strength of specialized invariant generation tools.
Like for other types of cooperation, the question of the exchange format for
results comes up. Here, we have chosen correctness witnesses [3] for this purpose.
Correctness witnesses are employed in witness validation and certify a verifier’s
result stating the correctness of a program. These witnesses are particularly well
suited for our intended usage, because their format is standardized and a number
of verifiers already produce correctness witnesses. To account for the incoopera-
tion of helper verifiers not producing witnesses, our framework also foresees the
inclusion of adapters transforming invariants into correctness witnesses. We pro-
vide an implementation of two such adapters. Witnesses are then injected into
the verification run of the master. For stating the task to be solved by invariant
110 J. Haltermann and H. Wehrheim
Program
Master
Verifier Result
Property Pro
rop WitnessInjector g+P
g+P rop
Pro Witness Witness
Mapper Adapter Adapter Mapper
|| ... ||
Helper Invari- Helper Invari-
ant Generator ant Generator
2 Fundamentals
We aim at the cooperative verification of programs written in GNU C, focusing
on the validation of safety properties. To be able to define safety properties, a
1
https://llvm.org/docs/LangRef.html
CoVEGI: Cooperative Verification via Externally Generated Invariants 111
8: if (!(n==y))
9: Error: return 1;
A CFA (or program) C violates a safety property P = (, ϕ) when the pro-
gram reaches location in a state which does not satisfy ϕ. More formally, P is
g0 gn−1
violated by C, if there is some path π ∈ paths(C), π = σ0 , l0 → σ1 , l1 · · · →
σn , ln and some i, 0 ≤ i ≤ n, such that i = and σi (ϕ) = false.
2
In our formalization, we use integer variables only, the implementation covers C
programs.
3
https://github.com/sosy-lab/sv-benchmarks
112 J. Haltermann and H. Wehrheim
1 int main() { 1
n=nondet()
2 unsigned int n = nondet();
3 unsigned int x = n, y = 0; 2
x=n
4 while(x > 0){
3 ¬ (x>0) 8
5 x--;
y=0
6 y++; }
4 n==y ¬ (n==y)
7 // Safety property
(x>0)
8 if (!(n == y)) {
y++ 5 11 9
9 Error: return 1; }
x- - ret 0 ret1
10 return 0;}
6 12 10
(a) C code example (b) The corresponding CFA
o/w o/w o/w o/w
Fig. 2: An example program, its control flow automaton and one witness
(1) general information like the program associated with the witness, and (2)
a GraphML representation of the witness automaton. More information and a
formal specification of correctness witnesses can be found in [3].
In Figure 2(c), we see a correctness witness for our example program. State
q3 is reached by transitions labelled 3,enterLoopHead or 6,enterLoopHead and
thus corresponds to the loop head at program location 4. Associated with this
state is the invariant n = x + y.
3 Concept
The most important component of the framework is the master verifier, which
we build out of an existing verifier. The master is responsible for coordinating
the verification process and can, if needed, request support from the second type
of components, the helpers, in the form of invariants as described by correctness
witnesses. Hence, the master is also steering the cooperation.
In the following, we explain the two sorts of main components in more detail:
Master Verifier A master verifier gets as input the program C as CFA and a
safety property P . It computes as output a boolean answer b, stating whether
the property holds, and possibly (but not necessarily) provides an overall
witness ω. To be able to process the provided support in form of invariants
stored inside of correctness witnesses, a master is required to implement an
internal function called injectWitness. The function loads a witness, extracts
the invariants present in it and injects them into the analysis of the master
verifier. The witness injection can either happen before (re-)starting the
analysis or during runtime.
Helper Invariant Generator A helper invariant generator gets as input the
program C as CFA and a safety property P . It computes as output a set of
invariants, stored in a verification witness ω . The generated invariants are
neither required to be helpful for the master verifier nor to be correct. Thus,
helper invariant generators are also allowed to generate trivial invariants or
invariant candidates which might turn out to be wrong.
We can neither expect existing verification tools which we wish to use as helpers
to be able to work on CFAs, nor to understand the safety property or to produce
witnesses. Hence, we foresee two further sorts of components in our framework:
114 J. Haltermann and H. Wehrheim
Mapper A mapper transforms the safety property specification inside the pro-
gram into the desired input format of the helper. A mapper basically con-
ducts some simple syntactic code replacements. For instance, for our running
example some helpers might instead require the safety property to be written
as assert(n==y); or as if(!(n==y)) {verifier error();}.
Adapter An adapter generates a correctness witness out of the computed loop
invariants of a helper. Furthermore, some helper invariant generators work
on intermediate representations (IR) of the C-language (e.g. LLVM) or inter-
mediate verification languages (e.g. Boogie). Then, the computed invariants
(formulated in terms of IR-variables) first of all need to be translated back
to the namespace of the C-program. An adapter for LLVM is explained in
more detail in Section 3.4.
Algorithm 1 CoVEGI-algorithm
Input: C CFA
P safety property
M master
Helpers set of helpers
conf configuration
Output: ω witness
b result
1: M.start(C, P, conf.timerM);
2: wait until (M.requestsForHelp ∨ M.hasSolution());
3: if (M.hasSolution()) then
4: return M.getSolution();
5: for each H ∈ Helpers do parallel run helpers in parallel
6: H.start(C, P, conf.timeoutH);
7: wait until (H.timedout() ∨ H.hasSolution() ∨ H.stopped());
8: if (H.hasSolution() ∧ nonTrivial(H.getSolution())) then
9: witnesses := witnesses ∪ H.getSolution();
10: if (conf.termAfterFirstInv) then
11: for each H’ ∈ helpers \{ H } do parallel
12: H’.stop(); stop other helpers
13: if (M.hasSolution()) then
14: return M.getSolution();
15: if (witnesses = ∅) then invariants found
16: if (conf.restartMaster) then
17: M.stop();
18: M.inject(witnesses); inject witnesses into master
19: if (conf.restartMaster) then
20: M.start(C,P, ∞);
21: join(M); wait for M to finish
22: return M.getSolution();
Timeouts Finally, similar to the master, we can set a specific timeout for the
helpers which fixes how long they are allowed to try to generate invariants.
The timeout option is called timeoutH.
witness set (line 9). Depending on option termAfterFirstInv, either all but the
first finished helper are stopped or it is waited until all helpers either computed a
solution or ran into their timeout. If invariants (witnesses) have been computed,
these are injected into the master (line 18). If the restartMaster option is set,
the master needs to be stopped before injection and restarted afterwards. Then
the master continues and completes its verification (without any further request
for help) and the result is finally returned.
straction refinement) scheme [20] with lazy-abstraction [33] and Craig interpo-
lation [32].
Witness Injection: The predicate abstraction maintains, for each abstract state,
one set of available predicates (called precision) and one set of valid predicates.
Witness injection is realized by extracting all predicates and the corresponding
locations from the invariants. If these predicates contain conjunctions of clauses,
these are furthermore split up and inserted individually. Splitting predicates in-
creases the performance due to the fact that SMT solvers perform better on
many small predicates than on few larger ones5 . These predicates are added to
the precision of abstract states corresponding to the locations specified in the
witness. Thereby, the predicates are used during the next abstraction performed
by the analysis. The abstraction function itself guarantees that only predicates
from the candidate set being valid at the current location are used. Thus, in-
valid invariants are ignored. This procedure can also be used when restarting
predicate abstraction, by adding the predicates from the witness to the initial
precision of the abstract states corresponding to the locations specified in the
witness (which is empty otherwise).
k-Induction. The basic idea of k-induction [25] is to generalize bounded
model checking (BMC) [14] via induction. After proving k-bounded program ex-
ecutions safe using BMC, a generalization is aimed for. Therefore, it generates
auxiliary invariants that are continuously refined using a CEGAR based analy-
sis [5]. These invariants are combined with the information generated by BMC
and generalized to a safety proof by successfully conducting an induction step.
Witness Injection: For both cases, adding invariants into a running analysis or
adding before restarting, we make use of the same idea: Whenever a witness is
made available to the analysis, the encoded predicates and the program loca-
tions are added as candidates to the set of auxiliary invariants, generated by
the analysis. New elements in this set are periodically checked for validity by k-
induction. Thereby, valid externally generated invariants are conjoined with the
predicates stored in the analysis abstract states, corresponding to the invariants
location. Invalid invariants are thus ignored.
entry location (the first) and a single exit location (in general the last location of
the block). To construct a witness containing the invariants, we need to translate
them and find the matching C-code location for the basic block. For both, we
use the LLVM-IR equipped with debug information, using the compiler with
launch parameter -g. Thereby, we obtain the IR-code fragment of the program
in Figure 2(a), shown in simplified form and containing the most important
debug information as comments. The example contains two basic blocks, entry
and bb.
1 entry:
2 v1 = bitcast i32 (...)* @nondet to i32 ()* n
3 v2 = icmp eq i32 v1, 0
4 br i1 v2, label %error, label %_bb
5
6 _bb:
7 v3 = phi i32 [0, %entry], [v6, %_bb] y
8 v4 = phi i32 [v1, %entry], [v5, %_bb] x
9 v5 = add i32 v4, -1
10 v6 = add i32 v3, 1
11 v7 = icmp eq i32 v5, 0
12 br i1 v7, label %error, label %_bb line 4
The helper invariant generator computes the invariant v1 − v4 − v3 = 0 for
the example and associates it with the basic block bb. At first, we need to
transform the variables from the IR to C-variables occurring in the program.
In this example we can use the debug information, as shown in comments in
the code. In general, a more sophisticated procedure is needed since LLVM-IR
uses a three address code. Therein, complex expressions are split into several
statements using intermediate variables which are resolved to C-expressions.
Afterwards, the transformed invariant needs to be associated with the correct
location in the C-code. We analyze the LLVM IR program structure to map the
basic blocks back to C-locations. In the example, the block bb is identified as
being the loop of the program, thus the invariant is mapped to the loop head.
For this, we employed some basic functions provided by PHASAR [41] in our
adapter. Finally, we construct the CFA of the C-program, store the invariants
at the nodes and convert the equipped CFA to a verification witness.
4 Evaluation
In the following, we evaluate different instantiations of CoVEGI. We focus on
both effectiveness and efficiency, generally aiming at checking whether the use
of CoVEGI can increase the number of correctly solved verification tasks within
the same resource limits. A more detailed evaluation of CoVEGI can be found
in an extended pre-print [29].
RQ1. Can collective invariant generation increase the effectiveness of the master
verifier? Evaluation plan: We let the framework run with a single invariant
generator and compare the results to a standalone run of the master verifier.
RQ2. Does cooperation impact the overall efficiency of the verification? Eval-
uation plan: We compare the run time of CoVEGI with one helper against
the two master verifiers running standalone.
RQ3. Does it pay off to run two invariant generators in parallel? Evaluation
plan: We let the framework run with two invariant generators and compare
the results to a run, where only a single invariant generator is used.
Table 3: Comparison of the two master verifiers running standalone and using a
single helper.
Tool - k-induction predicate abstr.
Combination alone +SH +UA +VA alone +SH +UA +VA
correct overall 146 148 158 163 116 122 132 125
correct true 102 104 114 119 78 84 94 87
correct false 44 44 44 44 38 38 38 38
additional true - +3 +13 +19 - +6 +16 +9
additional false - 0 0 0 - 0 0 0
uniquely solved 1 0 8 15 0 0 6 3
container of the latest version8 . All three helper invariant generators are used in
their default configuration.
During evaluation, we used the following default configurations for our own
framework: We set termAfterFirstInv and restartMaster to true, setting the
timerM to 50s9 and the timeoutH to 300s. In general, we will use the abbrevia-
tions SH for SeaHorn, UA for UltimateAutomizer and VA for VeriAbs.
Verification Tasks. The verification tasks used are taken from the set of
SV-COMP 2020 benchmarks10 . As we are interested in finding suitable loop
invariants, we selected all tasks from the category ReachSafety-Loops. To obtain
a more broad distribution of tasks, we randomly selected 55 additional tasks
from the categories ProductLines, Recursive, Sequentialized, ECA, Floats and
Heap, yielding in total 342 tasks.
Computing Resources. We conducted the evaluation on three virtual ma-
chines, each having an Intel Xeon E5-2695 v4 CPU with eight cores and a fre-
quency of 2.10 GHz and 16GB memory, running an Ubuntu 18.04 LTS with
Linux Kernel 4.15. We run our experiments using the same setting as in the
SV-COMP, giving each task 15 minutes of CPU-time on 8 cores and 15GB of
memory. We employed Benchexec guaranteeing these resource-limitations [12].
Availability. Our tool and all experimental data are available11 .
On our data set, the total number of correctly solved tasks using CoVEGI
increases by 12% for k-induction and 14% for predicate abstraction as master.
1,000
CPU time (s) kInd
kInd-SH
kInd-UA
kInd-VA
100
50
90 100 110 120 130 140 150 160 170
n-th fastest correct result
(a) CoVEGI using k-induction as master
1,000
pred
CPU time (s)
pred-SH
pred-UA
pred-VA
100
50
90 95 100 105 110 115 120 125 130 135
n-th fastest correct result
(b) CoVEGI using predicate abstraction as master
help in these cases. We see some tasks for which helping increased the runtime,
but also some for which it decreased it. In most of the cases, the CPU time used
by CoVEGI is not significantly higher.
Finally, we compare the average CPU time needed to correctly solve a task.
Table 4 shows the average time needed for all tasks and – in brackets – for the
correctly solved tasks only. We observe that the runtime increases when only
looking at correctly solved tasks (in particular for VeriAbs), however, when
considering all tasks the CPU time is even decreased. The latter effect is due to
the number of timeouts of the master decreasing when cooperating with helpers.
Concluding, we can make the following observation.
1,000
Table 4: Total CPU time for
CoVEGI with kInd-VeriAbs(s)
On our dataset, CoVEGI can increase the total number of correctly solved
tasks using UA and VA in parallel; in general waiting for the other tool to
also finish its computation does not pay off.
124 J. Haltermann and H. Wehrheim
5 Related work
In this paper, we presented a framework for cooperative verification via collec-
tive invariant generation. The idea of collaboration for verification by combin-
ing known techniques has been widely employed before. For instance, there are
combinations of verification with testing approaches [21,22,26,18,19,24] and with
approaches for invariant generation [40,27,39,15,17]. The latter combinations are
conducted in a white box manner using strong coupling between the components,
making the addition of a new approach a challenging task. Our framework con-
ceptually decouples the invariant generation from the verification, making it
more flexible. In addition, using a black box integration with defined exchange
formats allows us to easily exchange or integrate new approaches.
There are also existing concepts for collaboration between different tech-
niques in a black-box manner. Conditional model checking is a technique for
sequentially composing different model checkers, sharing information between
the tools in form of conditions [7]. Beyer and Jakobs developed a concept for
combining model checking with testing [8]. Although both approaches enable co-
operation, none combines a verification tool and tools for invariant generation.
We next shortly discuss three approaches which are conceptually closer to
our framework. Frama-C is a framework for code analysis, aiming for analyzing
industrial size code [35]. The framework contains different plugins, each imple-
menting a verification or testing technique. The plugins can exchange informa-
tion in form of ASCL source code annotations. Within Frama-C, the analyzers
can collaborate by being either sequentially or parallelly composed. For this, par-
tial results produced by an analysis can be completed by a second one or several
CoVEGI: Cooperative Verification via Externally Generated Invariants 125
6 Conclusion
In this paper, we have presented a novel form of black box cooperation for
software verification via externally generated invariants. Within the configurable
framework named CoVEGI, the so called master verifier steering the verification
process is able to delegate the task of invariant generation to one or several
helper invariant generators.
We implemented CoVEGI within the CPAchecker framework using k-
induction and predicate abstraction as master analysis supported by three exist-
ing helpers SeaHorn, UltimateAutomizer and VeriAbs. Our evaluation on
a set of SV-COMP verification tasks shows that CoVEGI increases the number
of correctly solved tasks without increasing the overall verification time. The
best combination of helpers, UltimateAutomizer and VeriAbs in parallel,
yields an increase of 12% for k-induction and 17% for predicate abstraction.
Next, we plan to enhance the cooperation by analyzing the behavior of the
master in order to identify an optimal point to request for help. Moreover, ex-
tending CoVEGI by additionally taking error traces found by the helper into
account is also scheduled. In addition, we intend to investigate whether a selec-
tion of helpers on the basis of the given verification task is beneficial.
126 J. Haltermann and H. Wehrheim
References
1. Afzal, M., Asia, A., Chauhan, A., Chimdyalwar, B., Darke, P., Datar, A., Kumar,
S., Venkatesh, R.: Veriabs : Verification by abstraction and test generation. In:
ASE. pp. 1138–1141. IEEE (2019). https://doi.org/10.1109/ASE.2019.00121
2. Beyer, D.: Software verification with validation of results - (report on SV-COMP
2017). In: Legay, A., Margaria, T. (eds.) TACAS. LNCS, vol. 10206, pp. 331–349.
Springer, Berlin, Heidelberg (2017). https://doi.org/10.1007/978-3-662-54580-5 20
3. Beyer, D., Dangl, M., Dietsch, D., Heizmann, M.: Correctness witnesses: ex-
changing verification results between verifiers. In: Zimmermann, T., Cleland-
Huang, J., Su, Z. (eds.) FSE. pp. 326–337. ACM, New York, NY, USA (2016).
https://doi.org/10.1145/2950290.2950351
4. Beyer, D., Dangl, M., Dietsch, D., Heizmann, M., Stahlbauer, A.: Witness valida-
tion and stepwise testification across software verifiers. In: Nitto, E.D., Harman,
M., Heymans, P. (eds.) ESEC/FSE. pp. 721–733. ACM, New York, NY, USA
(2015). https://doi.org/10.1145/2786805.2786867
5. Beyer, D., Dangl, M., Wendler, P.: Boosting k-induction with continuously-refined
invariants. In: Kroening, D., Pasareanu, C.S. (eds.) CAV. LNCS, vol. 9206, pp.
622–640. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-21690-4 42
6. Beyer, D., Gulwani, S., Schmidt, D.A.: Combining model checking and data-flow
analysis. In: Clarke, E.M., Henzinger, T.A., Veith, H., Bloem, R. (eds.) Handbook
of Model Checking, pp. 493–540. Springer (2018). https://doi.org/10.1007/978-3-
319-10575-8 16
7. Beyer, D., Henzinger, T.A., Keremoglu, M.E., Wendler, P.: Conditional
model checking: a technique to pass information between verifiers. In:
Tracz, W., Robillard, M.P., Bultan, T. (eds.) FSE. p. 57. ACM (2012).
https://doi.org/10.1145/2393596.2393664
8. Beyer, D., Jakobs, M.: Coveritest: Cooperative verifier-based testing. In: Hähnle,
R., van der Aalst, W.M.P. (eds.) FASE. LNCS, vol. 11424, pp. 389–408. Springer
(2019). https://doi.org/10.1007/978-3-030-16722-6 23
9. Beyer, D., Jakobs, M., Lemberger, T., Wehrheim, H.: Reducer-based
construction of conditional verifiers. In: Chaudron, M., Crnkovic, I.,
Chechik, M., Harman, M. (eds.) ICSE. pp. 1182–1193. ACM (2018).
https://doi.org/10.1145/3180155.3180259
10. Beyer, D., Keremoglu, M.E.: CPAchecker: A tool for configurable software verifica-
tion. In: Gopalakrishnan, G., Qadeer, S. (eds.) CAV. LNCS, vol. 6806, pp. 184–190.
Springer, Berlin, Heidelberg (2011). https://doi.org/10.1007/978-3-642-22110-1 16
11. Beyer, D., Keremoglu, M.E., Wendler, P.: Predicate abstraction with adjustable-
block encoding. In: Bloem, R., Sharygina, N. (eds.) FMCAD. pp. 189–197. IEEE,
Washington, DC, USA (2010), http://ieeexplore.ieee.org/document/5770949/
12. Beyer, D., Löwe, S., Wendler, P.: Reliable benchmarking: requirements
and solutions. Int. J. Softw. Tools Technol. Transf. 21(1), 1–29 (2019).
https://doi.org/10.1007/s10009-017-0469-y
13. Beyer, D., Wehrheim, H.: Verification artifacts in cooperative verification:survey
and unifying component framework. In: Margaria, T., Steffen, B. (eds.) ISoLA.
LNCS, vol. 12476, pp. 143–167. Springer (2020). https://doi.org/10.1007/978-3-
030-61362-4 8
14. Biere, A., Cimatti, A., Clarke, E.M., Strichman, O., Zhu, Y.: Bounded model check-
ing. Advances in Computers 58, 117–148 (2003). https://doi.org/10.1016/S0065-
2458(03)58003-2
CoVEGI: Cooperative Verification via Externally Generated Invariants 127
15. Blanchet, B., Cousot, P., Cousot, R., Feret, J., Mauborgne, L., Miné, A.,
Monniaux, D., Rival, X.: A static analyzer for large safety-critical soft-
ware. In: Cytron, R., Gupta, R. (eds.) PLDI. pp. 196–207. ACM (2003).
https://doi.org/10.1145/781131.781153
16. Bradley, A.R.: Sat-based model checking without unrolling. In: Jhala, R.,
Schmidt, D.A. (eds.) VMCAI. LNCS, vol. 6538, pp. 70–87. Springer (2011).
https://doi.org/10.1007/978-3-642-18275-4 7
17. Brain, M., Joshi, S., Kroening, D., Schrammel, P.: Safety verification and refutation
by k-invariants and k-induction. In: Blazy, S., Jensen, T.P. (eds.) SAS. LNCS,
vol. 9291, pp. 145–161. Springer (2015). https://doi.org/10.1007/978-3-662-48288-
99
18. Christakis, M., Müller, P., Wüstholz, V.: Collaborative verification and test-
ing with explicit assumptions. In: Giannakopoulou, D., Méry, D. (eds.)
FM. LNCS, vol. 7436, pp. 132–146. Springer, Berlin, Heidelberg (2012).
https://doi.org/10.1007/978-3-642-32759-9 13
19. Christakis, M., Müller, P., Wüstholz, V.: Guiding dynamic symbolic exe-
cution toward unverified program executions. In: Dillon, L.K., Visser, W.,
Williams, L. (eds.) ICSE. pp. 144–155. ACM, New York, NY, USA (2016).
https://doi.org/10.1145/2884781.2884843
20. Clarke, E.M., Grumberg, O., Jha, S., Lu, Y., Veith, H.: Counterexample-guided ab-
straction refinement for symbolic model checking. J. ACM 50(5), 752–794 (2003).
https://doi.org/10.1145/876638.876643
21. Csallner, C., Smaragdakis, Y.: Check ’n’ crash: combining static checking and
testing. In: Roman, G., Griswold, W.G., Nuseibeh, B. (eds.) ICSE. pp. 422–431.
ACM, New York, NY, USA (2005). https://doi.org/10.1145/1062455.1062533
22. Csallner, C., Smaragdakis, Y., Xie, T.: DSD-Crasher: A hybrid analysis tool for bug
finding. TOSEM 17(2), 8:1–8:37 (2008). https://doi.org/10.1145/1348250.1348254
23. Czech, M., Jakobs, M., Wehrheim, H.: Just test what you cannot verify! In: Egyed,
A., Schaefer, I. (eds.) FASE. LNCS, vol. 9033, pp. 100–114. Springer, Berlin, Hei-
delberg (2015). https://doi.org/10.1007/978-3-662-46675-9 7
24. Daca, P., Gupta, A., Henzinger, T.A.: Abstraction-driven concolic testing. In: Job-
stmann, B., Leino, K.R.M. (eds.) VMCAI. LNCS, vol. 9583, pp. 328–347. Springer,
Berlin, Heidelberg (2016). https://doi.org/10.1007/978-3-662-49122-5 16
25. Donaldson, A.F., Haller, L., Kroening, D., Rümmer, P.: Software verification us-
ing k-induction. In: Yahav, E. (ed.) SAS. LNCS, vol. 6887, pp. 351–368. Springer
(2011). https://doi.org/10.1007/978-3-642-23702-7 26
26. Ge, X., Taneja, K., Xie, T., Tillmann, N.: Dyta: dynamic symbolic execu-
tion guided with static verification results. In: Taylor, R.N., Gall, H.C., Med-
vidovic, N. (eds.) ICSE. pp. 992–994. ACM, New York, NY, USA (2011).
https://doi.org/10.1145/1985793.1985971
27. Gupta, A., Rybalchenko, A.: Invgen: An efficient invariant generator. In: Bouaj-
jani, A., Maler, O. (eds.) CAV. LNCS, vol. 5643, pp. 634–640. Springer (2009).
https://doi.org/10.1007/978-3-642-02658-4 48
28. Gurfinkel, A., Kahsai, T., Komuravelli, A., Navas, J.A.: The seahorn verification
framework. In: Kroening, D., Pasareanu, C.S. (eds.) CAV. LNCS, vol. 9206, pp.
343–361. Springer (2015). https://doi.org/10.1007/978-3-319-21690-4 20
29. Haltermann, J., Wehrheim, H.: Cooperative Verification via Collective Invariant
Generation. arXiv e-prints arXiv:2008.04551 (2020), https://arxiv.org/abs/2008.
04551
128 J. Haltermann and H. Wehrheim
30. Heizmann, M., Chen, Y., Dietsch, D., Greitschus, M., Hoenicke, J., Li, Y., Nutz,
A., Musa, B., Schilling, C., Schindler, T., Podelski, A.: Ultimate automizer and
the search for perfect interpolants - (competition contribution). In: Beyer, D.,
Huisman, M. (eds.) TACAS. LNCS, vol. 10806, pp. 447–451. Springer (2018).
https://doi.org/10.1007/978-3-319-89963-3 30
31. Heizmann, M., Hoenicke, J., Podelski, A.: Software model checking for people who
love automata. In: Sharygina, N., Veith, H. (eds.) CAV. LNCS, vol. 8044, pp.
36–52. Springer (2013). https://doi.org/10.1007/978-3-642-39799-8 2
32. Henzinger, T.A., Jhala, R., Majumdar, R., McMillan, K.L.: Abstractions from
proofs. In: Jones, N.D., Leroy, X. (eds.) POPL. pp. 232–244. ACM, New York,
NY, USA (2004). https://doi.org/10.1145/964001.964021
33. Henzinger, T.A., Jhala, R., Majumdar, R., Sutre, G.: Lazy abstraction. In: Launch-
bury, J., Mitchell, J.C. (eds.) POPL. pp. 58–70. ACM, New York, NY, USA (2002).
https://doi.org/10.1145/503272.503279
34. Jakobs, M., Wehrheim, H.: Certification for configurable program analysis. In:
Rungta, N., Tkachuk, O. (eds.) SPIN. pp. 30–39. LNCS, ACM, New York, NY,
USA (2014). https://doi.org/10.1145/2632362.2632372
35. Kirchner, F., Kosmatov, N., Prevosto, V., Signoles, J., Yakobowski, B.: Frama-
c: A software analysis perspective. Formal Asp. Comput. 27(3), 573–609 (2015).
https://doi.org/10.1007/s00165-014-0326-7
36. McMillan, K.L.: Interpolation and model checking. In: Clarke, E.M., Henzinger,
T.A., Veith, H., Bloem, R. (eds.) Handbook of Model Checking, pp. 421–446.
Springer (2018). https://doi.org/10.1007/978-3-319-10575-8 14
37. Necula, G.C.: Proof-carrying code. In: Lee, P., Henglein, F., Jones, N.D.
(eds.) POPL. pp. 106–119. ACM Press, New York, NY, USA (1997).
https://doi.org/10.1145/263699.263712
38. Pauck, F., Wehrheim, H.: Together strong: cooperative Android app analysis. In:
Dumas, M., Pfahl, D., Apel, S., Russo, A. (eds.) ASE. pp. 374–384. ACM (2019).
https://doi.org/10.1145/3338906.3338915
39. Rocha, W., Rocha, H., Ismail, H., Cordeiro, L.C., Fischer, B.: Depthk: A k-
induction verifier based on invariant inference for C programs - (competition contri-
bution). In: Legay, A., Margaria, T. (eds.) TACAS. LNCS, vol. 10206, pp. 360–364
(2017). https://doi.org/10.1007/978-3-662-54580-5 23
40. Sankaranarayanan, S., Sipma, H.B., Manna, Z.: Scalable analysis of linear systems
using mathematical programming. In: Cousot, R. (ed.) VMCAI. LNCS, vol. 3385,
pp. 25–41. Springer (2005). https://doi.org/10.1007/978-3-540-30579-8 2
41. Schubert, P.D., Hermann, B., Bodden, E.: Phasar: An inter-procedural static anal-
ysis framework for C/C++. In: Vojnar, T., Zhang, L. (eds.) TACAS. LNCS,
vol. 11428, pp. 393–410. Springer (2019). https://doi.org/10.1007/978-3-030-17465-
1 22
CoVEGI: Cooperative Verification via Externally Generated Invariants 129
Open Access This chapter is licensed under the terms of the Creative Commons
Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/),
which permits use, sharing, adaptation, distribution and reproduction in any medium
or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license and indicate if changes were
made.
The images or other third party material in this chapter are included in the chapter’s
Creative Commons license, unless indicated otherwise in a credit line to the material. If
material is not included in the chapter’s Creative Commons license and your intended
use is not permitted by statutory regulation or exceeds the permitted use, you will need
to obtain permission directly from the copyright holder.
Engineering Secure Self-Adaptive Systems
with Bayesian Games
1 Introduction
A self-adaptive system is designed to be capable of modifying its structure and
behavior at run time in response to changes in its environment and the system
itself (e.g., variability in system performance, deployment cost, internal faults,
and system availability) [9,12]. One of the major challenges in self-adaptive
systems is managing uncertainty; i.e., the system should be capable of making
appropriate planning decisions despite limited observations about its environment.
Achieving security in presence of uncertainty is particularly challenging due to
the adversarial nature of the environment [17,13]: (1) to avoid detection, a typical
attacker may attempt to remain hidden while carrying out its actions, and so
accurately estimating its objectives and capabilities can be difficult, and (2) the
attacker actively attempts to cause as much harm as possible to the system, and
so a typical “average case” analysis may not be appropriate for making optimal
defensive decisions [28].
© The Author(s) 2021
E. Guerra and M. Stoelinga (Eds.): FASE 2021, LNCS 12649, pp. 130–151, 2021.
https://doi.org/10.1007/978-3-030-71500-7 7
Engineering Secure Self-Adaptive Systems with Bayesian Games 131
2 Background
2.1 Running Example
collected from the web server, however, cannot fully demonstrate its compromise
due to, e.g., the deficiencies of scanning tools, but with uncertainty. As shown in
the Figure, Server2 could be potentially attacked with a 20% probability while
Server3 is with a higher probability of 50%. These two servers, if compromised in
reality, might perform harmful actions controlled by the attackers to achieve their
objectives, rendering the loss of system reward. Here we assume the malicious
strategies of simply discarding all the distributed user requests. The reward of
attacks is denoted by the system loss, i.e., subtracting the maximum reward the
system could achieve from the reward under attacks, leading to a zero-sum game.
where a−i = [a1 , ..., ai−1 , ai+1 , ..., an ], θ−i = [θ1 , ..., θi−1 , θi+1 , ..., θn ], s∗−i =
[s∗1 , ...s∗i−1 , s∗i+1 , ..., s∗n ], S(θi ) is the set of all possible strategies for agent i under
134 N. Li et al.
θi , and ρ(θ−i |θi ) is the conditional probability representing the player i’s belief
about other players’ types under type θi .
Bayesian Nash equilibrium is a set of strategies, one for each type of player. It is
the best strategy that maximizes his or her payoff to other players’ equilibrium
strategies. In a Nash equilibrium, there is no player who can improve his profit
by unilaterally modifying his strategy if the actions of the rest are fixed [25,21].
based on the input from Monitor are identified. The Analyzer performs analysis
and further checks whether certain components are attacked with probabilities;
potential deviated malicious actions are identified; the rewards for the attack
are estimated, based on the knowledge about component vulnerabilities and
system objectives. Such attack probabilities can be analyzed with a statistical
combination of all feasible scenarios along with expert judgment [16,24]. A typical
example is that both Server2 and Server3 are analyzed to be compromised and
discarding user requests with a certain probability, reducing the system utility.
Planner. Planner generates a workflow of adaptation actions aiming to counteract
violations of system goals or better achieving goals. It consists of one or a set
of actions to be enacted by automatically solving the Multi-player Bayesian
Game transformed with the input of potential attacks from the Analyzer and
architectural model of the managed subsystem along with the system objectives,
which is elaborated in Section 4. For each security situation, it generates an
equilibrium if one exists as the adaptation to respond to unexpected attacks,
or prompts for a change in the design of the system if the violation cannot be
handled. Distributing more percentage of a user request to the normal server
while decreasing the percentage to those with a high probability of compromise
as well as adjusting the fidelity level for servers could be feasible actions for
Znn.com Website under security attacks.
Executor. During execution, the strategies from the adaptation equilibrium are
enacted on the managed subsystem through actuators. Typical examples could
be setting the distribution percentage of user percentage in LoadBalancer for
each server.
In the next part, we focus on planning activity with Bayesian game theory.
We assume adequate monitoring in place, sufficient analysis methods on potential
attacks with uncertainties based on observation and historical information, as
well as an execution environment through which selected adaptation strategies
are enacted.
– C is a set of components;
– A is a set of joint actions A = A1 × ... × An , where Ai denotes a finite set
of actions available to component i;
– Q is a set of quality attributes a system is interested in; for each Qx , a subset
of components SubCx ⊆ C could contribute to this quality attribute;
Each component is trying to make the right reaction to maximize the system
utility, essentially like a rational player in the game theory. Naturally, a system
under normal operation could be viewed as a cooperative game dealing with
how coalitions interact. Each component is denoted as an independent player
and these interacting components/players form a coalition. For instance, in the
running example, the LoadBalancer and three servers collaborate to achieve the
goals together, i.e., maximizing the system reward with revenue and penalty.
Specifically, the LoadBalancer should assign more user requests to those servers
with low computation usage, like the waiting queue in the bank, while the server
should adjust the fidelity level according to its current load. A high load may
lead to the text only content to decrease the cost while the server with low usage
can provide media content to promote the revenue.
Modeling Utility as Payoffs. The payoff among those players is allocated
by the utility from quality attributes. It is straightforward for developers to
design a system-level payoff function (e.g., the revenue and penalty in Section
2.1). However, due to the different roles of the components and the complex
relationship between them, it is complicated and sometimes untraceable to
manually design an appropriate component-level payoff function. To solve this
problem, we use the Shapley Value Method, a solution concept of fairly distributing
both gains and costs to several players working in coalition proportional to their
marginal contributions [37,36], to automatically decompose the system-level
utility into the component-level payoff. Shapley Value Method applies primarily
in situations when the contributions of each player are unequal, but each player
works in cooperation with each other to obtain the payoff. Given the component
set C, and a system-level utility function v, the payoff for a component i is:
1
|C |!(|C| − |C | − 1)![v(C ∪ {i}) − v(C )]
φi (C, v) = (2)
|C|!
C ⊆C \{i}
where |C| is the number of components in the set; C\{i} is the set C excluding
component i; v(C ) values the expected system-level utility when the system only
consists of the component set C .
The following is a typical example of system utility allocation with the
Shapley Value Method for the Znn website. To simplify the illustration, we
consider the situation where Server2 and Server3 are indeed compromised, the
LoadBalancer chooses the strategy equally distributing user requests to Server1
and Server2 (i.e., the requests distributed to Server1, Server2 and Server3 are
50, 50 and 0 respectively), and Server1 selects the text only mode. Besides,
the total unprocessed requests in the setting are 100, which is assumed to be
the full load of a server serving only text, with RM = 1.6, RT = 1,T = 50,
and K = 25 in Eq.(1). The computation capacity of a unit of text and media
Engineering Secure Self-Adaptive Systems with Bayesian Games 137
is 1 and 1.4 (i.e., CM and CT ) respectively. Thus, the system utility in this
situation is Usystem = 50 (i.e., 50 × 1 − (50 × 1 − 50)2 /25 with the remaining 50
requests discarded by malicious Server2 ). The cooperative player set consisting of
LoadBalancer and Server1 share this utility while Server2 and Server3 fight on
behalf of the attacks’ interests, thus not being considered in the coalition neither
allocated the payoff from the system utility.
Based on Eq.(2), we need the following two cases of coalitions for Shapley Value
calculation: (1) If there is only the LoadBalancer without Server1 in the coalition,
the utility of the system ULoadBalancer is 0 due to no requests process from Server1
neither from malicious Server2 ; (2) If there is only Server1 without LoadBalancer
distributing user requests, the requests are randomly passed among three servers,
i.e., the requests distributed to Server1, Server2 and Server3 are 34, 33 and 33
respectively, and the utility of the system for this coalition Userver1 is 34 (i.e., 34×
1 − 0). This is because malicious Server2 and Server3 do not return any feedback.
As a result, φLoadBalancer (C, v) = 1/2(Usystem − Userver1 + Uloadbalancer ) = 8 and
φServer1 (C, v) = 1/2(Usystem − ULoadBalancer + Userver1 ) = 42. Therefore,the
payoff to player LoadBalancer and Server1 are 8 and 42 respectively. Meanwhile,
attacks’ utility, the difference between system utility and the highest utility the
system could achieve without attacks (i.e., equally distributing user requests to
three servers and each server choosing multi-media mode in this setting with
value 160 = 100 × 1.6 − 0) is equally divided for two malicious players. In other
words, both Server2 and Server3 is allocated payoff 55 = (160-50)/2. Following
the aforementioned allocation process, each player obtains a unique payoff under
different attack situations and strategies from the Shapley Value Method based
on their roles contributing to marginal system utility.
Component-based Attacks. A system under security attacks is also defined as a
tuple SAS = C, A, Q, AT T . Instead of modeling an attacker or several attackers
with possible complex behaviors over different parts of the system, we model the
on-going attacks AT T the system is enduring at the component level since the
vulnerabilities of the components as well as their potential behavior deviations
are comparatively easy to observe. AT T can be obtained by synthesizing the
information from Monitor and Analyzer as described in Section 3.
Translation into a Bayesian game With the definition of the system on the
component level and the definition of the attacks AT T , a system under security
attacks is converted into a non-cooperative Bayesian game by the following steps:
138 N. Li et al.
Fig. 3: Results for Znn Website: (a) percentage of user requests to Server1 ; (b)
percentage of user requests to Server2 ; (c) strategies for Server1 ; (d) system
utility with game theory approach; (e) delta utility between Bayesian game theory
approach and probabilistic model checking approach.
which means that they are very likely compromised. Therefore, LoadBalancer
distributes as many user requests as possible to Server1, thus Server1 choosing
to provide text only content in avoid of overloading. Otherwise, Server1 can
provide multimedia content in less load condition to promote user satisfaction
with higher revenue.
Figure 3 (d) illustrates the maximum utility the system can achieve under
various attack situations. In particular, we observe that the utility reaches around
160 when all three servers are cooperative and is progressively decreased with
the increasing malicious probability of Server2 and Server3. This is consistent
with the fact that the system utility is deteriorated under security attack. To
compare the system utility in game theory with existing methods, we adopt
probabilistic model checking [29] as the comparison standard to formally model
the running example and synthesize the adaptation strategy maximizing its
expectation of the utility by reasoning about reward-based properties [11,7,32].
Figure 3 (e) presents the delta between two approaches (i.e., system utility with
game theory approach minus the utility with the probabilistic model checking
approach). Without security attacks, the adaptation decision generated by the
two approaches achieve the same utility. However, with the increasing malicious
probability of Server2 and Server3, game theory approach outperforms, providing
the better response to make up for the utility loss due to security attack, and
the average delta is 10.54, i.e., 15 percent outperforming with the average utility
80.39 achieved by game theory.
5 Evaluation – Routing Games
To evaluate our approach and assess its applicability for validation, we consider a
case study on an interdomain routing application. We first define the game (Sec-
tion 5.1) and propose a dynamic programming algorithm to solve the equilibrium
by decomposing the problem into smaller and tractable sub games (Section 5.2).
The results are present (Section 5.3) with a sensitivity analysis, illustrating how
the system can choose a robust strategy effective for a range of threat landscapes,
and a utility analysis by quantifying the defender’s utility with Bayesian game
compared to a greedy solution within the security context.
A routing system is usually composed of
N6 N7
smaller networks called nodes as shown in Fig-
ure 4. Since not all nodes are directly connected, Des:N5
N3 N4
packets often have to traverse several nodes and
the task of ensuring connectivity between nodes
is called interdomain routing [30,31]. Each node N1 N5
N2
could be owned by economic entities (Microsoft,
AT&T, etc.) and might be compromised by the
attacker at any time. Therefore, it is natural to Fig. 4: Routing Scenario.
consider interdomain routing from a game-theoretic point of view. Specifically,
game players are source nodes located on a network, aiming to send a package
(i.e., starting at N 1) to a unique destination node (i.e., N 5). The interaction
between players is dynamic and complex – asynchronous, sequential, and based
on partial information - and the best strategy for each player as the adaptation
response is updated as needed.
Engineering Secure Self-Adaptive Systems with Bayesian Games 141
– The player set for the game is C = {N 1, N 2, ..., N 7}. The set of affected
components by the attack includes N 2 and N 4, i.e., Catt = {N 2, N 4};
– The action set for all players, including malicious ones controlled by attacks,
is delivering the package to its neighboring nodes.
– The set of types for potential attacked component node includes “normal” and
“malicious” (i.e., θN 2 ∈ {normal, malicious}, θN 4 ∈ {normal, malicious}).
– The payoff for all the normal players is allocated by the system utility with the
Shapley Value Method (i.e., Usystem ÷ |normal players|, equally allocated
in this case since all of the nodes in this network is not cut vertex with the
same importance). For example. each node is awarded 8/7 if none of them
is attacked. The utility for the ongoing attacks on two components is the
utility loss from the system’s best response without attack, rendering a case
of zero-sum game.
– The probability distribution for both component N 2 and N 4 could be, e.g.,
50%/50% split (i.e., ρN 2,N 4 (normal, malicious) = (0.5, 0.5).
dynamic programming, the algorithm uses a set subG to store the set of nodes
which have been processed with their best reactive strategy. subG is initialized
as an empty set (line 1) and added with node d (line 2) since d does not need the
strategy to transmit the package. The algorithm starts by iterating all the nodes
in the distance disV alue (line 5), initialized by 1 (line 3). For example, N 2, N 4
and N 7 are qualified in the first iteration. Each node is checked whether it is
potentially attacked (i.e., uncertain(n) in line 6). For those uncertain nodes (e.g.,
N 2 and N 4), they might affect the strategy of their prior nodes (line 7) (e.g.,
N 1 and N 3), which shall be added to todoS (line 8), to be processed to update
their strategy due to its neighboring uncertainty. A typical example is that node
N 3 might trade off the delivery between N 4 and N 6 even though N 4 is in the
shortest path from N 3 to N 5, however, could deliberately send the package back
controlled by the attack. If the node is not in todoS to be updated (line 11),
it is directly added to the setG (line 12) as the best strategy for such benign
node is passing the package down to its adjacent node along the shortest path.
In this routing scenario, N 2, N 4 and N 7 is added to subG as their strategies in
equilibrium with normal type is easily determined.
After iterating all the nodes in disV alue 1, each node in todoS (line 15) is
checked whether it satisfies the condition (line 16) where all its neighboring
nodes (i.e., i ∈ adj(n) ) closer to destination (i.e., dis(i, d) == dis(n) − 1) have
been solved with their best strategies (i.e., in subG), to build a sub-game. As
shown in the example, though both N 1 and N 3 are prior to an uncertain node,
their strategy update is postponed as N 6 is not in subG yet, which affects the
sub-game generation for N 3, in turn delaying the sub-game construction for N 1.
An exemplified subgame construction
na
(line 17) starting from N 3 is illustrated
0.5 0.5
in Fig 5 when all conditions are satisfied. type(N2):: type(N2))
The stochastic behavior of those poten-
na na
tially compromised nodes can be modeled
by introducing a nature (or chance player), 0.5
type(N4):
)::
0.5
type(N4))
0.5 0.5
type(N4):: type(N4)::
who moves according to the probability dis-
N3 N3 N3 N3
tribution (e.g., 50%/50% split), randomly
determining whether attacks on N 2 and To N6 To N4 To N6 To N4 To N6 To N4 To N6 To N4
N 4 are successful. Then, N 3 can choose 6/7 7/7 6/6 4/6 6/6 7/6 6/5 4/5
an action passing to the one from the set 6/7 6/7
7/7
7/7
6/6 4/6 2/1
6/6
1/1 2/2
7/6 6/5
4/2
4/5
6/6 4/6
of its adjacent nodes, i.e., N 6 or N 4. Here, 6/7 7/7 2/1 4/1 6/6 7/6 2/2 4/2
6/7 7/7 6/6 4/6 6/6 7/6 6/5 4/5
N 3 is a normal node aware of that the 6/7 7/7 6/6 4/6 6/6 7/6 6/5 4/5
6/6 7/6 6/5 4/5
package is transmitted from N 1 and it is 6/7 7/7 6/6 4/6
not necessary to consider a rollback to N 1. Fig. 5: Sub-Game for N3.
The game is ended after N 3’s action as we
can prune the following branches: 1) to N 6, the remaining route sequence is N 7
and N 5 by default as their best strategy have been solved (i.e., N 6 delivers the
package to N 7, which in turn forwards to N 5); 2) to N 4, with N 4 forwarding
to N 5 if it is normal while backing to N 3 in malicious type. When the game
terminates, each player gets a unique payoff following different branches. As
Engineering Secure Self-Adaptive Systems with Bayesian Games 143
shown in the left most rectangle all the players (including N 2 and N 4 as they
are benign collaborating nodes) equally share the system utility value 6 with 3
hops from N 3 to N 5 plus the shortest path from N 1 to N 3. However, on the
rightmost branch, only five players ruling out N 2 and N 4 is allocated with the
system utility 4. The system utility is resulting from 6 hops if N 3 decides to
deliver the package to N 4 as the nature problematically chooses the malicious
type for N 4, which sends the package back to N 3 to maximize the attack’s utility.
Once N 3 receives the package from N 4, it redelivers the package to N 6 because
N 3 as a good player does not repeatedly send it back. To this end, N 2 and N 4 is
uniformly allocated the delta (i.e., 4) between the utility system obtained (i.e., 4)
and the maximum utility system could obtain (i.e., 8) as the payoff. The payoff
of the remaining branches can also be calculated accordingly.
After that, a pure Nash equilibrium is generated by solving this sub-game (line
18) with Gambit software tools [35], and the best strategy for the node is updated
according to the equilibrium. By solving the sub-game for N 3, the strategy for
N 3 in the equilibrium is to deliver the package to N 6, as the potential detriment
on delayed delivery time to N 4 due to attacks is greater than its comparative
advantage of the shortest path. Thus, this node with the solved strategy is
removed from todoS (line 19) and absorbed in setG (line 21). Once all the nodes
in the distance of disV alue from the destination have been iterated and all the
144 N. Li et al.
nodes in todoS satisfying conditions are computed for their best strategy, the
algorithm increment the value of disV alue one unit (line 23) and continue, until
the starting point s is in the set setG (line 24).
Fig. 6: Results for interdomain route example: (a) Expected route in equilibrium;
(b) System utility with game theory approach; (c) Delta between system utility
from game theory approach and utility from greedy algorithm.
Figure 6 (a) presents the results of the strategy selection (i.e., expected package
sequence) over two dimensions that correspond to the malicious probability of
N 2 and N 4, respectively. Red triangle points denote that the strategy for N 1 is
N 2, extending the range of P robability N 2 to around [0, 0.50]. This is because
when the chance of N 2 coming under attack is less than 0.50, N 1 should pass the
package to N 2, since N 2 is in the shortest path to the destination; otherwise, N 1
delivers the package to N 3. Similarly, when the malicious probability of N 4 is less
than 0.35, the strategy for N 3 reaching equilibrium is to deliver the package to
N 4 (i.e., blue square points), since the benefits of a short delivery time outweigh
the potential detriment. For the remaining situations denoted by the black circle
points, N 1 passes the package to N 3, which in turn forwards it to N 6.
Figure 6 (b) describes the utility the system could obtain for the attacked
components’ equilibrium strategies. As expected, when the P robability N 2 is
greater than 50% and P robability N 4 greater than 35% (i.e., black circle points
in Figure 6 (a)), the utility system can gain is 6 as there are 4 hops in the
expected sequence N 1 N 3 N 6 N 7 N 5). This plot also shows that the system
Engineering Secure Self-Adaptive Systems with Bayesian Games 145
6 Related Work
Self-adaptive systems under security attacks need to make adaptation decisions
as a response to a detected threat or to deviations from security goals and require-
ments [18]. Lorenzoli et al. [34] proposed a technique that could observe values at
relevant program points and identified the execution contexts leading to a soft-
ware failure so that mechanisms can be enabled for preventing future occurrences
of failures of the same type. Bailey et al. [4] generated Role Based Access Control
(RBAC) models to provide assurances for adaptations against insider threats.
RBAC technique was also applied to cloud computing environment to provide
appropriate security services according to the security level and dynamic changes
146 N. Li et al.
of the common resources [44]. Tsigkanos et al. [41] explored the use of Bigraphical
Reactive Systems to perform speculative threat analysis through model checking.
Burmester et al. [5] described a threat model to incorporate typical characteristics
of systems, such as survivability to abnormal behavior and possibility to recover
after critically vulnerable states are reached. Dimkov et al. [14] discussed insider
threats that span physical, cyber and social domains and present a framework
Portunes integrating all three security domains to describe attacks. Nashif et
al. [2] presented a multi-level intrusion detection system to detect network attacks
within three levels of granularities and proactively protected against them by
employing a fusion decision algorithm. Although, there are many different ways
of dealing with security attacks in self-adaptive systems, it is notable that the
application of game theory, with the characteristic of modeling the adversarial
nature of security attacks and designing reliable defense with proven mathematics,
has not gained the deserved attention.
Different sorts of games have been employed to study the actions of the
defender and attacker. Dijk et al. [42] presented a two-player game that reasons
about security scenarios where an attacker with uncertainty about its actions may
periodically gain full control of an asset, with each side trying to maintain control
as much as possible. An extension work by Farhang et al. [19] explicitly modeled
the information gains for the attackers as they control assets, improving attacker’s
capability. Based on these work, Kinneer et al. [28] additionally considered
multiple attacker types with different goals and capabilities by Bayesian Game.
Instead of modeling the attackers as independent players, our work models
the attacks on the component level, focusing on the defender modeling at the
architecture level and possible deviations of component behaviors. Cámara et
al. [6,8] adopted a game-theoretic perspective and model the system as turn-based
stochastic multi-player games between different players where players can either
cooperate to achieve the same goal or compete to achieve their own goals. In
addition, Glazier et al. [23] used game-based approach to automatically reason
and synthesize strategies for meta-manager by explicitly considering alternate
potential future state, thus improving the performance of a collection of autonomic
systems against a defined quality objective. Though, some of these existing works
concern about competitive behaviors in a system when some components cannot
be controlled and even behave according to conflicting goals with respect to other
components in the system. None of them, to the best of our knowledge, proposed
to model the Bayesian game in an architecture/component level and captured
multiple attacks as component’s variant types as well as the uncertainty due to
unsuccessful compromise.
Game theory is also increasingly applied to network security. Frigault et
al. [20] measured the network security in a dynamic environment with dynamic
Bayesian networks-based model to incorporate temporal factors. Charles et al. [26]
developed a packet forwarding game model under imperfect private monitoring.
Their equilibria rely on the probability of cooperation after observing a defection,
similar to our routing games in the evaluation. However, they looked at this
problem from the perspective of network nodes, without considering the situation
Engineering Secure Self-Adaptive Systems with Bayesian Games 147
of being attacked and how to allocate rewards from the system utility for multiple
components from the architecture perspective as illustrated in this work.
Acknowledgements
References
14. Trajce Dimkov, Wolter Pieters, and Pieter H. Hartel. Portunes: Representing attack
scenarios spanning through the physical, digital and social domain. In Automated
Reasoning for Security Protocol Analysis and Issues in the Theory of Security -
Joint Workshop, ARSPA-WITS 2010, Paphos, Cyprus, March 27-28, 2010. Revised
Selected Papers, pages 112–129, 2010.
15. Cuong T. Do, Nguyen H. Tran, Choong Seon Hong, Charles A. Kamhoua, Kevin A.
Kwiat, Erik Blasch, Shaolei Ren, Niki Pissinou, and Sundaraja Sitharama Iyengar.
Game theory for cyber security and privacy. ACM Comput. Surv., 50(2):30:1–30:37,
2017.
16. Dmitry Dudorov, David Stupples, and Martin Newby. Probability analysis of cyber
attack paths against business and commercial enterprise systems. In 2013 European
Intelligence and Security Informatics Conference, Uppsala, Sweden, August 12-14,
2013, pages 38–44, 2013.
17. Ahmed M. Elkhodary and Jon Whittle. A survey of approaches to adaptive
application security. In 2007 ICSE Workshop on Software Engineering for Adaptive
and Self-Managing Systems, SEAMS 2007, Minneapolis Minnesota, USA, May
20-26, 2007, page 16, 2007.
18. Mahsa Emami-Taba. A game-theoretic decision-making framework for engineering
self-protecting software systems. In Proceedings of the 39th International Conference
on Software Engineering, ICSE 2017, Buenos Aires, Argentina, May 20-28, 2017 -
Companion Volume, pages 449–452, 2017.
19. Sadegh Farhang and Jens Grossklags. Flipleakage: A game-theoretic approach
to protect against stealthy attackers in the presence of information leakage. In
Decision and Game Theory for Security - 7th International Conference, GameSec
2016, New York, NY, USA, November 2-4, 2016, Proceedings, pages 195–214, 2016.
20. Marcel Frigault, Lingyu Wang, Anoop Singhal, and Sushil Jajodia. Measuring
network security using dynamic bayesian network. In Proceedings of the 4th ACM
Workshop on Quality of Protection, QoP 2008, Alexandria, VA, USA, October 27,
2008, pages 23–30, 2008.
21. Drew Fudenberg and Jean Tirole. Game Theory. MIT press, 1991.
22. David Garlan, Robert T. Monroe, and David Wile. Acme: an architecture descrip-
tion interchange language. In Proceedings of the 1997 conference of the Centre
for Advanced Studies on Collaborative Research, November 10-13, 1997, Toronto,
Ontario, Canada, page 7, 1997.
23. Thomas J. Glazier and David Garlan. An automated approach to management
of a collection of autonomic systems. In IEEE 4th International Workshops on
Foundations and Applications of Self* Systems, FAS*W@SASO/ICCAC 2019,
Umea, Sweden, June 16-20, 2019, pages 110–115, 2019.
24. M. Hajizadeh, T. V. Phan, and T. Bauschert. Probability analysis of successful
cyber attacks in sdn-based networks. In 2018 IEEE Conference on Network Function
Virtualization and Software Defined Networks (NFV-SDN), pages 1–6, 2018.
25. John C Harsanyi. Games with incomplete information played by bayesian players,
i-iii. Management Science, 50(12):1804–1817, 2004.
26. Charles A. Kamhoua, Niki Pissinou, Alan Busovaca, and Kia Makki. Belief-
free equilibrium of packet forwarding game in ad hoc networks under imperfect
monitoring. In 29th International Performance Computing and Communications
Conference, IPCCC 2010, 9-11 December 2010, Albuquerque, NM, USA, pages
315–324, 2010.
27. Jeffrey O. Kephart and David M. Chess. The vision of autonomic computing. IEEE
Computer, 36(1):41–50, 2003.
150 N. Li et al.
28. Cody Kinneer, Ryan Wagner, Fei Fang, Claire Le Goues, and David Garlan. Model-
ing observability in adaptive systems to defend against advanced persistent threats.
In Proceedings of the 17th ACM-IEEE International Conference on Formal Methods
and Models for System Design, MEMOCODE 2019, La Jolla, CA, USA, October
9-11, 2019, pages 10:1–10:11, 2019.
29. Marta Kwiatkowska, Gethin Norman, and David Parker. Probabilistic Model Check-
ing: Advances and Applications, pages 73–121. Springer International Publishing,
Cham, 2018.
30. Hagay Levin, Michael Schapira, and Aviv Zohar. Interdomain routing and games.
In Proceedings of the 40th Annual ACM Symposium on Theory of Computing,
Victoria, British Columbia, Canada, May 17-20, 2008, pages 57–66, 2008.
31. Hagay Levin, Michael Schapira, and Aviv Zohar. Interdomain routing and games.
SIAM J. Comput., 40(6):1892–1912, 2011.
32. Nianyu Li, Sridhar Adepu, Eunsuk Kang, and David Garlan. Explanations for
human-on-the-loop: A probabilistic model checking approach. In Proceedings of
the 15th International Symposium on Software Engineering for Adaptive and Self-
managing Systems (SEAMS), 2020. To appear.
33. Wyatt Lloyd, Michael J. Freedman, Michael Kaminsky, and David G. Andersen.
Stronger semantics for low-latency geo-replicated storage. In Proceedings of the 10th
USENIX Symposium on Networked Systems Design and Implementation, NSDI
2013, Lombard, IL, USA, April 2-5, 2013, pages 313–328, 2013.
34. Davide Lorenzoli, Leonardo Mariani, and Mauro Pezzè. Towards self-protecting
enterprise applications. In ISSRE 2007, The 18th IEEE International Symposium
on Software Reliability, Trollhättan, Sweden, 5-9 November 2007, pages 39–48, 2007.
35. Richard D. McKelvey, Andrew M. McLennan, and Theodore L. Turocy. Gambit:
Software tools for game theory, version 16.0.1, 2018-02. http://www.gambit-project.
org.
36. Martin J. Osborne and Ariel Rubinstein. A course in game theory. MIT Press
Books, 1, 1994.
37. Lloyd S Shapley. A value for n-person games. In Contributions to the Theory of
Games, vol. 2, 1953.
38. Yoav Shoham and Kevin Leyton-Brown. Multiagent systems: Algorithmic, game-
theoretic, and logical foundations. Cambridge University Press, 2008.
39. Roykrong Sukkerd, Reid Simmons, and David Garlan. Tradeoff-focused contrastive
explanation for mdp planning, 2020.
40. Milind Tambe. Security and Game Theory - Algorithms, Deployed Systems, Lessons
Learned. Cambridge University Press, 2012.
41. Christos Tsigkanos, Liliana Pasquale, Carlo Ghezzi, and Bashar Nuseibeh. On the
interplay between cyber and physical spaces for adaptive security. IEEE Trans.
Dependable Secur. Comput., 15(3):466–480, 2018.
42. Marten van Dijk, Ari Juels, Alina Oprea, and Ronald L. Rivest. Flipit: The game
of ”stealthy takeover”. J. Cryptology, 26(4):655–713, 2013.
43. Danny Weyns, M. Usman Iftikhar, and Joakim Söderlund. Do external feedback
loops improve the design of self-adaptive systems? a controlled experiment. In Pro-
ceedings of the 8th International Symposium on Software Engineering for Adaptive
and Self-Managing Systems, SEAMS 2013, San Francisco, CA, USA, May 20-21,
2013, pages 3–12, 2013.
44. Youngmin Jung and Mokdong Chung. Adaptive security management model in
the cloud computing environment. In 2010 The 12th International Conference on
Advanced Communication Technology (ICACT), volume 2, pages 1664–1669, 2010.
Engineering Secure Self-Adaptive Systems with Bayesian Games 151
Open Access This chapter is licensed under the terms of the Creative Commons
Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/),
which permits use, sharing, adaptation, distribution and reproduction in any medium or
format, as long as you give appropriate credit to the original author(s) and the source,
provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s
Creative Commons license, unless indicated otherwise in a credit line to the material. If
material is not included in the chapter’s Creative Commons license and your intended
use is not permitted by statutory regulation or exceeds the permitted use, you will need
to obtain permission directly from the copyright holder.
An Abstract Contract Theory for
Programs with Procedures
1 Introduction
Contracts. Loosely speaking, a contract for a software or system component is
a means of specifying that the component obliges itself to guarantee a certain
behaviour or result, provided that the user (or client) of the component obliges
itself to fulfil certain constraints on how it interacts with the component.
This work has been funded by the Swedish Governmental Agency for Innovation
Systems (VINNOVA) under the AVerT project 2018-02727.
c The Author(s) 2021
E. Guerra and M. Stoelinga (Eds.): FASE 2021, LNCS 12649, pp. 152–171, 2021.
https://doi.org/10.1007/978-3-030-71500-7_8
An Abstract Contract Theory for Programs with Procedures 153
One of the earliest inspirations for the notion of software contracts came
from the works of Floyd [10] and Hoare [15]. One outcome of this was Hoare
logic, which is a way of assigning meaning to sequential programs axiomati-
cally, through so-called Hoare triples. A Hoare triple {P }S{Q} consists of two
assertions P and Q over the program variables, called the pre-condition and
post-condition, respectively, and a program S. The triple states that if the pre-
condition P holds prior to executing S, then, if execution of S terminates, the
post-condition Q will hold upon termination. With the help of additional, so-
called logical variables, one can specify, with a Hoare triple, the desired relation-
ship between the final values of certain variables (such as the return value of a
procedure) and the initial values of certain other variables (such as the formal
parameters of the procedure).
This style of specifying contracts has been advocated by Meyer [18], together
with the design methodology Design-by-Contract. A central characteristic of this
methodology is that it is well-suited for independent implementation and verifi-
cation, where software components are developed independently from each other,
based solely on the contracts, and without any knowledge of the implementation
details of the other components.
Contract Theories. Since then, many other contract theories have emerged, such
as Rely/Guarantee reasoning [16,22] and a number of Assume/Guarantee con-
tract theories [4,6]. A contract theory typically formalises the notion of contract,
and develops a number of operations on contracts that support typical design
steps. This in turn has lead to a few developments of contract meta-theories
(e.g. [5,2,8]), which aim at unifying these, in many cases incompatible, contract
theories. The most comprehensive, and well-known, of these, is presented in Ben-
veniste et al. [5], and is concerned specifically with the design of cyber-physical
systems. Here, all properties are derived from a most abstract notion of a con-
tract. The meta-theory focuses on the notion of contract refinement, and the
operations of contract conjunction and composition. The intention behind re-
finement and composition is to support a top-down design flow, where contracts
are decomposed iteratively into sub-contracts; the task is then to show that the
composition of the sub-contracts refines the original contract. These operations
are meant to enable independent development and reuse of components. In ad-
dition, the operation of conjunction is intended to allow the superimposition of
contracts over the same component, when they concern different aspects of its
behaviour. This also enables component reuse, by allowing contracts to reveal
only the behaviour relevant to the different use cases.
Related Work. Software contracts and operations on contracts have long been
an area of intensive research, as evidenced, e.g., by [1]. We briefly mention some
works related to our theory, in addition to the already mentioned ones.
Reasoning from multiple Hoare triples is studied in [21], in the context of un-
available source code, where new properties cannot be derived by re-verification.
In particular, it is found that two Hoare-style rules, the standard rule of conse-
quence and a generalised normalisation rule, are sufficient to infer, from a set of
existing contracts for a procedure, any contract that is semantically entailed.
Often-changing source code is a problem for contract-based reasoning and
contract reuse. In [13], abstract method calls are introduced to alleviate this
problem. Fully abstract contracts are then introduced in [7], allowing reasoning
about software to be decoupled from contract applicability checks, in a way that
not all verification effort is invalidated by changes in a specification.
The relation between behavioural specifications and assume/guarantee-style
contracts for modal transition systems is studied in [2], which shows how to build
a contract framework from any specification theory supporting composition and
refinement. This work is built on in [9], where a formal contract framework based
An Abstract Contract Theory for Programs with Procedures 155
Structure. The paper is organised as follows. Section 2 recalls the concept of con-
tract based design and the contract meta-theory considered in the present paper.
In Section 3 we present a denotational semantics for programs with procedures,
including a semantics for contracts for use in procedure-modular verification.
Next, Section 4 presents our abstract contract theory for sequential programs
with procedures. Then, we show in Section 5 that our contract theory fulfils the
axioms of the meta-theory, while in Section 6 we show how the specification of
contracts of procedures in Hoare logic and their procedure-modular verification
can be cast in the framework of our abstract contract theory. We conclude with
Section 7.
We consider the meta-theory described in [5]. The stated purpose of the meta-
theory has been to distil the notion of a contract to its essence, so that it
can be used in system design methodologies without ambiguities. In particu-
lar, the meta-theory has been developed to give support for design-chain man-
agement, and to allow component reuse and independent development. It has
been shown that a number of concrete contract theories instantiate it, including
assume/guarantee-contracts, synchronous Moore interfaces, and interface theo-
ries. To our knowledge, this is the only meta-theory of its purpose and scope.
We now present the formal definitions of the concepts defined in the meta-
theory, and the properties that they entail. The meta-theory is defined only in
terms of semantics, and it is up to particular concrete instantiations to provide
a syntax.
# Property
Refinement. When C1 C2 , every implementation of C1 is also an implementation
1
of C2 .
Shared refinement. Any contract refining C1 ∧ C2 also refines C1 and C2 .
2 Any implementation of C1 ∧ C2 is a shared implementation of C1 and C2 .
Any environment for C1 and C2 is an environment for C1 ∧ C2 .
Independent implementability. Compatible contracts can be independently
3
implemented.
Independent refinement. For all contracts Ci and Ci , i ∈ I, if Ci , i ∈ I are compat-
4
ible and Ci Ci , i ∈ I hold, then Ci , i ∈ I are compatible and i∈I Ci i∈I Ci
Commutativity, sub-associativity. For any finite sets of contracts Ci , i = 1, . . . , n,
5
C1 ⊗ C2 = C2 ⊗ C1 and 1≤i≤n Ci ( 1≤i<n Ci ) ⊗ Cn holds.
Sub-distributivity. The following holds, if all contract compositions in the
6
formula are well defined: ((C11 ∧ C21 ) ⊗ (C12 ∧ C22 )) ((C11 ⊗ C12 ) ∧ (C21 ⊗ C22 ))
Next, we summarise Hoare logic and contracts, and provide a semantic justifi-
cation of procedure-modular verification, also based on denotational semantics.
where S ranges over statements, a over arithmetic expressions, and b over Boolean
expressions.
To define the denotational semantics of the language, we define the set State
of program states. A state s ∈ State is a mapping from the program variables
to, for simplicity, the set of integers.
The denotation of a statement S, denoted [[S]], is typically given as a partial
function State → State such that [[S]] (s) = s whenever executing statement S
from the initial state s terminates in state s . In case that executing S from s does
not terminate, the value of [[S]] (s) is undefined. The definition of [[S]] proceeds
by induction on the structure of S. For example, the meaning of sequential
composition of statements is usually captured with relation composition, as given
def
by the equation [[S1 ; S2 ]] = [[S1 ]] ◦ [[S2 ]]. For the treatment of the remaining
statements of the language, the reader is referred to [23,19].
The definition of denotation captures through its type (as a partial func-
tion) that the execution of statements is deterministic. For non-deterministic
programs, the type of denotations is relaxed to [[S]] ⊆ State × State; then,
(s, s ) ∈ [[S]] captures that there is an execution of S starting in s that termi-
nates in s . For technical reasons that will become clear below, we shall use this
latter denotation type in our treatment.
Note that we could alternatively have chosen State+ as the denotational do-
main, and most results would still hold in the context of finite-trace semantics.
However, we chose to develop the theory with a focus on Hoare-logic and de-
ductive verification. In fact, the domain State × State can be seen as a special
case of finite traces. In future work, we will also investigate concrete contract
languages based on this semantics, and extend the theory for that context.
Procedures and Procedure Calls. To extend the language and its denotational
semantics with procedures and procedure calls, we follow again the approach
of [23], but adapt it to an “open” setting, where some called procedures might not
be declared. We consider programs in the context of a finite set P of procedure
names (of some larger, “closed” program), and a set of procedure declarations of
160 C. Lidström and D. Gurov
Hoare Logic. The basic judgement of Hoare logic [15] is the Hoare triple, written
{P }S{Q}, where P and Q are assertions over the program state, and S is a
program statement. The Hoare triple signifies that if the statement S is executed
from a state that satisfies P (called the pre-condition), and if this execution
terminates, then the final state of the execution will satisfy Q (called the post-
condition). Additionally, so-called logical variables can be used within a Hoare
triple, to specify the desired relationship between the values of variables after
execution and the values of variables before execution. The values of the program
variables are defined by the notion of state; to give a meaning to the logical
variables we shall use interpretations I. We shall write s |=I P to signify that
the assertion P is true w.r.t. state s and interpretation I. The formal validity of
a Hoare triple is denoted by |=par {P }S{Q}, where the subscript signifies that
validity is in terms of partial correctness, where termination of the execution
of S is not required.
An example of a Hoare triple, stating the desired behaviour of procedure odd
from Listing 1.1, is shown below, where we use the logical variable n0 to capture
to the value of n prior to execution of odd :
which essentially states that if executing S1 from any state satisfying P termi-
nates (if at all) in some state satisfying R, and executing S2 from any state
satisfying R terminates (if at all) in some state satisfying Q, then it is the case
that executing the composition S1 ; S2 from any state satisfying P terminates
(if at all) in some state satisfying Q. The proof system is sound and relatively
complete w.r.t. the denotational semantics of the programming language (see,
e.g., [23,19]).
Hoare Logic Contracts. One can view a Hoare triple {P }S{Q} as a contract
C = (P, Q) imposed on the program S. In many contexts it is meaningful to
separate the contract from the program; for instance, if the program is yet to
be implemented. In our earlier work [12], we gave such contracts a denotational
semantics as follows:
def
[[C]] = {(s, s ) | ∀I. (s |=I P ⇒ s |=I Q)} (2)
The rationale behind this definition is the following desirable property: a program
meets a contract whenever its denotation is subsumed by the denotation of the
contract, i.e., S |=par C if and only if [[S]] ⊆ [[C]].
For example, for the contract Codd induced by (1) we have that (s, s ) ∈
[[Codd ]] if and only if either s(n) < 0, or else s (r) = 0 if s(n) is even and
s (r) = 1 if s(n) is odd. The denotation of Ceven is analogous.
4.1 Components
In the context of a concrete programming language, we view a component as a
module, consisting of a collection of procedures that are provided by the module.
The module may call required procedures that are external to the module. The
way the provided procedures transform the program state upon a call depends
on how the required procedures transform the state. We take this observation
as the basis of our abstract setting, in which state transformers are modelled
as denotations (i.e., as binary relations over states). A component will thus be
simply a mapping from denotations of the required procedures to denotations of
the provided ones, both captured through the notion of procedure environments.
The contract theory is abstract, in that it is not defined for a particular
programming language, and may be instantiated with any procedural language.
As with the meta-theory, the abstract contract theory is also defined only on the
semantic level.
Recall the notions and notation from Section 3.1. A component interface
I = (P − , P + ) is a pair of disjoint, finite sets of procedure names, of the required
and the provided ones, respectively.
− +
Definition 1 (Component). A component m with interface Im = (Pm , Pm )
− → Env + .
is a mapping m : EnvPm Pm
words, when a component is closed, i.e., is not dependent on any external pro-
cedures, the provided environment is constant.
def
+
Pm 1 ×m2
+
= Pm 1
∪ Pm
+
2
− def
− −
Pm 1 ×m2
= (Pm 1
∪ Pm 2
) \ (Pm
+
1
∪ Pm
+
2
)
def
m1 × m2 = λρ−
m1 ×m2 ∈ EnvP − . μρ. χ+
m1 ×m2 (ρ)
m1 ×m2
where χ+
m1 ×m2 : EnvP + → EnvP + is defined, in the context of a given
m1 ×m2 m1 ×m2
ρ−
m1 ×m2 ∈ EnvP − , as follows. Let ρ+
m1 ×m2 ∈ EnvP + , and let ρ−
m1 ∈
m1 ×m2 m1 ×m2
EnvPm
− be the environment defined by:
1
−
ρ+ if p ∈ Pm ∩ Pm
+
def m1 ×m2 (p)
ρ−
m1 (p) =
1 2
ρ−
m1 ×m2 (p)
−
if p ∈ Pm1 \ Pm2
+
and let ρ−
m2 ∈ EnvPm
− be defined symmetrically. We then define:
2
m1 (ρ− if p ∈ Pm
+
def m1 )(p)
χ+ +
m1 ×m2 (ρm1 ×m2 )(p) =
1
m2 (ρ−
m2 )(p) if p ∈ Pm
+
2
The reason for not requiring the interfaces to be equal is that we aim at a subset
relation between components implementing a contract and those implementing
a refinement of said contract, in the meta-theory instantiation.
For a mapping h : A → B and set A ⊆ A, let h|A denote as usual the
restriction of h on A .
Definition 6 (Contract environment). A component m is an environment
for contract c iff, for any implementation m of c, m and m are composable,
and ∀ρ−m×m ∈ EnvP −
. (m × m )(ρ−
m×m )|Pc+ ρc .
+
m×m
where and are the lub and glb operations of the lattice, respectively.
This definition is consistent with the intention that any contract that refines
c1 ∧ c2 should also refine c1 and c2 individually. The interface of c1 ∧ c2 is then
Ic1 ∧c2 = (Pc−1 ∪ Pc−2 , Pc+1 ∩ Pc+2 ). Note that while this is the interface in general,
conjunction of contracts is typically used to merge different viewpoints of the
same component, and in that case Ic1 = Ic2 = Ic1 ∧c2 .
Definition 9 (Contract composability). Two contracts c1 = (ρ− +
c1 , ρc1 ) and
− + − + − +
c2 = (ρc2 , ρc2 ) with interfaces Ic1 = (Pc1 , Pc1 ) and Ic2 = (Pc2 , Pc2 ) are compos-
able if: (i) Pc+1 ∩ Pc+2 = ∅, (ii) ∀p ∈ Pc−1 ∩ Pc+2 . ρ+ −
c2 (p) ⊆ ρc1 (p), and
− −
(iii) ∀p ∈ Pc2 ∩ Pc1 . ρc1 (p) ⊆ ρc2 (p).
+ +
166 C. Lidström and D. Gurov
The conditions for composability ensure that the mutual guarantees of the two
contracts meet each other’s assumptions.
Definition 10 (Contract composition). The composition of two composable
contracts c1 = (ρ− + − + − +
c1 , ρc1 ) and c2 = (ρc2 , ρc2 ), with interfaces Ic1 = (Pc1 , Pc1 ) and
def
Ic2 = (Pc−2 , Pc+2 ), respectively, is the contract c1 ⊗ c2 = (ρ−
c1 ⊗c2 , ρc1 ρc2 ), where:
+ +
def
ρ− − −
c1 ⊗c2 = (ρc1 ρc2 ) (Pc−1 ∪Pc−2 )\(Pc+1 ∪Pc+2 )
The interface of c1 ⊗ c2 is Ic1 ⊗c2 = ((Pc−1 ∪ Pc−2 ) \ (Pc+1 ∪ Pc+2 ), Pc+1 ∪ Pc+2 ).
Theorem 2. For any composable contracts c1 and c2 , and any implementations
m1 |= c1 and m2 |= c2 , m1 and m2 are composable, and c1 ⊗ c2 is the least
contract (w.r.t. refinement order) for which the following properties hold:
(i) m1 × m2 |= c1 ⊗ c2 ,
(ii) if m is an environment to c1 ⊗ c2 , then m1 × m is an environment to c2 ,
(iii) if m is an environment to c1 ⊗ c2 , then m × m2 is an environment to c1 .
5 Connection to Meta-Theory
In this section we show that the abstract contract theory presented in Section 4
instantiates the meta-theory described in Section 2.2.
In our instantiation of the meta-theory, we consider as the abstract compo-
nent universe M the same universe of components M as defined in Section 4.1.
To distinguish the contracts of the meta-theory from those of the abstract the-
ory, we shall always denote the former by C and the latter by c. Recall that a
contract C is a pair (E, M ), where E, M ⊆ M. The formal connection between
the two notions is established with the following definition.
Definition 11 (Induced contract). Let c be a denotational contract. It in-
def
duces the contract Cc = (Ec , Mc ), where Ec = {m ∈ M | m is an environment
def
for c} and Mc = {m ∈ M | m |= c}.
Since contract implementation requires that the implementing component’s pro-
vided functions are a subset of the contract’s provided functions, every compo-
nent m such that Pm +
∩ Pc+ = ∅ is composable with every component in Mc .
The definitions of implementation, refinement and conjunction of denota-
tional contracts make this straightforward definition of induced contracts possi-
ble, so that it directly results in refinement as set membership and conjunction
as lub w.r.t. the refinement order.
Theorem 3. The contract theory of Section 4 instantiates the meta-theory of
Benveniste et al. [5], in the sense that composition of components is associative
and commutative, and for any two contracts c1 and c2 :
(i) c1 c2 iff Cc1 refines Cc2 according to the definition of the meta-theory,
An Abstract Contract Theory for Programs with Procedures 167
(ii) Cc1 ∧c2 is the conjunction of Cc1 and Cc2 as defined in the meta-theory, and
(iii) Cc1 ⊗c2 is the composition of Cc1 and Cc2 as defined in the meta-theory.
The proof is straightforward, since many definitions of the contract theory are
deliberately similar to their counterparts in the meta-theory.
Let us now return to our example from Section 3. When applying Contract
Based Design, contracts at the more abstract level will be decomposed into
contracts at the more concrete level. So, for our example, we might have at the
top level a contract c = (ρ− −
c , ρc ) with interface (∅, {even, odd}), where ρc = ∅,
+
and where ρc ∈ EnvPc+ maps even to the set of pairs (s, s ) such that whenever
+
s(n) is non-negative and even, then s (r) = 1, and when s(n) is non-negative
and odd, then s (r) = 0, and maps odd in a dual manner. This contract could
def
then be decomposed into two contracts ceven and codd , so that ρ+
ceven (even) =
− def
ρ+ +
c (even) and ρceven (odd ) = ρc (odd ), and codd is analogous. Then, we would
have ceven ⊗ codd c, and for any two components meven and modd such that
meven |= ceven and modd |= codd , it would hold that meven × modd |= c.
In this section we discuss how our abstract contract theory from Section 4 relates
to programs with procedures as presented in Section 3.1, and how it relates to
Hoare logic and procedure-modular verification as presented in Section 3.2.
First, we define how to abstract the denotational notion of procedures into
components in the abstract theory, based on the function ξ from Section 3.1.
− def def −
m = P \ Pm and Pm = P , so that ∀ρm ∈ EnvPm
− . ∀p ∈
+ + +
+ , where P
EnvPm
def
+
Pm . m(ρ−
m )(p) = [[Sp ]]ρ−
m
.
As the next result shows, procedure set abstraction and component compo-
sition commute. Together with commutativity and associativity of component
composition, this means that the initial grouping of procedures into components
is irrelevant, and that one can start with abstracting each individual procedure
into a component.
Theorem 4. For any two disjoint sets of procedures P1+ and P2+ , abstracted
individually into components m1 and m2 , respectively, and P1+ ∪ P2+ abstracted
into component m, it holds that m1 × m2 = m.
We now define how to abstract Hoare logic contracts into denotational con-
tracts, in terms of the contract environment ρc defined in Section 3.2.
def def
Pc−p = P − , so that ρ+ − −
cp (p) = ρc (p), and ∀p ∈ P . ρcp (p ) = ρc (p ).
The result follows mainly from Definitions 12 and 13, and the denotational se-
mantics given in Section 3.
Returning to the example from Sections 3 and 5, we can abstract the pro-
cedure set {even} into component meven , with interface ({odd}, {even}), which
would be a function Env{odd} → Env{even} , and ∀ρ− ∈ Env{odd} . m(ρ− )(even)
= [[Seven ]]ρ− . The denotational contracts ceven and codd resulting from the de-
composition shown in Section 5, would be exactly the abstraction of the Hoare
Logic contracts Ceven and Codd shown in Section 3.2. They would both be part
of the contract environment used in procedure-modular verification, for example
when verifying that Seven |=crpar Ceven , which would entail meven |= ceven . Thus,
by applying standard procedure-modular verification at the source code level,
we prove the top-level contract c proposed in Section 5.
7 Conclusion
References
1. Abadi, M., Lamport, L.: Composing specifications. ACM Trans. Program. Lang.
Syst. 15(1), 73–132 (Jan 1993). https://doi.org/10.1145/151646.151649
2. Bauer, S., David, A., Hennicker, R., Larsen, K., Legay, A., Nyman, U., Wa-
sowski, A.: Moving from specifications to contracts in component-based de-
sign. In: Fundamental Approaches to Software Engineering. pp. 43–58 (2012).
https://doi.org/10.1007/978-3-642-28872-2_3
3. Bekić, H.: Definable operation in general algebras, and the theory of automata and
flowcharts. In: Programming Languages and Their Definition - Hans Bekić (1936-
1982). Lecture Notes in Computer Science, vol. 177, pp. 30–55. Springer (1984).
https://doi.org/10.1007/BFb0048939
4. Benveniste, A., Caillaud, B., Ferrari, A., Mangeruca, L., Passerone, R., Sofro-
nis, C.: Multiple viewpoint contract-based specification and design. In: For-
mal Methods for Components and Objects. vol. 5382, pp. 200–225 (10 2007).
https://doi.org/10.1007/978-3-540-92188-2_9
5. Benveniste, A., Caillaud, B., Nickovic, D., Passerone, R., Raclet, J.B.,
Reinkemeier, P., Sangiovanni-Vincentelli, A., Damm, W., Henzinger, T.A.,
Larsen, K.G.: Contracts for System Design, vol. 12. Now Publishers (2018).
https://doi.org/10.1561/1000000053
6. Benvenuti, L., Ferrari, A., Mangeruca, L., Mazzi, E., Passerone, R., Sofronis, C.: A
contract-based formalism for the specification of heterogeneous systems. In: 2008
Forum on Specification, Verification and Design Languages. pp. 142–147 (09 2008).
https://doi.org/10.1109/FDL.2008.4641436
7. Bubel, R., Hähnle, R., Pelevina, M.: Fully abstract operation contracts. In: Lever-
aging Applications of Formal Methods, Verification and Validation. Specialized
Techniques and Applications. pp. 120–134 (2014). https://doi.org/10.1007/978-3-
662-45231-8_9
8. Chen, T., Chilton, C., Jonsson, B., Kwiatkowska, M.: A compositional specifica-
tion theory for component behaviours. In: Programming Languages and Systems.
pp. 148–168. Springer Berlin Heidelberg (2012). https://doi.org/10.1007/978-3-
642-28869-2_8
9. Cimatti, A., Tonetta, S.: Contracts-refinement proof system for component-
based embedded systems. Science of Computer Programming 97 (2015).
https://doi.org/10.1016/j.scico.2014.06.011
10. Floyd, R.W.: Assigning meanings to programs. Mathematical aspects of computer
science 19, 19–32 (1967). https://doi.org/10.1007/978-94-011-1793-7_4
11. Gurov, D., Lidström, C., Nyberg, M., Westman, J.: Deductive functional verifi-
cation of safety-critical embedded c-code: An experience report. In: Proceedings
of FMICS-AVoCS 2017. Lecture Notes in Computer Science, vol. 10471, pp. 3–18.
Springer (2017). https://doi.org/10.1007/978-3-319-67113-0_1
12. Gurov, D., Westman, J.: A Hoare Logic Contract Theory: An Exercise in Deno-
tational Semantics, pp. 119–127. Springer International Publishing, Cham (2018).
https://doi.org/10.1007/978-3-319-98047-8_8
13. Hähnle, R., Schaefer, I., Bubel, R.: Reuse in software verification by abstract
method calls. In: Automated Deduction – CADE-24. vol. 7898, pp. 300–314 (06
2013). https://doi.org/10.1007/978-3-642-38574-2_21
14. Hatcliff, J., Leavens, G.T., Leino, K.R.M., Müller, P., Parkinson, M.: Behav-
ioral interface specification languages. ACM Comput. Surv. 44(3) (Jun 2012).
https://doi.org/10.1145/2187671.2187678
An Abstract Contract Theory for Programs with Procedures 171
15. Hoare, C.A.R.: An axiomatic basis for computer programming. Commun. ACM
12(10), 576–580 (Oct 1969). https://doi.org/10.1145/363235.363259
16. Jones, C.: Specification and design of (parallel) programs. In: Proceedings Of
IFIP’83. vol. 83, pp. 321–332 (01 1983)
17. Lidström, C., Gurov, D.: An abstract contract theory for programs with procedures
(full version). CoRR abs/2101.06087 (2021), https://arxiv.org/abs/2101.06087
18. Meyer, B.: Applying "design by contract". IEEE Computer 25(10), 40–51 (1992).
https://doi.org/10.1109/2.161279
19. Nielson, H.R., Nielson, F.: Semantics with Applications: An Appetizer. Springer-
Verlag, Berlin, Heidelberg (2007). https://doi.org/10.1007/978-1-84628-692-6
20. Nyberg, M., Westman, J., Gurov, D.: Formally proving compositionality in
industrial systems with informal specifications. In: Margaria, T., Steffen, B.
(eds.) Leveraging Applications of Formal Methods, Verification and Valida-
tion: Applications. pp. 348–365. Springer International Publishing, Cham (2020).
https://doi.org/10.1007/978-3-030-61467-6_22
21. Owe, O., Ramezanifarkhani, T., Fazeldehkordi, E.: Hoare-style reasoning from mul-
tiple contracts. In: Integrated Formal Methods - 13th International Conference.
Lecture Notes in Computer Science, vol. 10510, pp. 263–278. Springer (2017).
https://doi.org/10.1007/978-3-319-66845-1_17
22. van Staden, S.: On rely-guarantee reasoning. In: Mathematics of Program
Construction. pp. 30–49. Springer International Publishing, Cham (2015).
https://doi.org/10.1007/978-3-319-19797-5_2
23. Winskel, G.: The Formal Semantics of Programming Languages:
An Introduction. MIT Press, Cambridge, MA, USA (1993).
https://doi.org/10.7551/mitpress/3054.001.0001
Open Access This chapter is licensed under the terms of the Creative Commons
Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/),
which permits use, sharing, adaptation, distribution and reproduction in any medium
or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license and indicate if changes were
made.
The images or other third party material in this chapter are included in the chapter’s
Creative Commons license, unless indicated otherwise in a credit line to the material. If
material is not included in the chapter’s Creative Commons license and your intended
use is not permitted by statutory regulation or exceeds the permitted use, you will need
to obtain permission directly from the copyright holder.
Paracosm: A Test Framework for Autonomous
Driving Simulations
1 Introduction
Paracosm program
Test Vehicle Pedestrian
Visual model Physical model Visual model Physical model Road Collision
World
Segment Monitor
Controller Behavior
System
Test Input Under Simulation ... ... ...
Generator Test
(SUT)
autonomous systems in virtual simulation environments [21, 26, 53, 61, 68, 72].
Simulation reduces the cost per test, and more importantly, gives precise control
over all aspects of the environment, so as to test corner cases.
A major limitation of current tools is the lack of customizability: they either
provide a GUI-based interface to design an environment piece-by-piece, or focus
on bespoke pre-made environments. This makes the setup of varied scenarios
difficult and time consuming. Though exploiting parametricity in simulation is
useful and effective [10,23,31,67], the cost of environment setup, and navigating
large parameter spaces, is quite high [31]. Prior works have used bespoke en-
vironments with limited parametricity. More recently, programmatic interfaces
have been proposed [27] to make such test procedures more systematic. However,
the simulated environments are largely still fixed, with no dynamic behavior.
In this work, we present Paracosm, a programmatic interface that enables
the design of parameterized environments and test cases. Test parameters control
the environment and the behaviors of the actors involved. Paracosm supports
various test input generation strategies, and we provide a notion of coverage for
these. Rather than computing coverage over intrinsic properties of the system
under test (which is not yet understood for neural networks [39]), our coverage
criteria is over the space of test parameters. Figure 1 depicts the various parts
of a Paracosm test. A Paracosm program represents a family of tests, where
each instantiation of the program’s parameters is a concrete test case.
Paracosm is based on a synchronous reactive programming model [13, 35,
40,70]. Components, such as road segments or cars, receive streams of inputs and
produce streams of outputs over time. In addition, components have graphical
assets to describe their appearance for an underlying visual rendering engine and
physical properties for an underlying physics simulator. For example, a vehicle
in Paracosm not only has code that reads in sensor feeds and outputs steering
angle or braking, but also has a textured mesh representing its shape, position
174 R. Majumdar et al.
and orientation in 3D space, and a physics model for its dynamical behavior. A
Paracosm configuration consists of a composition of several components. Us-
ing a set of system-defined components (road segments, cars, pedestrians, etc.)
combined using expressive operations from the underlying reactive programming
model, users can set up complex temporally varying driving scenarios. For ex-
ample, one can build an urban road network with intersections, pedestrians and
vehicular traffic, and parameterize both, environment conditions (lighting, fog),
and behaviors (when a pedestrian crosses a street).
Streams in the world description can be left “open” and, during testing,
Paracosm automatically generates sequences of values for these streams. We use
a coverage strategy based on k-wise combinatorial coverage [14, 38] for discrete
variables and dispersion for continuous variables. Intuitively, k-wise coverage
ensures that, for a programmer-specified parameter k, all possible combinations
of values of any k discrete parameters are covered by tests. Low dispersion [57]
ensures that there are no “large empty holes” left in the continuous parameter
space. Paracosm uses an automatic test generation strategy that offers high
coverage based on random sampling over discrete parameters and deterministic
quasi-Monte Carlo methods for continuous parameters [49, 57].
Like many of the projects referenced before, our implementation performs
simulations inside a game engine. However, Paracosm configurations can also
be output to the OpenDRIVE format [7] for use with other simulators, which is
more in-line with the current industry standard. We demonstrate through various
case studies how Paracosm can be an effective testing framework for both
qualitative properties (crash) and quantitative properties (distance maintained
while following a car, or image misclassification).
Our main contributions are the following: (I) We present a programmable
and expressive framework for programmatically modeling complex and parame-
terized scenarios to test autonomous driving systems. Using Paracosm one can
specify the environment’s layout, behaviors of actors, and expose parameters
to a systematic testing infrastructure. (II) We define a notion of test coverage
based on combinatorial k-wise coverage in discrete space and low dispersion in
continuous space. We show a test generation strategy based on fuzzing that the-
oretically guarantees good coverage. (III) We demonstrate empirically that our
system is able to express complex scenarios and automatically test autonomous
driving agents and find incorrect behaviors or degraded performance.
many different driving scenarios, including different road networks, weather and
light conditions, and other car and pedestrian traffic. We show how Paracosm
enables writing such tests as well as generate test inputs automatically.
A test configuration consists of a composition of reactive objects. The follow-
ing is an outline of a test configuration in Paracosm, in which the autonomous
vehicle drives on a road with a pedestrian wanting to cross. We have simplified
the API syntax for the sake of clarity and omit the enclosing Test class. In the
code segments, we use ‘:’ for named arguments.
1 // Test parameters
2 light = VarInterval (0.2 , 1.0) // value in [0.2 , 1.0]
3 nlanes = VarEnum ({2 ,4 ,6}) // value is 2 , 4 or 6
4 // Description of environment
5 w = World ( light : light , fog :0)
6 // Create a road segment
7 r = StraightRoadSegment ( len :100 , nlanes : nlanes )
8 // The autonomous vehicle controlled by the SUT
9 v = AutonomousVehicle ( start :... , model :... , controller :...)
10 // Some other actor ( s )
11 p = Pedestrian ( start :.. , model :... , ...)
12 // Monitor to check some property
13 c = CollisionMonitor ( v )
14 // Place elements in the world
15 run_test ( env : {w , r , v , p } , test_params : { light , nlanes } ,
monitors : { c } , iterations : 100)
nlanes 2 4
light .2 .9
camera
Test Parameters. Using test variables, we can have general, but constrained
streams of values passed into objects [59]. Our automatic test generator can
then pick values for these variables, thereby leading to different test cases (see
Figure 2). There are two types of parameters: continuous (VarInterval) and dis-
crete (VarEnum). In the example presented, light (light intensity) is a continuous
test parameter and nlanes (number of lanes) is discrete.
It may seem surprising that we model static scene components such as roads
as reactive objects. This serves two purposes. First, we can treat the number of
lanes in a road segment as a constant input stream that is set by the test case,
allowing parameterized test cases. Second, certain features of static objects can
also change over time. For example, the coefficient of friction on a road segment
may depend on the weather condition, which can be a function of time.
Other Actors. A test often involves many actors such as pedestrians, and other
(non-test) vehicles. Apart from the standard geometric (optionally physical)
properties, these can also have some pre-programmed behavior. Behaviors can
either be only dependent on the starting position (say, a car driving straight
on the same lane), or be dynamic and reactive, depending on test parameters
and behaviors of other actors. In general, the reactive nature of objects enables
complex scenarios to be built. For example, here, we specify a simple behavior of
a pedestrian crossing a road.The pedestrian starts crossing the road when a car
is a certain distance away. In the code segments below, we use ‘_’ as shorthand
for a lamdba expression, i.e., “f(_)” is the same as “x => f(x)”.
178 R. Majumdar et al.
their value over a finite, discrete range; for example, the color of a car, the number
of lanes on a road segment, or the position of the next pedestrian (left/right) are
discrete streams. Continuous streams take their values in a continuous (bounded)
interval. For example, the fog density or the speed of a vehicle are examples of
continuous streams.
The goal of our default test generator is to maximize (k, ) for programmer-
specified number of test iterations or ticks.
k-Wise Covering Family. One can use explicit construction results from combi-
natorial testing to generate k-wise covering families [14]. However, a simple way
to generate such families with high probability is random testing. The proof is by
the probabilistic method [4] (see also [44]). Let A be a set of 2k (k log N − log δ)
N
uniformly randomly generated {0, 1} vectors. Then A is a k-wise covering fam-
ily with probability at least 1 − δ.
Cost Functions and Local Search. In many situations, testers want to optimize
parameter values for a specific function. A simple example of this is finding
higher-speed collisions, which intuitively, can be found in the vicinity of test pa-
rameters that already result in high-speed collisions. Another, slightly different
case is (greybox) fuzzing [5, 55], for example, finding new collisions using small
mutations on parameter values that result in the vehicle narrowly avoiding a col-
lision. Our test generator supports such quantitative objectives and local search.
A quantitative monitor evaluates a cost function on a run of a test case. Our test
generation tool generates an initial, randomly chosen, set of test inputs. Then,
it considers the scores returned by the Monitor on these samples, and performs
a local search on samples with the highest/lowest scores to find local optima of
the cost function.
Paracosm: A Test Framework for Autonomous Driving Simulations 181
Paracosm uses the Unity game engine [69] to render visuals, do runtime checks
and simulate physics (via PhysX [16]). Reactive objects are built on top of UniRx
[36], an implementation of the popular Reactive Extensions framework [56]. The
game engine manages geometric transformations of 3D objects and offers easy
to use abstractions for generating realistic simulations. Encoding behaviors and
monitors, management of 3D geometry and dynamic checks are implemented
using the game engine interface. The project code is available at: https://gitlab.
mpi-sws.org/mathur/paracosm.
A simulation in Paracosm proceeds as follows. A test configuration is spec-
ified as a subclass of the EnvironmentProgramBaseClass.Tests are run by invoking
the run_test method, which receives as input the reactive objects that should
be instantiated in the world as well as additional parameters relating to the test.
The run_test method runs the tests by first initializing and placing the reactive
objects in the scene using their 3D mesh (if they have one) and then invoking a
reactive engine to start the simulation. The system under test is run in a sepa-
rate process and connects to the simulation. The simulation then proceeds until
the simulation completion criteria is met (a time-out or some monitor event).
Output to Standardized Testing Formats. There have been recent efforts to cre-
ate standardized descriptions of tests in the automotive industry. The most
relevant formats are OpenDRIVE [7] and OpenSCENARIO (only recently
finalized) [8]. OpenDRIVE describes road structures, and OpenSCENARIO
describes actors and their behavior. Paracosm currently supports outputs to
OpenDRIVE. Due to the static nature of the specification format, a different
file is generated for each test iteration/configuration.
4.2 Evaluation
Table 1: An overview of our case studies. Note that even though the Adaptive
Cruise Control study has 2 discrete parameters, we calculate k-wise coverage for
3 as the 2 parameters require 3 bits for representation.
Road segmentation Jaywalking pedestrian Adaptive Cruise Con-
trol
SUT VGGNet CNN [62] NVIDIA CNN [12] NVIDIA CNN [12]
Training 191 images 403 image & car con- 1034 image & car con-
trol samples trol samples
Test 3 discrete 2 continuous 3 continuous & 2 dis-
params crete
Test iters 100 100, 15s timeout 100, 15s timeout
Monitor Ground truth Scored Collision Collision & Distance
Coverage k = 3 with probabil- = 0.041 = 0.043, k = 3 with
ity ∼ 1 probability ∼ 1
(a) A good test with all parameter val- (b) A bad test with all parameter values
ues same as the training set (true positive: different from the training set (true posi-
89%, false positive: 0%). tive: 9%, false positive: 1%).
Fig. 3: Example results from the road segmentation case study. Pixels with a
green mask are segmented by the SUT as a road.
how Paracosm can help uncover problematic scenarios. A summary of the case
studies presented here is available in Table 1. In our Technical Report [43], we
present more case studies, specifically experiments on many pre-trained neu-
ral networks, busy urban environments and studies exploiting specific testing
features of Paracosm.
Table 2: Summary of results of the road segmentation case study. Each combi-
nation of parameter values is presented separately, with the parameter values
used for training in bold. We report the SUT’s average true positive rate (% of
pixels corresponding to the road that are correctly classified) and false positive
rate (% of pixels that are not road, but incorrectly classified as road).
# lanes # cars Lighting # test iters True positive (%) False positive (%)
2 5 Noon 12 70% 5.1%
2 5 Evening 14 53.4% 22.4%
2 0 Evening 12 51.4% 18.9%
2 0 Noon 12 71.3% 6%
4 5 Evening 10 60.4% 7.1%
4 5 Noon 16 68.5% 20.2%
4 0 Evening 13 51.5% 7.1%
4 0 Noon 11 83.3% 21%
captured in the test environment with fixed parameters (2 lanes, 5 cars, and
N oon lighting), recorded at the rate of one image every 1/10th second, while
manually driving the vehicle around (using a keyboard). We test on 100 images
generated using Paracosm’s default test generation strategy (uniform random
sampling for discrete parameters). Table 2 summarizes the test results. Tests with
parameter values far away from the training set are observed to not perform so
well. As depicted in Figure 3, this happens because varying test parameters can
drastically change the scene.
the road at various time delays, but always at a fixed walking speed (1 m/s). In
order to evaluate RQ 2 completely, we evaluate the default coverage maximizing
sampling approach, as well as explore two quantitative objectives: first, maxi-
mizing the collision speed, and second, finding new failing cases around samples
that almost fail. For the default approach, the CollisionMonitor as presented
in Section 2 is used. For the first quantitative objective, this CollisionMonitor’s
code is prepended with the following calculation:
// Score is speed of car at time of collision
coll_speed = v . speed . CombineLatest ( v . collider , (s , c ) = > s )
. First ()
The score coll_speed is used by the test generator for optimization. For the sec-
ond quantitative objective, the CollisionMonitor is modified to give high scores
to tests where the distance between the autonomous vehicle and pedestrian is
very small:
CollisionMonitor ( AutonomousVehicle v , Pedestrian p )
extends Monitor {
minDist = v . pos . Zip ( p . pos ) . Map (1/ abs (_ - _ ) ) . Min ()
coll_score = v . collider . Map (0)
// Score is either 0 ( collision ) or 1/ minDist
score = coll_score . DefaultIfEmpty ( minDist )
assert ( v . collider . IsEmpty () )
}
We evaluate the following test input generation strategies: (i) Random sam-
pling (ii) Halton sampling, (iii) Random or Halton sampling with local search
for the two quantitative objectives. We run 100 iterations of each strategy with
a 15 second timeout. For random or Halton sampling, we sample 100 times. For
the quantitative objectives, we first generate 85 random or Halton samples, then
choose the top 5 scores, and finally run 3 simulated annealing iterations on each
of these 5 configurations. Table 3 presents results from the various test input gen-
eration strategies. Clearly, Halton sampling offers the lowest dispersion (highest
coverage) over the parameter space. This can also be visually confirmed from
the plot of test parameters (Figure 4). There are no big gaps in the parameter
space. Moreover, we find that test strategies optimizing for the first objective
are successful in finding more collisions with higher speeds. As these techniques
perform simulated annealing repetitions on top of already failing tests, they also
find more failing tests overall. Finally, test strategies using the second objective
are also successful in finding more (newer) failure cases than simple Random or
Halton sampling.
Adaptive Cruise Control. We now create and test in an environment with our
test vehicle following a car (lead car) on the same lane. The lead car’s behav-
ior is programmed to drive on the same lane as the test vehicle, with a certain
maximum speed. This is a very typical driving scenario that engineers test their
implementations on. We use 5 test parameters: the initial lead of the lead car to
Paracosm: A Test Framework for Autonomous Driving Simulations 185
(a) Random sampling (no (b) Random + opt. / max- (c) Random + opt. / al-
opt.) imizing collision. most failing.
(d) Halton sampling (no (e) Halton + opt. / maxi- (f) Halton + opt. / almost
opt.) mizing collision. failing.
Fig. 4: A comparison of the various test generation strategies for the jaywalking
pedestrian case study. The X-axis is the walking speed of the pedestrian (2 to
10 m/s). The Y-axis is the distance from the car when the pedestrian starts
crossing (30 to 60 m). Passing tests are labelled with a green dot. Failing tests
(tests with a collision) are marked with a red cross.
the test vehicle ([8m, 40m]), the lead car’s maximum speed ([3m/s, 8m/s]), den-
sity of fog3 in the environment ([0, 1]), number of lanes on the road ({2, 4}), and
color of the lead car ({Black, Red, Y ello, Blue}). We use both, CollisionMonitor
4
and DistanceMonitor, as presented in Section 2. A test passes if there is no
collision and the autonomous vehicle moves atleast 5 m during the simulation
duration (15 s).
We use Paracosm’s default test generation strategy, i.e., Halton sampling
for continuous parameters and Random sampling for discrete parameters (no
optimization or fuzzing). The SUT is the same CNN as in the previous case
study. It is trained on 1034 training samples, which are obtained by manually
driving behind a red lead car on the same lane of a 2-lane road with the same
maximum velocity (5.5 m/s) and no fog.
The results of this case study are presented in Table 4. Looking at the dis-
crete parameters, the number of lanes does not seem to contribute towards a risk
of collision. Surprisingly, though the training only involves a Red lead car, the
results appear to be the best for a Blue lead car. Moving on to the continuous
3
0 denotes no fog and 1 denotes very dense fog (exponential squared scale).
4
the monitor additionally calculates the mean distance of the test vehicle to the lead
car during the test, which is used for later analysis.
186 R. Majumdar et al.
(a) Initial offset (X-axis) (b) Initial offset (X-axis) (c) Max. speed (X-axis) vs.
vs. max. speed (Y-axis). vs. fog density (Y-axis). fog density (Y-axis).
Fig. 5: Continuous test parameters of the Adaptive Cruise Control study plotted
against each other: the initial offset of the lead car (8 to 40 m), the lead car’s
maximum speed (3 to 8 m/s) and the fog density (0 to 1). Green dots, red crosses,
and blue triangles denote passing tests, collisions, and inactivity respectively.
Table 4: Parameterized test on Adaptive Cruise Control, separated for each value
of discrete parameters, and low and high values of continuous parameters. A test
passes if there are no collisions and no inactivity (the overall distance moved by
the test vehicle is more than 5 m. The average offset (in m) maintained by the
test vehicle to the lead car (for passing tests) is also presented.
Discrete parameters Continuous parameters
Num. lanes Lead car color Initial offset (m) Speed (m/s) Fog density
2 4 Black Red Yellow Blue < 24 ≥ 24 < 5.5 ≥ 5.5 < 0.5 ≥ 0.5
Test iters 54 46 24 22 27 27 51 49 52 48 51 49
Collisions 7 7 3 3 6 2 6 8 8 6 12 0
Inactivity 12 4 4 4 6 2 9 7 9 7 1 15
Offset (m) 42.4 43.4 46.5 48.1 39.6 39.1 33.7 52.7 38.4 47.4 36.5 49.8
parameters, the fog density appears to have the most significant impact on test
failures (collision or vehicle inactivity). In the presence of dense fog, the SUT
behaves pessimistically and does not accelerate much (thereby causing a failure
due to inactivity). These are all interesting and useful metrics about the perfor-
mance of our SUT. Plots of the results projected on to continuous parameters
are presented in Figure 5.
and at the same time, successfully identifying problematic cases in all the three
case studies. The jaywalking pedestrian study demonstrates that optimization
and local search are possible on top of these strategies, and are quite effective
in finding the relevant scenarios. The adaptive cruise control study tests over 5
parameters, which is more than most related works, and even guarantees good
coverage of this parameter space. Therefore, it is amply clear that Paracosm’s
test input generation methods are useful and effective.
RQ 3: The road segmentation case study uses a well-performing neural network
for object segmentation, and we are able to detect degraded performance for
automatically generated test inputs. Whereas this study focuses on static image
classification, the next two, i.e., the jaywalking pedestrian and the adaptive
cruise control study uncover poor performance on simulated driving, using a
popular neural network architecture for self driving cars. Therefore, we can safely
conclude that Paracosm can find bugs in various different kinds of systems
related to autonomous driving.
5 Related Work
effective strategy for finding errors in classification [39]. We use coverage metrics
on the test space, rather than the structure of the neural network. Alternately,
there are recent techniques to verify controllers implemented as neural networks
through constraint solving or abstract interpretation [24, 30, 34, 58, 71]. While
these tools do not focus on the problem of autonomous driving, their underlying
techniques can be combined in the test generation phase for Paracosm.
References
1. Abbas, H., O’Kelly, M., Rodionova, A., Mangharam, R.: Safe at any speed: A
simulation-based test harness for autonomous vehicles. In: 7th Workshop on De-
sign, Modeling and Evaluation of Cyber Physical Systems (CyPhy17) (October
2017)
2. Abdessalem, R.B., Nejati, S., Briand, L.C., Stifter, T.: Testing vision-
based control systems using learnable evolutionary algorithms. In: Pro-
ceedings of the 40th International Conference on Software Engineering. p.
1016–1026. ICSE ’18, Association for Computing Machinery, New York, NY,
USA (2018). https://doi.org/10.1145/3180155.3180160, https://doi.org/10.1145/
3180155.3180160
3. Alexander, R., Hawkins, H., Rae, A.: Situation coverage – a coverage criterion for
testing autonomous robots (02 2015)
4. Alon, N., Spencer, J.H.: The Probabilistic Method. Wiley-Interscience series in
discrete mathematics and optimization, Wiley (2004)
5. American Fuzzy Loop: Technical “whitepaper” for afl-fuzz, http://lcamtuf.
coredump.cx/afl/technical details.txt, accessed: 2019-08-23
6. Annpureddy, Y., Liu, C., Fainekos, G.E., Sankaranarayanan, S.: S-TaLiRo: A tool
for temporal logic falsification for hybrid systems. In: TACAS 11. Lecture Notes
in Computer Science, vol. 6605, pp. 254–257. Springer (2011)
7. Association for Advancement of international Standardization of Automation and
Measuring Systems (ASAM): Opendrive (2018), http://www.opendrive.org/index.
html, accessed: 2019-08-21
8. Association for Advancement of international Standardization of Automation and
Measuring Systems (ASAM): Openscenario (2018), http://www.opendrive.org/
index.html, accessed: 2019-08-21
9. Beck, K.L.: Test Driven Development: By Example. Addison-Wesley Professional
(2002)
10. Ben Abdessalem, R., Nejati, S., Briand, L.C., Stifter, T.: Testing advanced driver
assistance systems using multi-objective search and neural networks. In: Proceed-
ings of the 31st IEEE/ACM International Conference on Automated Software En-
gineering (ASE). pp. 63–74 (2016)
11. Bhagoji, A.N., He, W., Li, B., Song, D.: Exploring the space of black-box attacks
on deep neural networks. CoRR abs/1712.09491 (2017), http://arxiv.org/abs/
1712.09491
12. Bojarski, M., Del Testa, D., Dworakowski, D., Firner, B., Flepp, B., Goyal, P.,
Jackel, L.D., Monfort, M., Muller, U., Zhang, J., et al.: End to end learning for
self-driving cars. arXiv preprint arXiv:1604.07316 (2016)
13. Caspi, P., Pilaud, D., Halbwachs, N., Plaice, J.: Lustre: A declarative language
for programming synchronous systems. In: Conference Record of the Fourteenth
Annual ACM Symposium on Principles of Programming Languages, Munich, Ger-
many, January 21-23, 1987. pp. 178–188 (1987)
14. Colbourn, C.J.: Combinatorial aspects of covering arrays. Le Matematiche 59(1,2),
125–172 (2004), https://lematematiche.dmi.unict.it/index.php/lematematiche/
article/view/166
15. comma.ai: openpilot: open source driving agent (2016), https://github.com/
commaai/openpilot, accessed: 2018-11-13
16. Coporation, N.: Physx (2008), https://developer.nvidia.com/
gameworks-physx-overview, accessed: 2018-11-13
Paracosm: A Test Framework for Autonomous Driving Simulations 191
17. Deshmukh, J., Jin, X., Kapinski, J., Maler, O.: Stochastic local search for falsifi-
cation of hybrid systems. In: ATVA. pp. 500–517. Springer (2015)
18. Deshmukh, J.V., Donzé, A., Ghosh, S., Jin, X., Juniwal, G., Seshia, S.A.: Ro-
bust online monitoring of signal temporal logic. Formal Methods in System Design
51(1), 5–30 (2017). https://doi.org/10.1007/s10703-017-0286-7, https://doi.org/
10.1007/s10703-017-0286-7
19. Deshmukh, J.V., Horvat, M., Jin, X., Majumdar, R., Prabhu, V.S.: Testing cyber-
physical systems through bayesian optimization. ACM Trans. Embedded Com-
put. Syst. 16(5), 170:1–170:18 (2017). https://doi.org/10.1145/3126521, https:
//doi.org/10.1145/3126521
20. Donzé, A.: Breach, A Toolbox for Verification and Parameter Synthesis of Hybrid
Systems, pp. 167–170. Springer (2010)
21. Dosovitskiy, A., Ros, G., Codevilla, F., Lopez, A., Koltun, V.: CARLA: An open
urban driving simulator. In: Proceedings of the 1st Annual Conference on Robot
Learning. pp. 1–16 (2017)
22. Dreossi, T., Donzé, A., Seshia, S.A.: Compositional falsification of cyber-physical
systems with machine learning components. In: NASA Formal Methods - 9th Inter-
national Symposium, NFM 2017. Lecture Notes in Computer Science, vol. 10227,
pp. 357–372. Springer (2017)
23. Dreossi, T., Jha, S., Seshia, S.A.: Semantic adversarial deep learning 10981, 3–
26 (2018). https://doi.org/10.1007/978-3-319-96145-3 1, https://doi.org/10.1007/
978-3-319-96145-3 1
24. Dutta, S., Chen, X., Jha, S., Sankaranarayanan, S., Tiwari, A.: Sherlock - A tool for
verification of neural network feedback systems: demo abstract. In: Ozay, N., Prab-
hakar, P. (eds.) Proceedings of the 22nd ACM International Conference on Hybrid
Systems: Computation and Control, HSCC 2019, Montreal, QC, Canada, April
16-18, 2019. pp. 262–263. ACM (2019). https://doi.org/10.1145/3302504.3313351,
https://doi.org/10.1145/3302504.3313351
25. Fainekos, G.: Automotive control design bug-finding with the S-TaLiRo tool. In:
ACC 2015. p. 4096 (2015)
26. Foundation, O.S.R.: Vehicle simulation in gazebo, http://gazebosim.org/blog/
vehicle%20simulation, accessed: 2019-08-23
27. Fremont, D.J., Dreossi, T., Ghosh, S., Yue, X., Sangiovanni-Vincentelli, A.L., Se-
shia, S.A.: Scenic: A language for scenario specification and scene generation.
In: Proceedings of the 40th ACM SIGPLAN Conference on Programming Lan-
guage Design and Implementation. pp. 63–78. PLDI 2019, ACM, New York,
NY, USA (2019). https://doi.org/10.1145/3314221.3314633, http://doi.acm.org/
10.1145/3314221.3314633
28. Gambi, A., Huynh, T., Fraser, G.: Generating effective test cases for self-
driving cars from police reports. In: Dumas, M., Pfahl, D., Apel, S., Russo, A.
(eds.) Proceedings of the ACM Joint Meeting on European Software Engineer-
ing Conference and Symposium on the Foundations of Software Engineering,
ESEC/SIGSOFT FSE 2019, Tallinn, Estonia, August 26-30, 2019. pp. 257–267.
ACM (2019). https://doi.org/10.1145/3338906.3338942, https://doi.org/10.1145/
3338906.3338942
29. Gambi, A., Müller, M., Fraser, G.: Automatically testing self-driving cars with
search-based procedural content generation. In: Zhang, D., Møller, A. (eds.) Pro-
ceedings of the 28th ACM SIGSOFT International Symposium on Software Test-
ing and Analysis, ISSTA 2019, Beijing, China, July 15-19, 2019. pp. 318–328.
ACM (2019). https://doi.org/10.1145/3293882.3330566, https://doi.org/10.1145/
3293882.3330566
192 R. Majumdar et al.
30. Gehr, T., Mirman, M., Drachsler-Cohen, D., Tsankov, P., Chaudhuri, S., Vechev,
M.T.: AI2: safety and robustness certification of neural networks with abstract
interpretation. In: 2018 IEEE Symposium on Security and Privacy, S&P 2018. pp.
3–18. IEEE (2018)
31. Gladisch, C., Heinz, T., Heinzemann, C., Oehlerking, J., von Vietinghoff, A.,
Pfitzer, T.: Experience paper: Search-based testing in automated driving control
applications. In: Proceedings of the 34th IEEE/ACM International Conference on
Automated Software Engineering (ASE). pp. 26–37 (2019)
32. Goodfellow, I.J., Shlens, J., Szegedy, C.: Explaining and harnessing adversarial
examples. CoRR abs/1412.6572 (2014), http://arxiv.org/abs/1412.6572
33. Ho, H., Ouaknine, J., Worrell, J.: Online monitoring of metric temporal logic. In:
Runtime Verification RV 2014. Lecture Notes in Computer Science, vol. 8734, pp.
178–192. Springer (2014)
34. Huang, X., Kwiatkowska, M., Wang, S., Wu, M.: Safety verification of deep neural
networks. In: Majumdar, R., Kuncak, V. (eds.) Computer Aided Verification -
29th International Conference, CAV 2017, Heidelberg, Germany, July 24-28, 2017,
Proceedings, Part I. Lecture Notes in Computer Science, vol. 10426, pp. 3–29.
Springer (2017). https://doi.org/10.1007/978-3-319-63387-9 1, https://doi.org/10.
1007/978-3-319-63387-9 1
35. Hudak, P., Courtney, A., Nilsson, H., Peterson, J.: Arrows, robots, and
functional reactive programming. In: Advanced Functional Programming, 4th
International School, AFP 2002, Oxford, UK, August 19-24, 2002, Re-
vised Lectures. Lecture Notes in Computer Science, vol. 2638, pp. 159–187.
Springer (2002). https://doi.org/10.1007/978-3-540-44833-4 6, https://doi.org/10.
1007/978-3-540-44833-4 6
36. Kawai, Y.: Unirx: Reactive extensions for unity (2014), https://github.com/
neuecc/UniRx, accessed: 2018-11-13
37. Kim, J., Feldt, R., Yoo, S.: Guiding deep learning system testing using sur-
prise adequacy. In: Proceedings of the 41st International Conference on Soft-
ware Engineering. pp. 1039–1049. ICSE ’19, IEEE Press, Piscataway, NJ, USA
(2019). https://doi.org/10.1109/ICSE.2019.00108, https://doi.org/10.1109/ICSE.
2019.00108
38. Kuhn, D.R., Kacker, R.N., Lei, Y.: Combinatorial testing. In: Laplante, P.A. (ed.)
Encyclopedia of Software Engineering, pp. 1–12. CRC Press (Nov 2010)
39. Li, Z., Ma, X., Xu, C., Cao, C.: Structural coverage criteria for neural networks
could be misleading. In: Sarma, A., Murta, L. (eds.) Proceedings of the 41st Inter-
national Conference on Software Engineering: New Ideas and Emerging Results,
ICSE (NIER) 2019, Montreal, QC, Canada, May 29-31, 2019. pp. 89–92. IEEE /
ACM (2019), https://dl.acm.org/citation.cfm?id=3339171
40. Liberty, J., Betts, P.: Programming Reactive Extensions and LINQ. Apress (2011)
41. Ma, L., Juefei-Xu, F., Zhang, F., Sun, J., Xue, M., Li, B., Chen, C., Su, T., Li, L.,
Liu, Y., Zhao, J., Wang, Y.: Deepgauge: Multi-granularity testing criteria for deep
learning systems. In: Proceedings of the 33rd ACM/IEEE International Conference
on Automated Software Engineering. pp. 120–131. ASE 2018, ACM, New York,
NY, USA (2018). https://doi.org/10.1145/3238147.3238202, http://doi.acm.org/
10.1145/3238147.3238202
42. Mackinnon, T., Freeman, S., Craig, P.: Endo-testing: Unit testing with mock ob-
jects. In: eXtreme Programming and Flexible Processes in Software Engineering -
XP2000 (2000)
Paracosm: A Test Framework for Autonomous Driving Simulations 193
43. Majumdar, R., Mathur, A.S., Pirron, M., Stegner, L., Zufferey, D.: Para-
cosm: A language and tool for testing autonomous driving systems. CoRR
abs/1902.01084 (2019), http://arxiv.org/abs/1902.01084
44. Majumdar, R., Niksic, F.: Why is random testing effective for partition tolerance
bugs? PACMPL 2(POPL), 46:1–46:24 (2018)
45. Majzik, I., Semeráth, O., Hajdu, C., Marussy, K., Szatmári, Z., Micskei, Z., Vörös,
A., Babikian, A.A., Varró, D.: Towards system-level testing with coverage guar-
antees for autonomous vehicles. In: 2019 ACM/IEEE 22nd International Confer-
ence on Model Driven Engineering Languages and Systems (MODELS). pp. 89–94
(2019). https://doi.org/10.1109/MODELS.2019.00-12
46. Mirman, M., Gehr, T., Vechev, M.: Differentiable abstract interpretation
for provably robust neural networks. In: International Conference on Ma-
chine Learning (ICML) (2018), https://www.icml.cc/Conferences/2018/Schedule?
showEvent=2477
47. Mockito: Tasty mocking framework for unit tests in java, http://site.mockito.org,
accessed: 2019-08-23
48. National Transportation Safety Board: Collision between vehicle controlled by
developmental automated driving system and pedestrian, tempe, arizona, march
18, 2018. Highway Accident Report NTSB/HAR-19/03, National Transportation
Safety Board (November 2019)
49. Niederreiter, H.: Random number generation and quasi-Monte Carlo methods.
SIAM (1992)
50. O’Kelly, M., Sinha, A., Namkoong, H., Tedrake, R., Duchi, J.C.: Scalable end-
to-end autonomous vehicle testing via rare-event simulation. Advances in Neural
Information Processing Systems 31, 9827–9838 (2018)
51. Papernot, N., McDaniel, P., Goodfellow, I., Jha, S., Celik, Z.B., Swami, A.: Prac-
tical black-box attacks against machine learning. In: Proceedings of the 2017 ACM
on Asia Conference on Computer and Communications Security - ASIA CCS 17.
ACM (2017). https://doi.org/10.1145/3052973.3053009, https://doi.org/10.1145/
3052973.3053009
52. Pei, K., Cao, Y., Yang, J., Jana, S.: Deepxplore: Automated whitebox test-
ing of deep learning systems. In: Proceedings of the 26th Symposium on Op-
erating Systems Principles, Shanghai, China, October 28-31, 2017. pp. 1–18.
ACM (2017). https://doi.org/10.1145/3132747.3132785, https://doi.org/10.1145/
3132747.3132785
53. Pomerleau, D.: ALVINN: An autonomous land vehicle in a neural network. In:
NIPS 88: Neural Information Processing Systems (1988)
54. Quigley, M., Conley, K., Gerkey, B., Faust, J., Foote, T., Leibs, J., Wheeler, R.,
Ng, A.: Ros: an open-source robot operating system. In: ICRA workshop on open
source software (2009)
55. Rawat, S., Jain, V., Kumar, A.J.S., Cojocar, L., Giuffrida, C., Bos, H.: Vuzzer:
Application-aware evolutionary fuzzing. In: NDSS (2017)
56. ReactiveX: Reactivex, http://reactivex.io/, accessed: 2019-08-23
57. Rote, G., Tichy, R.: Quasi-Monte-Carlo methods and the dispersion of point se-
quences. Mathematical and Computer Modelling 23, 9–23 (1996)
58. Ruan, W., Huang, X., Kwiatkowska, M.: Reachability analysis of deep neu-
ral networks with provable guarantees. In: Lang, J. (ed.) Proceedings of the
Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI
2018, July 13-19, 2018, Stockholm, Sweden. pp. 2651–2659. ijcai.org (2018).
https://doi.org/10.24963/ijcai.2018/368, https://doi.org/10.24963/ijcai.2018/368
194 R. Majumdar et al.
59. Samimi, H., Hicks, R., Fogel, A., Millstein, T.: Declarative mocking. In: ISSTA
2013. pp. 246–256. ACM (2013)
60. Sankaranarayanan, S., Fainekos, G.: Falsification of temporal properties of hybrid
systems using the cross-entropy method. In: HSCC 12. pp. 125–134. ACM (2012)
61. Shah, S., Dey, D., Lovett, C., Kapoor, A.: Airsim: High-fidelity visual and physical
simulation for autonomous vehicles. In: Field and Service Robotics (2017), https:
//arxiv.org/abs/1705.05065
62. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale
image recognition. CoRR abs/1409.1556 (2014)
63. Stewart, L., Musa, M., Croce, N.: Look no hands: self-driving vehi-
cles’ public trust problem (2019), https://www.weforum.org/agenda/2019/08/
self-driving-vehicles-public-trust/, accessed: 2021-01-18
64. Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I.J.,
Fergus, R.: Intriguing properties of neural networks. CoRR abs/1312.6199 (2013)
65. Tian, Y., Pei, K., Jana, S., Ray, B.: Deeptest: Automated testing of deep-neural-
network-driven autonomous cars. In: Proceedings of the 40th International Con-
ference on Software Engineering. pp. 303–314. ACM (2018)
66. Tuncali, C.E., Fainekos, G., Prokhorov, D., Ito, H., Kapinski, J.: Requirements-
driven test generation for autonomous vehicles with machine learning components.
arXiv preprint arXiv:1908.01094 (2019)
67. Tuncali, C.E., Fainekos, G.E., Ito, H., Kapinski, J.: Sim-atav: Simulation-based
adversarial testing framework for autonomous vehicles. In: Proceedings of the
21st International Conference on Hybrid Systems: Computation and Control
(part of CPS Week), HSCC 2018, Porto, Portugal, April 11-13, 2018. pp. 283–
284. ACM (2018). https://doi.org/10.1145/3178126.3187004, http://doi.acm.org/
10.1145/3178126.3187004
68. Udacity: Self-driving car simulator, https://github.com/udacity/
self-driving-car-sim, accessed: 2019-08-23
69. Unity3D: Unity game engine, https://unity3d.com/, accessed: 2019-08-23
70. Wan, Z., Hudak, P.: Functional reactive programming from first principles. In:
Proceedings of the 2000 ACM SIGPLAN Conference on Programming Language
Design and Implementation (PLDI), Vancouver, Britith Columbia, Canada, June
18-21, 2000. pp. 242–252. ACM (2000). https://doi.org/10.1145/349299.349331,
https://doi.org/10.1145/349299.349331
71. Wicker, M., Huang, X., Kwiatkowska, M.: Feature-guided black-box safety test-
ing of deep neural networks. In: Beyer, D., Huisman, M. (eds.) Tools and Algo-
rithms for the Construction and Analysis of Systems - 24th International Con-
ference, TACAS 2018, Held as Part of the European Joint Conferences on The-
ory and Practice of Software, ETAPS 2018, Thessaloniki, Greece, April 14-20,
2018, Proceedings, Part I. Lecture Notes in Computer Science, vol. 10805, pp.
408–426. Springer (2018). https://doi.org/10.1007/978-3-319-89960-2 22, https:
//doi.org/10.1007/978-3-319-89960-2 22
72. Wymann, B., Espié, E., Guionneau, C., Dimitrakakis, C., Coulom, R., Sumner, A.:
TORCS, The Open Racing Car Simulator. http://www.torcs.org (2014)
Paracosm: A Test Framework for Autonomous Driving Simulations 195
Open Access This chapter is licensed under the terms of the Creative Commons
Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/),
which permits use, sharing, adaptation, distribution and reproduction in any medium
or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license and indicate if changes were
made.
The images or other third party material in this chapter are included in the chapter’s
Creative Commons license, unless indicated otherwise in a credit line to the material. If
material is not included in the chapter’s Creative Commons license and your intended
use is not permitted by statutory regulation or exceeds the permitted use, you will need
to obtain permission directly from the copyright holder.
Compositional Analysis of Probabilistic Timed
Graph Transformation Systems
1 Introduction
border
fragment
core 1 2 3
Fig. 1: Occurrence of single FT with border and core in LST (left) and five
occurrences of three FTs in LST overlapping in their borders (right).
GTSs constructed for the FTs employing its reduction to probabilistic timed
automata (PTA) instead of applying the model checking approach directly to
the PTGTS modeling the large-scale system.
To illustrate our approach, we consider a running example in which we
model shuttles driving on tracks of an LST and for which we verify that shut-
tles never collide and are unlikely to execute emergency brakes. In our evalu-
ation, we apply an implementation of our approach to the running example.
The idea to decompose a system into subsystems or to compose it from sub-
systems for the analysis has been studied intensively [25] but our suggested
compositional approach has distinguishing characteristics. Firstly, the vast ma-
jority of approaches (like process algebras or similar models) assume that the
modeling formalism supports the composition/decomposition as a first class
concept such that compositional analysis techniques are directly applicable as
the subsystem models cover all possible behaviors in all contexts. In contrast,
we do not rely on a built-in decomposition operator but rather allow for a
flexible derivation of an LST decomposition in terms of FTs, overlappings, and
a suitable overapproximation on the border, which are not predefined by the
modeling formalism.
Secondly, several approaches rely on a protocol-like specification of how
the decomposed subsystems interact, while in our approach the overapprox-
imation is derived systematically from the PTGTS model that does not nec-
essarily provide such a protocol-like specification already. The compositional
analysis approach for graph transformation systems (GTSs) from [24, 11] de-
fines explicit interfaces, which are used to consider whether the behavior of
two independent graphs glued via these interfaces (requiring that local tran-
sitions are compatible) cover jointly all global transitions. Moreover, in further
approaches, protocols for the roles of collaborations and ports of components
have been assumed. For example, in [14], the idea to overapproximate the
environment and border is explored for timed automata with explicit mod-
els of the roles in form of protocol automata. This idea has been combined
with dynamic collaborations in [12, 13] captured by timed GTSs (TGTSs) and
their analysis via inductive invariant checking [3, 4]. Later on, this approach
has been extended to role, component, and collaboration behavior, which is
captured by TGTSs and hybrid GTSs in [5] and [2], respectively. However, as
opposed to the presented approach, in all these cases an explicit concept of
interface is assumed to separate parts that are analyzed in isolation.
This paper is structured as follows. In section 2, we introduce our running
example from the domain of cyber-physical systems. In section 3, we recapit-
ulate the necessary preliminaries related to PTA and PTGTSs also presenting
the modeling of our running example. In section 4, we discuss the decompo-
sition of static substructures of large-scale systems. In section 5, we present
our decomposition-based approach allowing to split the model checking prob-
lem into more manageable parts. In section 6, we present an evaluation of the
conceptual results for our running example. Finally, in section 7, we close the
paper with a conclusion and an outlook on planned future work.
Compositional Analysis of PTGTSs 199
2 Running Example
We now informally introduce a scenario (based on the RailCab project [23]) of
autonomous shuttles driving on an LST, which serves as a running example in
the remainder of this paper. Based on this introduction, we will discuss how
we model this shuttle scenario as a PTGTS in the next section.
In the considered shuttle scenario, a track topology containing a large num-
ber of tracks of approximately equal length is given. Tracks are connected to
the adjacent tracks via directed connections building in this manner track se-
quences. Two track sequences can be joined together (i.e., can end up in a
common track with two predecessors) leading to a join fragment topology (see
FT8 in Figure 4a) or can split up from a common track (i.e., a common track
has then two successor tracks) leading to a fork fragment topology (see FT7
in Figure 4a). Moreover, depots may have a directed connection to a track
allowing shuttles to enter or exit the track topology. Shuttles, which are al-
ways located on a single track, may be in mode DRIVE, STOP, or BRAKE.
Being in mode DRIVE, shuttles drive to the next track (respecting the direc-
tion of the connection between the tracks) with a certain velocity, which may
be slow ([3, 4] time units per track) or fast ([2, 3] time units per track). Regu-
larly, shuttles change into mode STOP, which allows them to avoid coming too
close to other shuttles. Moreover, shuttles should slow down before entering
a track with a construction site on it. However, shuttles noticing the construc-
tion site too late have to execute an emergency brake thereby changing into the
mode BRAKE. To reduce the likelihood of such emergency brakes, yellow traf-
fic lights are installed a few tracks ahead of such construction sites to indicate
to shuttles that they should slow down. After construction sites, green traffic
lights may be installed permitting shuttles to increase their velocity. However,
we also consider failures on demand where a traffic light that is passed by a
shuttle is not recognized or, for some other reason, not appropriately taken
into account by the shuttle. We assume a failure probability of 10−6 for this
case assuming that the failure does not only depend on the visual observation
by the train driver but also depends on a failure of the backup system.
In our running example, static elements are the tracks, depots, installed
traffic lights, and construction sites as well as connections between these el-
ements. The PTGTS modeling the behavior of the described scenario never
changes this underlying LST. Complementary, dynamic elements are shuttles,
their attributes, their connections to tracks of the LST as well as the attributes
of traffic lights. Note that we use later a grammar to generate admissible LSTs.
For the considered shuttle scenario, we are interested in various properties.
Firstly, we need to verify that the behavior of the system never gets temporally
stuck in a state where no steps (discrete steps of e.g. driving shuttles or timed
steps) are enabled. Secondly, we need to verify whether the rules have been
constructed in a way ensuring the absence of collisions between shuttles (i.e.,
two shuttles should not be on a common track). Thirdly, emergency brakes
should be improbable at a local level for a single shuttle but also at the global
level for the entire LST and its possible numerous number of shuttles.
200 M. Maximova et al.
:TLYellow :TLGreen r
L K R
active:bool active:bool
(f) The rule SetSlow: a shuttle may successfully decrease its velocity by setting its time
per track to [3, 4] (where only the lower end of the interval is stored in the graph)
with probability 1 − 10−6 or may fail to decrease its velocity with probability 10−6 .
Setting the active attribute to ⊥ ensures that the rule cannot be applied twice.
S2 :Shuttle ⊕ e4 :at
¬∃ e5 :at , S1 :Shuttle
e1 :at
T1 :Track
e2 :next
T2 :Track
e3 :at
CS:ConstructionSite
mode=m1 id=tid1 id=tid2
T2 :Track +
minDur=minD1 clockDrive=d1 clockDrive=d2
(g) The rule ConstructionSiteBrake: a shuttle with high velocity ([2, 3] time units per
track where only the lower end of the interval is stored in the graph) needs to execute
an emergency brake to ensure that the track with a construction site on it is not
entered with a too high velocity.
Fig. 2: Details for our running example, DPO diagram, and PTA example.
Compositional Analysis of PTGTSs 201
∧¬∃ S2 :Shuttle
e5 :at
T2 :Track ,
⊕ e4 :at
guard: d1 ≥ minD1 , reset: {d2* }, priority: 0, stepLabel: (m1 , minD1 , tid1 , tid2 )
(a) The rule Drive: a shuttle may drive to the next track where the application condi-
tion is used to rule out situations that on the next track is a construction site or that
the considered shuttle comes too close to another shuttle.
¬∃ T3 :Track
e3 :next
T1 :Track , S1 :Shuttle ⊕
e2 :at
T1 :Track
e1 :next
T2 :Track
mode=m1* id=tid1
⊕
minDur=minD1* clockDrive=d1
∧¬∃ S2 :Shuttle
e3 :at
T1 :Track ,
m1* = DRIVE ∧ minD1* = 2 ∧ unchanged(tid1 )
∧¬∃ S2 :Shuttle
e3 :at
T2 :Track ,
guard: , reset: {d1* }, priority: 0, stepLabel: (tid1 )
(b) The rule DriveEnterFast: adaptation of the rule Drive for the case that a new shuttle
enters the current fragment topology with a high velocity (the similar rule for a shuttle
with a low velocity has been omitted here for brevity) from a context track belonging
to another fragment topology.
⊕ e3 :at
¬∃ T2 :Track
e4 :next
T3 :Track ,
S1 :Shuttle T1 :Track T2 :Track
e1 :at e2 :next
∧¬∃ S2 :Shuttle
e4 :at
T2 :Track , mode=m1
+
id=tid1 id=tid2
minDur=minD1 clockDrive=d1 clockDrive=d2
m1* = DRIVE ∧ unchanged(minD1 , tid1 , tid2 ) guard: d1 ≥ minD1 , reset: {d2* }, priority: 0, stepLabel: (tid1 , tid2 )
(c) The rule DriveExit1: adaptation of the rule Drive for the case that a shuttle drives
onto the last track of the current fragment topology.
S1 :Shuttle + T1 :Track
e1 :at
¬∃ T1 :Track
e2 :next
T2 :Track ,
mode=m1 id=tid1
+
minDur=minD1 clockDrive=d1
(d) The rule DriveExit2: adaptation of the rule Drive for the case that a shuttle exits the
current fragment topology towards a track belonging to another fragment topology.
Fig. 3: The rule Drive and the three adapted rules DriveEnterFast, DriveExit1,
and DriveExit2 for fragment topologies where parts of the application con-
dition of the rule Drive are omitted due to the overlay specification of the
running example.
202 M. Maximova et al.
3 Preliminaries
We now briefly introduce the subsequently required details for graph trans-
formation systems (GTSs) [10], probabilistic timed automata (PTA) [17], and
probabilistic timed graph transformation systems (PTGTSs) [18, 19] in our no-
tation. Along this presentation, we also discuss the modeling details for our
running example from the previous section.
We employ type graphs (cf. [10]) such as the type graph TG from Figure 2a
for our running example. A type graph describes the set of all admissible
(typed attributed) graphs by mentioning the allowed types of nodes, edges,
and attributes. We assume typed attributed graphs in which attributes are
specified using a many sorted first-order attribute logic as proposed in [21]
(the attribute constraint ⊥ (false) in TG means that the type graph does not
restrict attribute values). This approach to attribution has been used to capture
constraints on attributes in graph conditions in [27] and to describe attribute
modifications in [22, 28].
Graph transformation is then performed by applying a graph transforma-
tion rule (short rule) ρ = ( : K L, r : K R) consisting of two monomor-
phisms (i.e., all components of the morphisms are injective). The rule specifies
that the graph elements in L − (K ) are to be deleted, the graph elements in
K are to be preserved, and the graph elements in R − r (K ) are to be added
during graph transformation. Such a rule is applied to a graph G for a given
match m : L G resulting in a graph G ** by constructing the double pushout
(DPO) diagram (see Figure 2c) where the first and the second pushout squares
describe the removal and the addition of graph elements specified in the rule,
respectively. Moreover, a rule may additionally contain an application condi-
tion φ (denoted by ρ = (, r, φ)) to rule out certain matches specifying e.g.
graph elements that may not be connected to graph elements matched by m.
For further details on the graph transformation approach, we refer to [10].
PTA [17] combine the use of clocks to capture real-time phenomena and
probabilism to approximate/describe the likelihood of outcomes of certain
steps. A PTA such as the one in Figure 2d consists of (a) a set of locations with a
distinguished initial location such as 0 , (b) a set of clocks such as c0 (which are
initially set to 0), (c) an assignment of a set of atomic propositions (APs) such as
{done} to each location (for subsequent analysis of e.g. reachability properties),
(d) an assignment of constraints on its clocks to each location as invariants such
as c0 ≤ 3, and (e) a set of probabilistic timed edges each consisting of (e1) a
single source location, (e2) at least one target location, (e3) a clock constraint
such as c0 ≥ 2 specifying as a guard when the edge is enabled based on the
current values of the clocks, (e4) for each target location a probability such
as 0.5 that this target is reached (the sum of all the probabilities for the target
locations of the edge must add up to 1 as a probability distribution is required),
and (e5) for each target location a set of clocks such as {c0 } to be reset to 0
when that target location is reached.
States of a PTA are given by pairs (, v) where is a location and v is the
variable valuation mapping each clock of the PTA to a real number. Nonde-
Compositional Analysis of PTGTSs 203
terminism arises in PTA since a step for advancing time as well as multiple
steps applying rules may be enabled in a single state. The logic PTCTL [17]
then allows to specify properties such as “what is the worst-case probability
that the PTA reaches a location labeled with the AP done within 5 time units”,
which can be analyzed by the PRISM model checker [16]. For the example
PTA from Figure 2d, the given condition is satisfied with probability 0.75 since
the nondeterminism of the PTA would be resolved (by a so-called adversary)
such that the PTA first takes a step to 1 without letting time pass and then
performs the probabilistic step (up to two times after waiting for not longer
than 2 time units) until it reaches the location 2 labeled with the AP done (the
probabilistic step cannot be taken a third time due to the requirement of at
most 5 time units in the quoted property above).
PTGTSs have been introduced in [18, 19] as a probabilistic real-time ex-
tension of GTSs. It has been shown that PTGTSs can be translated to PTA
and, hence, PTGTSs can be understood as a high-level language for PTA as
discussed below in more detail and can be analyzed using PRISM as well.
Similarly to PTA, a PTGTS state is given by a pair ( G, v) of a graph and a
clock valuation. The initial state is given by a distinguished initial graph and a
valuation setting all clocks to 0. In our running example, each attribute of type
clockDrive of a Track node (cf. Figure 2a) represents one clock. Invariants and
APs are specified for PTGTSs by means of graph conditions as in Figure 2b and
Figure 2e, respectively, for our running example. We use the single invariant
INVdriving requiring that shuttles in mode DRIVE cannot be on a track longer
than the value of their minDur (minimal duration) attribute plus 1. Moreover,
we consider three APs to specify properties that we want to analyze later on.
The AP APunexpectedVelocity is used to detect graphs in which a shuttle does not
have an expected velocity of [2, 3] or [3, 4] time units per track where only the
lower end of the interval is stored in the graph in the minDur attribute. The
AP APcollision is used to detect graphs in which two shuttles are on a common
track to capture their collision. Finally, the AP APbraked is used to detect graphs
in which a shuttle has just executed an emergency brake.
PTGT rules of a PTGTS then correspond to edges of a PTA and contain
(a) a left-hand side graph L, (b) an attribute constraint on the clock attributes
contained in L to capture a guard, (c) a natural number describing a priority
where higher numbers denote higher priorities, and (d) a nonempty set of tu-
ples of the form ( : K L, r : K R, φ, C, p) where (, r, φ) is an underlying
GT rule with application condition φ1 , C is a set of clock attributes contained
in L to be reset, and p is a real-valued probability from [0, 1] where the prob-
abilities of all such tuples must add up to 1. See Figure 2f, Figure 2g, and
Figure 3a for three PTGT rules SetSlow, ConstructionSiteBrake, and Drive from
our running example where the last two PTGT rules have a unique underlying
GT rule with probability 1 and where the first PTGT rule has a higher priority
as well as two underlying GT rules with probabilities 10−6 and 1 − 10−6 . For
the PTGT rules ConstructionSiteBrake and Drive, we depict the graphs L, K, and
1 The underlying GT rule may not delete or add clock attributes.
204 M. Maximova et al.
FT4 FT5
Y CS Y Y CS
T T T T T T T T T T T T T T T T T T
(a) FTs for our running example where the red arrows indicate points for topology
(de)composition.
+ +
¬∃ t3 : T T , ∧¬∃ T t4 : T , t1 : T
+
t2 : T
+
t3 : T
⊕
∧¬∃ t3 : T D , ∧¬∃ D t4 : T ,
t4 : T t5 : T t6 : T
∧¬φFT1 ∧ . . . ∧ ¬φFT8 +
+
D T T T T T T T T T T T T T T T T T D
m2
m1 m3
D T T T T T T T T T T T D
Y CS
m m* m**
κ* κ **
Step of S0 from G to G ** G κ G ˆ G* r̂ G **
αi ei ei* ei**
(d) Correspondence of the graph transformation based steps between the large-scale
system S0 and one of its fragment systems Si , which are preserving the respective
static structure given by G and Fi .
Fig. 4: FTs for our running example, rule Merge, example for topology com-
position, and correspondence between steps in the large-scale system and a
fragment system.
206 M. Maximova et al.
To provide a better intuition for this definition, we now present the decompo-
sition of the LST considered for our running example.
In general, we consider the two use cases: (a) a given PTGTS with underlying
LST is to be analyzed and (b) LSTs are to be constructed based on the se-
lected and analyzed FTs. Both use cases are supported but require a different
handling. For the use case (a) a parsing of the LST w.r.t. the given FTs and over-
lapping specification must be performed to obtain a decomposition of the LST.
Efficient parsing algorithms have been devised for the special case of hyper-
edge replacement (HR) grammars (which require that nodes are not deleted)
in [8, 6, 7]. A suitable graph transformation based grammar for our running
example with 25 rules is given in [20, Appendix]. For the use case (b) in which
we need to construct some LST, we may employ node deleting rules. For our
running example, consider the rule Merge from Figure 4b that can be used to
iteratively overlap two FTs starting with a disjoint union of copies of FTs. The
rule Merge overlaps two instances of three successive Track nodes following
the overlapping specification where the application condition ensures that the
rule is applied at entry and exit points also excluding the possibility that the
six matched Track nodes belong to an instance of FTi using ¬φFTi .
2 For each FT from Figure 4a, this constraint can be formalized as a graph condition.
208 M. Maximova et al.
5 Overapproximation of Behavior
The decompositions of LSTs introduced in the previous section are now used
as a foundation to establish a behavioral relationship between a given PTGTS
S0 and n PTGTs Si that operate on the instances of FTs that are embedded into
the LST of S0 according to the given LST decomposition.
For this purpose, we extend the structural embeddings given by the α
monomorphisms from FTs to the LST in Definition 1 to embeddings of the
entire graph (including the static but also the dynamic parts) of a state of
some Si called fragment topology state (FTS) into the entire graph of a state of
S0 called large-scale state (LSS). Consider the left middle square in Figure 4d
where the embedding αi together with the FT and LST embeddings κi and κ
is complemented with an embedding ei of the FTS Fi into the LSS G. Note
that ei must be an extension of αi in the sense that the square commutes (i.e.,
κ ◦ αi = ei ◦ κi is required). Also, ei ◦ κi must satisfy the constraint φi of the FT
used for Si .
To simplify our presentation, we assume that the PTGTS S0 (as in our
running example) only employs APs of the form ∃( f : ∅ P, ), invariants
of the form ¬∃( f : ∅ P, ), and application conditions in PTGT rules that
are conjunctions of graph conditions of the form ¬∃( f : ∅ P, ) for some
graph P. This restriction simplifies the identification of parts of FTSs and LSSs
that are considered for an evaluation of such graph conditions.
As a next step, we present a decomposition relation, which establishes a
relationship between S0 and the PTGTSs Si in terms of embedding monomor-
phisms κ, αi , ei , and κi for all reachable states of S0 . Moreover, the decom-
position relation requires that (a) the timed and discrete steps of S0 can be
mimicked by each affected Si and (b) that discrete steps performed by some
PTGTS Si in isolation on a part of the LST where the FT Fi does not overlap
with the FT F j of another PTGTS S j with i = j can be mimicked by S0 . That is,
the decomposition relation is a simulation for the steps performed by S0 and a
bisimulation on those steps that are performed in isolation by a single PTGTS
Si . Also, to allow to derive results for S0 from a model checking based analysis
of the PTGTSs Si , we require a set of APs A that is part of the APs of S0 and
of each Si . Based on this set A, the decomposition relation also requires that
only those FTSs and LSSs are related that satisfy the same sets of APs in A. For
our running example, this set will contain all three APs of S0 (see Figure 2e).
Finally, we require that the initial states of S0 and the n PTGTSs Si are covered
by the decomposition relation.
Definition 2 (Decomposition Relation). Given
– (PTGTS for large-scale system) S0 is a PTGTS with initial LSS s0 =
( G0 , v0 ) where the LST is identified via κ0 : G0 G0 (and preserved by all
steps of the PTGTS),
– (PTGTSs for FTs) for each 1 ≤ i ≤ n: Si is a PTGTS with initial FTS s0,i =
( F0,i , v0,i ) where the underlying FT is identified via κi : F0,i F0,i (and preserved
by all steps of the PTGTS),
Compositional Analysis of PTGTSs 209
• Since the step of Si preserves the FT, there are unique κi* : Fi Fi* and
the required κi : Fi ** ** * **
Fi such that ˆi ◦ κi = κi and κi = r̂i ◦ κi . *
• The step of Si must allow for ei* : Fi* G * and ei** : Fi** G ** such that
ˆ ◦ ei* = ei ◦ ˆi and r̂ ◦ ei* = ei** ◦ r̂i .
7. (simulation of structural steps of Si on its core by S0 ) if
– (( G, v), κ : G G, w) ∈ S,
– (( Fi , vi ), Fi , φi , αi , κi , ei ) ∈ w,
– Si performs the structural step from ( Fi , vi ) to ( Fi** , vi** ) using an underlying
GT rule ρi = (i : Ki L i , r i : Ki Ri , φac,i ) given in Figure 4d where,
since the step of Si preserves the FT, there are unique κi* : Fi Fi* and κi** :
Fi Fi** such that ˆi ◦ κi* = κi and κi** = r̂i ◦ κi* ,
– ei (mi ( Li )) does not overlap with any e j ( Fj ) for i = j, then
– there is some (( G ** , v** ), κ ** : G G, w** ) ∈ S for some G ** , v** , κ ** , and w**
as follows.
• There must be a step of S0 as given in Figure 4d from G to G ** for some
underlying rule ρ = ( : K L, r : K R, φac ) with the same probabil-
ity and priority as ρi .
• Since the step of S0 preserves the LST, there are unique κ * : G G * and
the required κ : G ** ** *
G such that ◦ κ = κ and κ = r̂i ◦ κ .
ˆ ** *
• The step of S0 must allow for ei* : Fi* G * and ei** : Fi** G ** such that
* *
ˆ ◦ ei = ei ◦ ˆi and r̂ ◦ ei = ei ◦ r̂i . **
• Finally, w** is obtained from w by only adapting the above chosen tuple
(( Fi , vi ), Fi , φi , αi , κi , ei ) into the tuple (( Fi** , vi** ), Fi , φi , αi , κi** , ei** ).
We now state that decomposition relations allow for the simulation of each
path of the PTGTS S0 by the PTGTSs Si .
Lemma 1 (Existence of Simulating Paths). If S is a decomposition relation be-
tween S0 and (S1 , . . . , Sn ), and π is a path of length m in S0 from the initial state to
a state sm , then, for each 1 ≤ i ≤ n, there is a path πi of Si (of length k i ≤ m) ending
in a state si,ki such that (sm , κ, w) ∈ S for some κ and w where the ith element of w is
of the form (si,ki , Fi , φi , αi , κi , ei ). Moreover, the probability of each such path πi is at
least as high as the probability of the path π. See [20] for the proof.
We now state that a PTGTS satisfies a safety property given by an AP, when
safety w.r.t. this AP can be established for each Si .
Theorem 1 (Safety Verification). If S is a decomposition relation between S0 and
(S1 , . . . , Sn ) w.r.t A and ap ∈ A, then S0 is safe w.r.t. the occurrence of an ap-labeled
graph when (for each 1 ≤ i ≤ n) Si is safe w.r.t. the occurrence of an ap-labeled graph.
Moreover, the probability of an occurrence of an ap-labeled graph from some state s in
S0 is smaller than the probability of an occurrence of an ap-labeled graph from some
S-related state si in Si . See [20] for the proof.
We now apply the proposed methodology of establishing a behavioral rela-
tionship between the PTGTS S0 and the PTGTSs Si to our running example.
For this purpose, we now describe how the FTS of each Si is embedded into
the LSS of S0 and, based on this embedding, how the Si is derived from S0 .
Compositional Analysis of PTGTSs 211
For our running example, we now describe the construction of a suitable de-
composition relation relying on the LST decomposition introduced before.
3 Here, we rely on the constraints on the eight FTs (cf. Example 1) requiring that the
AP APunexpectedVelocity is never labeled in the large-scale system S0 .
212 M. Maximova et al.
6 Evaluation
To analyze the eight PTGTSs constructed for our running example in section 5
(see Table 1 for the results), we have employed the methodology from [19]
generating the state spaces for these PTGTSs without timed steps and then
generated the corresponding PTA from these state spaces. We then restricted
these PTA to timed automata (TA) essentially removing the information on
probabilities, applied UPPAAL [15] to determine the edges of the TA that can
never be applied due to unsatisfiable guards, and removed the correspond-
ing edges from the previously generated PTA. The entire analysis using our
prototypical implementation required less than three days on a machine using
up to 250 GB memory where the state space generation required most of the
time. However, there is a vast potential for optimizations regarding memory
consumption (by only storing subsequently relevant information on states and
steps) and runtime (by facilitating concurrency during state space generation).
Firstly, using UPPAAL, we have verified that each of the eight TA (hence,
also the eight PTA) have no reachable deadlock (where also timed steps are
disabled). Hence, we obtain that the PTGTS S0 also does not contain this par-
ticular modeling error since, using the decomposition relation, we also obtain
that every deadlock reachable in S0 can be reached analogously in each Si .
Secondly, we have observed that the obtained PTA do not label any lo-
cation with APunexpectedVelocity or APcollision . For APunexpectedVelocity this means that
the additional rules such as DriveEnterFast and DriveEnterSlow for overapprox-
imating the steps of entering shuttles entirely cover all possible velocities of
shuttles. For APcollision this means that Corollary 1 implies that the PTGTS S0
with an LST constructed in the described way from the eight FTs is safe w.r.t.
the occurrence of collisions.
Thirdly, to verify that yellow traffic lights suitably slow down the shuttles
before construction sites, we have identified locations i in the resulting PTA
Compositional Analysis of PTGTSs 213
that are labeled with APbraked (occurring only in FT4 and FT5). In each case, we
were able to track using a custom analysis algorithm (since the PRISM model
checker was too slow for the large PTA at hand) the shuttle backwards over
all possible paths leading to such a location i up to the step where the shuttle
entered the FT. We then determined the maximal probability of any such path
obtaining a worst-case emergency brake probability of 10−6 and 10−12 for any
entering shuttle in FT4 and FT5, respectively. On the one hand, FT5 is thereby
verified to be quantitatively more desirable compared to FT4. On the other
hand, Corollary 1 implies that installations of yellow traffic lights as in FT4
and FT5 suitably decrease the likelihood of emergency brakes also for S0 .
However, the probabilities that some shuttle executes an emergency brake in a
given time span in FT4/FT5 (obtained by combining the maximal throughput
of shuttles for FT4/FT5 with the worst-case probability obtained for FT4/FT5)
can be expected to be too coarse upper bounds when the maximal throughput
is not to be expected for the real system.
References
[1] Paolo Baldan, Andrea Corradini, and Barbara König. “Static Analysis
of Distributed Systems with Mobility Specified by Graph Grammars—A
Case Study”. In: Proc. of Int. Conf. on Integrated Design & Process Technol-
ogy. Ed. by Ehrig, Krämer, et al. SDPS, 2002.
[2] Basil Becker. “Architectural modelling and verification of open service-
oriented systems of systems”. PhD thesis. Hasso-Plattner-Institut für
Softwaresystemtechnik, Universität Potsdam, 2014. url: http : / / opus .
kobv.de/ubp/volltexte/2014/7015/.
[3] Basil Becker, Dirk Beyer, Holger Giese, Florian Klein, and Daniela Schill-
ing. “Symbolic invariant verification for systems with dynamic struc-
tural adaptation”. In: 28th International Conference on Software Engineering
(ICSE 2006), Shanghai, China, May 20-28, 2006. Ed. by Leon J. Osterweil,
H. Dieter Rombach, and Mary Lou Soffa. ACM, 2006, pp. 72–81. doi:
10.1145/1134285.1134297.
[4] Basil Becker and Holger Giese. “On Safe Service-Oriented Real-Time Co-
ordination for Autonomous Vehicles”. In: 11th IEEE International Sympo-
sium on Object-Oriented Real-Time Distributed Computing (ISORC 2008), 5-7
May 2008, Orlando, Florida, USA. IEEE Computer Society, 2008, pp. 203–
210. doi: 10.1109/ISORC.2008.13.
[5] Basil Becker, Holger Giese, and Stefan Neumann. Correct dynamic service-
oriented architectures : modeling and compositional verification with dynamic
collaborations. Tech. rep. 29. Hasso Plattner Institute at the University of
Potsdam, 2009.
[6] Frank Drewes, Berthold Hoffmann, and Mark Minas. “Formalization
and correctness of predictive shift-reduce parsers for graph grammars
based on hyperedge replacement”. In: J. Log. Algebraic Methods Program.
104 (2019), pp. 303–341. doi: 10.1016/j.jlamp.2018.12.006.
[7] Frank Drewes, Berthold Hoffmann, and Mark Minas. “Graph Parsing as
Graph Transformation - Correctness of Predictive Top-Down Parsers”.
In: Graph Transformation - 13th International Conference, ICGT 2020, Held
as Part of STAF 2020, Bergen, Norway, June 25-26, 2020, Proceedings. Ed. by
Fabio Gadducci and Timo Kehrer. Vol. 12150. Lecture Notes in Computer
Science. Springer, 2020, pp. 221–238. doi: 10.1007/978-3-030-51372-6_13.
[8] Frank Drewes, Berthold Hoffmann, and Mark Minas. “Predictive Top-
Down Parsing for Hyperedge Replacement Grammars”. In: Graph Trans-
formation - 8th International Conference, ICGT 2015, Held as Part of STAF
2015, L’Aquila, Italy, July 21-23, 2015. Proceedings. Ed. by Francesco Parisi
- Presicce and Bernhard Westfechtel. Vol. 9151. Lecture Notes in Com-
puter Science. Springer, 2015, pp. 19–34. doi: 10.1007/978-3-319-21145-
9_2.
[9] Johannes Dyck. “Verification of Graph Transformation Systems with k-
Inductive Invariants”. PhD thesis. University of Potsdam, Hasso Plattner
Institute, Potsdam, Germany, 2020. doi: 10.25932/publishup-44274.
Compositional Analysis of PTGTSs 215
[10] Hartmut Ehrig, Karsten Ehrig, Ulrike Prange, and Gabriele Taentzer.
Fundamentals of Algebraic Graph Transformation. Springer-Verlag, 2006.
[11] Amir Hossein Ghamarian and Arend Rensink. “Generalised Composi-
tionality in Graph Transformation”. In: Graph Transformations - 6th Inter-
national Conference, ICGT 2012, Bremen, Germany, September 24-29, 2012.
Proceedings. Ed. by Hartmut Ehrig, Gregor Engels, Hans-Jörg Kreowski,
and Grzegorz Rozenberg. Vol. 7562. Lecture Notes in Computer Science.
Springer, 2012, pp. 234–248. doi: 10.1007/978-3-642-33654-6_16.
[12] Holger Giese. “Modeling and Verification of Cooperative Self-adaptive
Mechatronic Systems”. In: Reliable Systems on Unreliable Networked Plat-
forms - 12th Monterey Workshop 2005, Laguna Beach, CA, USA, September
22-24, 2005. Revised Selected Papers. Ed. by Fabrice Kordon and Janos Szti-
panovits. Vol. 4322. Lecture Notes in Computer Science. Springer, 2005,
pp. 258–280. doi: 10.1007/978-3-540-71156-8_14.
[13] Holger Giese and Wilhelm Schäfer. “Model-Driven Development of Safe
Self-optimizing Mechatronic Systems with MechatronicUML”. In: Assur-
ances for Self-Adaptive Systems - Principles, Models, and Techniques. Ed. by
Javier Cámara, Rogério de Lemos, Carlo Ghezzi, and Antónia Lopes.
Vol. 7740. Lecture Notes in Computer Science. Springer, 2013, pp. 152–
186. doi: 10.1007/978-3-642-36249-1_6.
[14] Holger Giese, Matthias Tichy, Sven Burmester, and Stephan Flake. “To-
wards the compositional verification of real-time UML designs”. In: Pro-
ceedings of the 11th ACM SIGSOFT Symposium on Foundations of Software
Engineering 2003 held jointly with 9th European Software Engineering Confer-
ence, ESEC/FSE 2003, Helsinki, Finland, September 1-5, 2003. Ed. by Jukka
Paakki and Paola Inverardi. ACM, 2003, pp. 38–47. doi: 10.1145/940071.
940078.
[15] Eun-Young Kang, Dongrui Mu, and Li Huang. “Probabilistic Verification
of Timing Constraints in Automotive Systems Using UPPAAL-SMC”.
In: Integrated Formal Methods - 14th International Conference, IFM 2018,
Maynooth, Ireland, September 5-7, 2018, Proceedings. Ed. by Carlo A. Fu-
ria and Kirsten Winter. Vol. 11023. Lecture Notes in Computer Science.
Springer, 2018, pp. 236–254. doi: 10.1007/978-3-319-98938-9_14.
[16] Marta Z. Kwiatkowska, Gethin Norman, and David Parker. “PRISM 4.0:
Verification of Probabilistic Real-Time Systems”. In: Computer Aided Ver-
ification - 23rd International Conference, CAV 2011, Snowbird, UT, USA,
July 14-20, 2011. Proceedings. Ed. by Ganesh Gopalakrishnan and Shaz
Qadeer. Vol. 6806. Lecture Notes in Computer Science. Springer, 2011,
pp. 585–591. isbn: 978-3-642-22109-5. doi: 10.1007/978-3-642-22110-1_47.
[17] Marta Z. Kwiatkowska, Gethin Norman, Jeremy Sproston, and Fuzhi
Wang. “Symbolic Model Checking for Probabilistic Timed Automata”.
In: Formal Techniques, Modelling and Analysis of Timed and Fault-Tolerant
Systems, Joint International Conferences on Formal Modelling and Analysis of
Timed Systems, FORMATS 2004 and Formal Techniques in Real-Time and
Fault-Tolerant Systems, FTRTFT 2004, Grenoble, France, September 22-24,
216 M. Maximova et al.
2004, Proceedings. Ed. by Yassine Lakhnech and Sergio Yovine. Vol. 3253.
Lecture Notes in Computer Science. Springer, 2004, pp. 293–308. isbn:
3-540-23167-6. doi: 10.1007/978-3-540-30206-3_21.
[18] Maria Maximova, Holger Giese, and Christian Krause. “Probabilistic
timed graph transformation systems”. In: Graph Transformation - 10th
International Conference, ICGT 2017, Held as Part of STAF 2017, Marburg,
Germany, July 18-19, 2017, Proceedings. Ed. by Juan de Lara and Detlef
Plump. Vol. 10373. Lecture Notes in Computer Science. Springer, 2017,
pp. 159–175. isbn: 978-3-319-61469-4. doi: 10.1007/978-3-319-61470-0_10.
[19] Maria Maximova, Holger Giese, and Christian Krause. “Probabilistic
timed graph transformation systems”. In: J. Log. Algebr. Meth. Program.
101 (2018), pp. 110–131. doi: 10.1016/j.jlamp.2018.09.003.
[20] Maria Maximova, Sven Schneider, and Holger Giese. Compositional Anal-
ysis of Probabilistic Timed Graph Transformation Systems. Tech. rep. 133.
Potsdam, Germany: Hasso Plattner Institute at the University of Pots-
dam, 2021.
[21] Fernando Orejas. “Symbolic graphs for attributed graph constraints”. In:
J. Symb. Comput. 46.3 (2011), pp. 294–315. doi: 10.1016/j.jsc.2010.09.009.
[22] Fernando Orejas and Leen Lambers. “Lazy Graph Transformation”. In:
Fundam. Inform. 118.1-2 (2012), pp. 65–96. doi: 10.3233/FI-2012-706.
[23] RailCab Project. url: https://www.hni.uni-paderborn.de/cim/projekte/
railcab.
[24] Arend Rensink. “Compositionality in Graph Transformation”. In: Au-
tomata, Languages and Programming, 37th International Colloquium, ICALP
2010, July 6-10, Bordeaux, France, 2010, Proceedings, Part II. Ed. by Sam-
son Abramsky, Cyril Gavoille, Claude Kirchner, Friedhelm Meyer auf
der Heide, and Paul G. Spirakis. Vol. 6199. Lecture Notes in Computer
Science. Springer, 2010, pp. 309–320. doi: 10.1007/978-3-642-14162-1_26.
[25] Willem P. de Roever, Hans Langmaack, and Amir Pnueli, eds. Composi-
tionality: The Significant Difference, International Symposium, COMPOS’97,
Bad Malente, Germany, September 8-12, 1997. Revised Lectures. Vol. 1536.
Lecture Notes in Computer Science. Springer, 1998. isbn: 3-540-65493-3.
doi: 10.1007/3-540-49213-5.
[26] Sven Schneider, Johannes Dyck, and Holger Giese. “Formal Verification
of Invariants for Attributed Graph Transformation Systems Based on
Nested Attributed Graph Conditions”. In: Graph Transformation - 13th
International Conference, ICGT 2020, Held as Part of STAF 2020, Bergen,
Norway, June 25-26, 2020, Proceedings. Ed. by Fabio Gadducci and Timo
Kehrer. Vol. 12150. Lecture Notes in Computer Science. Springer, 2020,
pp. 257–275. doi: 10.1007/978-3-030-51372-6_15.
[27] Sven Schneider, Leen Lambers, and Fernando Orejas. “Automated rea-
soning for attributed graph properties”. In: STTT 20.6 (2018), pp. 705–
737. doi: 10.1007/s10009-018-0496-3.
Compositional Analysis of PTGTSs 217
[28] Sven Schneider, Maria Maximova, Lucas Sakizloglou, and Holger Giese.
“Formal Testing of Timed Graph Transformation Systems using Metric
Temporal Graph Logic”. In: STTT (2019). Accepted.
[29] Sven Schneider, Lucas Sakizloglou, Maria Maximova, and Holger Giese.
“Optimistic and Pessimistic On-the-fly Analysis for Metric Temporal
Graph Logic”. In: Graph Transformation - 13th International Conference,
ICGT 2020, Held as Part of STAF 2020, Bergen, Norway, June 25-26, 2020,
Proceedings. Ed. by Fabio Gadducci and Timo Kehrer. Vol. 12150. Lecture
Notes in Computer Science. Springer, 2020, pp. 276–294. doi: 10.1007/
978-3-030-51372-6_16.
Open Access This chapter is licensed under the terms of the Creative Commons
Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/),
which permits use, sharing, adaptation, distribution and reproduction in any medium
or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license and indicate if changes were
made.
The images or other third party material in this chapter are included in the chapter’s
Creative Commons license, unless indicated otherwise in a credit line to the material. If
material is not included in the chapter’s Creative Commons license and your intended
use is not permitted by statutory regulation or exceeds the permitted use, you will need
to obtain permission directly from the copyright holder.
Efficient Bounded Model Checking
of Heap-Manipulating Programs
using Tight Field Bounds
1 Introduction
SAT-based bounded model checking [7] is an automated software analysis tech-
nique, consisting of appropriately encoding a program as a propositional formula
in such a way that its satisfying valuations correspond to program defects, such
Nicolás Rosner was affiliated with the University of Buenos Aires, Buenos Aires,
Argentina at the time of contribution to this work.
© The Author(s) 2021
E. Guerra and M. Stoelinga (Eds.): FASE 2021, LNCS 12649, pp. 218–239, 2021.
https://doi.org/10.1007/978-3-030-71500-7 11
Efficient Bounded Model Checking using Tight Field Bounds 219
A tool based on bounded model checking over SAT is CBMC [20]. It supports
all of ANSI-C, including programs handling pointers and pointer arithmetic. The
tool is able to exhaustively explore many user-bounded program executions re-
sulting from various sources of non-determinism, including scheduling decisions
and the assignment of values to program variables. To achieve this, CBMC pro-
vides statements to produce non-deterministic values for certain variables, forc-
ing the model checker to consider all possible values for these variables during
verification. These statements enable program verification on all legal inputs, by
assigning these inputs values within their corresponding (legal) domains. While
this mechanism is effective for the verification of programs manipulating basic
data types and simple structured types, it is disabled as a feature for the gener-
ation of pointers. This issue forces the user to provide an ad-hoc environment to
verify programs handling dynamic data structures. In fact, a typical, convenient
mechanism to verify programs handling heap-allocated linked structures using
CBMC and similar tools, is to non-deterministically build such structures using
insertion routines [19, 22, 11].
The aforementioned approach, while effective, has its scalability tied to how
complex the insertion routines are, and how many of these are actually needed.
Indeed, there are many linked structures whose domain of valid structures can-
not be built only via insertion operations (e.g., red-black trees and node caching
linked lists require insertions as well as removals, in order to reach all bounded
valid structures). In this paper, we study an alternative technique for verifying,
using CBMC, programs handling heap-allocated linked structures. The approach
essentially consists of building a pool of objects with nondeterministically ini-
tialized fields, which are then used for nondeterministically building structures.
The rapid explosion in the number of generated linked structures is tamed by ex-
ploiting precomputed bounds for fields, that disregard values deemed invalid by
the structure’s assumed properties, such as datatype invariants and routine pre-
conditions. This leaves us the additional problem of precomputing these bounds,
a computationally costly task on its own. We then present a novel algorithm for
these precomputations, based on incremental SAT solving, making the whole
process fully automated.
220 P. Ponzio et al.
avl_init(t);
int size = nondet_int();
__CPROVER_assume(size>=0 && size<=MAX_SIZE);
for (int i = 0; i < size; i++) {
int value = nondet_int();
__CPROVER_assume(value >= MIN_VAL && value < MAX_VAL);
avl_insert(t, value);
}
int r_value = nondet_int();
__CPROVER_assume(r_value >= MIN_VAL && r_value < MAX_VAL);
avl_remove(t, r_value);
__CPROVER_assert(avl_repok(t));
2 A Motivating Example
Let us start by describing a particular verification scenario, that will serve the
purpose of motivating our approach. Suppose that we have an implementation
of dictionaries, based on AVL trees; furthermore, we would like to verify that
the remove operation on this structure preserves the structure’s invariant, i.e.,
after a removal is performed, the resulting structure is still a valid AVL tree
(acyclic, with every node having at most one parent, sorted, and balanced).
Moreover, let us assume that, besides operation avl remove, we have AVL’s
avl init, avl insert and avl repok, the latter being a routine that checks
whether a given structure satisfies the AVL invariant, as described above. In
order to perform the desired verification, we can proceed by building the program
shown in Figure 1. Notice how this program:
– employs CBMC primitives to nondeterministically decide how many values,
and which values to insert in/remove from the tree (appropriately con-
strained by constants MAX SIZE, MIN VAL and MAX VAL),
– uses an AVL insertion routine to produce the insertions, and
– uses an avl repok routine, which checks the AVL invariant on the linked
structure rooted at t.
When running CBMC on this program, if loops are unwound enough and
no violation of the assertion is obtained, then we have verified that, within the
provided bounds, remove indeed preserves the invariant.
The above traditional approach to verifying linked structures using CBMC
and similar tools [19, 22, 11] has its efficiency tied to how complex the involved
routines are, in particular the insertion routine(s) (the avl remove routine, being
verified, cannot be avoided).
Efficient Bounded Model Checking using Tight Field Bounds 221
For analysis techniques that must consider all possible state configurations
that satisfy some given property, we may reduce this relational semantics by
considering tight field bounds. Intuitively, for a field f and a property α, its
tight field bound on α is the union of f ’s representation across all program
states that satisfy α. Tight field bounds have been used to reduce the number
of variables and clauses in propositional representations of relational heap en-
codings for Java automated analyses [14, 13, 2], and in symbolic execution based
model checking to prune parts of the symbolic execution search tree constraining
nondeterministic options [15, 26] (see section 6 for a more detailed description
of these previous applications). Tight field bounds are computed from assumed
properties, and can be employed to restrict structures in states that are assumed
to satisfy such properties, i.e. precondition states. In our case, we will use the
invariant of the structure, as opposed to stronger preconditions, so that these
can be reused across several routines of the same structure.
Definition 1. Let f be a field of structure T1 with type T2 . Let i and j be the
scopes for types T1 and T2 , respectively. Let A = {a1 , . . . , ai } be the identifiers for
data objects of type T1 , and let B = {b1 , . . . , bj } be the identifiers for data objects
of type T2 . Given an identifier k, ok denotes the corresponding data object. The
tight field bound for field f is the smallest relation Uf ⊆ A×(B+Null ) satisfying:
x, y ∈ Uf iff there exists a valid heap instance I in which ox ->f = oy .
By scope we mean the limit in the number of objects, ranges for numerical
types, and maximum depth in loop unwinding, as in [17, 12]. An important
assumption we make for analysis is that structure invariants do not refer to
the specific heap addresses of data objects, and in particular that these do not
Efficient Bounded Model Checking using Tight Field Bounds 223
root root
BT0 N0 BT1 N0
N1 N2 N1 N3
N3 N4 N5 N5 N4
ts_nodes[0]->f = NULL;
if (nondet_bool()) ts_nodes[0]->f = tt_nodes[0];
else if (nondet_bool()) ts_nodes[0]->f = tt_nodes[1];
...
ts_nodes[1]->f = NULL;
if (nondet_bool()) ts_nodes[1]->f = tt_nodes[0];
else if (nondet_bool()) ts_nodes[1]->f = tt_nodes[1];
...
Finally, nondet T() ends by returning either NULL or t nodes[0] (no other
non-null node is necessary, due to symmetry breaking). Using nondet T(), we
build the following verification harness for p:
T x = nondet_T();
__CPROVER_assume(repok(x));
p(x);
__CPROVER_assert(repok(x));
To illustrate the benefits of using tight field bounds in this setting, compare
the two (semantically equivalent) nondet avl() methods in Figure 4 for build-
ing AVLs with size at most 4. At the left of Figure 4, we show the code for
the approach that considers all the feasible assignments to nodes’ fields within
the scope (many assignments not displayed due to the lack of space). With
precomputed tight field bounds we can discard a significant number of these
assignments, that are not allowed due to the bounds, as shown at the right of
Figure 4. Notice that, among many others, all self-loops in nodes are discarded
by the bounds.
For the rest of this section we assume a fixed structure T, with fields f1 , . . . , fm
and representation invariant repok, and a fixed scope k. Tight field bounds for
T can be automatically computed from assumed properties such as invariants
and preconditions. These properties must be expressed in a language amenable
to automated analysis, reducible to SAT-based analysis in our case. We employ
the automated translation of the definition of T and its repok to a propositional
formula implemented in the TACO tool [14, 13]. We also assume a symmetry
breaking predicate is created by this translation, forcing canonical orderings
of heap nodes in structures (see [14, 13] for a careful description of how these
symmetry-breaking predicates are automatically built). We discuss below the
Efficient Bounded Model Checking using Tight Field Bounds 225
Fig. 4: Building AVLs with size at most 4. Left: all feasible assignments to node’s
fields. Right: only assignments deemed feasible by tight field bounds
parts of the translation that are important for the understanding of our approach,
and refer the reader to the literature for additional details [14, 13].
Let f be a field of T with type T’. Let A = a1 , . . . , ak and B = b1 , . . . , bk be
the identifiers for data objects of type T and T’ within scope k, respectively. This
bounded field is then a relation f ⊆ A × (B + null ). The propositional encoding
of f consists of boolean variables fi,j , 0 ≤ i, j < k, such that fi,j = T rue
in a instance I if and only if the value of f for object ai is equal to object
bj (i.e. ai ->f = bj ) in I (the original translation has variables representing
ai ->f = null, we omit these here to simplify the presentation).
As an example, Figure 5 below shows the propositional variables representing
all the feasible values of binary trees’ left and right fields for scope 6, in tabular
226 P. Ponzio et al.
form. In the tables, object identifiers are named Ni (0 ≤ i < 6), variables li,j
(0 ≤ i, j < 6) denote Ni ->lef t = Nj (similarly, ri,j denote Ni ->right = Nj ).
Fig. 5: Propositional encodings of binary trees’ left and right fields for a scope
of 6
In this way, the binary tree at the left of Figure 3, whose relational represen-
tation is given in equation 1, is defined exactly by setting the following variables
to true (and all the remaining variables to false):
In case the satisfiability verdict is true, the valuation returned by the solver
corresponds to a valid (in the sense that it satisfies the invariant) memory heap,
containing pair ai , bj in the relational representation of f . Also, from the valu-
ation we can retrieve for each field f all the (true) variables that represent pairs
of objects related by f in that particular heap.
The formula above can be used to compute tight bounds, determining what
are the infeasible variables fi,j (and hence the corresponding pairs in the fields’
semantics), in states that satisfy the invariant. In [14], the infeasible variables
are determined using a top-down algorithm. In the algorithm therein, the field
semantics is initially set, for a field of type B declared in structure A, to A ×
Efficient Bounded Model Checking using Tight Field Bounds 227
(B ∪ {null }). From this fully populated initial semantics, each pair is checked for
feasibility. Pairs found to be infeasible are removed from the bound. Adopting
this top-down approach for computing tight field bounds leads to feasibility
checks (a large number of these) that are independent from one another, thus
making it amenable to distributed processing. Moreover, a pair can be removed
from the bound as soon as it is deemed infeasible, which can be exploited to
compute tight field bounds “non-exhaustively”, e.g., dedicating a certain time
to the computation of tight field bounds, and taking the obtained tight field
bound for improving SAT analysis, regardless of whether the tight bound is the
tightest (it converged to removing all infeasible pairs) or not. The latter can be
achieved thanks to the fact that, in the top-down approach, intermediate bounds
are also tight bounds [14, 13]. As each SAT query in this top-down approach
is independent from the rest, the algorithm does not exploit the incremental
capabilities of modern SAT solvers.
Let us present our approach to compute tight field bounds. As opposed to
the technique in [14], our algorithm operates in a bottom-up fashion. In our
presentation below, we assume a propEncoding method that takes the repok,
a symmetry breaking predicate sbpred, and the scopes scope, and returns an
encoding object. Its getPropositionalFormula method creates and returns a
CNF propositional formula, encoding the repok and sbpred for the given scope.
Also, the encoding’s getVars(f) method returns all the propositional variables
in the encoding of field f (see Figure 5). The algorithm uses an incremental SAT
solver, represented by a module solver, with the following routines:
– load: receives as argument a propositional formula in CNF and loads it into
the solver.
– addClause: (incrementally) adds a clause to the current formula in the solver
for future solving invocations.
– solve: calls the SAT-solving procedure, deciding whether the formula cur-
rently loaded in the solver is satisfiable (SAT) or not.
– getModel: if the formula is satisfiable, it returns the valuation produced by
the SAT-solver. The truth value of a variable v in the model can be retrieved
by invoking getValue(v).
The pseudocode of our algorithm is shown in Figure 6. Line 3 builds a proposi-
tional encoding using the repok, the symmetry breaking predicate sbpred and
the scopes. The CNF propositional formula produced by the encoding object
is then loaded into the solver in Line 4. Lines 5-7 initialize sets vars f1 , . . . ,
vars fm , each containing all the propositional variables in the encoding of the
corresponding fields f1 , · · · , fm . As opposed to the top-down algorithm proposed
in [14], which initialized fields’ semantics as binary relations containing all pairs,
the bottom-up algorithm starts with empty sets feasible f1 , . . . , feasible fm
(lines 8-10). feasible f1 , . . . , feasible fm are used by the algorithm to store
partial bounds for the corresponding fields f1 , · · · , fm , and will be iteratively
extended with the true variables in instances returned by the SAT solver.
A crucial step in our algorithm is performed at line 12, where the current
formula loaded in the SAT solver is extended, exploiting incremental SAT solv-
228 P. Ponzio et al.
Proof. Termination easily follows from the following two facts: (i) for given
bounds on data domains of the structure under analysis and limited by scopes,
the number of pairs that can be added to a field bound is a finite number; and
(ii) each while-loop iteration either adds at least an extra pair to the bounds,
or otherwise returns unsat, in which case the loop terminates.
To prove that the algorithm yields tight field bounds, we proceed as follows.
Notice that at each iteration, and for any field fi , the bound associated to field
fi (feasible fi ) is a subset of the corresponding tight bound, i.e., contains
only feasible variables: the initial bound (∅) is obviously a subset of the tight
Efficient Bounded Model Checking using Tight Field Bounds 229
bound, and bounds are extended only by adding variables extracted from valid
structures (i.e., each loop iteration produces a valid expansion). An inductive
argument allows us to conclude that, on termination, the bound associated to
field fi (feasible fi ) is a subset of the tight bound. We will now show that
feasible fi is the tight field bound. Let us suppose that, once the algorithm
terminates, bound feasible fi is not tight, i.e., there exists a variable vw,z
that does not belong to feasible fi . Then, there must exist a canonical (i.e.,
satisfying symmetry breaking) instance I of repok within scopes, in which
ow ->fi = oz . Therefore, I satisfies repok, sbpred, and vw,z = T rue, contradict-
ing the fact that the algorithm had terminated. Therefore, all variables excluded
from feasible fi are infeasible, making this bound tight.
As opposed to the top-down algorithm for tight bounds introduced in [14, 13]
Algorithm 6 only provides useful information once it terminates – intermediate
bounds cannot be used to improve analysis. Moreover, whereas the top-down
approach lends itself well to parallelization (as we mentioned before, it implies
a large number of independent SAT queries, that can be solved in a distributed
manner), it is not obvious how one would reasonably distribute our new bottom-
230 P. Ponzio et al.
5 Evaluation
Our first experimental evaluation assesses the impact of tight field bounds in
verification of code handling linked structures using CBMC. The evaluation is
based on a benchmark of collection implementations, previously used for tight
field bounds computation in [14, 13], composed of data structures with increas-
ingly complex invariants:
– an implementation of sequences based on singly linked lists (LList);
– a List implementation (from Apache Commons.Collections), based on circu-
lar doubly-linked lists (AList);
– a List implementation (from Apache Commons Collections), based on node
caching linked lists (CList);
– a Set implementation (from java.util) based on red-black trees (TSet);
– an implementation of AVL trees obtained from the case study used in [4]
(AVL); and
– an implementation of binomial heaps used as part of a benchmark in [28]
(BHeap).
Experiments in this section were run on workstations with Intel Core i7
4790 processor, 8Mb Cache, 3.6Ghz (4 Turbo), and 16 Gb of RAM, running
GNU/Linux. The incremental SAT solver used was Minisat 2.2.0. We denote
by OOM that the 16GB of memory were exhausted, and by OOM+ that the
16GB where exhausted while CBMC was preprocessing; in this latter case no
numbers of clauses or variables were produced by CBMC. Timeout was set for
these experiments to 1 hour.
Table 1 reports, for the most relevant routines of each of the data structures in
our benchmark, the verification running times with the underlying decision pro-
cedure running times discriminated in seconds, as well as the number of clauses
and variables (expressed in thousands) in the CNF formulas corresponding to
each of the verification tasks, for several scopes (S). Since we checked whether
the routines preserved the corresponding structure’s invariant, we did not con-
sider for the experiments those routines that did not modify the structure (these
trivially preserve the invariant). We assessed three different approaches:
– Build*: use of verification harnesses based on insertion routines (see Fig. 1),
– Gen&Filter (generate and filter): non-deterministic generation of data struc-
tures without tight field bounds (as illustrated in Fig. 4), using a traditional
symmetry breaking algorithm to discard isomorphic structures [14] (we do
not discuss this here due to space reasons),
– TFB: our introduced approach, which incorporates tight field bounds into the
previous to discard irrelevant non-deterministic assignments of field values
(as illustrated in Fig. 4).
Efficient Bounded Model Checking using Tight Field Bounds 231
Some remarks on the results are in order. Table 1 shows that in all analyzed
routines, the TFB approach allowed us to analyze larger scopes for which the
other input generation techniques exhausted the allotted time or memory. TFB
was able to analyze larger scopes than Gen&Filter in 7 out of 12 cases (remark-
ably, by at least 6 in AList, at least 3 in CList and at least 2 in AVL), and in
8 out of 12 cases with respect to Build* (by at least 4 in all 8 cases). Routine
extractMin in structure BHeap is particularly interesting: it contains a bug first
found in [14] that can only be exhibited by an input with at least 13 nodes.
Gray cells mark experiments in which the bug was detected by CBMC. Notice
in particular that Build* does not scale well enough to find this bug.
Our second evaluation is devoted to tight field bounds computation, in com-
parison with the top-down approach presented in [14]. We re-ran the TACO
experiments as reported in [13] on the same hardware we used for our own
experiments for a fair comparison. Original scripts and configurations were pre-
served. All distributed experiments were run on a cluster of 9 PCs (one being the
master) of the same characteristics as described above. Each distributed exper-
iment was run 3 times; the reported timing is the average thereof. All times are
given in wall-clock seconds. A timeout (TO) is set at 18,000 seconds (5 hours),
for tight bounds computation. Our bottom up tight field bounds technique is
non-parallel, and was run on a single workstation. Table 2 summarizes the re-
sults of our experiments regarding tight bounds computation. We compared the
running times of computing tight field bounds using the distributed technique
from [14] and our non-parallel presented algorithm, for scopes 10, 12, 15, 17 and
20, reporting the following:
– TACO(||): The parallel wall-clock time required to compute tight field bounds
with TACO, the tool subsuming the top-down tight bounds approach [14, 13].
– TACO(s): The TACO sequentialized time, i.e., the sum of times over all the
Minisat solvings performed by the TACO distributed algorithm.
– BU: The time the bottom-up algorithm (Alg. 6) requires to compute tight
field bounds.
– speedup(||): The speed-up achieved by BU when compared to the distributed
TACO time reported as TACO(||).
– Speedup(s): The speed-up achieved by BU when compared to the sequen-
tialized TACO time reported as TACO(s).
The speed-ups obtained by Alg. 6 are, in comparison with the distributed
approach in [14], in general very good. In particular, in all experiments but AVL
with scope 20, the running time of our sequential bottom-up approach (BU) is
already below the wall-clock time of (parallel) TACO. For AVL trees with scope
20, the only experiment where BU performed slower than TACO, the achieved
speed up is 0.6X. This means that running BU on a single workstation does not
even take twice as long as running TACO(||) on 32 processors (4 cores in 8 slave
machines used for distributed computation). Second, it is worth noticing that
structures with strong invariants (e.g., BHeap) intuitively lead to “small” tight
field bounds; a bottom-up approach then, as we explained earlier, is particularly
well suited for tight bounds computation for these structures, since the process
232 P. Ponzio et al.
Table 1: Dynamic data structure verification in CBMC: TFB versus Build* and
Gen&Filter. Verification and solving times in seconds, clauses and variables in
thousands
6 – – – – – – OoM+
remove 2 128(85) 9,708 34,998 10(7) 1,029 3,934 9(7) 1,000 3,825
3 OoM 28,268 107,255 23(19) 2,074 8,039 22(17) 2,016 7,818
9 – – – 828/761 22,881 91,143 760/699 22,698 90,434
10 – – – OoM 29,724 118,620 OoM 29,548 117,943
insert 5 188(181) 17,620 75,858 35(32) 2,722 11,679 32(28) 2,627 11,297
6 OoM 27,217 117,396 56(50) 3,852 16,522 47(42) 3,717 15,972
13 – – – 640/603 18,480 78,795 523/487 17,827 76,142
BHeap
cases. Third, some structures with relatively weak invariants also had good run-
ning times (AList, in particular), when compared to other case studies. Although
the invariants in these cases are weaker, which intuitively would lead to more
expensive tight bounds computations, these structures have fewer fields, so the
state space to be covered to compute tight bounds is significantly smaller than
that of more complex structures.
All the experiments in this section can be reproduced following the instruc-
tions available at [1].
234 P. Ponzio et al.
6 Related Work
Automated analysis of code handling dynamic data structures has been the fo-
cus of various lines of research, including separation logic based approaches [5],
approaches based on combinations of testing and static analysis [22], various
forms of model checking including explicit state model checking [27], symbolic
execution based model checking [23] and SAT-based verification [14, 13]. The
approach that we refer to as Build*, producing nondeterministic structures by
using insertion routines, has been used in some of these approaches, including
[22, 11]. The “generate & filter” mechanism, on the other hand, is more often
employed in modular (assume-guarantee) verification. In particular, the lazy ini-
tialization approach, whose symmetry breaking we borrowed for “generate & fil-
ter” in this paper is used in [19], among others. However, in SAT-based bounded
model checking, with tools such as [20], “generate & filter” is not reported as
an analysis option for dynamic data structures. The use of tight bounds to im-
prove analysis has been used previously to improve test generation and bounded
verification for JML-annotated Java programs [14, 13]. The setting is however
different from that of CBMC, due to the relational program (and heap state) se-
Efficient Bounded Model Checking using Tight Field Bounds 235
mantics, which enabled them to exploit tight bounds directly at the propositional
encoding level. Tight bounds have also been used for improving symbolic exe-
cution based model checking [15, 26]. Again, the context is different, since these
approaches that essentially “walk” the code (either concretely or symbolically),
can exploit tight bounds more deeply [26], also obtaining greater profits.
We have also reported a novel technique to compute tight bounds. This al-
gorithm is inspired in the work of [24] about black-box test input generation
using SAT. Our work is also closely related to [14, 13]. The approach to com-
pute tight field bounds presented in [14, 13] as part of the TACO tool, performs
a very large number of independent SAT queries to compute bounds, and thus
requires a cluster of workstations to do so effectively (we compared with this
approach in the paper). Another alternative approach to compute tight field
bounds is presented in [25], but requires structure specifications to be provided
in a Separation Logic flavor [21] to compute field bounds.
7 Conclusions
We have investigated the use of tight field bounds in the context of SAT-based
bounded model checking, more concretely, in (assume-guarantee) verification of
C code, using CBMC. We showed that, in this context, and in particular in
the verification of programs dealing with linked structures, an approach based
on nondeterministically generating structures, and then “filtering out” ill-formed
ones, can be more efficient than the more traditional approach of repeatedly using
data structure builders, especially when tight bounds are exploited. We have
performed a number of experiments that confirm that this alternative approach
allows CBMC to consider larger input sizes as well as to detect bugs that could
not be detected without using bounds.
Since the approach depends on precomputing tight field bounds, we have also
studied this problem, providing a novel algorithm for tight field bound compu-
tation. Tight field bounds have proved useful for a number of different analyses,
but computing them is costly, and previous field bound computation approaches
that performed reasonably did so at the expense of relying on a cluster of work-
stations to perform the task, or were only applicable to a limited set of class
invariants, expressible in separation logic. Thus, while tight field bounds proved
to have a deep impact in the previously mentioned automated software analysis
techniques, their use has been severely undermined by the necessity of a cluster
of computers for their effective computation, or the availability of specifications
in separation logic. The algorithm presented in this article allows one to compute
tight field bounds on a single workstation more efficiently than the distributed
approach on a cluster of 8 quad-core, and therefore makes tight field bounds
computation both practical and worthwhile, as part of the above mentioned
analyses.
236 P. Ponzio et al.
References
1. Website and replication package for Efficient Bounded Model Check-
ing of Heap-Manipulating Programs using Tight Field Bounds.
https://sites.google.com/view/bmc-bounds.
2. Pablo Abad, Nazareno Aguirre, Valeria S. Bengolea, Daniel Alfredo Ciolek,
Marcelo F. Frias, Juan P. Galeotti, Tom Maibaum, Mariano M. Moscato, Nicolás
Rosner, and Ignacio Vissani. Improving test generation under rich contracts by
tight bounds and incremental SAT solving. In Sixth IEEE International Confer-
ence on Software Testing, Verification and Validation, ICST 2013, Luxembourg,
Luxembourg, March 18-22, 2013, pages 21–30. IEEE Computer Society, 2013.
3. Saswat Anand, Corina S. Pasareanu, and Willem Visser. JPF-SE: A symbolic exe-
cution extension to java pathfinder. In Orna Grumberg and Michael Huth, editors,
Tools and Algorithms for the Construction and Analysis of Systems, 13th Interna-
tional Conference, TACAS 2007, Held as Part of the Joint European Conferences
on Theory and Practice of Software, ETAPS 2007 Braga, Portugal, March 24 -
April 1, 2007, Proceedings, volume 4424 of Lecture Notes in Computer Science,
pages 134–138. Springer, 2007.
4. Jason Belt, Robby, and Xianghua Deng. Sireum/topi LDP: a lightweight semi-
decision procedure for optimizing symbolic execution-based analyses. In Hans van
Vliet and Valérie Issarny, editors, Proceedings of the 7th joint meeting of the Euro-
pean Software Engineering Conference and the ACM SIGSOFT International Sym-
posium on Foundations of Software Engineering, 2009, Amsterdam, The Nether-
lands, August 24-28, 2009, pages 355–364. ACM, 2009.
5. Josh Berdine, Cristiano Calcagno, Byron Cook, Dino Distefano, Peter W. O’Hearn,
Thomas Wies, and Hongseok Yang. Shape analysis for composite data struc-
tures. In Werner Damm and Holger Hermanns, editors, Computer Aided Verifica-
tion, 19th International Conference, CAV 2007, Berlin, Germany, July 3-7, 2007,
Proceedings, volume 4590 of Lecture Notes in Computer Science, pages 178–192.
Springer, 2007.
6. Chandrasekhar Boyapati, Sarfraz Khurshid, and Darko Marinov. Korat: automated
testing based on java predicates. In Phyllis G. Frankl, editor, Proceedings of the
International Symposium on Software Testing and Analysis, ISSTA 2002, Roma,
Italy, July 22-24, 2002, pages 123–133. ACM, 2002.
7. Edmund M. Clarke, Armin Biere, Richard Raimi, and Yunshan Zhu. Bounded
model checking using satisfiability solving. Formal Methods Syst. Des., 19(1):7–34,
2001.
8. Edmund M. Clarke, Daniel Kroening, and Flavio Lerda. A tool for checking
ANSI-C programs. In Kurt Jensen and Andreas Podelski, editors, Tools and Algo-
rithms for the Construction and Analysis of Systems, 10th International Confer-
ence, TACAS 2004, Held as Part of the Joint European Conferences on Theory and
Practice of Software, ETAPS 2004, Barcelona, Spain, March 29 - April 2, 2004,
Proceedings, volume 2988 of Lecture Notes in Computer Science, pages 168–176.
Springer, 2004.
9. Greg Dennis, Felix Sheng-Ho Chang, and Daniel Jackson. Modular verification
of code with SAT. In Lori L. Pollock and Mauro Pezzè, editors, Proceedings of
the ACM/SIGSOFT International Symposium on Software Testing and Analysis,
ISSTA 2006, Portland, Maine, USA, July 17-20, 2006, pages 109–120. ACM, 2006.
10. Niklas Eén and Niklas Sörensson. An extensible sat-solver. In Enrico Giunchiglia
and Armando Tacchella, editors, Theory and Applications of Satisfiability Testing,
Efficient Bounded Model Checking using Tight Field Bounds 237
6th International Conference, SAT 2003. Santa Margherita Ligure, Italy, May 5-8,
2003 Selected Revised Papers, volume 2919 of Lecture Notes in Computer Science,
pages 502–518. Springer, 2003.
11. Stephan Falke, Florian Merz, and Carsten Sinz. LLBMC: improved bounded model
checking of C programs using LLVM - (competition contribution). In Nir Piter-
man and Scott A. Smolka, editors, Tools and Algorithms for the Construction and
Analysis of Systems - 19th International Conference, TACAS 2013, Held as Part
of the European Joint Conferences on Theory and Practice of Software, ETAPS
2013, Rome, Italy, March 16-24, 2013. Proceedings, volume 7795 of Lecture Notes
in Computer Science, pages 623–626. Springer, 2013.
12. Marcelo F. Frias, Juan P. Galeotti, Carlos López Pombo, and Nazareno Aguirre.
Dynalloy: upgrading alloy with actions. In Gruia-Catalin Roman, William G.
Griswold, and Bashar Nuseibeh, editors, 27th International Conference on Software
Engineering (ICSE 2005), 15-21 May 2005, St. Louis, Missouri, USA, pages 442–
451. ACM, 2005.
13. Juan P. Galeotti, Nicolás Rosner, Carlos Gustavo López Pombo, and Marcelo F.
Frias. TACO: efficient sat-based bounded verification using symmetry breaking
and tight bounds. IEEE Trans. Software Eng., 39(9):1283–1307, 2013.
14. Juan P. Galeotti, Nicolás Rosner, Carlos López Pombo, and Marcelo F. Frias.
Analysis of invariants for efficient bounded verification. In Paolo Tonella and
Alessandro Orso, editors, Proceedings of the Nineteenth International Symposium
on Software Testing and Analysis, ISSTA 2010, Trento, Italy, July 12-16, 2010,
pages 25–36. ACM, 2010.
15. Jaco Geldenhuys, Nazareno Aguirre, Marcelo F. Frias, and Willem Visser. Bounded
lazy initialization. In Guillaume Brat, Neha Rungta, and Arnaud Venet, editors,
NASA Formal Methods, 5th International Symposium, NFM 2013, Moffett Field,
CA, USA, May 14-16, 2013. Proceedings, volume 7871 of Lecture Notes in Com-
puter Science, pages 229–243. Springer, 2013.
16. John N. Hooker. Solving the incremental satisfiability problem. J. Log. Program.,
15(1&2):177–186, 1993.
17. Daniel Jackson. Software Abstractions - Logic, Language, and Analysis. MIT Press,
2006.
18. Daniel Jackson and Mandana Vaziri. Finding bugs with a constraint solver. In
Debra J. Richardson and Mary Jean Harold, editors, Proceedings of the Interna-
tional Symposium on Software Testing and Analysis, ISSTA 2000, Portland, OR,
USA, August 21-24, 2000, pages 14–25. ACM, 2000.
19. Sarfraz Khurshid, Corina S. Pasareanu, and Willem Visser. Generalized symbolic
execution for model checking and testing. In Hubert Garavel and John Hatcliff,
editors, Tools and Algorithms for the Construction and Analysis of Systems, 9th
International Conference, TACAS 2003, Held as Part of the Joint European Con-
ferences on Theory and Practice of Software, ETAPS 2003, Warsaw, Poland, April
7-11, 2003, Proceedings, volume 2619 of Lecture Notes in Computer Science, pages
553–568. Springer, 2003.
20. Daniel Kroening and Michael Tautschnig. CBMC - C bounded model checker -
(competition contribution). In Erika Ábrahám and Klaus Havelund, editors, Tools
and Algorithms for the Construction and Analysis of Systems - 20th International
Conference, TACAS 2014, Held as Part of the European Joint Conferences on
Theory and Practice of Software, ETAPS 2014, Grenoble, France, April 5-13, 2014.
Proceedings, volume 8413 of Lecture Notes in Computer Science, pages 389–391.
Springer, 2014.
238 P. Ponzio et al.
21. Huu Hai Nguyen, Cristina David, Shengchao Qin, and Wei-Ngan Chin. Automated
verification of shape and size properties via separation logic. In Byron Cook and
Andreas Podelski, editors, Verification, Model Checking, and Abstract Interpreta-
tion, 8th International Conference, VMCAI 2007, Nice, France, January 14-16,
2007, Proceedings, volume 4349 of Lecture Notes in Computer Science, pages 251–
266. Springer, 2007.
22. Aditya V. Nori, Sriram K. Rajamani, SaiDeep Tetali, and Aditya V. Thakur. The
yogiproject: Software property checking via static analysis and testing. In Stefan
Kowalewski and Anna Philippou, editors, Tools and Algorithms for the Construc-
tion and Analysis of Systems, 15th International Conference, TACAS 2009, Held
as Part of the Joint European Conferences on Theory and Practice of Software,
ETAPS 2009, York, UK, March 22-29, 2009. Proceedings, volume 5505 of Lecture
Notes in Computer Science, pages 178–181. Springer, 2009.
23. Corina S. Pasareanu, Willem Visser, David H. Bushnell, Jaco Geldenhuys, Peter C.
Mehlitz, and Neha Rungta. Symbolic pathfinder: integrating symbolic execution
with model checking for java bytecode analysis. Autom. Softw. Eng., 20(3):391–
425, 2013.
24. Pablo Ponzio, Nazareno Aguirre, Marcelo F. Frias, and Willem Visser. Field-
exhaustive testing. In Thomas Zimmermann, Jane Cleland-Huang, and Zhendong
Su, editors, Proceedings of the 24th ACM SIGSOFT International Symposium on
Foundations of Software Engineering, FSE 2016, Seattle, WA, USA, November
13-18, 2016, pages 908–919. ACM, 2016.
25. Pablo Ponzio, Nicolás Rosner, Nazareno Aguirre, and Marcelo F. Frias. Efficient
tight field bounds computation based on shape predicates. In Cliff B. Jones, Pekka
Pihlajasaari, and Jun Sun, editors, FM 2014: Formal Methods - 19th Interna-
tional Symposium, Singapore, May 12-16, 2014. Proceedings, volume 8442 of Lec-
ture Notes in Computer Science, pages 531–546. Springer, 2014.
26. Nicolás Rosner, Jaco Geldenhuys, Nazareno Aguirre, Willem Visser, and
Marcelo F. Frias. BLISS: improved symbolic execution by bounded lazy initializa-
tion with SAT support. IEEE Trans. Software Eng., 41(7):639–660, 2015.
27. Willem Visser and Peter C. Mehlitz. Model checking programs with java
pathfinder. In Patrice Godefroid, editor, Model Checking Software, 12th Interna-
tional SPIN Workshop, San Francisco, CA, USA, August 22-24, 2005, Proceedings,
volume 3639 of Lecture Notes in Computer Science, page 27. Springer, 2005.
28. Willem Visser, Corina S. Pasareanu, and Radek Pelánek. Test input generation for
java containers using state matching. In Lori L. Pollock and Mauro Pezzè, editors,
Proceedings of the ACM/SIGSOFT International Symposium on Software Testing
and Analysis, ISSTA 2006, Portland, Maine, USA, July 17-20, 2006, pages 37–48.
ACM, 2006.
Efficient Bounded Model Checking using Tight Field Bounds 239
Open Access This chapter is licensed under the terms of the Creative Commons
Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/),
which permits use, sharing, adaptation, distribution and reproduction in any medium
or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license and indicate if changes were
made.
The images or other third party material in this chapter are included in the chapter’s
Creative Commons license, unless indicated otherwise in a credit line to the material. If
material is not included in the chapter’s Creative Commons license and your intended
use is not permitted by statutory regulation or exceeds the permitted use, you will need
to obtain permission directly from the copyright holder.
Effects of Program Representation on Pointer
Analyses — An Empirical Study
Abstract Static analysis frameworks, such as Soot and Wala, are used
by researchers to prototype and compare program analyses. These frame-
works vary on heap abstraction, modeling library classes, and underlying
intermediate program representation (IR). Often, these variations pose
a threat to the validity of the results as the implications of comparing
the same analysis implementation in different frameworks are still un-
explored. Earlier studies have focused on the precision, soundness, and
recall of the algorithms implemented in these frameworks; however, little
to no work has been done to evaluate the effects of program represen-
tation. In this work, we fill this gap and study the impact of program
representation on pointer analysis. Unfortunately, existing metrics are
insufficient for such a comparison due to their inability to isolate each
aspect of the program representation. Therefore, we define two novel
metrics that measure these analyses’ precision after isolating the influ-
ence of class-hierarchy and intermediate representation. Our results es-
tablish that the minor differences in the class hierarchy and IR do not
impact program analysis significantly. Besides, they reveal the sources of
unsoundness that aid researchers in developing program analysis.
1 Introduction
– We defined two metrics for evaluating each aspect in isolation, one for mod-
eling of library classes, the other for IR.
– We evaluated the differences in library modeling and found that these have
little influence on program analyses. Additionally, we discovered sources of
unsoundness in these frameworks.
– We evaluated the precision for different IRs and found that they have no
impact on the precision of virtual method call elimination.
– We empirically found differences in heap abstractions even for analyses claim-
ing the same levels of context-sensitivity regarding the types of heap objects.
In summary, our empirical study dispels the threats to the validity of the
results of existing works posed by these differences of frameworks. It also dis-
covers novel sources of unsoundness and imprecision in existing frameworks that
provide suggestions that users/developers of these frameworks could incorporate
into their analyses. Although we focus on pointer analysis in the paper, our re-
sults are, in principle, generalizable to many other static analyses, as the findings
presented in this paper also hold for these. We have made the artifacts available
on https://github.com/jpksh90/pointeval to facilitate reproduction.
The goal of pointer analysis is to determine which objects a variable may refer
(point) to at runtime. A points-to set is a static approximation of this question,
which maps variables to objects that are allocated on the heap (heap objects).
More precisely, if V is the set of variables in a program, and H is the set of heap
objects, then points-to : V → P(H). points-to(v) returns the set of heap objects
in H referred by v.
Doop is a framework that exclusively focuses on pointer analysis, defines
the analysis’ inference rules in Datalog [41], and is in active development. It
supports tuning of the analysis to adapt for various factors of precision (and
scalability). Doop leverages the program synthesizer Soufflé [12, 22] to resolve
points-to according to the inference rules and the ground facts, which are derived
directly from the program.
Wala [37] and Soot [28] are general-purpose program analyzers providing
some pre-defined analyses and APIs for the development of custom analyses.
Wala comes with various pre-defined pointer analyses [39], some of which feature
novel optimizations to enhance scalability.
A context-sensitive analysis improves a pointer analysis’ precision by discern-
ing method calls based on their calling contexts. Popular notions of contexts
are based on method callsites [23] (callsite-sensitive), invoking objects (object-
sensitive) [19], or hybrids thereof [13].
In the sequel, we explain the need for this study by exemplifying the three
factors that influence the results of pointer analyses.
Effects of Program Representation on Pointer Analyses 243
Listing 1.4: Snapshot of pointer analysis results from Doop with different IR
1 // Variables in main method with **** Wala ****
2 < << main method array > > < Factory : void
main ( java . lang . String []) >/ v1
3 // Variables in main method with **** Soot ****
4 > << main method array > > < Factory : void
main ( java . lang . String []) >/ @parameter0
5 > << main method array > > < Factory : void
main ( java . lang . String []) >/ l0 # _0
the number v1. Further method parameters are assigned subsequent variable
numbers, succeeded by local variables. Again, the static method calls to the
method getInstance are translated to invokestatic, where v3 and v6 hold the
(implicitly defined) constant arguments 6 and 7. The objects returned from the
factory method invocations are stored in the variables v5 and v8. Potential
exceptions thrown in the invoked methods are stored in v4 or v7, respectively.
The differences in program representation influence the metrics of pointer
analysis: We analyzed Listing 1.1 context-insensitively with Doop, using Jimple
and Wala’s IR. The results are shown in Listing 1.4: The main method parameter
object «main method array» is referred by one variable in Wala (line 2) but
two variables in Soot (lines 4– 5). Even though the average points-to set size is
1 for all variables in Listing 1.4, we found noticeable differences in the average
points-to set sizes in other program’s analyses, with Soot’s frontend the average
size of the points-to set being 2.07 for 3328 variables, and 1.95 for 2298 variables
using Wala’s—Jimple again created more variables than Wala. These subtle
differences in program representation affect the average points-to set size, and
it is unclear whether these two numbers are in fact comparable. In this work,
we aim to investigate the impact of IRs on the precision and scalability of the
analysis (Section 4.3).
application. They can also remove “irrelevant” classes, favoring scalability over
soundness. Interestingly, we found cases where some frontends do not load all of
the required classes, which induces discrepancies when comparing the analyses.
Consider the program shown in Listing 1.1. To corroborate our intuition,
we analyzed this program context-insensitively with Soot’s and Wala’s front-
ends. Using the former front-end, Doop loads 3,837 classes and computes the
analysis with an average points-to set size of 2.07. With Wala’s front-end, it
loads 19,927 (~5×) classes for analysis with an average points-to set size of
1.95. Further investigating the types of heap objects, we found that Doop with
Wala’s IR contains objects of the class java.security.PrivilegedActionException,
which is absent in the analysis with Soot. Note that our simple program contains
no instance of that type, so it must stem from analyzing libraries. In another
instance, Soot loads the classes from javax.crypto, whereas Wala does not. In this
research, we examine the imprecise modeling and discover possible implications
on precision and soundness (sections 4.1 and 4.2).
3 Methodology
set size: A lower precision value (i.e. average points-to set size) implies a higher
precision of the computed analysis result, as precise analyses aim at excluding
unrealizable (at runtime) allocation sites from the points-to sets of variables.
An IR may create many synthetic variables, among other reasons for method
parameters or for φ-nodes at control-flow joins of SSA-form. For example, three-
address code re-uses the same variable in assignments in the if and else blocks
of a conditional. However, SSA-based IRs insert a synthetic variable in a φ-node
at the control-flow join to select one of the distinct variables of the respective
blocks. The presence of synthetic variables in IRs impedes the comparison of
different analyses using the average points-to set size, as averages depend on
the (unequal) number of variables. Therefore, we devise heuristics to establish
comparability of our metrics for different IRs.
Another challenge in this work is inferring the impact of each analysis param-
eter on its precision. Computed at the end of the analysis, the average points-to
set size loses information on the contribution of a particular aspect of pointer
analysis. Therefore, we require a fine-grained metric to quantify the precision
for each parameter. We propose two such techniques, one for the class hierarchy
and the other for the intermediate representation.
Class Hierarchy The analysis of the program’s class hierarchy builds the foun-
dation for inferring relevant variables and heap allocations. However, each frame-
work leverages a particular strategy to infer classes that contribute to the pro-
gram’s semantics. Adding irrelevant classes to the class hierarchy may manifest
into a synthetically precise analysis, as these classes add to the total number of
variables (which will all be pointing to an empty set), thus potentially decreasing
the average size of points-to sets. Some of these variables and heap allocations
are not part of the actual code executed at runtime, but rather arise out of an
imperfect model of the program analysis framework’s frontend. Here, we study
the variables and heap objects stemming from the additional classes exclusive
to a framework.
We first instrument the Doop framework to log the class hierarchies and
compare the class hierarchies obtained using Soot and Wala as frontends, which
yields the classes exclusive to each of the frameworks. CH soot and CH wala de-
notes the set of classes in the class hierarchies of Soot and Wala respectively.
CH common = CH soot ∩ CH wala is the set of classes common to both frameworks.
We define CH-precision in terms of the average points-to set size restricted to
variables defined in methods of CH common .
Hcf (v)
v∈Vfc
CPf =
|Vfc |
248 J. Prakash et al.
If an analysis does not contain any exclusive classes or all of their variables
(and corresponding heap objects) belong to the types present in the set of ex-
clusive classes, CH-precision equals the average points-to set size.
points-to(v)
v∈VC,f
Hvf =
|C|
Based on the above discussion, we formulate and answer the following re-
search questions:
RQ1. How does the class hierarchy vary with the benchmarks?
RQ2. How do differences in class hierarchies affect the precision of analyses?
RQ3. How do the choice of IR affect the precision of the analysis?
RQ4. How do the heap abstractions differ between pointer analysis frameworks?
4 Evaluation
We use Doop version 4.20.7-67 and Wala version 1.5.0. For RQ1-RQ3, we
invoked Doop with the following analysis options: 1-call-site-sensitive,
1-object-sensitive, 2-call-site-sensitive+heap, 2-object-sensitive+
heap. Specific options used in our study for each research questions are de-
scribed in their respective sections. We use the DaCapo [2] (version 9.12-bach)
benchmarks, a standardized suite of open-source Java applications, for our study.
5
Note that Soot and Wala provide options to exclude certain classes from analysis
(to, e.g., exclude library classes). For a fair comparison we ignore this feature and
compute the whole class hierarchy including libraries.
250 J. Prakash et al.
Table 1: Difference in classes considered by Soot and Wala. Last two columns
show the extra classes loaded by Soot and Wala respectively.
#classes analyzed Extra classes
Benchmark Wala Soot Soot Wala
Avrora 21,997 9,204 0 12,793
Batik 23,461 10,739 12 12,734
Eclipse 25,718 9,813 62 15,967
H2 21,007 8,042 1 12,966
Jython 23,323 10,411 2 12,914
Lusearch 20,469 4,671 53 15,851
Luindex 20,479 4,681 53 15,851
PMD 21,315 8,517 1 12,799
SunFlow 20,677 7,847 0 12,830
Tradebeans 20,658 3,951 0 16,707
Xalan 22,688 10,164 0 12,524
Study Setup We have used the var-points-to relation, which maps all vari-
ables and context pairs to their resolved pairs of heap-object and context. We
select those variables that originate from classes common to both frameworks
(Section 4.1) and query their points-to information. We then compute the CH −
Precision based on Definition 1.
Results Table 2 presents the results of the analysis (for one-callsite, one-object,
and two-object context-sensitivity) for the objects and variables belonging to ex-
clusive classes present in Wala (only non-zero values included). Note that the
two-object sensitive analysis did not terminate for Eclipse and Jython, there-
fore, these are not presented in the table. In one-callsite and one-objects analy-
sis, Table 2 lists six out of eleven benchmarks contain variables that belong to
the exclusive class hierarchy. The remaining benchmark applications show no
differences in the number of variables and heap-objects, despite the presence
of additional classes. It demonstrates that the additional classes loaded by the
these frameworks have no influence on the precision of these benchmarks.
The third and fourth columns of Table 2 list the number of variables (in
principle, variable-context pairs) and heap objects belonging to the set of exclu-
sive classes, respectively. In all analyses, all but one benchmark have a higher
average points-to set size for exclusive variables than the general average. Trade-
beans only creates 3 additional heap objects with Wala’ frontend, therefore the
analyses are almost identical for both frontends. The average points-to sets for
exclusive classes for bigger benchmarks such as Eclipse and Jython are outliers,
showing very high averages. Still, the contribution of exclusive classes’ heap ob-
jects and variables is negligible compared with the total heap objects of these
benchmarks.
The eighth and ninth columns depict the CH-precision and the original pre-
cision for the analyses. We observe that the CH-precision is slightly lower than
the precision for all benchmarks but tradebeans, which originates from the addi-
Effects of Program Representation on Pointer Analyses 251
Soundness In our observation, the Wala frontend takes the internal Java libraries
into account. We find heap objects belonging to libraries such as sun.nio.fs,
sun.util.resources, sun.security, and sun.nio.cs, which are internal libraries used
by the JVM. Soot, on the other hand, does not model these libraries for analysis.
Comparing the class hierarchies of the analyses using Soot and Wala, we ob-
served that the class hierarchy using Soot as frontend is a subset of Wala’s for all
252 J. Prakash et al.
Table 4: Total (for each framework) and interesting (section 4.3) methods M.
Benchmark 1-CS 1-OS 2-OS
Soot Wala M Soot Wala M Soot Wala M
Avrora 3651 3678 3194 3642 3669 3187 3615 3642 3159
Batik 3407 3415 3006 3398 3406 2999 3285 3293 2895
Eclipse 20339 20281 18723 20261 20204 18655 Timed out
H2 3041 3091 2673 3027 3075 2661 2985 3029 2616
Jython 8482 8531 7672 8447 8494 7643 Timed out
Lusearch 2449 2457 2135 2440 2448 2128 2414 2422 2103
Luindex 3524 3532 3132 3514 3522 3124 3466 3474 3081
PMD 4587 4596 4131 4577 4586 4124 4418 4427 3978
Sunflow 8369 8384 7514 8335 8350 7475 7740 7754 6928
Tradebeans 2442 2406 2083 2433 2397 2076 2407 2371 2051
Xalan 4607 5701 4125 4597 5678 4115 4502 5503 4031
benchmarks except Eclipse. This suggests that analyses with Soot are as sound
as analyses with Wala for all benchmarks except Eclipse. Eclipse is a compelling
case: Its analysis using Soot contains heap objects and variables that belong to
the internal libraries of Eclipse, such as org.eclipse.core.internal.runtime.Perfor-
manceStatsProcessor, while the analyses with Wala does not report these objects.
However, results from the analyses with Wala contain heap objects from the in-
ternal libraries such as sun.util.*, which are not present using Soot. It shows
that the class hierarchy model is unsound in both frontends, as both lack some
of the classes loaded by these benchmark applications at runtime.
Our study reveals that library modeling in both Soot and Wala is unsound
even for (non-native) Java objects, shown by the presence of heap-objects
belonging to the exclusive classes of Soot and Wala.
Results Table 4 reports the number of interesting methods and total methods
resolved using both frontends. Note that the number of interesting method is
identical for both frameworks for the same type of context-sensitivity. The num-
ber of reachable methods in each analysis differs, just as the number of distinct
methods signatures discovered in each framework (columns Soot, Wala in 1-CS,
1-OS, 2-OS6 ). However, deriving a relationship between those is impossible, as
6
We excluded 2-CS for its large file sizes.
Effects of Program Representation on Pointer Analyses 253
Table 5: Results for IR. Third and fifth columns are the number of heap objects.
Fourth and sixth columns are the number of virtual calls. Last two columns lists
the average devirtualized heap objects (Hvf ) for Soot and Wala respectively.
Soot Wala Hvf
Analysis Benchmark Heap Objs. Virt. Calls Heap Objs. Virt. Calls Soot Wala
1 call-site Avrora 7,684 3499 7759 3499 2.20 2.22
sensitive Batik 2,645 1588 2702 1588 1.67 1.70
Eclipse 7.7M 56.8K 7.9M 56.8K 136.33 139.24
H2 1,936 1,434 1,988 1,434 1.35 1.39
Jython 662K 9,286 656K 9,283 71.33 70.67
Lusearch 1,667 1,139 1,674 1,139 1.46 1.47
Luindex 8,090 4408 8,098 4,408 1.84 1.84
PMD 8,518 3,527 8,708 3,527 2.42 2.47
Sunflow 4,741 2,088 4,627 2,088 2.27 2.22
Tradebeans 1,638 1,114 1,649 1,106 1.47 1.49
Xalan 43K 5,832 55K 5,850 7.45 9.44
1 object Avrora 6,561 3,498 6,563 3,498 1.88 1.88
sensitive Batik 1,673 1,587 1,709 1,587 1.05 1.08
Eclipse 2.9M 56.7K 3.0M 56.8K 51.61 53.53
H2 1,218 1,433 1,258 1,433 0.85 0.88
Jython 3.5K 9,272 3.6K 9,269 386.79 389.20
Lusearch 958 1,138 964 1,138 0.84 0.85
Luindex 4,530 4,407 4,552 4,407 1.03 1.03
PMD 7,369 3,527 7,518 3,527 2.09 2.13
Sunflow 2,978 2,088 2,864 2,088 1.43 1.37
Tradebeans 928 1,113 938 1,105 0.83 0.85
Xalan 99K 5,830 106K 5,810 17.11 18.33
2 object Avrora 8,561 3,459 8,563 3,459 2.47 2.48
sensitive Batik 1,257 1,567 1,275 1,567 0.80 0.81
H2 1,288 1,433 1,307 1,433 0.90 0.91
Luindex 5,210 4,363 5,215 4,363 1.19 1.20
Lusearch 948 1,138 954 1,138 0.83 0.84
PMD 7,271 3,496 7,398 3,496 2.08 2.12
Sunflow 2,342 2,088 2,324 2,088 1.12 1.11
Tradebeans 919 1,113 929 1,105 0.83 0.84
Xalan 214K 5,791 215K 5,771 36.97 37.36
analyses such as one-call-site and one-object are not comparable. In all cases, we
observed that the majority (~90%) of the methods are interesting. Therefore,
we cannot ignore the significance of this aspect.
Interesting methods are difficult to ignore because of their sheer presence in
the benchmarks applications.
Table 5 presents the differences in the average devirtualized heap objects
for Jimple and Wala IR. Although the number of variables and abstract heap
locations are dependent on the IR, we did not observe many differences between
those when restricting ourselves to target variables of virtual method calls, which
corresponds to our intuition. The differences in the Hvf values for both IRs
254 J. Prakash et al.
are negligible except for three larger benchmarks, Jython, Eclipse, and Xalan.
Overall, the values from Soot IR were smaller than those of Wala, implying
that devirtualization in Soot is either slightly more precise or slightly less sound
than in Wala, however, the differences are minor in the majority of the cases. In
conclusion, the choice of IR shows little to no impact on the precision of pointer
analysis. In the sequel, we describe one such case study where the difference in
Hvf is approximately two, which is a significant figure as compared to others.
Finding 2 : IR has negligible impact on the precision of pointer analysis at
least for the devirtualization client.
Differences in the heap objects For evaluation, we extracted the heap-objects cre-
ated in Wala’s and Doop’s analyses and observe huge differences in the number of
heap objects created. Intuitively, using the same level of heap-sensitivity (heap-
cloning) should create the same number of heap objects. However, in certain
cases, the number of heap objects in Wala exhibits a factor of ~14 compared to
those in Doop (columns 2 and 3 in 7). (Note that eclipse and jython are elided,
as the analyses did not terminate within the time budget owing to the large
file size (~100GB).) Therefore, the heap abstractions of these analyses are not
comparable, although superficially they look similar.
Subtle optimizations also manifests into imprecise heap modeling even
though, at the outset, they look similar.
To investigate this further, we compared the the types of the heap objects.
Our study shows that the set of types are not even consistent using the same
frontend! In many cases the types of objects analyzed by Wala is approximately
four times those in Doop (columns 4 and 5 in Table 7). The differences in heap
abstraction for application level objects build the reason for this.
256 J. Prakash et al.
Application level objects Application level objects, i.e., the heap objects cre-
ated due to allocations within the program (rather than libraries.) In three out
of eleven benchmarks we observe that Doop’s analysis is lacking application
level classes that Wala reports. We found corresponding allocations on a man-
ual inspection of the source code. For example, in avrora, the analysis in Wala
allocates heap objects of BRNE_builder [8], which are not present in Doop’s.
Similar cases can be found in PMD and Xalan. However, owing to the limita-
tions of the program representation, we could not determine the precise reason
for the unsoundness. Pointer analysis uses an IR based on a control flow graph
(CFG) rather than source code. Being a lower level representation of the program
source code the IR mangles variables names. Therefore, a one-to-one correspon-
dence between the IR’s variables and variables in source code is not trivial.
Finding 3 : Heap modeling is not similar even for allocations within the
application scope. Wala handles application levels objects more precisely than
Soot in our evaluation.
5 Threats to Validity
Naturally, the technique used relies on the precise handling of reflection calls and
other dynamic features of the languages such as dynamic proxies. Other than
that, handling of native calls could alleviate the unsoundness of the analyses.
Analysis of native calls could infer the native objects in JVM missed by the Soot
framework. Here, we have used the TamiFlex framework for handling reflection
calls. Other approaches have improved the reflection handling [10, 15–18, 25]. To
convince ourself, we experimented with one of the state-of-the-art techniques,
i.e., reflection with matching substring resolution [10]. However, we did not find
any significant differences in results. Another limitation of this study is the un-
soundness from ignoring the native library calls in static analyses. Few of the
sources of unsoundness discovered stem from the native calls. Recently, Four-
tounis et al. [7] proposed a technique for resolving native calls in Java. However,
at the time of writing this paper, the technique was not available. Further, our
analysis in Section 4.3 is based on test-cases which may not reflect all possible
executions of an application.
Our study also involves hours of manual evaluation which can be subject to
bias. To counteract it, we did a manual inspection of the source code, especially
for the sources of unsoundness. We had rerun the benchmark applications with
valid inputs to determine to compare and reassert that the objects are actually
allocated during runtime.
6 Related Work
Pointer analysis tools Pointer analysis has garnered significant interest in the
last decades, focussing on scalability, precision, and soundness. The Doop sys-
tem used in this paper results from years of research on declarative-style pointer
Effects of Program Representation on Pointer Analyses 257
analysis [1, 3, 10, 24, 26]. Similarly, the Wala framework was a result of an indus-
trial project and, unlike Doop, follows an imperative paradigm. The underlying
program representation comes with many prior assumptions mentioned. In this
work, we study the effects of these assumptions on program analysis.
7 Conclusion
This paper reports the effects of program representation on program analysis.
Our metrics makes it possible to compare implementations leveraging different
frontends. We find that differences in program representation have negligible im-
pact on the precision of the pointer analysis. In addition, we also discovered novel
sources of unsoundness and imprecision in the program analysis. Our results also
demonstrate that the promised heap abstraction are practically not similar, even
though they may appear so on a birds eye view. Since pointer analysis builds the
foundation of many static analyses, we conjecture the results generalize these,
as well.
References
1. Antoniadis, T., Triantafyllou, K., Smaragdakis, Y.: Porting doop to souf-
flé;: A tale of inter-engine portability for datalog-based analyses. In: Pro-
ceedings of the 6th ACM SIGPLAN International Workshop on State Of
258 J. Prakash et al.
the Art in Program Analysis. pp. 25–30. SOAP 2017, ACM, New York,
NY, USA (2017). https://doi.org/10.1145/3088515.3088522, http://doi.acm.org/
10.1145/3088515.3088522
2. Blackburn, S.M., Garner, R., Hoffmann, C., Khang, A.M., McKinley, K.S.,
Bentzur, R., Diwan, A., Feinberg, D., Frampton, D., Guyer, S.Z., Hirzel, M.,
Hosking, A., Jump, M., Lee, H., Moss, J.E.B., Phansalkar, A., Stefanović, D.,
VanDrunen, T., von Dincklage, D., Wiedermann, B.: The dacapo benchmarks:
Java benchmarking development and analysis. In: Proceedings of the 21st An-
nual ACM SIGPLAN Conference on Object-oriented Programming Systems,
Languages, and Applications. pp. 169–190. OOPSLA ’06, ACM, New York,
NY, USA (2006). https://doi.org/10.1145/1167473.1167488, http://doi.acm.org/
10.1145/1167473.1167488
3. Bravenboer, M., Smaragdakis, Y.: Strictly declarative specification of so-
phisticated points-to analyses. In: Proceedings of the 24th ACM SIG-
PLAN Conference on Object Oriented Programming Systems Languages
and Applications. pp. 243–262. OOPSLA ’09, ACM, New York, NY, USA
(2009). https://doi.org/10.1145/1640089.1640108, http://doi.acm.org/10.1145/
1640089.1640108
4. Cytron, R., Ferrante, J., Rosen, B.K., Wegman, M.N., Zadeck, F.K.: Ef-
ficiently computing static single assignment form and the control depen-
dence graph. ACM Trans. Program. Lang. Syst. 13(4), 451–490 (Oct 1991).
https://doi.org/10.1145/115372.115320
5. Dietrich, J., Sui, L., Rasheed, S., Tahir, A.: On the construction of soundness ora-
cles. In: Proceedings of the 6th ACM SIGPLAN International Workshop on State
Of the Art in Program Analysis. pp. 37–42. SOAP 2017, Association for Computing
Machinery, New York, NY, USA (2017). https://doi.org/10.1145/3088515.3088520,
https://doi.org/10.1145/3088515.3088520
6. Fourtounis, G., Triantafyllou, L., Smaragdakis, Y.: Identifying java calls in
native code via binary scanning. In: Proceedings of the 29th ACM SIG-
SOFT International Symposium on Software Testing and Analysis. pp. 388–
400. ISSTA 2020, Association for Computing Machinery, New York, NY,
USA (2020). https://doi.org/10.1145/3395363.3397368, https://doi.org/10.1145/
3395363.3397368
7. Fourtounis, G., Triantafyllou, L., Smaragdakis, Y.: Identifying java calls in
native code via binary scanning. In: Proceedings of the 29th ACM SIG-
SOFT International Symposium on Software Testing and Analysis. pp. 388–
400. ISSTA 2020, Association for Computing Machinery, New York, NY,
USA (2020). https://doi.org/10.1145/3395363.3397368, https://doi.org/10.1145/
3395363.3397368
8. GitHub: https://github.com/cmorty/. https://github.com/cmorty/avrora/
blob/222ea1645b67bc40429881526555d19bced4a590/src/avrora/arch/avr/
AVRInstrBuilder.java (August 2020), (Accessed on 05.08.2020)
9. Grech, N., Fourtounis, G., Francalanza, A., Smaragdakis, Y.: Heaps don’t
lie: Countering unsoundness with heap snapshots. Proc. ACM Program. Lang.
1(OOPSLA) (Oct 2017). https://doi.org/10.1145/3133892, https://doi.org/10.
1145/3133892
10. Grech, N., Kastrinis, G., Smaragdakis, Y.: Efficient Reflection String Analy-
sis via Graph Coloring. In: Millstein, T. (ed.) 32nd European Con-
ference on Object-Oriented Programming (ECOOP 2018). Leibniz Inter-
national Proceedings in Informatics (LIPIcs), vol. 109, pp. 26:1–26:25.
Effects of Program Representation on Pointer Analyses 259
23. Sharir, M., Pnueli, A.: Two approaches to interprocedural data flow analysis. New
York Univ. Comput. Sci. Dept., New York, NY (1978), https://cds.cern.ch/record/
120118
24. Smaragdakis, Y., Balatsouras, G.: Pointer analysis. Found. Trends Program. Lang.
2(1), 1–69 (Apr 2015). https://doi.org/10.1561/2500000014, http://dx.doi.org/10.
1561/2500000014
25. Smaragdakis, Y., Balatsouras, G., Kastrinis, G., Bravenboer, M.: More sound
static handling of java reflection. In: Feng, X., Park, S. (eds.) Programming Lan-
guages and Systems - 13th Asian Symposium, APLAS 2015, Pohang, South Korea,
November 30 - December 2, 2015, Proceedings. Lecture Notes in Computer Science,
vol. 9458, pp. 485–503. Springer (2015). https://doi.org/10.1007/978-3-319-26529-
2_26, https://doi.org/10.1007/978-3-319-26529-2_26
26. Smaragdakis, Y., Bravenboer, M., Lhoták, O.: Pick your contexts well: Under-
standing object-sensitivity. In: Proceedings of the 38th Annual ACM SIGPLAN-
SIGACT Symposium on Principles of Programming Languages. pp. 17–30. POPL
’11, ACM, New York, NY, USA (2011). https://doi.org/10.1145/1926385.1926390,
http://doi.acm.org/10.1145/1926385.1926390
27. Smaragdakis, Y., Kastrinis, G.: Defensive Points-To Analysis: Effective
Soundness via Laziness. In: Millstein, T. (ed.) 32nd European Con-
ference on Object-Oriented Programming (ECOOP 2018). Leibniz Inter-
national Proceedings in Informatics (LIPIcs), vol. 109, pp. 23:1–23:28.
Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, Dagstuhl, Germany (2018).
https://doi.org/10.4230/LIPIcs.ECOOP.2018.23, http://drops.dagstuhl.de/opus/
volltexte/2018/9228
28. Soot: Soot - a framework for analyzing and transforming java and android appli-
cations (Jan 2019), http://sable.github.io/soot/
29. Späth, J., Ali, K., Bodden, E.: Ideal: Efficient and precise alias-aware dataflow
analysis. In: 2017 International Conference on Object-Oriented Programming, Lan-
guages and Applications (OOPSLA/SPLASH). ACM Press (Oct 2017), https:
//doi.org/10.1145/3133923
30. Späth, J., Ali, K., Bodden, E.: Context-, flow-, and field-sensitive data-flow
analysis using synchronized pushdown systems. Proc. ACM Program. Lang.
3(POPL), 48:1–48:29 (2019). https://doi.org/10.1145/3290361, https://doi.org/
10.1145/3290361
31. Späth, J., Do, L.N.Q., Ali, K., Bodden, E.: Boomerang: Demand-driven flow- and
context-sensitive pointer analysis for java. In: Krishnamurthi, S., Lerner, B.S. (eds.)
30th European Conference on Object-Oriented Programming, ECOOP 2016, July
18-22, 2016, Rome, Italy. LIPIcs, vol. 56, pp. 22:1–22:26. Schloss Dagstuhl - Leibniz-
Zentrum für Informatik (2016). https://doi.org/10.4230/LIPIcs.ECOOP.2016.22,
https://doi.org/10.4230/LIPIcs.ECOOP.2016.22
32. Sui, L., Dietrich, J., Emery, M., Rasheed, S., Tahir, A.: On the soundness of call
graph construction in the presence of dynamic language features - a benchmark
and tool evaluation. In: Ryu, S. (ed.) Programming Languages and Systems. pp.
69–88. Springer International Publishing, Cham (2018), https://doi.org/10.1007/
978-3-030-02768-1_4
33. Sui, L., Dietrich, J., Tahir, A., Fourtounis, G.: On the recall of static call graph con-
struction in practice. In: Proceedings of the ACM/IEEE 42nd International Confer-
ence on Software Engineering. p. 1049–1060. ICSE ’20, Association for Computing
Machinery, New York, NY, USA (2020). https://doi.org/10.1145/3377811.3380441,
https://doi.org/10.1145/3377811.3380441
Effects of Program Representation on Pointer Analyses 261
34. Tan, T., Li, Y., Xue, J.: Efficient and precise points-to analysis: Modeling the heap
by merging equivalent automata. In: Proceedings of the 38th ACM SIGPLAN
Conference on Programming Language Design and Implementation. pp. 278–291.
PLDI 2017, Association for Computing Machinery, New York, NY, USA (2017).
https://doi.org/10.1145/3062341.3062360
35. Vallée-Rai, R., Co, P., Gagnon, E., Hendren, L., Lam, P., Sundaresan, V.: Soot - a
java bytecode optimization framework. In: Proceedings of the 1999 Conference of
the Centre for Advanced Studies on Collaborative Research. p. 13. CASCON ’99,
IBM Press (1999), https://dl.acm.org/doi/10.5555/781995.782008
36. Vallée-Rai, R., Gagnon, E., Hendren, L., Lam, P., Pominville, P., Sundaresan,
V.: Optimizing java bytecode using the soot framework: Is it feasible? In: Watt,
D.A. (ed.) Compiler Construction. pp. 18–34. Springer Berlin Heidelberg, Berlin,
Heidelberg (2000), https://doi.org/10.1007/3-540-46423-9_2
37. WALA: Watson libraries for program analysis (Jan 2019), http://wala.sourceforge.
net/wiki/index.php/Main_Page
38. Wala: Intermediate representation (IR) (Aug 2020), https://github.com/wala/
WALA/wiki/Intermediate-Representation-(IR)
39. Wala: Pointer analysis (Aug 2020), https://github.com/wala/WALA/wiki/
Pointer-Analysis
40. Wei, F., Roy, S., Ou, X., Robby: Amandroid: A precise and general inter-component
data flow analysis framework for security vetting of android apps. ACM Trans.
Priv. Secur. 21(3) (Apr 2018). https://doi.org/10.1145/3183575, https://doi.org/
10.1145/3183575
41. Wikipedia: Datalog (Jan 2019), https://en.wikipedia.org/wiki/Datalog
Open Access This chapter is licensed under the terms of the Creative Commons
Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/),
which permits use, sharing, adaptation, distribution and reproduction in any medium
or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license and indicate if changes were
made.
The images or other third party material in this chapter are included in the chapter’s
Creative Commons license, unless indicated otherwise in a credit line to the material. If
material is not included in the chapter’s Creative Commons license and your intended
use is not permitted by statutory regulation or exceeds the permitted use, you will need
to obtain permission directly from the copyright holder.
Keeping Pace with the History of
Evolving Runtime Models
1 Introduction to InTempo
A (structural) Runtime Model (RTM) provides a snapshot of the constituents of
a system and their state [3]. RTMs are typically employed in the context of Self-
adaptive Systems (SAS) [4], where a feedback loop adapts the system behavior
at runtime in response to external or internal stimuli, the latter represented as
model fragments in the RTM and detected via the execution of model queries.
Encoding an RTM as a graph enables detection via graph queries, which
specify a sought (graph) pattern. Such an encoding conforms to a metamodel
which restricts the structure of model instances and defines types of vertices,
edges, and attributes. Formally, these concepts rely on typed, attributed graph
transformation [6] where graphs are typed over a type graph.
Capturing the history of RTMs, i.e., previous snapshots, may be useful for
a number of aims such as the detection of recurrent behavior or postmortem
analysis [3,8]. However, handling history at runtime poses important challenges
to tool support. Tools are required to enable the specification and timely execu-
tion of queries with temporal requirements, i.e., requirements on the evolution of
patterns over multiple snapshots. Timely execution is crucial for SAS, where a
loop may depend on query results before planning and performing adaptations.
Faced with these challenges, the available tool support is seemingly limited
either by the lack of support for direct specification of temporal requirements
in graph queries [5] or by the on-disk representation of the model [8,11] that
introduces an overhead on execution times in runtime settings, e.g., in SAS.
c The Author(s) 2021
E. Guerra and M. Stoelinga (Eds.): FASE 2021, LNCS 12649, pp. 262–268, 2021.
https://doi.org/10.1007/978-3-030-71500-7 13
Keeping Pace with the History of Evolving Runtime Models 263
2 RTMH Analysis
This section presents an exemplary query in ITQL which it then uses to demon-
strate the RTMH Analysis. It concludes with technical details.
InTempo Query Language (ITQL) Formally, a temporal graph query q is
characterized by a (graph) pattern p and an application condition ac, denoted
q = (p, ac). A match m corresponds to an occurrence of p in the RTMH . In order
for m to be valid, it must satisfy the ac. ITQL supports the formulation of ac in
the Metric Temporal Graph Logic (MTGL) [10] which supports operators such
as negation (¬), existential quantification (∃), conjunction (∧), and the metric,
i.e, interval-based, temporal operators until (UI , where I is a time interval over
IR+0 ) and since (SI ), as well as abbreviations such as eventually, i.e., ♦I ∃ n =
true UI ∃ n, where n is a graph pattern and true is always satisfied. MTGL also
supports the nesting of patterns to bind graph elements in outer conditions and
relate them to inner (nested) conditions, i.e., elements common to two patterns
n1 and n2 refer to the same element in the RTMH .
MTGL is able to express real-time properties such as “every patient diag-
nosed with sepsis, must eventually within 5 time units be given the proper drug”
(adjusted from the medical guideline in [14]). In an RTMH of the SHS, In-
Tempo can find violations of the property above by executing the ITQL query
q1 = (n1 , κ), with κ the MTGL formula ¬( ♦[0,5] ∃ n2 ) and n1 , n2 patterns rep-
resenting a sepsis diagnosis and drug administration respectively. The query
searches for matches of n1 in the RTMH that satisfy κ, i.e., for patients that,
Keeping Pace with the History of Evolving Runtime Models 265
G3 G9
s:Sensor s:Sensor u:Pump
sepsis diagnosis,ts=3,id=1 ⏎ cts = 3
shs:SHS cts = 3 shs:SHS cts = 9
drug administration,ts=9,id=1 ⏎ dts = ∞ cts = 0 dts = ∞ cts = 0 dts = ∞
id = 1 dts = ∞ id = 1 dts = ∞ id = 1
status=sepsis status=sepsis status=drug
although diagnosed with sepsis, did not receive a drug within the designated
time. In InTempo, each match is associated with a temporal validity, i.e., a set
of time intervals for which, based on the overlap among the cts and dts of the
matched elements and the interval for which ac is satisfied, the match is valid.
ITQL also allows for the definition of OCL constraints [12] on sought patterns.
Output The ITQL specification for the
( $n1 , ! ( true U [0 ,5] E $n2 ) )
query q1 is shown in Figure 4. Performing
RTMH Analysis for the query q1 on the RTMH declarations {
n1 { shs : SHS
G9 of Figure 3 returns one match, since there s : Sensor
is indeed no Pump attached to the SHS, i.e., a shs -ownedSensors- > s
[ OCL : " s . status = ‘ sepsis ’ " ]}
match for n2 , within five time units after a
Sensor was activated, i.e., a match for n1 was n2 { shs : SHS
s : Sensor
found. The temporal validity interval [3, 4] is u : Pump
returned together with the match. The match, shs -ownedPumps- > u
shs -ownedSensors- > s
i.e., violation, is indeed valid only for that in- [ OCL : " u . status = ‘ drug ’ " ]}
terval since after timestamp 4, a match for n2 }
starts to exist within five time units of a match
for n1 . If the API of InTempo is used, the Fig. 4: Example query in ITQL
query returns the match of the n1 pattern, i.e., the EMF objects, together with
the temporal validity. In case InTempo is used via the UI it displays a message
box in Eclipse with the following message: SHS@0[] Sensor@3[status=sepsis]
[[3,4]]. Note that “@” precedes the cts of an object and values within square
brackets are attributes of the object.
Technical Details For the execution of temporal graph queries, InTempo em-
ploys the operationalization framework presented in [15]. The framework sup-
ports the decomposition of a query into a suitable ordering of simpler sub-queries
which is executed bottom-up. The outermost query computes the overall result.
For pattern-matching, InTempo employs the Story Diagram Interpreter from [1]
which uses heuristics shown to reduce the pattern-matching effort. InTempo
provides an Xtext [2] editor for ITQL which supports completion suggestions
for element types and validation of the query syntax.
3 LogAnalysis
This section demonstrates the LogAnalysis operation mode which assumes that
data from past states have been captured as events in a log. InTempo offers
the capability to process the system changes and, upon each change, obtain an
updated RTMH which is then used internally to perform RTMH Analysis.
266 L. Sakizloglou et al.
First, the sepsis diagnosis event is processed which makes the internal RTMH
be identical to G3 in the same figure. The query is executed using RTMH Analysis
and returns a match, i.e., violation, since at that moment a match for n2 does
not exist in the graph. The temporal validity is equal to [3,∞], i.e., the match
is valid from time point 3 onward. Next, the drug administration event is
processed which leads to G9 . The result of RTMH Analysis for G9 is the same
as the result described in Section 2.
Technical Details In LogAnalysis the query execution framework monitors the
RTMH for changes and, upon every change, recomputes the matches. Previous
matches are kept in-between executions and therefore the query is executed
incrementally. Similarly to ITQL, E2P is supported by an Xtext editor that
offers syntax validation and completion suggestions for element types.
Keeping Pace with the History of Evolving Runtime Models 267
References
1. Barkowsky, M., Giese, H.: Hybrid search plan generation for generalized graph
pattern matching. JLAMP 114, 100563 (2020)
2. Bettini, L.: Implementing domain-specific languages with Xtext and Xtend. Packt
Publishing Ltd (2016)
3. Bencomo N., Goetz S., and Song H.: Models@ run. time: a guided tour of the state
of the art and research challenges. SoSyM 18.5 (2019)
4. Brun, Y., Di Marzo Serugendo, G., Gacek, C., Giese, et al.:Software engineering
for self-adaptive systems, pp. 48–70. Heidelberg (2009) Springer
5. Búr, M., Szilágyi, G., Vörös, A., Varró, D.: Distributed graph queries over mod-
[email protected] for runtime monitoring of cyber-physical systems. STTT 22(1)
6. Ehrig, H., Prange, U., Taentzer, G.: Fundamental Theory for Typed Attributed
Graph Transformation. ICGT Berlin, Heidelberg (2004) Springer
7. Eclipse Foundation: Eclipse modeling framework (EMF) (Aug 2020), https://www.
eclipse.org/modeling/emf/, accessed: 2020-10-11
8. Garcı́a-Domı́nguez, A., Bencomo, N., Parra-Ullauri, J.M., Garcı́a-Paucar, L.H.:
Querying and Annotating Model Histories with Time-Aware Patterns. MODELS.
pp. 194–204 (2019) ACM/IEEE
9. Ghahremani, S., Giese, H., Vogel, T.: Efficient utility-driven self-healing employing
adaptation rules for large dynamic architectures. ICAC (2017)
10. Giese, H., Maximova, M., Sakizloglou, L., Schneider, S.: Metric Temporal Graph
Logic over Typed Attributed Graphs. FASE, (2019) Springer
11. Gómez, A., Cabot, J., Wimmer, M.: TemporalEMF: A temporal metamodeling
framework. ER, vol. 11157, pp. 365–381. (2018) Springer
12. Kleppe, A., Warmer, J.: An introduction to the object constraint language (OCL).
In: TOOLS p. 456 (2000)
13. MDELab: InTempo Homepage, http://www.hpi.uni-potsdam.de/giese/public/
mdelab/mdelab-projects/intempo/, accessed: 2021-01-19
14. Rhodes, A., Evans, L.E., Alhazzani, W., Levy, M.M., et al.: Surviving sepsis cam-
paign: International guidelines for management of sepsis and septic shock: 2016.
Intensive care medicine 43(3), 304–377 (2017)
15. Sakizloglou, L., Ghahremani, S., Barkowsky, M., Giese, H.: A scalable querying
scheme for memory-efficient runtime models with history. MoDELS pp. 175–186.
(2020) ACM/IEEE
268 L. Sakizloglou et al.
Open Access This chapter is licensed under the terms of the Creative Commons
Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/),
which permits use, sharing, adaptation, distribution and reproduction in any medium
or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license and indicate if changes were
made.
The images or other third party material in this chapter are included in the chapter’s
Creative Commons license, unless indicated otherwise in a credit line to the material. If
material is not included in the chapter’s Creative Commons license and your intended
use is not permitted by statutory regulation or exceeds the permitted use, you will need
to obtain permission directly from the copyright holder.
SpecTest: Specification-Based Compiler Testing
1 Introduction
Compilers must be thoroughly tested (if not verified) for multiple reasons. First,
compilers are essential for the software ecosystem. Their correctness is a prereq-
uisite for program correction. That is, a compiler bug might propagate to all pro-
duced programs. Second, compilers are error-prone due to their high complexity.
Their main functionality is to convert source code to executable machine code.
They often provide additional features, like code optimisation or debug utilities.
A variety of compilers has been written for countless languages. Modern compil-
ers like GCC, javac, and LLVM are overwhelmingly complicated (e.g., GCC has
more than 7M lines of code and OpenJDK has more than 11M [20]). Although
some of them have been used for decades, they may still be buggy [54,55].
Recently, there have been numerous efforts on formalising and standard-
ising programming language semantics, such as K-Java [24], C semantics [29],
KJS [47], or KSolidity [34,44], which readily serve as a specification of the respec-
tive compilers. Usually, these executable semantics are accompanied by manually
crafted unit tests. Such tests are however designed to test the semantics rather
than the compliance of the compiler to the language semantics. In this work, we
aim to better utilise these semantics by automatically generating test programs
with a novel coverage criterion that facilitates systematic compiler testing.
Multiple approaches have been recently proposed to test compilers. Most of
them successfully found compiler bugs. For instance, the EMI project discovered
c The Author(s) 2021
E. Guerra and M. Stoelinga (Eds.): FASE 2021, LNCS 12649, pp. 269–291, 2021.
https://doi.org/10.1007/978-3-030-71500-7_14
270 R. Schumi and J. Sun
more than 1600 bugs in GCC and LLVM [53]. Another study has revealed bugs
in the Java compiler by comparing different javac and JVM versions [27]. For
the relatively new Solidity (smart contract) language, many crashes were found
through fuzzing [28]. Moreover, bugs in compilers may be exploited by attackers.
For example, prior to version 0.5.0, the Solidity compiler had an uninitialised
storage pointer vulnerability that affected many smart contracts on Ethereum.
A honey pot named OpenAddressLottery was designed to exploit this vulnerably
and steal ether (i.e., digital money in Ethereum). There are hundreds or even
thousands of programming languages according to different sources [30] and
many new ones emerge every year. For example, various new general purpose or
domain-specific languages have been developed recently, such as Rust, Kotlin,
Solidity, and Move.
Compiler testing is an ongoing research field. Next, we briefly review existing
approaches according to how they address the following two problems.
1. The test generation problem: how are test cases (i.e., programs with specific
inputs) selected and generated?
2. The oracle problem: how are testing results deemed successful or failure?
Existing compiler testing approaches solve the test generation problem mainly
through two ways, by generating programs according to a grammar that spec-
ifies the syntax of a language [49,31,23], or by mutating existing seed pro-
grams [40,55,41]. For the former, due to a huge search space, additional selection
criteria must be applied to selectively generate test cases for compilers, such as
standard code coverage criteria like statement coverage. For the latter, existing
mutation strategies are often limited by the ‘weak’ oracles (as we will discuss
shortly) employed by the approach, e.g., mutating to introduce ‘dead’ code.
Generally, approaches which generate complicated syntax focus more on parsing
errors instead of errors in the semantics. For the oracle problem, existing propos-
als mainly have three oracles. The first oracle is one that only flags a test failure
if the program is incompilable or leads to crashes [28]. The second oracle flags
a test failure if certain algebraic properties are violated. For instance, the alge-
braic property adopted in the EMI approach [55] is that mutating unreachable
code does not change the execution result. We remark that these two oracles are
‘weak’ as they are unable to detect simple semantic errors such as 3 + 4 = 8. The
third, stronger oracle is one that checks whether the output of a test program is
consistent with a reference, which could be a second compiler (i.e., differential
testing [45]), or an abstract specification like a state machine [35,36]. This oracle
requires a reference, which is not always feasible. Furthermore, it is limited to
bugs which result in inconsistencies between the compiled program and the ref-
erence. Last but not least, existing approaches do not provide a good adequacy
measurement on the progress of compiler testing. Often measurements, like code
coverage, are used as an indicator, but they have the limitation that they need
access to the compiler code, and achieving full code coverage is challenging.
In this work, we present a novel specification-based testing method called
SpecTest for compiler testing. SpecTest differs from existing approaches in the
following aspects. First, SpecTest is built upon a strong oracle, i.e., an executable
SpecTest: Specification-Based Compiler Testing 271
language specification that can predict the expected output of test programs.
This strong oracle enables us to detect semantic errors, i.e., bugs that are related
to the semantics. Such bugs may also originate from the runtime environment.
Hence, SpecTest is not just limited to classical compiler bugs. Second, SpecTest
offers a testing adequacy measurement in term of semantic coverage and has a
built-in mutation-based test case generation method which aims to achieve high
semantic coverage. The semantic coverage measures the number of language
semantic rules that are covered by existing test cases. The test case generation
method mutates the seed programs accordingly to maximise the coverage of the
language semantics, e.g., by introducing less-tested language features into these
programs. Compared to measuring the code coverage of a compiler, our semantic
coverage has the added value that it does not need access to the compiler code,
and it specifically targets semantic bugs.
Given a language semantics (in the form of a set of small-step operational
semantic rules), SpecTest executes fully automatically. We have implemented
SpecTest for two compilers, i.e., the Java compiler and the Solidity compiler and
tested the language features that are supported by our applied semantics [24,44].
The results of the evaluation were promising. SpecTest successfully increases the
semantic coverage for both compilers, and identified many bugs and issues that
helped the compiler and specification developers.
To sum up, we make the following technical contributions.
– We propose a semantic coverage criterion for measuring the adequacy of
compiler testing.
– We introduce a novel compiler testing method that uses an executable lan-
guage specification as an oracle.
– We demonstrate the applicability and generality of SpecTest by applying it
to two compilers.
The paper is structured as follows. Sect. 2 explains our method and discusses
the required components in detail. In Sect. 3, we present our evaluation with
two compilers. Next, we review related work in Sect. 4 and conclude in Sect. 5.
2 Method
In this section, we outline how SpecTest works. In particular, we present its
high-level design, highlight relevant details of its components, and explain the
workflow step by step using an example.
two aspects, i.e., extending them with proper interface and conversion so that
they work with other components in SpecTest; and introducing a measurement
feature for semantic coverage. For example, we enhanced the coverage engine of
the old K version for K-Java, and we added a visualisation of the covered rules.
Given a test case (in the form of a program with inputs), the executable
semantics is used as follows. First, the test case is executed using the built-in
execution engine of the K framework which fires the SOS rules one by one. The
final variable valuations are captured as the result of the test case. For instance,
for Solidity, we capture all the persistent states in the blockchain network (which
includes addresses, their balances and the values of storage variables). This test-
ing result is turned into an assertion in the test case. The test case with the
assertion is then executed using the compiled program. If the assertion fails
(e.g., the value of at least one variable is different), a bug is revealed.
Simply applying the above-mentioned steps to test compilers would not be
comprehensive. That is, existing seed programs often use a limited set of common
language features and thus would not be able to test the compiler extensively.
In fact, our experience on testing the Solidity compiler with existing smart con-
tracts suggests that many smart contracts are suspiciously similar. As a result,
the test cases would only exercise a limited set of semantic rules and thus would
miss those bugs in the part of the Solidity compiler that encodes the remaining
semantic rules. While collecting a large set of seed programs would likely be
helpful, the larger problem at stake is whether there could be a certain quanti-
tative measurement on the comprehensiveness of the test cases and whether we
can use the measurement to guide the generation of new test cases? SpecTest’s
answer to this question lies in the design of the mutator and the fuzzer.
In SpecTest, we achieve high semantics coverage with the following two syn-
ergistic parts. First, we design and implement a mutator which systematically
introduces less-exercised language features into the test programs automatically.
Second, we design and apply powerful fuzzing techniques to generate program
inputs to exercise all statements including the less-used features in the test pro-
grams. The latter can be achieved with fuzzers optimised for existing code cov-
erage criteria such as branch or statement coverage.
We believe that a comprehensive test suite for a compiler must cover all
relevant aspects of the language semantics, and semantic coverage offers such a
measurement. The above definition simply measures whether a rule is fired or
not. It might be meaningful to further measure the context in which each SOS
rule is fired (as certain bugs might only be triggered when a rule is fired in a
certain context), which we leave as future work.
To achieve high semantic coverage, SpecTest employs a two-part solution.
Given the oracle’s feedback on which SOS-rules are not fired (or least fired), the
language features which are associated with the SOS rules are identified. This is
straightforward as each SOS rule is associated with a specific language construct.
For instance, when the first rule of Fig. 2 is not fired, then this would highlight
that our test programs contain no addition between Integer variables. Next, the
mutator takes the information and systematically mutates the seed programs to
introduce these less-tested language constructs.
The mutator is a code mutation engine which is designed to automatically
mutate a given source program to generate new programs (i.e., test cases for the
compiler). Existing mutation approaches [38,41,55] for compiler testing already
applied mutators to generate test programs, but they mutate based on simple
algebraic rules and are not systematic. For instance, equivalence modulo inputs
(EMI) [41] works by injecting code into seed programs with the aim to achieve
a high difference in the control- and data-flow compared to the original seed
program in order to produce diverse test programs. In comparison, our mutator
is designed to maximise semantic coverage.
Implementing the mutator is not trivial. For SpecTest, the mutators for So-
lidity and Java were implemented based on existing parsers through code instru-
mentation. That is, given a language feature and a source program, the mutator
first parses the source program to build an AST. Afterwards, it identifies poten-
tial locations in the AST for introducing the features. Lastly, it systematically
applies a mutation strategy specifically designed for the language constructs to
inject them at all possible or specific pre-defined locations. In the following, we
introduce three mutation strategies as examples.
We investigated features that were specific for Solidity. For example, one mu-
tation introduces modifiers for functions, which define conditions that must hold
when a function is executed. Listing 1.1 shows a smart contract with modifiers
written in the Solidity language. Unlike traditional programs, smart contracts
cannot be modified once they are deployed on the blockchain. As a result, their
correctness is crucial. So is the correctness of the compiler since the compiled
programs are deployed on the blockchain. Furthermore, the Solidity compiler
276 R. Schumi and J. Sun
1 contract A c c e s s R e s t r i c t i o n { bibt4QkDIfJ : {
2 address public owner = msg . sender ; bsJxhbtSJBu : {
3 // default modifier : bHhq23OwDjZ : { try {
4 modifier onlyBy ( address account ) { bEdqZ33tKi9 : {
5 require ( msg . sender == account , " Sender bVm9tCxbul4 : {
not authorized " ) ; if ( i >= 5) { break ; }
6 _ ; } // injected modifier : break bEdqZ33tKi9 ;
7 modifier cgskst ( address value ) { } }
8 require ( value == address (0 x0 ) ," " ) ; } catch ( RuntimeException e ) {
9 _ ; } // injected modifier : bQ2yucCPLQr : {
10 modifier cbhsmo ( address value ) { System . out . print ( " X " ) ;
11 require ( value == address (0 x0 ) ," " ) ; break bQ2yucCPLQr ;
12 _ ; } // injected modifier : } } } } }
13 modifier nlwxmv ( address value ) {
14 require ( value == address (0 x0 ) ," " ) ; Listing 1.2: Labelled block mutation
15 _;
16 } // Make newOwner the contract owner : contract Test {
17 function changeOwner ( address newOwner function testFunc ( int a )
) public onlyBy ( owner ) cgskst ( public pure returns ( int ) {
address (0 x0 ) ) cbhsmo ( address (0 x0 int result = a + a ++;
) ) nlwxmv ( address (0 x0 ) ) { // produces 3 when a is 1
18 owner = newOwner ; return result ;
19 }} } }
Listing 1.1: Simple modifier example Listing 1.3: Simple contract example
has been under rapid development and there are unique language features with
sometimes confusing semantics. Thus, it is a good target for evaluating the ef-
fectiveness of SpecTest. In this example, the modifier onlyBy ensures that the
function changeOwner can only be called when the address of the contract owner is
used. By integrating various dummy modifiers (Lines 7, 10 & 13) into our seed
contracts and by adding them to functions (Line 17), we noticed that an older
version of the Solidity compiler crashed in some cases, when more than a certain
number of modifiers are used. Such a case is difficult to find with normal tests,
since it is rare to use multiple modifiers for a function. Given that a less-fired
SOS rule is concerned with the modifier construct in Solidity, to introduce mod-
ifiers, the mutator scans through the AST for function declarations. For each
function declaration, the mutator randomly adds one or more modifiers.
We also introduced specific mutations for Java. For example, our experiments
showed that semantic rules associated with labels were not fired. Hence, we in-
troduced mutations that target these rules, e.g., a mutation that injects labelled
blocks, which is a special and rarely used feature that allows an immediate exit of
a block with a break statement. This mutation is illustrated in Listing 1.2, where
we injected labelled blocks and breaks (with these labels) into a seed program.
Both for Solidity and Java, we noticed that there are various rules in the K
specifications (i.e., 11 rules Java and 17 for Solidity) concerning mathematical
expressions that were not covered, e.g., computations with hex-values. In order
to cover these rules and to cover unusual usages in different contexts, we relied
on a random approach in contrast to the other mutations where we injected code
at specific places. We developed mutations that produce a variety of mathemat-
ical expressions combining various language features, like operations containing
variables with different data types, hexadecimal, octal or binary literals, pre-
and postfix increment/decrement (++/--), bitwise and bitshift operators, various
combinations of unary operators and arrays. A simplified example of a muta-
SpecTest: Specification-Based Compiler Testing 277
tion produced with this strategy is shown in Listing 1.3. It can be seen that the
increment operators (++) is used in an unusual context within a mathematical
expression. Our experiments showed that the computation produced unexpected
results, i.e., we found an issue with the computation order that caused the in-
crement to be executed first, although it should be executed last [19].
3 Evaluation
We have implemented SpecTest for two compilers, a compiler for a general pur-
pose language (Java) and one for a new domain-specific language (Solidity).
In the following, we design multiple experiments to systematically answer the
following research questions (RQ).
– RQ1: How effective is our proposed method in finding bugs or inconsisten-
cies? This is important since the primary aim of SpecTest is to provide a
systematic way of generating a test suite for identifying compiler bugs.
– RQ2: What kind of bugs and inconsistencies can be found? To further mo-
tivate the usage of SpecTest, it makes sense to point out what issues can
be found. In particular, we would like to check whether indeed there are
compiler bugs associated with less-fired SOS rules.
278 R. Schumi and J. Sun
– RQ3: To what extent can the coverage of rules within the language specifica-
tion be increased with specific mutations? The semantic rule coverage is one
of the core aspects of SpecTest for finding bugs. Therefore, it is important
to investigate to which extent we can increase this coverage.
– RQ4: How much effort is it to apply SpecTest? When a tester is considering
a testing method, the effort usually plays a big role. To create a good basis
for a decision, we discuss the effort of applying SpecTest to two compilers.
experiments sometimes were stuck due to out of memory exceptions, not enough
space, etc. Unfortunately, we could not fully resolve such issues, because many
mutations inject features with random aspects into the diverse seed programs.
This caused various unpredictable situations, like endless loops or too large data
structures. By adopting our mutator, we greatly reduced the number of such
situations, but we could not remove all rare cases.
RQ1: How effective is our proposed method in finding bugs or inconsistencies?
We discovered issues and bugs both for Solidity and Java. Some of these issues
were not found within the compiler or the runtime environment, but within
the language semantics. Fixing such issues is also essential, since improving the
specification is an important aspect of testing.
In total, we found six issues for the Solidity compiler [19,10], two were related
to error/warning messages [7,13], and three of the other issues might have the
same cause, i.e., the execution order. For KSolidity, we found eight issues, six
of them were related to unimplemented features. For Java, we found four issues
with the compiler [2,5], two of which were concerned with error messages [6,12],
and we discovered 13 issues with K-Java [14,15,11,9,8,3,1,16] (eight issues or
bugs, one warning related issue, and four minor issues, like a wrong output
representation [16]). More details about the different types of issue follow below.
Our experiments showed that SpecTest is able to reveal issues, inconsistencies
and bugs. These issues were not only found in the compiler, but also in language
semantics (which are developed independently by other groups with dedicated
effort). One might argue that finding bugs/issues in the language semantics is not
as meaningful as finding bugs in the compiler. We believe that it is also crucial
to ensure the robustness of the semantics since in general the quality of the tests
or specification are essential for the overall robustness of software. SpecTest
was able to find various inconsistencies and bugs in the specifications, which
is important for the specification developers, as well as issues in the compilers.
We have spent effort on confirming our findings and out of the 31 issues, we
submitted 19 to the corresponding git repositories and reported the other issues
to the developers or to a bug reporting system. For 13 issues, we received a
confirmation or the developers mentioned that they will investigate and fix them.
An aspect that might have limited the effectiveness, is that we did not fully
apply our method for Java, since we only tested simple seed programs and did not
use fuzzing. We believe that the issues we found still showed that our method
was reasonably effective, even though we only partially applied it. Using the
full extend of SpecTest for Java might require a more powerful specification,
which is a potential topic for future work. Moreover, it should be mentioned that
KSolidity is still being developed and not as stable as the Solidity compiler (or
runtime environment), since much more effort was invested into its development.
This is similar for K-Java, and Java in general is robust due to its maturity.
RQ2: What kind of bugs and inconsistencies can be found? We categorise
our findings into three categories as illustrated in Table 1, i.e., (1) normal issues,
bugs and missing features, (2) issues related to warning or error messages, and
(3) minor inconsistencies or issues, like a small discrepancy in the output, e.g.,
280 R. Schumi and J. Sun
-0e+00.0 instead of -0.0 [16]. Table 1: Found semantics and compiler issues
Additionally, we differenti- Solidity KSolidity Java K-Java
ate whether the origin of an Normal issue or bug
Warning or error
4 8 2 8
2 - 2 1
issue was the compiler or the message related issues
Minor issues - - - 4
specification, as illustrated Total 6 8 4 13
by the rows of Table 1.
The most interesting issues that we found were the ones concerning the wrong
computation order in Solidity. The cause of these issues were actual semantic
errors within the compiler. Moreover, we also found various issues with error or
warning messages. Such issues might seem trivial, but it is important to fix them
since meaningless error messages can cause a huge waste of debugging effort. The
bugs we found in the specifications had multiple sources, like the syntax parser,
wrong semantic rules, partially implemented rules, or rules applied in a wrong
context. Although K-Java and KSolidity had already many manual tests, we
showed that SpecTest was able to discover many inconsistencies and bugs. In
the following, we present example issues from the mentioned categories.
Solidity Findings. One of the issues [19] that SpecTest identified was that
there were wrong results, when we tested expressions with different assignment
operators. The behaviour can be observed in the following example, where the
increment operator is applied at first, but should be applied in the end.
int a = 2; a *= 1 + a ++; // results in 9 but should be 6
A potential cause might be a wrong computation order. This issue was found
since some SOS rules for assignment operators were uncovered. By creating mu-
tations that target these rules, we could generate expressions like in the example
which led to the discovery of the issue since the oracle predicted a different result.
An inconsistency regarding an error message [13] was revealed when we tested
computations with different data types. As illustrated below, we discovered that
it is possible to add int variables with different bit sizes, but an error is produced
if an int_const is added to an int variable with a smaller bit size.
int8 a = 10; int16 b = 234;
int c = b + a ; // works
int c = 234 + a ; // TypeError : Oper . + i ncompatible with types int_const & int8
In this case, our oracle performed the computation without an error, but the
Solidity compiler produced a type error. For KSolidity, we found an incorrect
overflow behaviour for computations, and that there is no support for numerous
language features, like increment operators.
Additionally, we applied our Solidity truffle tests to the Conflux blockchain
[17], which is a new alternative for Ethereum. It basically can be seen as another
runtime environment for Solidity contracts. With our tests, we were able to
reveal a bug in the testing environment that resulted in incorrect results when
we injected formulas with unary and bitwise operators [4].
Java findings. Our experiments showed that there is an inconsistency [1,2]
when casts from double and long variables to Integers are performed. These casts
are handled differently by Java when an overflow occurs, i.e., in the following code
the results will be the maximum Integer for the double cast and bits will be cut
SpecTest: Specification-Based Compiler Testing 281
off for the long cast. In K-Java both casts produce the same result, i.e., bits will
be cut off. Although this behaviour is documented in the language specification
and already others were wondering about this issue, we believe that the approach
of K-Java is more consistent, and we are still waiting for a comment of the Java
team about the motivation to handle these cases differently.
System . out . println ((( int ) 2147483648 L ) ) ; // -2147483648
System . out . println ((( int ) 2147483648.0) ) ; // 2147483647
A problem we found for the Java compiler [6] is a missing error message
when a computation with a long and a double variable is performed. Normally,
an incompatible types error is produced as illustrated in the following code, but
the error does not occur when the same computation is done with an += operator.
long a = 1 L + 0.1 * 3 L ; // produces error : incompatible types : possible lossy
long b = 1 L ; // conversion from double to long
b += 0.1 * 3 L ; // no error is produced
We discovered that K-Java has an issue with the modulo operator [14]. The
computation is wrong for all negative doubles and floats, i.e., it produces incon-
sistent values compared to Java and compared to the same computation with
Integer values. This is illustrated in the following examples.
System . out . println ( " -8 % 3 = " +( -8 % 3) ) ; // K - Java and Java return -2
System . out . println ( " -8.0 % 3.0= " +( -8.0 % 3) ) ; // K - Java 1.0 Java -2.0
System . out . println ( " 8 % -3 = " +( 8 % -3) ) ; // K - Java and Java return 2
System . out . println ( " 8.0 % -3.0= " +( 8.0 % -3.0) ) ; // K - Java -4.0 Java 2.0
arrays, structs, simple Table 2: Comparison of the covered rules between the
transactions, or mathe- K-Java tests (Default) and our mutated test cases
matical expressions, and File
Default Mutants Difference
Char Line Char Line Char Line
managed to increase the folding.k 93.04 93.89 93.04 93.89 - -
coverage. Even with just unfolding.k 91.84 94.55 91.84 94.55 - -
these features, we found process-class-decs.k 89.07 92.95 89.07 92.95 - -
expressions.k 72.30 78.74 86.58 89.92 14.28 11.18
meaningful bugs. The process-comp-units.k 83.39 86.03 83.39 86.03 - -
coverage improvements static-init.k 81.20 82.35 81.20 82.35 - -
process-class-members.k 80.65 83.53 80.65 83.53 - -
compared to the original statements.k 80.51 82.38 80.51 82.38 - -
seed programs are illus- new-instance.k 79.59 82.41 79.59 82.41 - -
method-invoke.k 79.44 80.74 79.44 80.74 - -
trated in Table 3. There api-core.k 61.77 63.37 78.74 81.82 16.97 18.45
were partially imple- var-lookup.k 77.52 79.41 77.52 79.41 - -
mented features which process-type-names.k 76.56 75.76 76.56 75.76 - -
expressions-classes.k 73.03 65.00 73.03 65.00 - -
could not be fully cov- process-local-classes.k 67.62 72.12 67.62 72.12 - -
ered. The coverage of process-anonymous-classes.k 66.79 81.52 66.79 81.52 - -
arrays.k 62.07 66.90 62.07 66.90 - -
the completed features api-threads.k 35.51 39.04 41.43 47.01 5.92 7.97
was considerably im- syntax-conversions.k 40.65 42.42 40.65 42.42 - -
literals.k 29.19 34.31 38.73 42.72 9.54 8.40
proved.
We have shown that our mutations can increase the rule coverage both for K-
Java and KSolidity. Our close investigation shows that the increase in coverage
requires non-trivial programs (e.g., programs that specifically include missing
language features) which are unlikely to be generated without our mutator. It
is worth mentioning that writing mutations for the uncovered rules lead to the
discovery of many issues. Moreover, the mutations that targeted specific semantic
rules or language features could generally increase the coverage instantaneously
with a single test, but we still applied them to all seed programs, and we also used
general mutation operators to produce mutants for many different situations.
RQ4: How much effort is it to apply SpecTest? To answer this question, we
analysed the effort required to apply and implement SpecTest for Java and Solid-
ity. It consists of two parts, the effort of applying SpecTest once it is developed,
and the implementation effort. The latter one consists of three parts, the effort
for developing the oracle, the mutator and the fuzzer. The goal of this analysis
is to understand how generalisable SpecTest is to a new programming language.
Applying SpecTest after the implementation has the following timing re-
quirements. Both for Solidity and Java, the mutant generation took only a few
seconds. For Solidity, we set a timeout of 2 min per contract for fuzzing and
it took on average 24 min to finish all 37 contracts. Usually, 40–45 test cases
were created by the fuzzer (normally multiple per contract depending on the
mutation). Most test cases were executed by KSolidity within a minute, but
there were outliers which did not terminate even after hours. Hence, we used a
timeout of 5 min. On average, the testing time of KSolidity was 37 min (when
five runs with different mutations were considered). For Java, we did not apply
a fuzzer due to the simplicity of the seed programs. We executed the 756 test
programs directly with K-Java, which took on average 3 hours and 51 min for
an introduced mutation (for five runs with different mutation types).
SpecTest: Specification-Based Compiler Testing 283
should be noted that this time depends on the availability of existing tools,
like a language parser or fuzzer. For this work, we relied on pre-existing lan-
guage specifications, which helped to reduce the overall effort, but as mentioned
they came with limitations, which caused additional efforts. Writing a specifica-
tion for a new programming language is not trivial. Based on past experiences,
we assume that it takes about six to 12 months depending on the complexity of
the language. Given the many recent efforts on developing executable language
semantics, we believe that SpecTest provides a good way to better utilise these
existing specifications for systematic compiler testing.
To summarise, the implementation effort of SpecTest is about two to three
work months mainly for the mutator, if there is an existing specification and a
fuzzer. The application of our method in terms of run time is about a few hours
for a single mutation. Further increasing the number of seed programs, and
performing a reasonable number of mutations increases this time to a couple of
days or weeks, when the tests are only executed on one machine. Even though
this seems like a lot of effort, we believe our method is still worthwhile, since it
will pay off eventually, especially considering all the effort that can be required
for releasing a new compiler version, when serious bugs are discovered. Moreover,
our method can be easily accelerated by distributing it to multiple machines.
As mentioned before, the implementation effort for our method was about
two to three work months. This is about the time that is needed for the mutator
and for other minor tools. It does not include the effort for creation of the
language specification or the fuzzer. There are already many existing fuzzers that
could be adopted for new programming languages, and also numerous language
specifications. We especially want to recommend our method for all languages
with pre-existing specifications (or when similar specifications exist) since then
there is only a small implementation effort, which will soon be mitigated by the
advantages of SpecTest. Even when there are no pre-existing specifications for
a language, we highly recommend to create one and to adopt our method, since
it will save time in the long term.
An effort that should not be underestimated is the time for analysing bugs. It
can be troublesome and to find the cause of a bug, due to the complexity of the
test cases, i.e., it sometimes took us hours or even days. In such cases, it can be
helpful to minimise failing test cases. There are numerous techniques, like delta
debugging [62] or program slicing [58], which can reduce the debugging effort,
and integrating them into SpecTest would be interesting for future work.
284 R. Schumi and J. Sun
A threat to the validity of our evaluation might be that we did not show a
comparison to other compiler testing methods. A comparison might be inter-
esting, but our main goal was to show the general applicability and usefulness
of SpecTest for different compilers. It would not be fair to compare SpecTest
to other testing techniques that focus on different types of bugs, e.g., it might
be much easier to find simple parsing errors caused by unusual characters (with
techniques, like fuzzing).
One might argue that the test size we used is too limited, which might be a
potential threat to the validity of our evaluation. It is true that it would make
sense to apply more seed programs and to continue mutating and testing for an
extended period of time. However, due to restrictions of KSolidity and K-Java,
a larger set of seed programs was not supported, and due to a limited time and
computing budget, we did not execute more tests. Nevertheless, we believe that
our test size was reasonable, since it allowed us to reveal various issues and bugs.
Another threat to the validity of our evaluation might be that we should
not have just relied on existing specifications, where we cannot be sure about
their quality. It is true that we might have more confidence in a specification
that we created, but since SpecTest checks the correctness of compilers as well
as specifications, we have trust that our specifications had a reasonable quality.
4 Related Work
Compiler testing is a broad research field with a range of techniques that target,
e.g., the test case generation [49,31,23] or the oracle problem [22]. Several surveys
give an overview of these methods [56,26,39,25]. Our study however shows that
existing approaches suffer from two weaknesses. They do not apply a test case
generation that can extensively cover rare language features, and they often
rely on weak or limited test oracles. The test case generation often works with
standard code coverage criteria concerning compiler components. For example,
Zelenov and Zelenova [61] applied a BNF grammar as a model and produced
test cases according to, e.g., code or functional coverage of a syntax analyser. A
method based on the coverage of context-free grammar rules was presented by
Purdom [49], but it only targets the parser of the compiler. Kalinov et al. [35,36]
defined coverage criteria based on a statement machine specification. In contrast
to our work, they do not identify rare language features by analysing semantic
rule coverage, and they do not construct their test programs via code mutation.
Various compiler testing methods work without any coverage by just ran-
domly generating test cases according to a grammar, which defines valid pro-
grams [52,60]. There are also techniques that use mutation for producing test
cases [38,41,55]. For example, Le, Sun, and Su [41] produced mutants that should
have the same behaviour as the original programs in order to find cases where the
behaviour diverges. However, in contrast to our work, they are not considering
a semantic coverage for less used language features.
SpecTest: Specification-Based Compiler Testing 285
Several attempts have been presented to answer the oracle problem for com-
piler testing. In the simple case of positive/negative testing, an oracle only tells
whether a program is compilable. When a test program is compiled, the result is
checked to see if it matches the expectation of the oracle. A match means a suc-
cessful compilation. Otherwise, there may be a bug. For example, Zelenov and
Zelenova [61] illustrated a specification-based approach for generating positive
and negative tests. Such approaches are limited to testing the syntax parser.
In the line of work on differential testing compilers [45], the oracle is defined
as consistency among two or more compilers for the same language. In this
method, the same test programs are given to multiple compilers and the results
are compared. If there is a difference then a bug in one of the compilers or an
ambiguity in the language is found. There exist different versions of differential
testing as explained by McKeeman [45]. Cross-compiler testing [52] is a technique
that works by contrasting a new compiler against a pre-existing compiler that
has the same specification. When the same test programs are executed with
both compilers, a different result can reveal a fault in the new or pre-existing
compiler. Sometimes this technique is also called randomised differential testing
[60], because the test programs are usually generated randomly, e.g., based on
a grammar. Another differential testing technique is cross-optimisation testing,
where programs compiled with different optimisations implemented for the same
compiler are contrasted to find bugs. Le, Sun, and Su [42] presented such a
technique for stress testing link optimisers. Their method generates random test
programs and injects various function calls into different code regions in order to
increase dependencies between procedures, and it also randomly selects different
optimisation levels to produce challenging tests for the optimiser. Cross-version
or regression testing is another differential testing method that tries to find bugs
by comparing different versions of the same compiler. For example, Sun, Le,
and Su [54] developed Epiphron, a tool that generates random test programs to
find inconsistencies with the debug information, like missing warning messages,
in different versions of the same compiler. Such approaches work only if there
are multiple relatively mature compilers for the same language. In contrast to
these techniques, SpecTest works with a formal language specification which
is especially useful when no compilers could be used as a reference. Moreover,
different compilers or compiler versions for the same language might still suffer
from the same bugs, which is unlikely for an independent specification.
There are approaches that assume the existence of a reference compiler, i.e.,
the oracle is an existing formally proven compiler. For example, Leroy [43] pre-
sented CompCert, a compiler for a subset of C, which was verified with the
proof assistant Coq. However, there are usually no such compilers for a newly
developed language and the existing ones cover only subsets of languages since
formally proving a compiler is extremely challenging.
For metamorphic testing [57], the oracle is defined as certain algebraic prop-
erties of the compiler. For instance, one such property explored in the compiler
testing technique called equivalence modulo inputs (EMI) [40,55] is that a mod-
ification on a program part which is never executed should not alter the result.
286 R. Schumi and J. Sun
Based on this simple oracle, EMI works by randomly pruning dead code (i.e.,
code which is not executed given a certain program input) or by randomly insert-
ing or removing instructions from dead code based on a Markov Chain Monte
Carlo method. Such approaches are limited to identifying bugs which violate the
algebraic properties. Hence, they are not able to find deep semantic errors.
The closest related work to SpecTest was proposed by Kalinov et al. [35,36],
where a language specification in the form of abstract state machines and mon-
tages is used as an oracle. With this specification, they compare the expected
output from the specification to that of a compiled program in order to check
whether there are compiler bugs. This approach is limited by the choice of the
specification language and it quickly becomes infeasible, because the computa-
tion time is too high. Moreover, it is not concerned with semantic coverage.
To demonstrate the limitations of the closely related methods, we come back
to the example of Sect. 2, i.e., we discussed a bug with the increment operator
that we discovered during our analysis of the Solidity compiler.
int a = 1; int result = a + a ++; // produces 3 , but it should be 2
In this example, the compiler had an issue with the computation order, which
resulted in wrong results. Existing approaches, like EMI or differential testing
might be able to detect such issues, but with EMI it is difficult to find mutations
that lead to such cases. The same is true for differential testing and there is also
a high chance that different compiler versions have the same faulty behaviour
for such a case (e.g., all versions of the Solidity compiler had this issue).
5 Conclusion
Acknowledgments
References
24. Bogdanas, D., Rosu, G.: K-Java: A complete semantics of java. In: Proceedings of
the 42nd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Program-
ming Languages, POPL 2015, Mumbai, India, January 15-17, 2015. pp. 445–456.
ACM (2015). https://doi.org/10.1145/2676726.2676982
25. Boujarwah, A.S., Saleh, K.: Compiler test case generation methods: a sur-
vey and assessment. Information & Software Technology 39(9), 617–625 (1997).
https://doi.org/10.1016/S0950-5849(97)00017-7
26. Chen, J., Hu, W., Hao, D., Xiong, Y., Zhang, H., Zhang, L., Xie, B.: An empirical
comparison of compiler testing techniques. In: Proceedings of the 38th International
Conference on Software Engineering, ICSE 2016, Austin, TX, USA, May 14-22,
2016. pp. 180–190. ACM (2016). https://doi.org/10.1145/2884781.2884878
27. Chen, Y., Su, T., Su, Z.: Deep differential testing of JVM implementations. In: Pro-
ceedings of the 41st International Conference on Software Engineering, ICSE 2019,
Montreal, QC, Canada, May 25-31, 2019. pp. 1257–1268. IEEE / ACM (2019).
https://doi.org/10.1109/ICSE.2019.00127
28. Cummins, C., Petoumenos, P., Murray, A., Leather, H.: Compiler fuzzing
through deep learning. In: Proceedings of the 27th ACM SIGSOFT Inter-
national Symposium on Software Testing and Analysis, ISSTA 2018, Am-
sterdam, The Netherlands, July 16-21, 2018. pp. 95–105. ACM (2018).
https://doi.org/10.1145/3213846.3213848
29. Ellison, C., Rosu, G.: An executable formal semantics of C with appli-
cations. In: Field, J., Hicks, M. (eds.) Proceedings of the 39th ACM
SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL
2012, Philadelphia, Pennsylvania, USA, January 22-28, 2012. pp. 533–544.
ACM (2012). https://doi.org/10.1145/2103656.2103719, https://doi.org/10.
1145/2103656.2103719
30. Fowler, T.: How many computer languages are there? (2020), https://
careerkarma.com/blog/how-many-coding-languages-are-there
31. Hanford, K.V.: Automatic generation of test cases. IBM Systems Journal 9(4),
242–257 (1970). https://doi.org/10.1147/sj.94.0242
32. Jackson, D., Damon, C.: Elements of style: Analyzing a software design feature with
a counterexample detector. In: Proceedings of the 1996 International Symposium
on Software Testing and Analysis, ISSTA 1996, San Diego, CA, USA, January
8-10, 1996. pp. 239–249. ACM (1996). https://doi.org/10.1145/229000.226322
33. Jia, Y., Harman, M.: An analysis and survey of the development of
mutation testing. IEEE Trans. Software Eng. 37(5), 649–678 (2011).
https://doi.org/10.1109/TSE.2010.62
34. Jiao, J., Kan, S., Lin, S., Sanan, D., Liu, Y., Sun, J.: Semantic understanding
of smart contracts: Executable operational semantics of solidity. In: 2020 IEEE
Symposium on Security and Privacy, SP 2020, San Francisco, CA, USA, May 18-
20, 2020. IEEE (2020), accepted for publication
35. Kalinov, A., Kossatchev, A., Posypkin, M., Shishkov, V.: Using ASM specifica-
tion for automatic test suite generation for mpC parallel programming language
compiler. Action Semantics AS 2002 p. 99 (2002)
36. Kalinov, A., Kossatchev, A.S., Petrenko, A.K., Posypkin, M., Shishkov,
V.: Coverage-driven automated compiler test suite generation. Electr. Notes
Theor. Comput. Sci. 82(3), 500–514 (2003). https://doi.org/10.1016/S1571-
0661(05)82625-8
37. Klein, C., Clements, J., Dimoulas, C., Eastlund, C., Felleisen, M., Flatt, M., Mc-
Carthy, J.A., Rafkind, J., Tobin-Hochstadt, S., Findler, R.B.: Run your research:
SpecTest: Specification-Based Compiler Testing 289
52. Sheridan, F.: Practical testing of a C99 compiler using output comparison. Softw.,
Pract. Exper. 37(14), 1475–1488 (2007). https://doi.org/10.1002/spe.812
53. Su, Z., Sun, C.: Emi-based compiler testing (2018), https://web.cs.ucdavis.
edu/~su/emi-project
54. Sun, C., Le, V., Su, Z.: Finding and analyzing compiler warning defects.
In: Proceedings of the 38th International Conference on Software Engineering,
ICSE 2016, Austin, TX, USA, May 14-22, 2016. pp. 203–213. ACM (2016).
https://doi.org/10.1145/2884781.2884879
55. Sun, C., Le, V., Su, Z.: Finding compiler bugs via live code mutation. In: Proceed-
ings of the 2016 ACM SIGPLAN International Conference on Object-Oriented Pro-
gramming, Systems, Languages, and Applications, OOPSLA 2016, part of SPLASH
2016, Amsterdam, The Netherlands, October 30 - November 4, 2016. pp. 849–863.
ACM (2016). https://doi.org/10.1145/2983990.2984038
56. Tang, Y., Ren, Z., Kong, W., Jiang, H.: Compiler testing: A systematic literature
analysis. CoRR abs/1810.02718 (2018), http://arxiv.org/abs/1810.02718
57. Tao, Q., Wu, W., Zhao, C., Shen, W.: An automatic testing approach
for compiler based on metamorphic testing technique. In: 17th Asia Pacific
Software Engineering Conference, APSEC 2010, Sydney, Australia, Novem-
ber 30 - December 3, 2010. pp. 270–279. IEEE Computer Society (2010).
https://doi.org/10.1109/APSEC.2010.39
58. Tip, F.: A survey of program slicing techniques. J. Prog. Lang. 3(3) (1995), http:
//compscinet.dcs.kcl.ac.uk/JP/jp030301.abs.html
59. Wang, C., Kang, S.: ADFL: an improved algorithm for american fuzzy lop in
fuzz testing. In: Cloud Computing and Security - 4th International Conference,
ICCCS 2018, Haikou, China, June 8-10, 2018, Revised Selected Papers, Part
V. Lecture Notes in Computer Science, vol. 11067, pp. 27–36. Springer (2018).
https://doi.org/10.1007/978-3-030-00018-9_3
60. Yang, X., Chen, Y., Eide, E., Regehr, J.: Finding and understanding bugs in C
compilers. In: Proceedings of the 32nd ACM SIGPLAN Conference on Program-
ming Language Design and Implementation, PLDI 2011, San Jose, CA, USA, June
4-8, 2011. pp. 283–294. ACM (2011). https://doi.org/10.1145/1993498.1993532
61. Zelenov, S.V., Zelenova, S.A.: Automated generation of positive and negative
tests for parsers. In: Formal Approaches to Software Testing, 5th International
Workshop, FATES 2005, Edinburgh, UK, July 11, 2005, Revised Selected Pa-
pers. Lecture Notes in Computer Science, vol. 3997, pp. 187–202. Springer (2005).
https://doi.org/10.1007/11759744_13
62. Zeller, A., Hildebrandt, R.: Simplifying and isolating failure-inducing input. IEEE
Trans. Software Eng. 28(2), 183–200 (2002). https://doi.org/10.1109/32.988498
SpecTest: Specification-Based Compiler Testing 291
Open Access This chapter is licensed under the terms of the Creative Commons
Attribution 4.0 International License (http://creativecommons.org/licenses/by/
4.0/), which permits use, sharing, adaptation, distribution and reproduction in any
medium or format, as long as you give appropriate credit to the original author(s) and
the source, provide a link to the Creative Commons license and indicate if changes
were made.
The images or other third party material in this chapter are included in the chapter’s
Creative Commons license, unless indicated otherwise in a credit line to the material. If
material is not included in the chapter’s Creative Commons license and your intended
use is not permitted by statutory regulation or exceeds the permitted use, you will need
to obtain permission directly from the copyright holder.
PASTA: An Efficient Proactive Adaptation
Approach Based on Statistical Model Checking
for Self-Adaptive Systems
1 Introduction
As the complexity of an environment that affects a system’s goal achievement in-
creases, analyzing the environment becomes important for reliable goal achieve-
ment. The environment, such as user traffic and outdoor temperatures, can
change over time [15,29]. Full anticipation of environmental changes at the sys-
tem design time is challenging and often impossible [6,9]. Systems are required
c The Author(s) 2021
E. Guerra and M. Stoelinga (Eds.): FASE 2021, LNCS 12649, pp. 292–312, 2021.
https://doi.org/10.1007/978-3-030-71500-7 15
PASTA: Proactive Adaptation Based on Statistical Model Checking 293
4 Illustrative Example
We illustrate PASTA using an adaptive air condition control system as an ex-
ample. The system monitors indoor and outdoor air conditions, including tem-
perature and humidity, and adaptively controls the indoor condition for a given
target condition. Planning an adaptive air condition control with an immediate
reaction to the monitored indoor condition can aid the system in achieving its
goal; however, the indoor air conditions may change over time due to the influ-
ence of the outdoor air conditions, as shown in Fig. 2. If the adaptation plan
296 Y.-J. Shin et al.
is made without taking the environmental change into account, the adaptation
consequences may differ from the expectations, and thus there could have been
a better adaptation tactic that was not chosen. The air condition control system
developed by the PASTA approach forecasts future air condition changes and
selects an optimal adaptation tactic whose adaptation consequences are verified
by SMC at runtime. Throughout this paper, we will describe our approach using
this example.
statistically sufficient number of samples for the verification of the tactic’s per-
formance for the adaptation goal in the expected future environmental change.
(Step 5) Based on the accumulated samples, the performance of an adaptation
tactic is verified. All adaptation tactics are evaluated repeatedly in the same
manner, and the SAS statistically guarantees the effects of its adaptation tac-
tics. (Step 6 and 7) When all possible adaptation tactics have been evaluated,
an optimal adaptation tactic is chosen and executed. This adaptation process
is continuously repeated to respond to continuous environmental changes. We
describe the PASTA approach in detail based on this adaptation process in the
subsequent sections.
5.2 Knowledge
tation goal is also specified in the knowledge. Thus, the optimal tactic for the
adaptation goals will be selected and executed.
Example. The environmental factors of interest in the adaptive air condition
control system are the indoor/outdoor temperature and humidity; therefore, the
monitored environment data at a specific time include values of four factors. The
simulation models imitate the changes of the indoor temperature and humidity
affected by outdoor conditions and the air condition control system’s control
values. The system’s possible adaptation tactics are defined by the system ca-
pabilities of each temperature and humidity control capability. For example, the
system can increase or decrease the temperature and humidity in 0.1◦ C and
0.1% increments up to 5◦ C and 5%, respectively, in a discrete simulation time
unit. The tactic space is a Cartesian product of the possible temperature and
humidity controls. The adaptation goal is to manipulate the indoor temperature
and humidity to the user’s desired conditions.
In addition, to support engineers who develop SASs based on the PASTA ap-
proach, which was explained in the previous sections, we implemented a PASTA
skeleton based on the reference architecture with guiding comments and released
the source code on an open-source repository1 . The skeleton is available in Java
and Python. Engineers should write application-specific codes following com-
ments tagged with “todo”. The class diagram of the skeleton is presented in Fig.
5. An adaptation is activated by the “adaptManagedSystem” operator. It pro-
motes easier PASTA implementation, allowing for the utilization of third-party
libraries or tools for some components, such as the forecasting engine or the
SMC module.
6 Evaluation
6.1 Research Questions
We demonstrate the feasibility of applying the PASTA approach as one efficient
alternative to PMC-based proactive adaptation to SAS development. There are
three research questions addressed.
RQ1: (Cost efficiency of PASTA) How fast is PASTA’s adaptation
planning? PASTA leverages SMC for efficient adaptation verification at run-
time. Although almost all existing proactive adaptation approaches utilize PMC
for the runtime verification of adaptation tactics, the PASTA approach is one of
the most efficient alternatives to PMC-based proactive adaptation approaches.
To determine the efficiency of PASTA, we compare the application planning time
of PASTA and the PMC-based adaptation. We confirm the differences in time
consumption between SMC- and PMC-based approaches in solving proactive
adaptation problems of the same complexities.
RQ2: (Adaptation planning accuracy of PASTA) How accurately
does PASTA search for the optimal adaptation tactic? PMC formally
examines a probabilistic model and verifies whether it satisfies the given proper-
ties; however, SMC examines the given model with numerous sample simulation
1
https://github.com/yongjunshin/PASTA
302 Y.-J. Shin et al.
results, so it returns the statistical evidence of the model’s properties and thus
has the inevitable limitation that it can return inaccurate verification results
limited to the finite number of samples. It is known that SMC can produce
results similar to PMC [19,23,34], and for this research question, we compare
the similar proactive adaptation planning results of PASTA with the planning
results of the PMC-based approach. We determine how much accuracy has been
lost by the cost savings identified in RQ1 as well as whether the loss of accuracy
is acceptable.
RQ3: (Adaptation performance of PASTA) How effective is the
adaptation goal achievement performance of PASTA? For research ques-
tion 3, we examine whether the PASTA approach is actually effective in achiev-
ing the adaptation goals of SASs. To evaluate the adaptation performance of
PASTA, we compare the simulation execution results of approaches taking no
adaptation, reactive adaptation, PMC-based proactive adaptation, and PASTA.
RQ1: We measured and compared the time spent on adaptation planning for
both case systems using the PASTA and PMC-based approaches. The adap-
tation planning time includes modeling or sampling time and probabilistic or
statistical verification time to identify the optimal tactic. Figs. 7 and 8 show the
304 Y.-J. Shin et al.
evaluation results for each system. The reported planning time is the average
of 100 repeated experiments. The adaptation planning time for the PMC-based
approach is constant, but the time for PASTA increases in proportion to the
number of samples used for the SMC because the time for a single simulation
is almost constant. Unfortunately, the traffic signal controller was not able to
obtain adaptation planning results using PMC with a 2G memory because its
models and tactics were more complex than the air condition control system so
consume larger verification resource. Therefore, for the traffic signal controller,
PASTA: Proactive Adaptation Based on Statistical Model Checking 305
the adaptation planning time for the PMC-based approach was not assigned;
however, both systems confirmed that PASTA would complete adaptation plan-
ning much faster than the PMC-based approach. It was also confirmed that the
adaptation planning time of PASTA is proportional to the number of samples
and the complexity of the adaptation problem.
RQ2: To confirm the similarity of the optimal tactics that the PASTA and
PMC-based approaches found, we compared the optimal tactics returned by the
PASTA and PMC-based approaches in the same situation. To quantify the simi-
306 Y.-J. Shin et al.
larity, we defined two criteria. If the two tactics were the same, they were defined
as identical, and if they were adjacent in terms of the tactic specifications, they
were defined as similar. For example, for the air condition control system, tem-
perature control tactics +3◦ C and +3.1◦ C were adjacent because the tempera-
ture control unit is 0.1C based on the system’s capability, and the probability
that arbitrarily two tactics are adjacent is less than 2%. Because the samples
used by SMC are randomly generated, we repeated the PASTA experiments 100
times and report the percentage of identical or similar tactics compared to the
tactic returned by the PMC-based approach. Because the traffic signal controller
could not find the optimal tactic utilizing PMC, only the experimental results
of the air condition controller are shown in Fig. 9. We could see that PASTA
always found the same or similar optimal tactic as the PMC-based approach
except when using 10 samples; however, one limitation of utilizing SMC is that
regardless of how many samples we increased, we could not always obtain the
same results as the PMC-based approach’s results, which is considered an oracle.
This case system returned accurate results at approximately 50% on average.
RQ3: For RQ1 and RQ2, we showed that PASTA can quickly find a sub-
optimal adaptation tactic that is similar to the PMC-based approach’s result.
For RQ3, we obtained simulation results to confirm the adaptation performance
of the PASTA approach in comparison with non-adaptation, reactive adapta-
tion, and PMC-based proactive adaptation. As shown in Fig. 10, the goal of the
air condition control system was to keep the temperature at 25◦ C, and proac-
tive adaptation approaches showed a better adaptation performance than other
strategies. In addition, the PASTA and PMC-based approaches exhibited a simi-
lar performance because PASTA has always made similar adaptation decisions to
PASTA: Proactive Adaptation Based on Statistical Model Checking 307
the PMC-based approach. In Fig. 11, the goal of the traffic signal controller was
to reduce the number of vehicles waiting at the intersection as much as possible,
and proactive adaptation using PASTA showed the best performance. These two
results demonstrate that proactive adaptation outperforms reactive adaptation
and PASTA shows similar adaptation performance to the PMC-based approach
with smaller verification cost.
308 Y.-J. Shin et al.
7 Threats to Validity
One threat is the selection of the SMC algorithm. We selected SMCS to demon-
strate the adaptation performance when selecting the simplest SMC algorithm.
SMCS is suitable for explicitly indicating SMC-based adaptation costs affected
PASTA: Proactive Adaptation Based on Statistical Model Checking 309
by the number of samples, and all other SMC algorithms have similar character-
istics. To reduce this threat, we also implemented SSP and SPRT and compared
them to the PMC-based approach, and both showed similar cost, accuracy, and
performance differences. Therefore, for this paper, only SMCS was selected and
explained by varying the number of samples.
Another threat is the implementation of the PMC-based adaptation ap-
proach. We implemented the PMC-based approach directly following paper [26].
This threat was reduced because the authors published all the structures and
codes of the PRISM module for the implementation of the approach. We im-
plemented two case systems according to the PRISM module code shown in
the paper. For a fair comparison, environment, system, and adaptation tactic
spaces of the same complexities were given to both the PMC-based and PASTA
approach.
8 Conclusion
Acknowledgement
References
1. Aichernig, B.K., Schumi, R.: Statistical model checking meets property-based test-
ing. In: 2017 IEEE International Conference on Software Testing, Verification and
Validation (ICST). pp. 390–400. IEEE (2017)
2. Anaya, I.D.P., Simko, V., Bourcier, J., Plouzeau, N., Jézéquel, J.M.: A prediction-
driven adaptation approach for self-adaptive sensor networks. In: Proceedings of
the 9th International Symposium on Software Engineering for Adaptive and Self-
Managing Systems. pp. 145–154. ACM (2014)
3. Angelopoulos, K., Papadopoulos, A.V., Silva Souza, V.E., Mylopoulos, J.: Model
predictive control for software systems with cobra. In: Proceedings of the 11th
international symposium on software engineering for adaptive and self-managing
systems. pp. 35–46. ACM (2016)
4. Boyer, B., Corre, K., Legay, A., Sedwards, S.: Plasma-lab: A flexible, distributable
statistical model checking library. In: International Conference on Quantitative
Evaluation of Systems. pp. 160–164. Springer (2013)
5. Calinescu, R., Ghezzi, C., Kwiatkowska, M., Mirandola, R.: Self-adaptive software
needs quantitative verification at runtime. Communications of the ACM 55(9),
69–77 (2012)
6. Cheng, B.H., de Lemos, R., Giese, H., Inverardi, P., Magee, J., Andersson, J.,
Becker, B., Bencomo, N., Brun, Y., Cukic, B., et al.: Software engineering for self-
adaptive systems: A research roadmap. In: Software engineering for self-adaptive
systems, pp. 1–26. Springer (2009)
7. Dagum, E.B.: The X-II-ARIMA seasonal adjustment method. Statistics Canada,
Seasonal Adjustment and Time Series Staff (1980)
8. De Dear, R.J., Brager, G.S.: Thermal comfort in naturally ventilated buildings:
revisions to ashrae standard 55. Energy and buildings 34(6), 549–561 (2002)
9. De Lemos, R., Giese, H., Müller, H.A., Shaw, M., Andersson, J., Litoiu, M.,
Schmerl, B., Tamura, G., Villegas, N.M., Vogel, T., et al.: Software engineering
for self-adaptive systems: A second research roadmap. In: Software Engineering
for Self-Adaptive Systems II, pp. 1–32. Springer (2013)
10. De Matteis, T., Mencagli, G.: Proactive elasticity and energy awareness in data
stream processing. Journal of Systems and Software 127, 302–319 (2017)
11. Elkhodary, A., Esfahani, N., Malek, S.: Fusion: a framework for engineering self-
tuning self-adaptive software systems. In: Proceedings of the eighteenth ACM SIG-
SOFT international symposium on Foundations of software engineering. pp. 7–16.
ACM (2010)
12. Fredericks, E.M., Ramirez, A.J., Cheng, B.H.: Towards run-time testing of dynamic
adaptive systems. In: Proceedings of the 8th International Symposium on Software
Engineering for Adaptive and Self-Managing Systems. pp. 169–174. IEEE Press
(2013)
13. Garlan, D., Cheng, S.W., Huang, A.C., Schmerl, B., Steenkiste, P.: Rainbow:
Architecture-based self-adaptation with reusable infrastructure. Computer 37(10),
46–54 (2004)
14. Gerostathopoulos, I., Skoda, D., Plasil, F., Bures, T., Knauss, A.: Tuning self-
adaptation in cyber-physical systems through architectural homeostasis. Journal
of Systems and Software 148, 37–55 (2019)
15. Giese, H., Bencomo, N., Pasquale, L., Ramirez, A.J., Inverardi, P., Wätzoldt, S.,
Clarke, S.: Living with uncertainty in the age of runtime models. In: Models@ run.
time, pp. 47–100. Springer (2014)
PASTA: Proactive Adaptation Based on Statistical Model Checking 311
16. Hielscher, J., Kazhamiakin, R., Metzger, A., Pistore, M.: A framework for proactive
self-adaptation of service-based applications based on online testing. In: European
Conference on a Service-Based Internet. pp. 122–133. Springer (2008)
17. Hyndman, R.J., Athanasopoulos, G.: Forecasting: principles and practice. OTexts
(2018)
18. Kephart, J.O., Chess, D.M.: The vision of autonomic computing. Computer 36(1),
41–50 (Jan 2003)
19. Kim, Y., Kim, M., Kim, T.H.: Statistical model checking for safety critical hybrid
systems: An empirical evaluation. In: Haifa Verification Conference. pp. 162–177.
Springer (2012)
20. Krupitzer, C., Pfannemüller, M., Kaddour, J., Becker, C.: Satisfy: Towards a self-
learning analyzer for time series forecasting in self-improving systems. In: 2018
IEEE 3rd International Workshops on Foundations and Applications of Self* Sys-
tems (FAS* W). pp. 182–189. IEEE (2018)
21. Kwiatkowska, M., Norman, G., Parker, D.: Prism 4.0: Verification of probabilistic
real-time systems. In: International conference on computer aided verification. pp.
585–591. Springer (2011)
22. Larsen, K.G., Legay, A.: Statistical model checking past, present, and future. In:
International Symposium On Leveraging Applications of Formal Methods, Verifi-
cation and Validation. pp. 135–142. Springer (2014)
23. Legay, A., Delahaye, B., Bensalem, S.: Statistical model checking: An overview. In:
International conference on runtime verification. pp. 122–135. Springer (2010)
24. Metzger, A.: Towards accurate failure prediction for the proactive adaptation of
service-oriented systems. In: Proceedings of the 8th workshop on Assurances for
self-adaptive systems. pp. 18–23. ACM (2011)
25. Metzger, A., Neubauer, A., Bohn, P., Pohl, K.: Proactive process adaptation using
deep learning ensembles. In: International Conference on Advanced Information
Systems Engineering. pp. 547–562. Springer (2019)
26. Moreno, G.A., Cámara, J., Garlan, D., Schmerl, B.: Proactive self-adaptation under
uncertainty: a probabilistic model checking approach. In: Proceedings of the 2015
10th Joint Meeting on Foundations of Software Engineering. pp. 1–12. ACM (2015)
27. Moreno, G.A., Cámara, J., Garlan, D., Schmerl, B.: Efficient decision-making under
uncertainty for proactive self-adaptation. In: 2016 IEEE International Conference
on Autonomic Computing (ICAC). pp. 147–156. IEEE (2016)
28. Moreno, G.A., Cámara, J., Garlan, D., Schmerl, B.: Flexible and efficient decision-
making for proactive latency-aware self-adaptation. ACM Transactions on Au-
tonomous and Adaptive Systems (TAAS) 13(1), 3 (2018)
29. Shin, Y.J., Baek, Y.M., Jee, E., Bae, D.H.: Data-driven environment modeling for
adaptive system-of-systems. In: Proceedings of the 34th ACM/SIGAPP Sympo-
sium on Applied Computing. pp. 2044–2047 (2019)
30. Spitzer, F.: Principles of random walk, vol. 34. Springer Science & Business Media
(2013)
31. Sykes, D., Corapi, D., Magee, J., Kramer, J., Russo, A., Inoue, K.: Learning revised
models for planning in adaptive systems. In: 2013 35th International Conference
on Software Engineering (ICSE). pp. 63–71. IEEE (2013)
32. Wald, A.: Sequential tests of statistical hypotheses. The annals of mathematical
statistics 16(2), 117–186 (1945)
33. Xu, C., Yang, W., Ma, X., Cao, C.: Environment rematching: toward dependability
improvement for self-adaptive applications. In: Proceedings of the 28th IEEE/ACM
International Conference on Automated Software Engineering. pp. 592–597. IEEE
Press (2013)
312 Y.-J. Shin et al.
34. Younes, H.L.: Verification and planning for stochastic processes with asynchronous
events. Ph.D. thesis, Carnegie Mellon University (2005)
Open Access This chapter is licensed under the terms of the Creative Commons
Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/),
which permits use, sharing, adaptation, distribution and reproduction in any medium
or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license and indicate if changes were
made.
The images or other third party material in this chapter are included in the chapter’s
Creative Commons license, unless indicated otherwise in a credit line to the material. If
material is not included in the chapter’s Creative Commons license and your intended
use is not permitted by statutory regulation or exceeds the permitted use, you will need
to obtain permission directly from the copyright holder.
Understanding Local Robustness of Deep Neural
Networks under Natural Variations
1 Introduction
Deep Neural Networks (DNNs) have achieved an unprecedented level of perfor-
mance over the last decade in many sophisticated areas such as image recogni-
tion [38], self-driving cars [5] and playing complex games [65]. These advances
c The Author(s) 2021
E. Guerra and M. Stoelinga (Eds.): FASE 2021, LNCS 12649, pp. 313–337, 2021.
https://doi.org/10.1007/978-3-030-71500-7_16
314 Z. Zhong et al.
(a) 0◦ , (b) +6◦ , (c) +24◦ , (d) -9◦ , (e) 0◦ , (f) +6◦ , (g) +24◦ , (h) -9◦ ,
bird airplane cat dog bird bird bird bird
Fig. 1: (a)-(d) A well-trained Resnet model [14] misclassifies the rotated
variations of a bird image into three different classes though the original
un-rotated image is classified correctly. (e)-(h) The same model successfully
classifies all the rotated variants of another bird image from the same test
set. The sub-captions consist of rotation degrees and the predicted classes.
have also motivated companies to adapt their software development flows to in-
corporate AI components [3]. This trend has, in turn, spawned a new area of
research within software engineering addressing the quality assurance of DNN
components [11, 20, 32, 36, 40, 42, 55, 57, 73, 74, 91, 92].
Notwithstanding the impressive capabilities of DNNs, recent research has
shown that DNNs can be easily fooled, i.e., made to mispredict, with a lit-
tle variation of the input data [14, 23, 73]—either adding a norm-bound pixel-
level perturbation into the original input [9, 23, 71], or with natural variants of
the inputs, e.g., rotating an image, changing the lighting conditions, adding fog
etc. [14, 52, 55]. The natural variants are especially concerning as they can oc-
cur naturally in the field without any active adversary and may lead to serious
consequences [73, 92].
While norm-bound perturbation based DNN robustness is relatively well-
studied, our knowledge of DNN robustness under the natural variations is still
limited—we do not know which images are more robust than others, what their
characteristics are, etc. For example, consider Figure 1: although the original
bird image (a) is predicted correctly by a DNN, its rotated variations in images
(b)-(d) are mispredicted to three different classes. This makes the original image
(a) very weak as far as robustness is concerned. In contrast, the bird image
(e) and all its rotated versions (generated by the same degrees of rotation) in
Figure 1:(f)-(h) are correctly classified. Thus, the original image (e) is quite
robust. It is important to distinguish between such robust vs. non-robust images,
as the non-robust ones can induce errors with slight natural variations.
Existing literature, however, focuses on estimating the overall robustness of
DNNs across all the test data [4, 14, 88]. From a traditional software point of
view, this is analogous to estimating how buggy a software is without actually
localizing the bugs. Our current work tries to bridge this gap by localizing the
non-robust points in the input space that pose significant threats to a DNN
model’s robustness. However, unlike traditional software where bug localization
is performed in program space, we identify the non-robust inputs in the data
space. As a DNN is a combination of data and architecture, and the architecture
is largely uninterpretable, we restrict our study of non-robustess to the input
space. To this end, we first quantify the local (per input) robustness property of
a DNN. First, we treat all the natural variants of an input image as its neigh-
bors. Then, for each input data, we consider a population of its neighbors and
Understanding Local Robustness of DNNs under Natural Variations 315
measure the fraction of this population classified correctly by the DNN - a high
fraction of correct classifications indicates good robustness (Figure 1:e) and vice
versa (Figure 1:a). We term this measure neighbor accuracy. Using this metric,
we study different local robustness properties of the DNNs and analyze how
the weak, a.k.a. non-robust, points differ characteristically from their robust
counterparts. Given that the number of natural neighbors of an image can be
potentially infinite, first we performed a more controlled analysis by keeping the
natural variants limited to spatially transformed images generated by rotation
and translation, following the previous work [4, 14, 88]. Such controlled exper-
iments help us to explore different robustness properties while systematically
varying transformation parameters.
Our analysis with three well-known object recognition datasets across three
popular DNN models, i.e., a total of nine DNN-dataset combinations, reveal
several interesting properties of local robustness of a DNN w.r.t. natural variants:
– The neighbors of a weaker point are not necessarily classified to one single
incorrect class. In fact, the weaker the point is its neighbors (mis)classifications
become more diverse.
– The weak points are concentrated towards the class decision boundaries of the
DNN in the feature space.
Existing studies have proposed different techniques to generate test data inputs
by perturbing input images for a DNN and use them to evaluate the robustness
of the DNN. Depending on how the input image is perturbed, the techniques for
generating DNN test data can be classified into three broad categories:
i) Adversarial inputs are typically generated by norm-based perturbation
techniques [9, 23, 39, 46, 53, 85] where some pixels of an input image (I) are
perturbed by norm-based distance (l1 ,l2 or linf ) such that the distance between
the perturbed image and I is ≤ , where is a small positive value. These
adversarial examples are used to expose the security vulnerabilities of DNNs.
ii) Natural variations are generated through a variety of image transfor-
mations, and are used to evaluate the robustness of DNNs under such varia-
tions [13, 14, 73]. Sources of these variations include changes in camera configu-
ration, or variations in background or ambient conditions. The transformations
simulating these variations could be spatial, such as rotation, translations, mir-
roring, shear, and scaling on images, or non-spatial transformations, such as
changes in the brightness or contrast of an image. Here we first focus on spatial
transformations as opposed to adversarial one for two reasons. First, compared
with adversarial examples, which is fairly contrived, spatial transformations are
more likely to arise in more benign environments. Second, using simple para-
metric spatial transformations like rotations and translations, it is easier to sys-
tematically explore the local robustness properties. Later, to emulate a more
natural variation we add fog and rain on the images of self-driving car dataset
and evaluate our method’s generalizibility.
iii) GAN-based image generation techniques use Generative Adversarial Net-
work (GAN) to synthesize images. GAN is one class of generative models trained
as a minimax two-player game between a generative model and a discriminative
model [22]. GAN-based image generation has been successfully used to generate
DNN test data instances [92, 93].
Standard Accuracy vs. Robust Accuracy. Standard accuracy measures how
accurately an ML model predicts the correct classes of the instances in a given
test dataset. Robust, a.k.a. adversarial accuracy, estimates how accurately an
ML model classifies the generated variants [76]. In this paper, we adopt a point-
wise robust accuracy measure, neighbor accuracy, to quantify the robustness of
a DNN for the neighbors around each data point.
3 Methodology
3.1 Terminology
Neighbor Generation: For the image classification tasks, for each original im-
age point, we generate its neighbors by combining two types of spatial transfor-
mations: rotation and translation. We carefully choose these two types as repre-
sentatives of non-linear and linear spatial transformations, respectively, following
Engstrom et al. [14]. In particular, following them, we generate a neighbor by
randomly rotating the original point by t (∈ [−30, 30]) degrees, shifting it by dx
(about 10% of the original image’s width i.e. ∈ [−3, 3]) pixels horizontally, and
shifting it by dy (about 10% of the original image’s height i.e. ∈ [−3, 3]) pixels
vertically. It should be noted that for image classification it is standard in the
literatures [14, 15, 86] to assume that the transformed image has the same label
as the original one. As the transformation parameters are continuous, there can
be infinite neighbors of an original data point. Hence, we sample m neighbors
for each original data point. We explore the impact of m in RQ2.
For the self-driving-car task where the model predicts steering angle, for each
original image point, we generate 50% neighbors with rain effect and the rest
50% with fog effects. We adopt a widely used self-driving car data augmentation
package, Automold [60], for adding these effects where we randomly vary the
degrees of the added effect. For the rain effect, we set “rain_type=heavy" and
everything else as default. For the fog effect, we set everything as default.
Estimating Neighbor Accuracy: To compute the neighbor accuracy of a
data point for a given DNN model, we first generate its neighbor samples by
applying different transformations—spatial for image classification and rain or
fog for self-driving-car application. Then we feed these generated neighbors into
the DNN model and compute the accuracy by comparing the DNN’s output with
the label of the original data point. For self-driving-car application, we follow the
technique described in DeepTest [73]. More specifically, if the predicted steering
angle of the transformed image is within a threshold to the original image, we
consider it as correct. This ensures that any small variations of steering angle
are tolerated in the predicted results. We then compute neighbour accuracy =
#correct predictions
original point+#total neighbours .
the pre-trained model, this image is in a vulnerable region where a slight change
to this image may cause the pre-trained DNN to misclassify the changed input.
ground-truth
Original
labels
Training
Data
Pre-trained
Generate Neighbor Strong/Weak
DNN under
Neighbors Accuracy Points
Test
feature Strong
Pre-trained Point
vectors Test Data feature
DNN under DEEPROBUST-W
Point vectors
Test
DEEPROBUST-W Weak Point
we identify the weak points, as the ground truth. The highest diversity score
among these weak points is chosen as the diversity score threshold.
4 Experimental Design
Image Classification Similar to many existing works [36, 41, 61, 73, 74, 92] on
DNN testing, in this work, we use image classification application of DNNs as
the basis of our investigation. This is one of the most popular computer vision
tasks, where the model tries to classify the objects in an image or video.
Datasets: We conduct our experiments on three image classification datasets:
F-MNIST [87], CIFAR-10 [37], and SVHN [89].
– CIFAR-10: consists of 50,000 training and 10,000 testing 32x32 color images.
Each image is one of ten digit classes.
– F-MNIST: consists of 60,000 training images and 10,000 testing 28x28 gray-
scale images. Each image is one of ten fashion product related classes.
– SVHN: consists of 73,257 training images and 26,032 testing images. Each
image is a 32x32 color cropped image of house numbers collected from Google
Street View images.
– WRN: We use a structure with block type (3, 3) and depth 28 in [90] but
replace the widening factor 10 with 2 for less parameters and faster training.
We train all the models from scratch using widely used hyper-parameters and
achieve accepted level of validation natural accuracy). When training models on
CIFAR-10, we pre-process the input images with random augmentation (random
translation with dx, dy ∈ [−2, 2] pixels both horizontally and vertically) which is
a widely used preprocessing step for this dataset. When training models on the
other two datasets, plain images are directly fed into the models. The natural
accuracies and robust accuracies of the models are shown in Table 1.
4.2 Evaluation
Evaluation Metric. We evaluate both DeepRobust-W and DeepRobust-B
for detecting weak points under twelve and nine different DNN-dataset combina-
tions, respectively, in terms of precision, recall, and F1 score. Let us assume that
E is the number of weak points detected by our tool and A is the the number
of true weak points in the ground truth set. Then the precision and recall are
|A∩E|
|E| and |A∩E|
|A| , respectively. F1 score is a single accuracy measure that con-
siders both precision and recall, and defined as 2×precision×recall
precision+recall . We perform
each experiment for two thresholds of neighbor accuracy that defines strong vs.
weak points: 0.75 and 0.50.
Baselines. We compare DeepRobust-W and DeepRobust-B with two base-
lines. One naive baseline (denoted random) is randomly selecting the same num-
ber of points as detected by our proposed method to be weak points. Another
baseline (denoted top1) is based on prediction confidence score—if the confi-
dence of a data point is higher than a pre-defined cutoff we call it a strong point,
weak otherwise. This baseline is based on the intuition that DNNs might not be
confident enough to predict the weak points.
322 Z. Zhong et al.
5 Results
RQ1b. Given a well trained model, do the feature representations of the data
points vary by their degree of robustness? By analyzing the classifications of
the neighbors of weak vs. strong points, we observe that the weaker a point is,
its neighbors are more likely to be classified in different classes. We quantify this
observation by computing diversity of the outputs a point’s neighbor; We adopt
Simpson Diversity Index (λ) [67] as defined in Equation (1).
Table 3 shows the Spear- Table 3: Spearman Correlation between
man correlation between neigh- Neighbor Accuracy and Simpson Diversity
bor accuracy and λ on the Index. All coefficients are reported with sta-
three datasets and three mod- tistical significance (p < 0.05).
els for each. Note that while Dataset CIFAR-10 SVHN F-MNIST
calculating the correlation, we Model ResN WRN VGG ResN WRN VGG ResN WRN VGG
remove points with neighbor corr.coeff. 0.853 0.909 0.946 0.970 0.984 0.983 0.923 0.962 0.8947
accuracy 100% since there are many points having 100% neighbor accuracy and
tend to bias upward the Spearman Correlation; if we include points with neigh-
bor accuracy 100%, the correlations become even higher. We notice that for any
setting, the Spearman Correlation is never lower than 0.853. This indicates that
neighbor accuracy and diversity are highly correlated with each other. For exam-
ple, the bird image in Fig.1a has neighbor accuracy 0.49 and diversity 0.36, while
the bird image in Fig.1e has neighbor accuracy 1 and diversity 1. This shows,
the classifier tends to be confused about weak points and mispredicts them into
many different kinds of classes.
Result 1: In the representation space, weak points tend to lie towards the
class decision boundary while the strong points lie towards the center. The
weaker an image is, the model tends to be more confused by it, and classify
its neighbors into more diverse classes.
A: with varying number of strong/weak points CIFAR-10 6 0.662 0.389 967 493 0.49
12 0.685 0.384 955 440 0.492
dataset m prec recall tp fp f1
25 0.665 0.502 1250 629 0.572
CIFAR-10 0 0.660 0.518 1290 664 0.581 50 0.660 0.518 1290 664 0.581
1 0.615 0.599 1490 932 0.607 200 0.683 0.507 1261 585 0.582
2 0.544 0.699 1740 1460 0.612
SVHN 6 0.723 0.403 1136 436 0.518
SVHN 0 0.677 0.502 1414 674 0.577 12 0.672 0.527 1483 725 0.59
1 0.575 0.653 1837 1357 0.612 25 0.619 0.629 1771 1090 0.624
2 0.332 0.767 2160 4356 0.463 50 0.632 0.605 1703 993 0.618
200 0.667 0.550 1550 774 0.603
F-MNIST 0 0.794 0.787 2144 556 0.791
1 0.746 0.839 2284 777 0.79 F-MNIST 6 0.817 0.727 1981 443 0.77
2 0.712 0.871 2372 962 0.783 12 0.784 0.790 2153 592 0.787
25 0.773 0.787 2143 629 0.78
50 0.836 0.727 1981 390 0.778
200 0.778 0.812 2211 632 0.794
the manual labels, and λ is a positive coefficient that is chosen to reflect a user’s
tolerance on the deviation. Note that there is no softmax layer (and thus no
confidence score) in these regression models so the top1 baseline method cannot
be used here.
Table 7 shows the result when λ = 3.
At 0.75 setting, DeepRobust-W has f1
score up to 78.9%, with an average of
58.2%. At 0.50 setting, DeepRobust-W
detects weak points with an average f1 of
47.9%, while it can go up to 68.2%. It con-
sistently produces better estimation than
the random baseline under all the settings.
Fig. 8: The t-SNE plot of correctly
It should be noted that our observation is classified data points from Self-
valid for all the λ used in [73] from λ equal Driving dataset by the epoch
to 1 to 5. This shows that our proposed model. data points are colored
method DeepRobust-W can be applied based on neighbor accuracy.
to regression problems with more complex natural transformations.
model method 0.75 neighbor acc. 0.50 neighbor acc. It should also be noted that it is un-
f1 tp fp f1 tp fp realistic to use DeepRobust-B for this
chauffeur ours 0.417 555 547 0.346 339 384 task for two reasons: It is impractical
random 0.146 194 908 0.096 94 629
epoch ours 0.789 4354 1112 0.682 2641 1127
to try different variations of an image
random 0.586 3234 2232 0.411 1592 2176 in real-time for a self-driving car, which
dave2 ours 0.541 979 471 0.409 475 246
random 0.193 350 1100 0.121 141 580
is a time-sensitive application. Further,
Table 7: Performance of Deep- DeepRobust-B requires the calculation
Robust-W for predicting weak of neighbor diversity score. For a regres-
points of Self-Driving dataset sion problem, the predicted values are
continuous, so there is a very low probability for any two predictions being
equal. Thus, the neighbor diversity score for every data point will be the same
and cannot be used for identifying the weak points.
Result 4: DeepRobust-W can detect weak points of a self-driving car
dataset with f1 score up to 78.9%, with an average of 58.2%, at neighbor
accuracy cutoff 0.75.
6 Related Work
Adversarial examples. Many works focus on generating adversarial examples
to fool the DNNs and evaluate their robustness using pixel-based perturbation
[9, 17, 23, 25, 31, 36, 48, 49, 54, 63, 80–83]. Some other papers [14, 15, 86], like us,
proposed more realistic transformations to generate adversarial examples. In par-
ticular, Engstrom et al. [14] proposed that a simple rotation and translation can
fool a DNN based classifier, and spatial adversarial robustness is orthogonal to
lp -bounded adversarial robustness. However, all these works estimate the overall
robustness of a DNN based on its aggregated behavior across many data points.
In contrast, we analyze the robustness of individual data points under natural
variations and propose methods to detect weak/strong points automatically.
Understanding Local Robustness of DNNs under Natural Variations 329
DNN testing. Many researchers [16, 21, 29, 36, 41, 55, 69, 70, 74, 94] proposed
techniques to test DNN. For example, Pei et al. [55] proposed an image transfor-
mation based differential testing framework, which can detect erroneous behavior
by comparing the outputs of an input image across multiple DNNs. Ferit et al.
[16] used fault localization methods to identify suspicious neurons and leveraged
those to generate adversarial test cases.
In contrast, others [8, 29, 64, 73, 78, 92, 94] used metamorphic testing where
the assumption is the outputs of an original and its transformed image will be
the same under natural transformations. Among them, some use a uncertainty
measure to quantify some types of non-robustness of an input for prioritizing
samples for testing / retraining [8] or generating test cases[78]. We follow a simi-
lar metamorphic property while estimating neighbor accuracy and our proposed
DeepRobust-B also leverages an uncertainty measure. The key differences are:
First, we focus on estimating model’s performance on general natural variants of
an input rather than the input itself or only spatial variants. Second, we focus on
the task of weak points detection rather than prioritizing / generating test cases.
We also give detailed analyses of the properties of natural variants and propose
a feature vector based white-box detection method DeepRobust-W. Further,
we show that our method works across domains (both image classification and
self-driving car controllers) and tasks (both classification and regression). Other
uncertainty work complement ours in the sense that we can easily leverage weak
points identified by DeepRobust-W and DeepRobust-B to prioritize test
cases or generate more adversarial cases of natural variants.
Another line of work [18, 19, 27, 33, 34, 58, 72] estimates the confidence of
a DNN’s output. For example, [19] leverages thrown away information from
existing models to measure confidence; [27] shows other NN properties like depth,
width, weight decay, and batch normalization are important factors influencing
prediction confidence. Although such methods can provide a confidence measure
per input or its adversarial variants, they do not check its natural robustness
property, i.e., with natural variations how will they behave.
DNN verification. There also exist work on verifying properties for a DNN
model [7, 12, 24, 30, 56, 62, 83]. Most of them focus on verifying properties on lp
norm bounded input space. Recently, Balunovic et al.[4] provides the first ver-
ification technique for verifying a data point’s robustness against spatial trans-
formation. However, their technique suffers from scalability issues.
7 Threats to Validity
In this work, we involve the data characteristic into the robustness testing of
DNN models. We adopt the concept of neighbor accuracy as a measure for local
robustness of a data point on a given model. We explore the properties of neigh-
bor accuracy and find that weak points are often located towards corresponding
class boundaries and their transformed versions tend to be predicted to be more
diverse classes. Leveraging these observations, we propose a white-box method
and a black-box method to identify weak/strong points to warn a user about po-
tential weakness in the given trained model in real-time. We design, implement
and evaluate our proposed framework, DeepRobust-W and DeepRobust-B,
on three image recognition datasets and one self-driving car dataset (for Deep-
Robust-W only) with three models for each. The results show that they can
effectively identify weak/strong points with high precision and recall.
For future work, other consistency analysis methods [18] e.g. variation ratio,
entropy can be tried. We can potentially attain statistical guarantee for our
black-box method by modeling the neighbor accuracy distribution and assume
certain level of correlation between neighbor accuracy and complexity score.
Besides, other definitions of robustness like consistency can be explored. We can
also leverage ideas from [8,78] to easily prioritize test cases or generate more hard
test cases based on identified weak points. Further, we can potentially modify
existing fixing methods such as [20] targeting the weak points to fix them.
9 Acknowledgement
We thank Mukul Prasad and Ripon Saha from Fujisu US for valuable discussions.
This work is supported in part by NSF CCF-1845893 and CCF-1822965.
Understanding Local Robustness of DNNs under Natural Variations 331
References
1. Chauffeur model. https://github.com/udacity/self-driving-car/tree/master/
steering-models/community-models/chauffeur (2016)
2. Epoch model. https://github.com/udacity/self-driving-car/tree/master/
steering-models/community-models/cg23 (2016)
3. Amershi, S., Begel, A., Bird, C., DeLine, R., Gall, H., Kamar, E., Nagappan, N.,
Nushi, B., Zimmermann, T.: Software engineering for machine learning: A case
study. In: Proceedings of the 41st International Conference on Software Engineer-
ing: Software Engineering in Practice. pp. 291–300. ICSE-SEIP ’19, IEEE Press
(2019). https://doi.org/10.1109/ICSE-SEIP.2019.00042, https://doi.org/10.1109/
ICSE-SEIP.2019.00042
4. Balunovic, M., Baader, M., Singh, G., Gehr, T., Vechev, M.: Certifying geomet-
ric robustness of neural networks. In: Advances in Neural Information Processing
Systems. pp. 15287–15297 (2019)
5. Bojarski, M., Del Testa, D., Dworakowski, D., Firner, B., Flepp, B., Goyal, P.,
Jackel, L.D., Monfort, M., Muller, U., Zhang, J., et al.: End to end learning for
self-driving cars. arXiv preprint arXiv:1604.07316 (2016)
6. Bojarski, M., Testa, D.D., Dworakowski, D., Firner, B., Flepp, B., Goyal, P., Jackel,
L.D., Monfort, M., Muller, U., Zhang, J., Zhang, X., Zhao, J., Zieba, K.: End to
end learning for self-driving cars. CoRR abs/1604.07316 (2016), http://arxiv.
org/abs/1604.07316
7. Bunel, R., Turkaslan, I., Torr, P.H., Kohli, P., Kumar, M.P.: A unified view of piece-
wise linear neural network verification. In: Proceedings of the 32nd International
Conference on Neural Information Processing Systems. p. 4795–4804. NIPS’18,
Curran Associates Inc., Red Hook, NY, USA (2018)
8. Byun, T., Sharma, V., Vijayakumar, A., Rayadurgam, S., Cofer, D.: Input priori-
tization for testing neural networks (01 2019)
9. Carlini, N., Wagner, D.: Towards evaluating the robustness of neural networks. In:
Security and Privacy (SP), 2017 IEEE Symposium on. pp. 39–57. IEEE (2017)
10. Cohen, J.: Statistical Power Analysis for the Behavioral Sciences. Lawrence Erl-
baum Associates (1988)
11. Du, X., Xie, X., Li, Y., Ma, L., Liu, Y., Zhao, J.: Deepstellar: Model-
based quantitative analysis of stateful deep learning systems. In: Proceed-
ings of the 2019 27th ACM Joint Meeting on European Software Engineer-
ing Conference and Symposium on the Foundations of Software Engineering. p.
477–487. ESEC/FSE 2019, Association for Computing Machinery, New York, NY,
USA (2019). https://doi.org/10.1145/3338906.3338954, https://doi.org/10.1145/
3338906.3338954
12. Ehlers, R.: Formal verification of piece-wise linear feed-forward neural networks. In:
International Symposium on Automated Technology for Verification and Analysis.
pp. 269–286. Springer (2017)
13. Engstrom, L., Tran, B., Tsipras, D., Schmidt, L., Madry, A.: A rotation and
a translation suffice: Fooling cnns with simple transformations. arXiv preprint
arXiv:1712.02779 (2017)
14. Engstrom, L., Tran, B., Tsipras, D., Schmidt, L., Madry, A.: Exploring the land-
scape of spatial robustness. In: International Conference on Machine Learning. pp.
1802–1811 (2019)
15. Engstrom, L., Tran, B., Tsipras, D., Schmidt, L., Mądry, A.: A rotation and a
translation suffice: Fooling cnns with simple transformations. In: Proceedings of
the 36th international conference on machine learning (ICML) (2019)
332 Z. Zhong et al.
16. Eniser, H.F., Gerasimou, S., Sen, A.: Deepfault: Fault localization for deep neu-
ral networks. In: Hähnle, R., van der Aalst, W. (eds.) Fundamental Approaches
to Software Engineering. pp. 171–191. Springer International Publishing, Cham
(2019)
17. Feinman, R., Curtin, R.R., Shintre, S., Gardner, A.B.: Detecting adversarial sam-
ples from artifacts. arXiv preprint arXiv:1703.00410 (2017)
18. Gal, Y.: Uncertainty in Deep Learning (2016)
19. Gal, Y., Ghahramani, Z.: Dropout as a bayesian approximation: Representing
model uncertainty in deep learning. In: Balcan, M.F., Weinberger, K.Q. (eds.)
Proceedings of The 33rd International Conference on Machine Learning. Proceed-
ings of Machine Learning Research, vol. 48, pp. 1050–1059. PMLR, New York, New
York, USA (20–22 Jun 2016), http://proceedings.mlr.press/v48/gal16.html
20. Gao, X., Saha, R., Prasad, M., Roychoudhury, A.: Fuzz testing based data aug-
mentation to improve robustness of deep neural networks. In: Proceedings of the
42nd International Conference on Software Engineering. ICSE 2020, ACM (2020)
21. Gerasimou, S., Eniser, H.F., Sen, A., Çakan, A.: Importance-driven deep learning
system testing. In: International Conference of Software Engineering (ICSE) (2020)
22. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair,
S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Advances in neural
information processing systems. pp. 2672–2680 (2014)
23. Goodfellow, I.J., Shlens, J., Szegedy, C.: Explaining and harnessing adversarial
examples. In: International Conference on Learning Representations (ICLR) (2015)
24. Gross, D., Jansen, N., Pérez, G.A., Raaijmakers, S.: Robustness verification for
classifier ensembles. In: Hung, D.V., Sokolsky, O. (eds.) Automated Technology for
Verification and Analysis. pp. 271–287. Springer International Publishing, Cham
(2020)
25. Gu, S., Rigazio, L.: Towards deep neural network architectures robust to adversarial
examples. In: International Conference on Learning Representations (ICLR) (2015)
26. Guo, C., Gardner, J., You, Y., Wilson, A.G., Weinberger, K.: Simple black-box
adversarial attacks. In: Chaudhuri, K., Salakhutdinov, R. (eds.) Proceedings of
the 36th International Conference on Machine Learning. Proceedings of Machine
Learning Research, vol. 97, pp. 2484–2493. PMLR, Long Beach, California, USA
(09–15 Jun 2019), http://proceedings.mlr.press/v97/guo19a.html
27. Guo, C., Pleiss, G., Sun, Y., Weinberger, K.Q.: On calibration of modern neu-
ral networks. In: Precup, D., Teh, Y.W. (eds.) Proceedings of the 34th Inter-
national Conference on Machine Learning. Proceedings of Machine Learning Re-
search, vol. 70, pp. 1321–1330. PMLR, International Convention Centre, Sydney,
Australia (06–11 Aug 2017), http://proceedings.mlr.press/v70/guo17a.html
28. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In:
Proceedings of the IEEE conference on computer vision and pattern recognition.
pp. 770–778 (2016)
29. He, P., Meister, C., Su, Z.: Structure-invariant testing for machine translation. In:
International Conference of Software Engineering (ICSE) (2020)
30. Huang, X., Kwiatkowska, M., Wang, S., Wu, M.: Safety verification of deep neural
networks. In: International Conference on Computer Aided Verification. pp. 3–29.
Springer (2017)
31. Ilyas, A., Santurkar, S., Tsipras, D., Engstrom, L., Tran, B., Madry, A.: Adversarial
examples are not bugs, they are features (2019), http://arxiv.org/abs/1905.02175
32. Islam, M.J., Nguyen, G., Pan, R., Rajan, H.: A comprehensive study on deep
learning bug characteristics. In: Proceedings of the 2019 27th ACM Joint Meeting
Understanding Local Robustness of DNNs under Natural Variations 333
47. Mann, H.B., Whitney, D.R.: On a test of whether one of two random variables
is stochastically larger than the other. Annals of Mathematical Statistics 18(1),
50–60 (1947)
48. Mao, C., Zhong, Z., Yang, J., Vondrick, C., Ray, B.: Metric learning for adversarial
robustness. In: Advances in Neural Information Processing Systems. pp. 478–489
(2019)
49. Metzen, J.H., Genewein, T., Fischer, V., Bischoff, B.: On detecting adversarial
perturbations. In: International Conference on Learning Representations (ICLR)
(2017)
50. Mirman, M., Gehr, T., Vechev, M.: Differentiable abstract interpretation for prov-
ably robust neural networks. In: International Conference on Machine Learning.
pp. 3575–3583 (2018)
51. Moon, S., An, G., Song, H.O.: Parsimonious black-box adversarial attacks via effi-
cient combinatorial optimization. In: Chaudhuri, K., Salakhutdinov, R. (eds.) Pro-
ceedings of the 36th International Conference on Machine Learning. Proceedings
of Machine Learning Research, vol. 97, pp. 4636–4645. PMLR, Long Beach, Cali-
fornia, USA (09–15 Jun 2019), http://proceedings.mlr.press/v97/moon19a.html
52. Ozdag, M., Raj, S., Fernandes, S., Velasquez, A., Pullum, L., Jha, S.K.: On the sus-
ceptibility of deep neural networks to natural perturbations. In: AISafety@IJCAI
(2019)
53. Papernot, N., McDaniel, P., Jha, S., Fredrikson, M., Celik, Z.B., Swami, A.: The
limitations of deep learning in adversarial settings. In: 2016 IEEE European Sym-
posium on Security and Privacy (EuroS&P). pp. 372–387. IEEE (2016)
54. Papernot, N., McDaniel, P., Wu, X., Jha, S., Swami, A.: Distillation as a defense to
adversarial perturbations against deep neural networks. In: Security and Privacy
(SP), 2016 IEEE Symposium on. pp. 582–597. IEEE (2016)
55. Pei, K., Cao, Y., Yang, J., Jana, S.: Deepxplore: Automated whitebox testing
of deep learning systems. In: Proceedings of the 26th Symposium on Operating
Systems Principles. pp. 1–18. ACM (2017)
56. Pei, K., Cao, Y., Yang, J., Jana, S.: Towards practical verification of machine
learning: The case of computer vision systems. arXiv preprint arXiv:1712.01785
(2017)
57. Pham, H.V., Lutellier, T., Qi, W., Tan, L.: Cradle: Cross-backend validation to
detect and localize bugs in deep learning libraries. In: Proceedings of the 41st
International Conference on Software Engineering. p. 1027–1038. ICSE ’19, IEEE
Press (2019). https://doi.org/10.1109/ICSE.2019.00107, https://doi.org/10.1109/
ICSE.2019.00107
58. Qiu, X., Meyerson, E., Miikkulainen, R.: Quantifying point-prediction uncertainty
in neural networks via residual estimation with an i/o kernel. In: International
Conference on Learning Representations (2020), https://openreview.net/forum?
id=rkxNh1Stvr
59. Sawilowsky, S.: New effect size rules of thumb. Journal of Modern Applied Statis-
tical Methods 8, 597–599 (11 2009). https://doi.org/10.22237/jmasm/1257035100
60. Saxena, U.: Automold. https://github.com/UjjwalSaxena/
Automold--Road-Augmentation-Library/
61. Sen, K., Marinov, D., Agha, G.: CUTE: A concolic unit testing engine for C. In:
FSE (2005)
62. Seshia, S.A., Desai, A., Dreossi, T., Fremont, D.J., Ghosh, S., Kim, E., Shivaku-
mar, S., Vazquez-Chanlatte, M., Yue, X.: Formal specification for deep neural
networks. In: International Symposium on Automated Technology for Verification
and Analysis. pp. 20–34. Springer (2018)
Understanding Local Robustness of DNNs under Natural Variations 335
63. Shaham, U., Yamada, Y., Negahban, S.: Understanding adversarial training: In-
creasing local stability of neural nets through robust optimization. arXiv preprint
arXiv:1511.05432 (2015)
64. Shankar, V., Dave, A., Roelofs, R., Ramanan, D., Recht, B., Schmidt, L.: A sys-
tematic framework for natural perturbations from videos (06 2019)
65. Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., van den Driessche, G.,
Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S.,
Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T., Leach, M.,
Kavukcuoglu, K., Graepel, T., Hassabis, D.: Mastering the game of go with deep
neural networks and tree search. Nature 529, 484–503 (2016), http://www.nature.
com/nature/journal/v529/n7587/full/nature16961.html
66. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale im-
age recognition. In: International Conference on Learning Representations (ICLR)
(2015)
67. SIMPSON, E.H.: Measurement of diversity. Nature 163(4148), 688–688 (1949),
https://doi.org/10.1038/163688a0
68. Stocco, A., Weiss, M., Calzana, M., Tonella, P.: Misbehaviour prediction for au-
tonomous driving systems. In: Proceedings of 42nd International Conference on
Software Engineering. p. 12 pages. ICSE ’20, ACM (2020)
69. Stocco, A., Weiss, M., Calzana, M., Tonella, P.: Misbehaviour prediction for au-
tonomous driving systems. In: International Conference of Software Engineering
(ICSE) (2020)
70. Sun, Y., Wu, M., Ruan, W., Huang, X., Kwiatkowska, M., Kroening, D.: Concolic
testing for deep neural networks (2018)
71. Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I.,
Fergus, R.: Intriguing properties of neural networks. In: International Conference
on Learning Representations (ICLR) (2014)
72. Teye, M., Azizpour, H., Smith, K.: Bayesian uncertainty estimation for batch nor-
malized deep networks. In: Dy, J., Krause, A. (eds.) Proceedings of the 35th In-
ternational Conference on Machine Learning. Proceedings of Machine Learning
Research, vol. 80, pp. 4907–4916. PMLR, Stockholmsmässan, Stockholm Sweden
(10–15 Jul 2018), http://proceedings.mlr.press/v80/teye18a.html
73. Tian, Y., Pei, K., Jana, S., Ray, B.: Deeptest: Automated testing of deep-neural-
network-driven autonomous cars. In: International Conference of Software Engi-
neering (ICSE), 2018 IEEE conference on. IEEE (2018)
74. Tian, Y., Zhong, Z., Ordonez, V., Kaiser, G., Ray, B.: Testing dnn image classifier
for confusion & bias errors. In: International Conference of Software Engineering
(ICSE) (2020)
75. Tramèr, F., Kurakin, A., Papernot, N., Goodfellow, I., Boneh, D., Mc-
Daniel, P.: Ensemble adversarial training: Attacks and defenses. arXiv preprint
arXiv:1705.07204 (2017)
76. Tsipras, D., Santurkar, S., Engstrom, L., Turner, A., Madry, A.: Robustness may be
at odds with accuracy. In: International Conference on Learning Representations
(ICLR) (2019)
77. Udacity: A self-driving car simulator built with Unity. https://github.com/
udacity/self-driving-car-sim (2017), online; accessed 18 August 2019
78. Udeshi, S., Jiang, X., Chattopadhyay, S.: Callisto: Entropy-based test generation
and data quality assessment for machine learning systems. In: 2020 IEEE 13th
International Conference on Software Testing, Validation and Verification (ICST).
pp. 448–453 (2020)
336 Z. Zhong et al.
79. Wang, F., Jiang, M., Qian, C., Yang, S., Li, C., Zhang, H., Wang, X., Tang, X.:
Residual attention network for image classification. In: Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition. pp. 3156–3164 (2017)
80. Wang, J., Dong, G., Sun, J., Wang, X., Zhang, P.: Adversarial sample detection for
deep neural network through model mutation testing. In: Proceedings of the 41st
International Conference on Software Engineering. p. 1245–1256. ICSE ’19, IEEE
Press (2019). https://doi.org/10.1109/ICSE.2019.00126, https://doi.org/10.1109/
ICSE.2019.00126
81. Wang, S., Chen, Y., Abdou, A., Jana, S.: Mixtrain: Scalable training of formally
robust neural networks. arXiv preprint arXiv:1811.02625 (2018)
82. Wang, S., Pei, K., Whitehouse, J., Yang, J., Jana, S.: Efficient formal safety analysis
of neural networks. In: Proceedings of the 32Nd International Conference on Neural
Information Processing Systems. pp. 6369–6379. NIPS’18, Curran Associates Inc.,
USA (2018), http://dl.acm.org/citation.cfm?id=3327345.3327533
83. Wang, S., Pei, K., Whitehouse, J., Yang, J., Jana, S.: Formal security analysis of
neural networks using symbolic intervals. USENIX Security Symposium (2018)
84. Wong, E., Schmidt, F., Metzen, J.H., Kolter, J.Z.: Scaling provable adversarial
defenses. In: Advances in Neural Information Processing Systems. pp. 8400–8409
(2018)
85. Xiao, C., Li, B., Zhu, J.Y., He, W., Liu, M., Song, D.: Generating adversarial
examples with adversarial networks. In: 27th International Joint Conference on
Artificial Intelligence (IJCAI) (2018)
86. Xiao, C., Zhu, J.Y., Li, B., He, W., Liu, M., Song, D.: Spatially transformed adver-
sarial examples. In: International Conference on Learning Representations (ICLR)
(2018)
87. Xiao, H., Rasul, K., Vollgraf, R.: Fashion-mnist: a novel image dataset for bench-
marking machine learning algorithms (2017)
88. Yang, F., Wang, Z., Heinze-Deml, C.: Invariance-inducing regularization using
worst-case transformations suffices to boost accuracy and spatial robustness. In:
Advances in Neural Information Processing Systems 32. pp. 14757–14768 (2019)
89. Yuval Netzer, T.W., Coates, A., Bissacco, A., Wu, B., Ng, A.Y.: Reading digits in
natural images with unsupervised feature learning. In: NIPS Workshop on Deep
Learning and Unsupervised Feature Learning (2011)
90. Zagoruyko, S., Komodakis, N.: Wide residual networks. In: BMVC (2016)
91. Zhang, H., Chan, W.K.: Apricot: A weight-adaptation approach to fix-
ing deep learning models. In: 2019 34th IEEE/ACM International Confer-
ence on Automated Software Engineering (ASE). pp. 376–387 (Nov 2019).
https://doi.org/10.1109/ASE.2019.00043
92. Zhang, M., Zhang, Y., Zhang, L., Liu, C., Khurshid, S.: Deeproad: Gan-based
metamorphic autonomous driving system testing. arXiv preprint arXiv:1802.02295
(2018)
93. Zhao, Z., Dua, D., Singh, S.: Generating natural adversarial examples. In: Inter-
national Conference on Learning Representations (ICLR) (2018)
94. Zhou, H., Li, W., Kong, Z., Guo, J., Zhang, Y., Zhang, L., Yu, B., Liu, C.: Deep-
billboard: Systematic physical-world testing of autonomous driving systems. In:
International Conference of Software Engineering (ICSE) (2020)
Understanding Local Robustness of DNNs under Natural Variations 337
Open Access This chapter is licensed under the terms of the Creative Commons
Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/),
which permits use, sharing, adaptation, distribution and reproduction in any medium
or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license and indicate if changes were
made.
The images or other third party material in this chapter are included in the chapter’s
Creative Commons license, unless indicated otherwise in a credit line to the material. If
material is not included in the chapter’s Creative Commons license and your intended
use is not permitted by statutory regulation or exceeds the permitted use, you will need
to obtain permission directly from the copyright holder.
Test-Comp Contributions
Status Report on Software Testing:
Test-Comp 2021
tifact *
Ar * Complete
Te
t
en
st
F A S E*
*
st
We
* Consi
- Co m p *
Dirk Beyer
ll Docum
se
eu
Ev
e
nt R
ed
* E as y t o
*
alu
at e d
LMU Munich, Munich, Germany
Abstract. This report describes Test-Comp 2021, the 3rd edition of the
Competition on Software Testing. The competition is a series of annual
comparative evaluations of fully automatic software test generators for C
programs. The competition has a strong focus on reproducibility of its
results and its main goal is to provide an overview of the current state
of the art in the area of automatic test-generation. The competition was
based on 3 173 test-generation tasks for C programs. Each test-generation
task consisted of a program and a test specification (error coverage,
branch coverage). Test-Comp 2021 had 11 participating test generators
from 6 countries.
1 Introduction
Among several other objectives, the Competition on Software Testing (Test-
Comp [4, 5, 6], https://test-comp.sosy-lab.org/2021) showcases every year the state
of the art in the area of automatic software testing. This edition of Test-Comp
is the 3rd edition of the competition. It provides an overview of the currently
achieved results by tool implementations that are based on the most recent ideas,
concepts, and algorithms for fully automatic test generation. This competition
report describes the (updated) rules and definitions, presents the competition
results, and discusses some interesting facts about the execution of the competition
experiments. The setup of Test-Comp is similar to SV-COMP [8], in terms
of both technical and procedural organization. The results are collected via
BenchExec’s XML results format [16], and transformed into tables and plots
in several formats (https://test-comp.sosy-lab.org/2021/results/). All results are
available in artifacts at Zenodo (Table 3).
This report extends previous reports on Test-Comp [4, 5, 6].
Reproduction packages are available on Zenodo (see Table 3).
Funded in part by the Deutsche Forschungsgemeinschaft (DFG) – 418257054 (Coop).
[email protected]
c The Author(s) 2021
E. Guerra and M. Stoelinga (Eds.): FASE 2021, LNCS 12649, pp. 341–357, 2021.
https://doi.org/10.1007/978-3-030-71500-7_17
342 Dirk Beyer
Competition Goals. In summary, the goals of Test-Comp are the following [5]:
• Establish standards for software test generation. This means, most promi-
nently, to develop a standard for marking input values in programs, define
an exchange format for test suites, agree on a specification language for
test-coverage criteria, and define how to validate the resulting test suites.
• Establish a set of benchmarks for software testing in the community. This
means to create and maintain a set of programs together with coverage
criteria, and to make those publicly available for researchers to be used in
performance comparisons when evaluating a new technique.
• Provide an overview of available tools for test-case generation and a snapshot
of the state-of-the-art in software testing to the community. This means to
compare, independently from particular paper projects and specific techniques,
different test generators in terms of effectiveness and performance.
• Increase the visibility and credits that tool developers receive. This means
to provide a forum for presentation of tools and discussion of the latest
technologies, and to give the participants the opportunity to publish about
the development work that they have done.
• Educate PhD students and other participants on how to set up performance
experiments, package tools in a way that supports reproduction, and how to
perform robust and accurate research experiments.
• Provide resources to development teams that do not have sufficient computing
resources and give them the opportunity to obtain results from experiments
on large benchmark sets.
Program
under Test
Test Suite Bug
(Test Cases) Report
Test Test
Generator Validator
Coverage
Statistics
Test
Specification
Fig. 1: Flow of the Test-Comp execution for one test generator (taken from [5])
1 format_version: ’2.0’
2
3 # old file name: floppy_true−unreach−call_true−valid−memsafety.i.cil.c
4 input_files: ’floppy.i.cil−3.c’
5
6 properties:
7 − property_file: ../properties/unreach−call.prp
8 expected_verdict: true
9 − property_file: ../properties/valid−memsafety.prp
10 expected_verdict: false
11 subproperty: valid−memtrack
12 − property_file: ../properties/coverage−branches.prp
13
14 options:
15 language: C
16 data_model: ILP32
Table 1 lists the two FQL formulas that are used in test specifications of
Test-Comp 2021; there was no change from 2020 (except that special function
__VERIFIER_error does not exist anymore).
Benchmark Programs. The input programs were taken from the largest and
most diverse open-source repository of software-verification and test-generation
tasks 3 , which is also used by SV-COMP [8]. As in 2020, we selected all pro-
grams for which the following properties were satisfied (see issue on GitHub 4
and report [6]):
This selection yielded a total of 3 173 test-generation tasks, namely 607 tasks
for category Error Coverage and 2 566 tasks for category Code Coverage. The
test-generation tasks are partitioned into categories, which are listed in Ta-
bles 6 and 7 and described in detail on the competition web site. 6 Figure 3
illustrates the category composition.
The programs in the benchmark collection contained functions
__VERIFIER_error and __VERIFIER_assume that had a specific prede-
fined meaning. Last year, those functions were removed from all programs
in the SV-Benchmarks collection. More about the reasoning is explained
in the SV-COMP 2021 competition report [8].
Category Error-Coverage. The first category is to show the abilities to dis-
cover bugs. The benchmark set consists of programs that contain a bug. Every
run will be started by a batch script, which produces for every tool and every
test-generation task one of the following scores: 1 point, if the validator succeeds
in executing the program under test on a generated test case that explores the
bug (i.e., the specified function was called), and 0 points, otherwise.
3 https://github.com/sosy-lab/sv-benchmarks
4 https://github.com/sosy-lab/sv-benchmarks/pull/774
5 https://test-comp.sosy-lab.org/2021/rules.php
6 https://test-comp.sosy-lab.org/2021/benchmarks.php
346 Dirk Beyer
Arrays
BitVectors
ControlFlow
ECA
Floats
Heap
Cover-Error
Loops
Recursive
Sequentialized
XCSP
BusyBox-MemSafety
DeviceDriversLinux64-ReachSafety
Arrays
C-Overall BitVectors
ControlFlow
ECA
Floats
Heap
Loops
Cover-Branches Recursive
Sequentialized
XCSP
Combinations
BusyBox
DeviceDriversLinux64
SQLite
MainHeap
4 Reproducibility
In order to support independent reproduction of the Test-Comp results, we
made all major components that are used for the competition available in public
version-control repositories. An overview of the components that contribute to
the reproducible setup of Test-Comp is provided in Fig. 4, and the details are
given in Table 2. We refer to the report of Test-Comp 2019 [6] for a thorough
description of all components of the Test-Comp organization and how we ensure
that all parts are publicly available for maximal reproducibility.
In order to guarantee long-term availability and immutability of the test-
generation tasks, the produced competition results, and the produced test suites,
we also packaged the material and published it at Zenodo (see Table 3). The
archive for the competition results includes the raw results in BenchExec’s
XML exchange format, the log output of the test generators and validator,
and a mapping from file names to SHA-256 hashes. The hashes of the files
are useful for validating the exact contents of a file, and accessing the files
inside the archive that contains the test suites.
To provide transparent access to the exact versions of the test generators that
were used in the competition, all test-generator archives are stored in a public
Git repository. GitLab was used to host the repository for the test-generator
archives due to its generous repository size limit of 10 GB.
Competition Workflow. As illustrated in Fig. 4, the ingredients for a test or
verification run are (a) a test or verification task (which program and which
specification to use), (b) a benchmark definition (which categories and which
options to use), (c) a tool-info module (uniform way to access a tool’s version
string and the command line to invoke), and (d) an archive that contains all
executables that are required and cannot be installed as standard Ubuntu package.
(a) Each test or verification task is defined by a task-definition file (as shown,
e.g., in Fig. 2). The tasks are stored in the SV-Benchmarks repository and
maintained by the verification and testing community, including the competition
participants and the competition organizer.
(b) A benchmark definition defines the choices of the participating team, that
is, which categories to execute the test generator on and which parameters to
pass to the test generator. The benchmark definition also specifies the resource
limits of the competition runs (CPU time, memory, CPU cores). The benchmark
definitions are created or maintained by the teams and the organizer.
348 Dirk Beyer
(a) Test-Generation Tasks (b) Benchmark Definitions (c) Tool-Info Modules (d) Tester Archives
Table 4: Competition candidates with tool references and representing jury members
Tester Ref. Jury member Affiliation
CMA-ES Fuzz [33] Gidon Ernst LMU Munich, Germany
CoVeriTest [12, 31] Marie-Christine Jakobs TU Darmstadt, Germany
FuSeBMC [1, 25] Kaled Alshmrany U. of Manchester, UK
HybridTiger [18, 38] Sebastian Ruland TU Darmstadt, Germany
Klee [19, 20] Martin Nowack Imperial College London, UK
Legion [37] Dongge Liu U. of Melbourne, Australia
LibKluzzer [35] Hoang M. Le U. of Bremen, Germany
PRTest [14, 36] Thomas Lemberger LMU Munich, Germany
Symbiotic [21, 22] Marek Chalupa Masaryk U., Brno, Czechia
TracerX [29, 30] Joxan Jaffar National U. of Singapore, Singapore
VeriFuzz [23] Raveendra Kumar M. Tata Consultancy Services, India
Evolutionary Algorithms
Explicit-Value Analysis
Predicate Abstraction
Algorithm Selection
Symbolic Execution
Random Execution
Portfolio
CEGAR
Participant
CMA-ES Fuzz
CoVeriTest
FuSeBMC
HybridTiger
Klee
Legion
LibKluzzer
PRTest
Symbiotic
TracerX
VeriFuzz
8 https://vcloud.sosy-lab.org
Status Report on Software Testing: Test-Comp 2021 351
Table 6: Quantitative overview over all results; empty cells mark opt-outs; label ‘new’
indicates first-time participants
Cover-Branches
Cover-Error
2566 tasks
3173 tasks
607 tasks
Overall
Participant
for time and energy are accumulated over all cores of the CPU. To measure the
CPU energy, we use CPU Energy Meter [17] (integrated in BenchExec [16]).
Further technical parameters of the competition machines are available in the
repository which also contains the benchmark definitions. 9
One complete test-generation execution of the competition consisted of
34 903 single test-generation runs. The total CPU time was 220 days and the
consumed energy 56 kWh for one complete competition run for test generation
(without validation). Test-suite validation consisted of 34 903 single test-suite
validation runs. The total consumed CPU time was 6.3 days. Each tool was
executed several times, in order to make sure no installation issues occur dur-
ing the execution. Including preruns, the infrastructure managed a total of
210 632 test-generation runs (consuming 1.8 years of CPU time) and 207 459
test-suite validation runs (consuming 27 days of CPU time). We did not mea-
sure the CPU energy during preruns.
Quantitative Results. Table 6 presents the quantitative overview of all tools
and all categories. The head row mentions the category and the number of test-
generation tasks in that category. The tools are listed in alphabetical order; every
table row lists the scores of one test generator. We indicate the top three candi-
dates by formatting their scores in bold face and in larger font size. An empty table
cell means that the test generator opted-out from the respective main category
9 https://gitlab.com/sosy-lab/test-comp/bench-defs/tree/testcomp21
352 Dirk Beyer
Table 7: Overview of the top-three test generators for each category (measurement
values for CPU time and energy rounded to two significant digits)
3000 CMA-ES-Fuzz
CoVeriTest
FuSeBMC
HybridTiger
Min. number of test tasks
2500
KLEE
Legion
2000 LibKluzzer
PRTest
Symbiotic
1500 TracerX
VeriFuzz
1000
500
0
0 200 400 600 800 1000 1200 1400 1600 1800
Cumulative score
Fig. 5: Quantile functions for category Overall. Each quantile function illustrates
the quantile (x-coordinate) of the scores obtained by test-generation runs below
a certain number of test-generation tasks (y-coordinate). More details were given
previously [6]. The graphs are decorated with symbols to make them better
distinguishable without color.
Table 8: Alternative rankings; quality is given in score points (sp), CPU time
in hours (h), energy in kilo-watt-hours (kWh), the first rank measure in kilo-
joule per score point (kJ/sp), and the second rank measure in score points (sp);
measurement values are rounded to 2 significant digits
Green Testing — Low Energy Consumption. Since a large part of the cost of
test generation is caused by the energy consumption, it might be important to
also consider the energy efficiency in rankings, as complement to the official
Test-Comp ranking. This alternative ranking category uses the energy consump-
tion per score point as rank measure: CPU Energy
Quality , with the unit kilo-joule per
354 Dirk Beyer
15
5 9 9
6
0
2019 2020 2021
Year
Fig. 6: Number of evaluated test generators for each year (top: number of first-time
participants; bottom: previous year’s participants)
score point (kJ/sp).11 The energy is measured using CPU Energy Meter [17],
which we use as part of BenchExec [16].
New Test Generators. To acknowledge the test generators that participated for the
first time in Test-Comp, the second alternative ranking category lists measures
only for the new test generators, and the rank measure is the quality with the
unit score point (sp). For example, CMA-ES Fuzz is an early prototype and has
already obtained a total score of 411 points in category Cover-Branches, and
FuSeBMC is a new tool based on some mature components and became second
place already in its first participation. This should encourage developers of test
generators to participate with new tools of any maturity level.
6 Conclusion
Test-Comp 2021 was the the 3rd edition of the Competition on Software Testing,
and attracted 11 participating teams (see Fig. 6 for the participation numbers and
Table 4 for the details). The competition offers an overview of the state of the art in
automatic software testing for C programs. The competition does not only execute
the test generators and collect results, but also validates the achieved coverage
of the test suites, based on the latest version of the test-suite validator TestCov.
As before, the jury and the organizer made sure that the competition follows the
high quality standards of the FASE conference, in particular with respect to the
important principles of fairness, community support, and transparency.
Data Availability Statement. The test-generation tasks and results of the
competition are published at Zenodo, as described in Table 3. All compo-
nents and data that are necessary for reproducing the competition are avail-
able in public version repositories, as specified in Table 2. Furthermore, the
results are presented online on the competition web site for easy access:
https://test-comp.sosy-lab.org/2021/results/.
11 Errata: Table 8 of last year’s report for Test-Comp 2020 contains a typo: The unit of the
energy consumption per score point is kJ/sp (instead of J/sp).
Status Report on Software Testing: Test-Comp 2021 355
References
1. Alshmrany, K., Menezes, R., Gadelha, M., Cordeiro, L.: FuSeBMC: A white-box
fuzzer for finding security vulnerabilities in C programs (competition contribution).
In: Proc. FASE. LNCS 12649, Springer (2021)
2. Bartocci, E., Beyer, D., Black, P.E., Fedyukovich, G., Garavel, H., Hartmanns, A.,
Huisman, M., Kordon, F., Nagele, J., Sighireanu, M., Steffen, B., Suda, M., Sutcliffe,
G., Weber, T., Yamada, A.: TOOLympics 2019: An overview of competitions in
formal methods. In: Proc. TACAS (3). pp. 3–24. LNCS 11429, Springer (2019).
https://doi.org/10.1007/978-3-030-17502-3_1
3. Beyer, D.: Second competition on software verification (Summary of SV-
COMP 2013). In: Proc. TACAS. pp. 594–609. LNCS 7795, Springer (2013).
https://doi.org/10.1007/978-3-642-36742-7_43
4. Beyer, D.: Competition on software testing (Test-Comp). In: Proc. TACAS (3). pp.
167–175. LNCS 11429, Springer (2019). https://doi.org/10.1007/978-3-030-17502-
3_11
5. Beyer, D.: Second competition on software testing: Test-Comp 2020. In: Proc.
FASE. pp. 505–519. LNCS 12076, Springer (2020). https://doi.org/10.1007/978-3-
030-45234-6_25
6. Beyer, D.: First international competition on software testing (Test-Comp 2019).
Int. J. Softw. Tools Technol. Transf. (2021)
7. Beyer, D.: Results of the 3rd Intl. Competition on Software Testing (Test-Comp
2021). Zenodo (2021). https://doi.org/10.5281/zenodo.4459470
8. Beyer, D.: Software verification: 10th comparative evaluation (SV-COMP 2021). In:
Proc. TACAS (2). LNCS 12652, Springer (2021), preprint available.
9. Beyer, D.: SV-Benchmarks: Benchmark set of 3rd Intl. Competition on Software
Testing (Test-Comp 2021). Zenodo (2021). https://doi.org/10.5281/zenodo.4459132
10. Beyer, D.: Test suites from Test-Comp 2021 test-generation tools. Zenodo (2021).
https://doi.org/10.5281/zenodo.4459466
11. Beyer, D., Chlipala, A.J., Henzinger, T.A., Jhala, R., Majumdar, R.: Gener-
ating tests from counterexamples. In: Proc. ICSE. pp. 326–335. IEEE (2004).
https://doi.org/10.1109/ICSE.2004.1317455
12. Beyer, D., Jakobs, M.C.: CoVeriTest: Cooperative verifier-based testing. In: Proc.
FASE. pp. 389–408. LNCS 11424, Springer (2019). https://doi.org/10.1007/978-3-
030-16722-6_23
13. Beyer, D., Kanav, S.: CoVeriTeam: On-demand composition of cooperative
verification systems. unpublished manuscript (2021)
14. Beyer, D., Lemberger, T.: Software verification: Testing vs. model checking. In:
Proc. HVC. pp. 99–114. LNCS 10629, Springer (2017). https://doi.org/10.1007/978-
3-319-70389-3_7
15. Beyer, D., Lemberger, T.: TestCov: Robust test-suite execution and
coverage measurement. In: Proc. ASE. pp. 1074–1077. IEEE (2019).
https://doi.org/10.1109/ASE.2019.00105
16. Beyer, D., Löwe, S., Wendler, P.: Reliable benchmarking: Requirements
and solutions. Int. J. Softw. Tools Technol. Transfer 21(1), 1–29 (2019).
https://doi.org/10.1007/s10009-017-0469-y
17. Beyer, D., Wendler, P.: CPU Energy Meter: A tool for energy-aware algorithms
engineering. In: Proc. TACAS (2). pp. 126–133. LNCS 12079, Springer (2020).
https://doi.org/10.1007/978-3-030-45237-7_8
356 Dirk Beyer
18. Bürdek, J., Lochau, M., Bauregger, S., Holzer, A., von Rhein, A., Apel, S., Beyer, D.:
Facilitating reuse in multi-goal test-suite generation for software product lines. In:
Proc. FASE. pp. 84–99. LNCS 9033, Springer (2015). https://doi.org/10.1007/978-
3-662-46675-9_6
19. Cadar, C., Dunbar, D., Engler, D.R.: Klee: Unassisted and automatic generation
of high-coverage tests for complex systems programs. In: Proc. OSDI. pp. 209–224.
USENIX Association (2008)
20. Cadar, C., Nowack, M.: Klee symbolic execution engine in 2019. Int. J. Softw.
Tools Technol. Transf. (2020). https://doi.org/10.1007/s10009-020-00570-3
21. Chalupa, M., Novák, J., Strejček, J.: Symbiotic 8: Parallel and targeted test
generation (competition contribution). In: Proc. FASE. LNCS 12649, Springer
(2021)
22. Chalupa, M., Strejček, J., Vitovská, M.: Joint forces for memory safety checking.
In: Proc. SPIN. pp. 115–132. Springer (2018). https://doi.org/10.1007/978-3-319-
94111-0_7
23. Chowdhury, A.B., Medicherla, R.K., Venkatesh, R.: VeriFuzz: Program-aware
fuzzing (competition contribution). In: Proc. TACAS (3). pp. 244–249. LNCS 11429,
Springer (2019). https://doi.org/10.1007/978-3-030-17502-3_22
24. Cok, D.R., Déharbe, D., Weber, T.: The 2014 SMT competition. JSAT 9, 207–242
(2016)
25. Gadelha, M.R., Menezes, R., Cordeiro, L.: Esbmc 6.1: Automated test-case genera-
tion using bounded model checking. Int. J. Softw. Tools Technol. Transf. (2020).
https://doi.org/10.1007/s10009-020-00571-2
26. Godefroid, P., Sen, K.: Combining model checking and testing. In: Handbook of
Model Checking, pp. 613–649. Springer (2018). https://doi.org/10.1007/978-3-319-
10575-8_19
27. Harman, M., Hu, L., Hierons, R.M., Wegener, J., Sthamer, H., Baresel, A., Roper,
M.: Testability transformation. IEEE Trans. Software Eng. 30(1), 3–16 (2004).
https://doi.org/10.1109/TSE.2004.1265732
28. Holzer, A., Schallhart, C., Tautschnig, M., Veith, H.: How did you
specify your test suite. In: Proc. ASE. pp. 407–416. ACM (2010).
https://doi.org/10.1145/1858996.1859084
29. Jaffar, J., Maghareh, R., Godboley, S., Ha, X.L.: TracerX: Dynamic symbolic
execution with interpolation (competition contribution). In: Proc. FASE. pp. 530–
534. LNCS 12076, Springer (2020). https://doi.org/10.1007/978-3-030-45234-6_28
30. Jaffar, J., Murali, V., Navas, J.A., Santosa, A.E.: Tracer: A symbolic execution
tool for verification. In: Proc. CAV. pp. 758–766. LNCS 7358, Springer (2012).
https://doi.org/10.1007/978-3-642-31424-7_61
31. Jakobs, M.C., Richter, C.: CoVeriTest with adaptive time scheduling (competition
contribution). In: Proc. FASE. LNCS 12649, Springer (2021)
32. Kifetew, F.M., Devroey, X., Rueda, U.: Java unit-testing tool com-
petition: Seventh round. In: Proc. SBST. pp. 15–20. IEEE (2019).
https://doi.org/10.1109/SBST.2019.00014
33. Kim, H.: Fuzzing with stochastic optimization (2020), Bachelor’s Thesis, LMU
Munich
34. King, J.C.: Symbolic execution and program testing. Commun. ACM 19(7), 385–394
(1976). https://doi.org/10.1145/360248.360252
35. Le, H.M.: Llvm-based hybrid fuzzing with LibKluzzer (competition con-
tribution). In: Proc. FASE. pp. 535–539. LNCS 12076, Springer (2020).
https://doi.org/10.1007/978-3-030-45234-6_29
Status Report on Software Testing: Test-Comp 2021 357
36. Lemberger, T.: Plain random test generation with PRTest. Int. J. Softw. Tools
Technol. Transf. (2020)
37. Liu, D., Ernst, G., Murray, T., Rubinstein, B.: Legion: Best-first concolic testing
(competition contribution). In: Proc. FASE. pp. 545–549. LNCS 12076, Springer
(2020). https://doi.org/10.1007/978-3-030-45234-6_31
38. Ruland, S., Lochau, M., Jakobs, M.C.: HybridTiger: Hybrid model checking
and domination-based partitioning for efficient multi-goal test-suite generation
(competition contribution). In: Proc. FASE. pp. 520–524. LNCS 12076, Springer
(2020). https://doi.org/10.1007/978-3-030-45234-6_26
39. Song, J., Alves-Foss, J.: The DARPA cyber grand challenge: A competi-
tor’s perspective, part 2. IEEE Security and Privacy 14(1), 76–81 (2016).
https://doi.org/10.1109/MSP.2016.14
40. Stump, A., Sutcliffe, G., Tinelli, C.: StarExec: A cross-community infrastructure
for logic solving. In: Proc. IJCAR, pp. 367–373. LNCS 8562, Springer (2014).
https://doi.org/10.1007/978-3-319-08587-6_28
41. Sutcliffe, G.: The CADE ATP system competition: CASC. AI Magazine 37(2),
99–101 (2016)
42. Visser, W., Păsăreanu, C.S., Khurshid, S.: Test-input generation
with Java PathFinder. In: Proc. ISSTA. pp. 97–107. ACM (2004).
https://doi.org/10.1145/1007512.1007526
43. Wendler, P., Beyer, D.: sosy-lab/benchexec: Release 3.6. Zenodo (2021).
https://doi.org/10.5281/zenodo.4317433
Open Access. This chapter is licensed under the terms of the Creative Commons
Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/),
which permits use, sharing, adaptation, distribution, and reproduction in any medium
or format, as long as you give appropriate credit to the original author(s) and the source,
provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s
Creative Commons license, unless indicated otherwise in a credit line to the material. If
material is not included in the chapter’s Creative Commons license and your intended
use is not permitted by statutory regulation or exceeds the permitted use, you will need
to obtain permission directly from the copyright holder.
CoVeriTest with Adaptive Time Scheduling
(Competition Contribution)
1 Test-Generation Approach
open
Test goals
covered
d co
re ve
ve CE r ed
co p at
h X
X pa
th
CE
Value Predicate
analysis .. ..
analysis
. .
limitV ..
Scheduler
.
..
.
limitP
goals, which are shared between the analyses, as unreachability queries and let
the analyses prove the unreachability of those goals. A reported counterexample
proves the reachability of a test goal. Therefore, the counterexample is converted
into a test [1] and the test goal is removed from the set of open test goals.
Time Scheduling. Our time scheduler limits the time per iteration round
to 100 s3 and distributes the 100 s based on the expected contribution of the
individual analyses. The idea is that an analysis gets more time if there exists
more paths to open test goals that the analysis is expected to handle well.
Figure 1 shows the integration of our time scheduler into the CoVeriTest
workflow. First, the scheduler samples a set of syntactical counterexample paths ρ,
which starts at the beginning of the program and ends in an open test goal. Then,
it estimates for each path ρ the probability P (Vi | ρ) that analysis i detects ρ
as a real counterexample4 . We estimate the probability P (Vi | ρ) using an uni-
gram language model [9] in combination with the approach of Richter et al. [10]
for the abstraction of the syntactical paths ρ. Finally, the scheduler assigns a
time budget to analysis i in proportion to the average probability of detecting a
counterexample path on a testing task T (program plus open test goals):
limitnew
i = 10s + 80s ∗ Eρ∈T [P (Vi | ρ)] (1)
execution. When the sampled paths are indecisive, Eρ∈T [P (Vi | ρ)] becomes
the normalized progress used in the TestComp’20 strategy [8]. The normalized
progress describes the relative contribution of an analysis to the goals covered
in the last iteration.
2 Tool Architecture
CoVeriTest is an extension of the software analysis framework CPAchecker [2]
(version 2.0) and is written in Java. For parsing, we use the Eclipse CDT parser 5 .
For test-case generation, we rely on two instances of CPAchecker’s test-case
generation algorithm, which extracts test cases from counterexamples [1]. One
instance generates test cases based on CPAchecker’s value analysis [4] and the
other instance uses CPAchecker’s predicate analysis [3]. Both analyses apply
counterexample-guided abstraction refinement [7] and use the SMT solver Math-
SAT5 [6]. We interleave the two instances and determine their time slices based
on their expected success on the set of open test goals. To determine the time
slices, we added the adaptive scheduler described in the previous section.
4 Setup
We develop our extension of CoVeriTest in a fork6 of CPAchecker and submit-
ted revision 970d550, which participated in all categories. To run CoVeriTest
5
https://www.eclipse.org/cdt/
6
https://github.com/cedricrupb/cpachecker
CoVeriTest with Adaptive Time Scheduling (Competition Contribution) 361
Note that property.prp is a place marker for the test specification (coverage-
-error-call.prp or coverage-branches.prp). Tests are generated for pro-
grams assuming a 32-bit environment. To support 64-bit environments, one
needs to add the configuration option -64. The generated tests are written to
the folder output/test-suite and adhere to the XML format demanded by the
Test-Comp rules. Additionally, the folder contains the mandatory metadata file.
References
1. Beyer, D., Chlipala, A.J., Henzinger, T.A., Jhala, R., Majumdar, R.: Generating
tests from counterexamples. In: Proc. ICSE. pp. 326–335. IEEE (2004), https:
//doi.org/10.1109/ICSE.2004.1317455
2. Beyer, D., Keremoglu, M.E.: CPAchecker: A tool for configurable software
verification. In: Proc. CAV. pp. 184–190. LNCS 6806, Springer (2011), https:
//doi.org/10.1007/978-3-642-22110-1 16
3. Beyer, D., Keremoglu, M.E., Wendler, P.: Predicate abstraction with adjustable-
block encoding. In: Proc. FMCAD. pp. 189–197. FMCAD (2010), http://
ieeexplore.ieee.org/document/5770949/
4. Beyer, D., Löwe, S.: Explicit-state software model checking based on CEGAR and
interpolation. In: Proc. FASE. pp. 146–162. LNCS 7793, Springer (2013), https:
//doi.org/10.1007/978-3-642-37057-1 11
5. Beyer, D., Jakobs, M.: CoVeriTest: Cooperative verifier-based testing. In: Proc.
FASE. pp. 389–408. LNCS 11424, Springer (2019), https://doi.org/10.1007/
978-3-030-16722-6 23
6. Cimatti, A., Griggio, A., Schaafsma, B.J., Sebastiani, R.: The MathSAT5 SMT
solver. In: Proc. TACAS. pp. 93–107. LNCS 7795, Springer (2013), https://doi.
org/10.1007/978-3-642-36742-7 7
7. Clarke, E.M., Grumberg, O., Jha, S., Lu, Y., Veith, H.: Counterexample-guided ab-
straction refinement for symbolic model checking. J. ACM 50(5), 752–794 (2003),
http://doi.acm.org/10.1145/876638.876643
8. Jakobs, M.: CoVeriTest with dynamic partitioning of the iteration time limit (com-
petition contribution). In: Proc. FASE. pp. 540–544. LNCS 12076, Springer (2020),
https://doi.org/10.1007/978-3-030-45234-6 30
7
https://cpachecker.sosy-lab.org/
362 M.-C. Jakobs and C. Richter
9. Jurafsky, D.: Speech & language processing. Pearson Education India (2000)
10. Richter, C., Hüllermeier, E., Jakobs, M.C., Wehrheim, H.: Algorithm selection for
software validation based on graph kernels. JASE 27(1), 153–186 (2020), https:
//doi.org/10.1007/s10515-020-00270-x
Open Access This chapter is licensed under the terms of the Creative Commons
Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/),
which permits use, sharing, adaptation, distribution and reproduction in any medium
or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license and indicate if changes were
made.
The images or other third party material in this chapter are included in the chapter’s
Creative Commons license, unless indicated otherwise in a credit line to the material. If
material is not included in the chapter’s Creative Commons license and your intended
use is not permitted by statutory regulation or exceeds the permitted use, you will need
to obtain permission directly from the copyright holder.
FuSeBMC: A White-Box Fuzzer for Finding
Security Vulnerabilities in C Programs
(Competition Contribution)
Kaled M. Alshmrany()1
, Rafael S. Menezes2 , Mikhail R. Gadelha3 ,
and Lucas C. Cordeiro4
1
University of Manchester, Manchester, UK
Institute of Public Administration, Jeddah, Saudi Arabia
[email protected]
2
Federal University of Amazonas, Manaus, Brazil
3
SIDIA Instituto de Ciência e Tecnologia, Manaus, Brazil
4
University of Manchester, Manchester, UK
Fuzzing
Selective
Fuzzer Learn Test suite
Test suite
Test generation
Incremental BMC allows FuSeBMC to keep unwinding the program until a prop-
erty violation is found or time or memory limits are exhausted. This approach is
advantageous in the Cover-Error category as finding one error is the primary
A White-Box Fuzzer for Finding Security Vulnerabilities in C Programs 365
1 #define N 100000
2 ...
3 i n t a , a1 [N] , a2 [N ] ;
4 f o r ( a = 0 ; a < N ; a++) {
5 a1 [ a ] = VERIFIER nondet int ( ) ;
6 a2 [ a ] = VERIFIER nondet int ( ) ;
7 }
8 ...
9 f o r ( i n t x = 0 ; x < N ; x++)
10 VERIFIER assert ( a1 [ x ] == a2 [ x ] ) ;
In this particular example, ESBMC exhausts the time limit before check-
ing the assertion a1[x] == a2[x]. Apart from that, our employed verification
engines also demonstrate a certain level of weakness to produce test-cases due
to the many optimizations we perform when converting the program to SMT.
In particular, two techniques affected the test-case generation significantly: con-
stant folding and slicing. Constant folding evaluates constants (which includes
nondeterministic symbols) and propagates them throughout the formula during
encoding, and slicing removes expression not in the path to trigger a property
366 K. M. Alshmrany et al.
violation. These two techniques can significantly reduce SMT solving time. How-
ever, they can remove the expressions required to trigger a violation when the
program is compiled, i.e., variable initialization might be optimized away, forcing
FuSeBMC to generate a test-case with undefined behavior.
Regarding our fuzzing engine, we identified a limitation to handle programs
with pointer dereferences. The fuzzing engine keeps track of variables throughout
the program but has issues identifying when they go out of scope. When we try
to generate a test-case that triggers a pointer dereference, our fuzzing engine
provides thrash values, and the selective fuzzer might create test-cases that do
not reach the error.
4 Software Project
The FuSeBMC source code is written in C++ and it is available for downloading
at GitHub,6 which includes the latest release of FuSeBMC v3.6.6. FuSeBMC is
publicly available under the terms of the MIT License. Instructions for building
FuSeBMC from the source code are given in the file README.md (including the
description of all dependencies).
References
1. Clang documentation. http://clang.llvm.org/docs/index.html.
2. Anand, S., Burke, E.K., Chen, T.Y., Clark, J.A., Cohen, M.B., Grieskamp, W.,
Harman, M., Harrold, M.J., McMinn, P.: An orchestrated survey of methodologies
for automated software test-case generation. J. Syst. Softw. 86(8), 1978–2001, 2013.
3. Beyer, D.: Second competition on software testing: Test-Comp 2020. In FASE,
LNCS 12076, pp. 505–519, 2020.
4. Gadelha, M.R., Monteiro, F.R., Morse, J., Cordeiro, L.C., Fischer, B., Nicole, D.A.:
ESBMC 5.0: An industrial-strength C model checker. In ASE, pp. 888–891, 2018.
5. Gadelha, M.R., Monteiro, F.R., Cordeiro, B., Nicole: ESBMC v6.0: Verifying C
Programs Using k -Induction and Invariant Inference - (Competition Contribution).
In TACAS, LNCS 11429, pp. 209–213, 2019.
6. Gadelha, M.R., Menezes, R., Monteiro, F.R., Cordeiro, L.C., Nicole, D.A.: ES-
BMC: scalable and precise test generation based on the floating-point theory -
(competition contribution). In FASE, LNCS 12076, pp. 525–529, 2020.
7. Gadelha, M.R., Cordeiro, L.C., Nicole, D.A.: An Efficient Floating-Point Bit-
Blasting API for Verifying C Programs. In VSTTE, LNCS 12549, pp. 178–195,
2020.
8. Menezes, R., Rocha, H., Cordeiro, L., Barreto, R.: Map2check using LLVM and
KLEE. In TACAS, LNCS 10806, pp. 437–441, 2018.
9. Niemetz, A., Preiner, M., Biere, A.: Boolector 2.0 system description. Journal on
Satisfiability, Boolean Modeling and Computation 9, 53–58 (2014)
Open Access This chapter is licensed under the terms of the Creative Commons
Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/),
which permits use, sharing, adaptation, distribution and reproduction in any medium
or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license and indicate if changes were
made.
The images or other third party material in this chapter are included in the chapter’s
Creative Commons license, unless indicated otherwise in a credit line to the material. If
material is not included in the chapter’s Creative Commons license and your intended
use is not permitted by statutory regulation or exceeds the permitted use, you will need
to obtain permission directly from the copyright holder.
6
https://github.com/kaled-alshmrany/FuSeBMC
Symbiotic 8: Parallel and
Targeted Test Generation
tifact *
Ar * Complete
(Competition Contribution)
Te
t
en
st
F A S E*
*
st
We
* Consi
- Co m p *
ll Docum
se
eu
Ev
e
nt R
ed
* E as y t o
*
alu
at e d
Marek Chalupa , Jakub Novák, and
Jan Strejček
1 Test-Generation Approach
Fig. 1. And example of a program (left) and its unsound slice with respect to the call
of error() (middle) and abort() (right).
guarantees that if a test covers a target in the corresponding slice, then it covers
the same target also in the original program. The opposite implication does not
hold due to the unsoundness. Note that tests generated from the slices may not
and usually do not cover all branches in the original program, therefore we still
need to run Klee on the original program.
2 Software Architecture
All parts of Symbiotic 8 use llvm 10 [7]. We compile the analyzed program
into llvm bitcode by the compiler Clang.
To carry out symbolic execution, we use our fork of the open-source sym-
bolic executor Klee [1]. The fork has several modifications compared to the
mainstream Klee. The main modification is the representation of pointers as
segment-offset pairs that enables symbolic-sized allocations. Since this year, our
fork Klee also supports comparison of and arithmetic on symbolic pointers.
We use Z3 [4] as the SMT solver in Klee. The components of Symbiotic are
programmed in C++ and the scripts that schedule and control running these
components are written in Python.
References
1. C. Cadar, D. Dunbar, and D. Engler. KLEE: Unassisted and automatic gen-
eration of high-coverage tests for complex systems programs. In OSDI, pages
209–224. USENIX Association, 2008. http://www.usenix.org/events/osdi08/tech/
full papers/cadar/cadar.pdf.
2. M. Chalupa, T. Jašek, L. Tomovič, M. Hruška, V. Šoková, P. Ayaziová, J. Strejček,
and T. Vojnar. Symbiotic 7: Integration of predator and more (competition contri-
bution). In TACAS, volume 12079 of LNCS, pages 413–417. Springer, 2020. doi:
10.1007/978-3-030-45237-7 31.
3. M. Chalupa, M. Vitovská, T. Jašek, M. Šimáček, and J. Strejček. Symbiotic 6:
generating test cases by slicing and symbolic execution. International Journal on
Software Tools for Technology Transfer, 2020. doi: 10.1007/s10009-020-00573-0.
4. L. de Moura and N. Bjørner. Z3: an efficient SMT solver. In TACAS, volume 4963
of LNCS, pages 337–340. Springer, 2008. doi: 10.1007/978-3-540-78800-3 24.
5. J. C. King. Symbolic execution and program testing. Communications of ACM,
19(7):385–394, 1976. doi: 10.1145/360248.360252.
6. Mark Weiser. Program slicing. IEEE Transactions on Software Engineering,
10(4):352–357, 1984. doi: 10.1109/TSE.1984.5010248.
7. LLVM. http://llvm.org/.
Open Access This chapter is licensed under the terms of the Creative Commons
Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/),
which permits use, sharing, adaptation, distribution and reproduction in any medium
or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license and indicate if changes were
made.
The images or other third party material in this chapter are included in the chapter’s
Creative Commons license, unless indicated otherwise in a credit line to the material. If
material is not included in the chapter’s Creative Commons license and your intended
use is not permitted by statutory regulation or exceeds the permitted use, you will need
to obtain permission directly from the copyright holder.
Author Index