Invivo
Invivo
lines covered
39450
lines covered
bzip2 Compression 8.2k 1.0.8 bzip2 1 2 1068.8
libass Rendering 35.4k 0.17.1 ffmpeg 4 2
libexif Parsing 30.7k 0.6.24 photographer 2 1 1012.5 39350
956.2 39250
TABLE II: Detailed information about our subject programs.
900.0 39150
0 6 12 18 24 0 6 12 18 24
time (hs) time (hs)
III. I S E XECUTION A MPLIFICATION E FFECTIVE ? (a) bzip2 (b) boringssl
2400
To study the effectiveness of our in-vivo approach, we im- 6500
lines covered
plement AFLLIVE and measure the difference in code coverage 2050
lines covered
5938
achieved between the unamplified original executions and the 1700
5375
amplified, shadow executions on four target libraries. For each 1350
one of them, we choose a host program and one host input 4812
1000
to generate unamplified executions, and use a simple CodeQL 4250 0 6 12 18 24
0 6 12 18 24 time (hs)
script to choose interesting amplifier points heuristically. time (hs)
(c) libass (d) libexif
A. Experimental Setup
Libraries and hosts. Table II shows information about the Fig. 2: Coverage-vs-time for test subjects using in-vivo
benchmarks selected for this experiment (two more to follow). fuzzing. The horizontal, dashed lines indicate the baseline
We randomly picked four widely-used open-source C libraries coverage from the original, unamplified execution. The ver-
that parse host-provided input data. These libraries cover a tical, dashed lines indicate when AFLLIVE switched from the
wide range of domains, from cryptography to rendering. Ap- screening loop to the main fuzzing loop, on the average.
plications using these libraries might attempt to parse untrusted
data and thus any errors present in them might represent
turn shows the median number of constraints specified for each
potential security vulnerabilities. For every library, we picked
amplifier point, as a “proxy” metric of the amount of effort
one host and one input for that host whose execution we
involved in setting up each subject.
sought to amplify. For our host selection criteria, we focused
Fuzzing campaigns. For every project, we started 20 in-vivo
on programs that were either developed or endorsed by the
campaigns initialized with the same original execution using
same group that developed the library. This was done with
AFLLIVE . All campaigns were run for 24-hours each on a
the intention of minimizing the likelihood of potential crashes
AMD EPYC 7713P 64-Core processor with 256GB of RAM.
stemming from wrong library usage, instead of actual bugs in
the library. B. Experimental Results
For boringssl, we used a binary as host that is supposed
to test the encryption functionality of the library which also Presentation. The results are shown in Figure 2. All values
generates the one unamplified execution. For bzip2, we used reflect coverage within the library, and not on the host. The
the example application bundled with the source code as host vertical, dashed line indicates when, on the average, AFLLIVE
and a compressed version of a text file containing sample text6 switched from screening to the main fuzzing loop. However,
to generate the unamplified execution. For libass, we used since the screening loop only lasts one minute for every target
ffmpeg as host, a large video and audio editing library that function executed, it is barely visible in cases where few
integrates subtitle functionality via libass. The unamplified functions were amplified. The horizontal, dashed line indicates
execution was generated by adding subtitle track with a single the coverage achieved by the unamplified host execution.
subtitle to the shortest possible video. For libexif, we used Results. The greatest increases in term of coverage were
an example application bundled with the source code and one obtained in libass and libexif, where AFLLIVE achieved
of the test images7 to generate the unamplified execution. an increase of 38.24% (1706 lines) and 91.79% (1040 lines)
Amplifier points and constraints are identified using a Cod- over the baseline, unamplified execution, respectively. For
eQL script implementing heuristics to identify parsing-related bzip2 and boringssl, AFLLIVE managed to achieve an
functions (Section II-A). This script returns potential amplifier increase in coverage of 18.75% (173 lines) and 0.82% (321
points consisting of at most a few hundred functions. We then lines), respectively.
went through the list, adjusting the automatically inferred con- The lack of increase in coverage for boringssl is further
straints based on the function signature and example invoca- confirmed by a visual inspection of Figure 2b where we can
tions within the code. Column AP in Table II shows how many see the campaign reach a plateau about 1 hour into the 24 hour
of the identified amplifier points were executed during the host fuzzing campaign.8 Although line coverage did not increase
execution used for the fuzzing campaign. Column M.#C. in after the first few hours, new path-increasing inputs kept being
added to the corpus throughout the campaign. We attribute this
6 “The quick brown fox jumps over the lazy dog”
7 22-canon_tags.jpg 8 Notice that coverage increases well after the screening loop has finished.
to the fact that out of the 2044 functions executed by the host, In the following, we compare the effectiveness of automatic
1381 were fully covered in terms of lines by the unamplified fuzz driver generation to in-vivo fuzzing as implemented
host execution. This in turn means there was little room for in AFLLIVE. Like fuzz driver generation techniques, in-vivo
improvement upon initial line coverage. fuzzing requires only the source code and little human inter-
The substantial increase in coverage for libass and vention in the specification of amplifier points and constraints.
libexif is observed over the entire day-long fuzzing cam-
A. Experimental Setup
paign and appears to further increase beyond our time bud-
get. This suggests that amplifying the right executions can Fuzz driver generators. We selected F UZZ G EN [4] and
bring tremendous benefits to automatic vulnerability discovery F UDGE [5] according to the following selection criteria. We
where the amplified executions reach deep into the code base. consider approaches that target C libraries, and that are either
False positives. No crashes were reported in these (previ- themselves publicly available and compilable or the generated
ously well-fuzzed) programs during the campaign. This in turn drivers are publicly available and compilable. We also consid-
implies that no false positives were reported either, although ered the following fuzz driver generators, but excluded them
it is important to keep in mind that the rate of false positives for the following reasons. GraphFuzz [24] focuses on object-
depends on the quality of the specified amplifier constraints. oriented libraries, and in the case of C libraries a complete
dataflow specification9 must be provided, which we do not
IV. O NBOARDING L IBRARIES W ITHOUT F UZZ D RIVERS have available. For Daisy [23], despite substantial effort, we
An advantage of in-vivo fuzzing, as shown in the previous did not succeed in compiling the available fuzz drivers due
section, is that it makes fuzz drivers superfluous. A fuzz driver to missing dependencies. For IntelliGen [22], neither the tool
is a piece of code that acts as a glue code between an off- itself nor fuzz drivers generated by it were publicly available.
the-shelf fuzzer and the library-under-test. The driver sets up For the comparison against F UZZ G EN, since the tool itself
an artificial calling context and is responsible for accepting was not available, we selected three of the seven libraries, as
data from the fuzzer and feeding it into the library with the shown in Table III. Out of the four excluded libraries, three
appropriate format. had an API that consisted of a single function that accepted
However, effective fuzz drivers are often manually imple- a complex struct object which wraps the actual library
mented which constitutes a major hinderance to widespread call and maintains the entire state of the library interaction10 .
adoption of fuzzing. For instance, Google incentivices the The remaining library was excluded because it could not be
development of fuzz drivers for important open source projects compiled (or easily fixed). For the three selected libraries, we
by paying up to 30k USD for the successful and effective used the only driver available for libaom and libvpx and a
integration [2]. In fact, Google’s project OSS-Fuzz [1] is random fuzz driver (codlin) for libgsm.
primarily a community-maintained collection of fuzz drivers For the comparison against F UDGE, we selected all fuzz
for open source projects. drivers mentioned in the paper, except OpenCV, as shown in
An existing approach to overcome this hindrance is to Table III. OpenCV was excluded since the entire API consisted
automatically synthesize fuzz drivers. For instance, F UZZ G EN of C++ rather than C functions. leptonica and htslib
[4] leverages a whole system analysis to infer the library’s are highly popular libraries used for image processing, and
interface and synthesizes fuzz drivers specifically against that high-throughput sequencing data processing, respectively.
interface. F UDGE [5] scans a repository for usages of the Hosts and Original Execution. For every library, we picked
library’s API, uses program slicing [21] to extract the corre- one host and one input for that host whose execution we
sponding code snippets, synthesizes a fuzz driver candidate for could amplify (cf. Table III). To be fair, we provided each
every code snippet by concretizing place holders, and evaluate auto-generated fuzz driver with an initial corpus that generates
the generated fuzz driver candidates by building and running it. precisely the same values for the library API as our in-vivo
IntelliGen [22] also first infers the library’s interface annotated fuzzer. Our intention is that the tested libraries execute on
with vulnerability likelhoods and generates fuzz drivers for the same piece of data during the first run (for instance,
the entry functions through hierarchical parameter replacement they should attempt to decode the same byte-stream in the
and type inference. Daisy [23] first dynamically observes how case of decoders). For our host selection criteria, we focused
a host system calls the library’s API, and then synthesizes fuzz on programs that were either developed or endorsed by the
drivers that follow a similar object usage pattern via a series same group that developed the library. This was done with
of API calls. the intention of minimizing the likelihood of potential crashes
However, these approaches hoist the tested library only stemming from wrong library usage, instead of actual bugs in
very artificially resulting in a high false positive and false the library.
negative rate. The libraries would never be integrated or used Fuzzing campaigns. For all of the five projects, we started
in this way in real applications. Approaches that immitate the 20 in-vivo fuzzing campaigns initialized with the same original
actual usage as faithfully as possible will still not be as close execution using AFLLIVE and 20 normal AFL campaigns using
to fuzzing a library as it is actually used. This is precisely 9 https://github.com/hgarrereyn/GraphFuzz/issues/1
our proposal: We suggest to amplify actual user-generated 10 Example for libhevc: https://android.googlesource.com/platform/
executions where a library is actually used. external/libhevc/+/refs/heads/main/test/decoder/main.c#563
SOTA Library Type #LOC Version Synth. fuzz driver #LOC Host Initial corpus AP M.#C.
libaom Video Codec 693.0k 3613e5d av1_dec_fuzzer 1131 aomdec sample av1 file 4 1.5
F UZZ G EN libvpx Video Codec 0.5k 1.12.0 simple_decoder 482 vpxdec sample vp9 file 5 2
libgsm Speech compressor 8.7k 1.0.22 cod2lin 371 STL/rpedemo sample wav file 3 1
htslib File parser 99.0k 1.16 hts_open 152 samtools sample sam 2 2
F UDGE
and fasta file
leptonica Image processor 320.0k 1.83.0 pix_rotate_shear 68 tesseract sample png file 18 1
with english text
TABLE III: Summary of setup for each test subjects for comparison against state-of-the-art (SOTA) fuzz driver generators.
40000 9600 AFL via the synthesized fuzz drivers when the campaign is
30000 lines covered 9400 started. We find that a synthesized fuzz driver only exercises
lines covered
lines covered
lines covered
59187.5
lines covered
openssl Cryptography 1M 3.0.6 60% 93 2
libxml2 Parsing 308k 2.10.3 61% 14 2
opus Speech 80k 1.3.1 93% 9 2 135800 59075.0
compressor 135600 58962.5
TABLE IV: Information about libraries and manual test suites. 135400 58850.0
0 6 12 18 24 0 6 12 18 24
time (hs) time (hs)
(a) openssl (b) libxml2
Distributing energy. Ideally, we would like to fuzz every 17750
Bug Type ID
17188
lines covered
amplifier point that is executed by the test suite for the same
Buffer overflow CVE-2022-3602
amount of time. However, some amplifier points are executed 16625 Buffer overflow PR 19166
by a large number of test cases while other amplifier points Use-after-free CVE-[blinded]
16062 Denial of Service PR [blinded]
are executed just by a single test case. So, how much “energy”
do we assign to each test case to achieve this objective? 15500
0 6 12 18 24
time (hs)
Algorithm 4 Test amplification (c) opus (d) Bugs found in OpenSSL.
Input: Test suite S Fig. 4: Coverage and bugs in test amplification campaigns.
Input: Amplifier points F , Types T , Constraints C, Time t0
1: Map test2func = ∅
2: Set funcs = ∅
3: for Test s ∈ S do
4: test2func[s] = get_exec_amplifiers(s, F ) the amplifier points were auto-identified using our tool (§ II-A)
5: funcs = funcs ∪ test2func[s] and manually constrained afterwards.
6: end for
7: executed = |funcs| State of the art. There exists a fuzz driver generator specific
8: fuzzed funcs= ∅ for test amplification, called UT OPIA [6]. Given a library-
9: while not aborted do
under-test and the gtest or boost test suite, UT OPIA first
10: for s in S do
11: unfuzzed = |test2func[s] − fuzzed funcs| performs a lightweight static analysis before synthesizing fuzz
12: if unfuzzed > 0 then drivers for the tested library functions. The static analysis is
13: Time budget t1 = unfuzzed/executed used to identify the precondition of every library function.
14: fuzz(F, T, C, exec(s), t0 , t1 ) For every test case, the synthesis first identifies the library
15: fuzzed funcs = fuzzed funcs ∪ test2func[s] functions used in the test case and the constants used as
16: end if
17: end for parameters in a corresponding function call, and then generates
18: end while a fuzz driver for the library functions by rendering the constant
library function call parameters subject to fuzzing. For our ex-
Algorithm 4 illustrates our algorithm to distribute the avail- periments, we reuse the identified functions and preconditions
able energy evenly over the amplifier points executed by as amplifier points and constraints using a straightforward
the test suite S. In Line 1–7, it finds the amplifier function translation, to ensure the fairness of the comparison. This
executed by each test case s ∈ S and counts how many demonstrates the versitality of our in-vivo approach which al-
amplifiers are executed in total. In Line 8–18, it skips test cases lows diverse means of automatic amplifier point identification
that execute no unfuzzed amplifier point (Line 12). Otherwise, and requires no specific test framework.
it computes the proportion of all executed amplifiers that are Unfortunately, despite several months of experimentation,
executed by test case s and still unfuzzed as the time budget t1 we realized that on the UT OPIA benchmark programs using
for s (Line 13), and starts a corresponding fuzzing campaign the UT OPIA-identified amplifier points and constraints, all
(Line 14). Specifically, the function fuzz implements the crashing inputs generated by UT OPIA (and by our AFLLIVE)
proposed in-vivo fuzzing approach as defined in Algorithm 1. only reveal false positives. Upon manual examination, we
A. Experimental Setup discovered that the drivers synthesized by UT OPIA (as well
as the results of its analysis) did lead to an incorrect usage of
Table IVshows the selected libraries, the corresponding test
the libraries, and thus to a large amount of spurious crashes.
suite coverage, and the number of executed amplifier points
To be sure, we repeated the analysis by filtering inputs that
(AP). We randomly chose libraries from diverse domains that
did not crash on the most recent version, assuming these
are security-critical, well-fuzzed (5+ years),13 and widely used
bugs would now be fixed, but only found that the remaining
open-source C libraries. For test amplification, no host or host
crashers were flaky, i.e., crashed again if run repeatedly. Since
input is needed, as all libraries had test suites and testing
the prototype provided by the authors is highly automated
frameworks readily available. Like for the other experiments,
(i.e. it requires little intervention and there is not much room
13 2016 Commit contains OpenSSL & LibXML2: https://github.com/google/ for misusage), we conclude that an experimental comparison
oss-fuzz/commit/a143b9b3 would not provide much insight.
Inferred config. Curated config.
B. Experimental Results Subject Cov. (#LOC) T.P. F.P. Cov. (#LOC) T.P. F.P.
Presentation. Figure 4 shows the average coverage over boringssl 40027.95 - 4 39511.95 - -
bzip2 - - - 1096.10 - -
time and the bugs found during test amplification. The vertical libass 6553.40 - - 6168.25 - -
dashed lines indicate a change in test case during the fuzzing libexif - - - 2174.95 - -
campaign (Line 14 of Alg. 4). The horizontal dashed line htslib 5723.00 - - 8504.50 7 -
leptonica 3178.85 - - 5255.75 - -
indicates the initial coverage for the library’s test suite. We libaom 11442.00 - - 38261.55 - -
only measure coverage of the library. libgsm - - - 1218.40 - -
Coverage results. AFLLIVE achieved an increase over the libvpx 8203.00 - - 9565.80 - -
libxml2 59681.40 - - 59235.95 - -
manual test suite by around 600 LoC for openssl and over openssl 135903.20 - 4 136162.90 4 -
300 LoC for libxml2. No increase in coverage was achieved opus 18804.95 - - 16642.00 - -
for opus. Closer inspection revealed that the manual test suite
is of very high quality and nearly saturated, covering almost TABLE V: Coverage and bugs in fully automated campaigns.
95% of lines of code in opus. There are five test cases that
exercise all of the amplifier points selected for opus (which
CodeQL script to identify APs (113 LoC) and a Python script
explains why all of the time budget was invested into one test
to generate ACs (274 LoC).14 For manual refinement:
case). The latter is true also for LibXML2, where the first test
• For subjects where no executed APs where identified
case already exercises all but one amplifier point.
For openssl, we see that switching test cases to exercise (bzip2, libexif and libgsm), we added the main
new amplifier points is effective and after saturation, code cov- entry points of the library as APs (via documentation).
• We added (or modified) ACs to ensure these conditions:
erage increases again when the next test case is fuzzed. This
highlights that, using our approach, once the amplifier points 1) (sizeof(buf) = len ∧ len < C) which requires that
have been identified and their constraints correctly specified variable len determines the length of the buffer buf
the user is able to setup several fuzzing campaigns with little and len is less than the constant C.
extra effort. Towards the end of the 24 hour campaign, it is also 2) (sizeof(buf)<C) which requires that the length of
interesting to note that coverage saturates despite switching to the buffer is smaller than the constant C, or
test cases that exercise new amplifier points. 3) (is_file(filename)) which requires that the string
Bug finding results. AFLLIVE discovers 4 bugs in openssl, f ilename refers to a valid file where fuzzing input
two of which have previously been found only by manual will be dumped.
auditing, including the high-severity PunyCode vulnerability These patterns account for 98% of all ACs.
(CVE-2022-3602), and two of which have not previously been As an indicator of the additional manual effort for each
known, including a moderate-severity use-after-free (CVE- subject, we note that we either added or removed constraints
[blinded]). In terms of false positives, no false positive crashes for no more than 12% of the automatically identified amplifier
were reported. points across all subjects. Even then, no more than two
constraints needed to be added/removed.
VI. S EMI -AUTOMATED I DENTIFICATION OF A MPLIFIER
In comparison to AC specification, writing a fuzz driver
P OINTS AND C ONSTRAINTS
from scratch could take an experienced developer several
The two main concepts of invivo fuzzing are amplifier hours, and would need to be maintained afterwards. For
points (APs) and amplifier constraints (ACs). While APs instance, the driver integrated into OSS-Fuzz [1] for libass
identify interesting functions, the purpose of ACs is to make was written over the course of two days by a core developer
implicit function-preconditions explicit, just like like user- of the project, and iterated upon several times.
defined preconditions in property-based testing (PBT) [18],
[19] or user-defined repOK-methods in search-based software B. Ablation Study
testing (SBST) [26], [27]. In order to study the impact of our additional manual effort
In general, ACs can be written manually to reduce false to reduce false positives, we compare the effectiveness of
positives, but they do not need to be added. In terms of AFLLIVE using only the auto-generated APs and ACs to the
manual effort, there is a tradeoff between specifying ACs effectiveness of AFLLIVE using the manually augmented set
versus going through the false positives. For instance, sup- of APs and ACs. All campaigns were run for 24-hours each
pose AFLLIVE finds a possible null-pointer-dereference on a on a AMD EPYC 7713P 64-Core processor with 256GB of
function-parameter, but that function is never called with a RAM.
null-pointer. This is a false positive. We allow users of invivo- Coverage results. Table V shows the average coverage
fuzzing to encode this implicit assumption explicitly. achieved throughout the campaign, along with false and true
positives reported, for both the fully automated and manually
A. Semi-Automatic Identification
modified configurations. For all subjects with identifiable
For our experiments, we used a semi-automated approach. amplifier points, coverage achieved via auto-generated APs
An initial set of APs/ACs was first automatically identified
and then manually refined. For automation, we developed a 14 https://anonymous.4open.science/r/afllive-598A/config generator
and ACs was on par (i.e., same order of magnitude) with the within the neighborhood of a valid program state, there is a
coverage achieved through the semi-automatic approach. low risk of false positives.
For the subject where no executed APs were identified Selective symbolic execution. S 2 E [33] first introduced
automatically (i.e., bzip2, libexif and libgsm), the the ”in-vivo approach” for symbolic execution by injecting
campaigns failed to run. However, after manually specifying 6 a symbolic executor into a program binary that would acti-
amplifier points and 7 constraints across the three subjects, the vate whenever an ”expansion point” is reached and collapse
campaigns run and managed to increase coverage significantly the symbolic state (corsetting) whenever symbolic execution
over the original execution (see Figure 2, Figure 3). becomes impractical, e.g., for library calls. In contrast, our
For some subjects the automatically inferred constraints led coverage-guided in-vivo fuzzer does not require the symbolic
to a higher code coverage, such as in the case of boringssl, execution machinery for tracking and solving symbolic states.
libass, libxml2 and opus. This can be attributed to Our approach is coverage-guided and works even for deployed
the fact that we were overly conservative when manually binaries using actual executions if non-interference between
modifying constraints in an effort to prevent a high false- shadow and original execution is guaranteed by the snapshot-
positive rate. ting mechanism.
Bug finding results. Given only the automatically inferred Snapshot fuzzing. The first lightweight snapshot-restore
constraints, AFLLIVE failed to find the previously discovered mechanism in fuzzing was the AFL fork server [20]. It would
bugs. Expectedly, this also led to a higher number of false pos- allow the fuzzer to skip the expensive execution prefix during
itives for two of the subjects (boringssl and openssl), repeated execution of the same program with different inputs.
which were also the most complex subjects that we analyzed. Snappy [34] further explored how to set the fork server as late
Still, no more than five false positives were reported in each as possible into the execution of the program. Nyx [14], [35]
case, and could thus be triaged in a reasonable amount of time introduced a proper Virtual Machine (VM)-based snapshot-
(less than a few hours). restore mechanism. In contrast, we relax the constraint that
the fuzzer must produce a system-level input and instead
VII. R ELATED W ORK propose to use the snapshot-restore mechanism to amplify an
original execution at user-specify concrete amplifier points and
Automatic unit-level testing. Long before fuzzing entered the constraints to generate shadow executions.
stage, the software engineering community studied automatic In-vivo fuzzing in production. Our long-term vision, assum-
approaches for unit-level test generation [18], [26], [28]–[30]. ing several technical challenges are tackled, is to integrate in-
Examples of a unit are Java objects or C functions. One major vivo fuzzing into the production system, so as to fuzz the
research challenge of automatic unit-level testing has been to entire supply chain of a software system, including all of its
minimize the number of false positives, i.e., bugs that only dependencies. The idea to integrate bug finding into production
appear during automatic testing, but never in production when is not very far fetched. For instance, Google is running a no-
the unit is properly used. There are two approaches to tackle overhead version of AddressSanitizer [36] on every Android
this problem: (a) to let the user specify conditions representing 11 phone and every Chrome browser [37], [38]. Apart from
the valid usage of that unit [18], [26], and (b) to observe bug finding, Google has long been running Google-Wide
how the unit is used, e.g., during system-level testing, and Profilers (GWP) which conduct light-weight program analysis
to enforce the infered protocol during unit-level testing [31]. across entire fleets of machines [39]. Mozilla implemented
For instance, the approach taken by the Daisy [23] fuzz driver the approach for Firefox [40]. The open source community
generator represents Approach (b) while our AFLLIVE takes implemented the approach for the Linux kernel [41].
Approach (a) to minimize the number of false positives during
VIII. C ONCLUSION
in-vivo fuzzing.
Valid calling context. Another major research challenge A. Perspective
of automatic unit-level testing has been to generate a valid Existing fuzzers are designed to test a software system
sequence of API calls and construct the required objects to in-vitro, i.e. under artificial lab conditions. However, the
pass in as parameters to these calls. Given the preconditions effectiveness of in-vitro fuzzing is limited [42]. It is these
(called contract), Randoop [29] constructs the sequence of API limitations which we sought to address in this paper.
calls and objects in a feedback-directed manner, continuously Solving dependency on fuzz driver quality. A fuzzer must
evolving test cases that do not violate the user-provided con- first be ”glued” to the software via fuzz drivers. Typically, fuzz
tract. JQF [19] and CGPT [32] add coverage-guidance. How- drivers are tediously developed and continuously updated over
ever, fundamentally these tools follow a generational approach months. For instance, Google pays up to 20k USD for fuzz
where the API calls and objects are generated out of thin air drivers of critical open source software [1], [43]. To reduce
and validated only against a user-provided specification. In some manual effort, recent research has focussed on generating
contrast, ours is a mutational approach, where we piggyback drivers automatically [4], [5], [44]. Whenever a security was
on a valid sequence of API calls that are passed valid objects. found by manual auditing, the developer would add a new fuzz
Like the mutational approach on the system-level [3], [7], [20], driver through which the fuzzer is able to find the security flaw.
this allows us to reach much deeper into the code. Staying While the drivers can be improved over time, this dependency
on driver quality cannot be avoided. OpenSSL has 16 drivers in fuzz-driver generation in terms of both code coverage and bug
OSS-Fuzz which have been continuously fuzzed 24/7 over the finding. Providing empirical evidence is the discovery of seven
past six (6) years [45]. In contrast, in-vivo fuzzing eliminates (7) previously unknown vulnerabilities in htslib, even as
the need for fuzz drivers entirely. Just by amplifying the this library has been continually fuzzed using synthetic fuzz
developer test suite, our in-vivo prototype found a critical bug drivers for seven (7) years as of the time of writing. This not
in ”unfuzzed” code of OpenSSL (CVE-2023-0215). only suggests that execution amplification is effective, but also
Solving structure-aware fuzzing. A fuzzer’s effectiveness that real-world applications do indeed interact with libraries in
depends critically on the quality of the initial seed corpus [46]. ways that are not properly captured by existing fuzz drivers.
For instance, if we are fuzzing an PNG image library, inputs Moreover, through test amplification we re-discover a high-
that were generated by mutating valid PNG image files will severity vulnerability in openssl and also uncover a novel
reach more deeply into the library than a random string of moderate severity vulnerability, both of which had not been
bytes. However, valid input structures are easily broken and found through fuzzing before. Apart from the vulnerabilities,
new input structures are difficult to generate by chance. For we find a known bug and a novel one, as well.
instance, if none of the seed images contains an optional eXIf We should note that the effectiveness of our approach
chunk specifying some metadata, it will hardly be generated. depends crucially on the choice of amplifier points and con-
Recent work, including ours [8], has addressed this using (or straints. If we choose the wrong amplifiers, we might get false
learning) the input structure, and “inventing” the missing data positives crashes; but given the flexibility of our approach,
chunks [47]–[50]. However, the critical dependence on initial we did not find this to be an obstacle. For our experiments,
seeds remains. In contrast, in-vivo fuzzing allows us to define we developed a CodeQL script to come up with an initial
as amplifier point that function in the parser which handles amplifier set.15 Via an interactive process, we refined the
an interesting data chunk or set amplifier points deep in the constraints (i.e., preconditions) for every function as follows:
program functionality to entirely skip the parser. Whenever a constraint was incorrectly specified, the fuzzer
Solving stateful fuzzing. Some software systems require would fail within a few seconds, and the constraint would need
inputs in a certain order. For instance, the Transmission Con- an obvious adjustment. Overall, the amplifier identification
trol Protocol (TCP) requires a three-way handshake between process took no more than a few hours for every library.
client and server before data can actually be sent. Without There are still several interesting socio-technical challenges
knowing precisely the implemented protocol, it is difficult ahead of us. Considering that the largest continuous fuzzing
for a fuzzer to generate the right sequence of packets with platform, OSS-Fuzz [1], which fuzzes over 1000 open source
the correct structure. Recent work, including ours [12], has projects on 100k machines 24/7, is nothing but a collection of
used mutational, feedback-direct fuzzing that uses response manually generated fuzz drivers, we are truely excited about
codes, state variables, or human annotations to identify and the prospect that in-vivo fuzzing enables fuzzing for every
leverage the sequence of software states for a sequence of library that is used and compiled in a production environment.
inputs/packets [14], [51]. However, these approaches heavily
depend on the recorded sequences of packets that are used to
seed the mutational fuzzers. In contrast, in-vivo fuzzing allows R EFERENCES
us to define as amplifier point that function which handles a
certain state or state transition. [1] K. Serebryany, “OSS-Fuzz - google’s continuous fuzzing service for
open source software,” in USENIX Security. Vancouver, BC: USENIX
Association, Aug. 2017.
B. Paper Summary
[2] OSS-Fuzz, “Integration rewards,” https://google.github.io/oss-fuzz/
Our approach allows the user to fuzz a library within the getting-started/integration-rewards/, 2021, accessed: 2023-01-11.
context of a host application by exploring the neighborhood [3] LLVM, “Libfuzzer,” https://llvm.org/docs/LibFuzzer.html, accessed:
2023-01-11.
of a valid program state induced by an actual host-generated [4] K. Ispoglou, D. Austin, V. Mohan, and M. Payer, “FuzzGen:
execution of that library. We do so by applying coverage- Automatic fuzzer generation,” in 29th USENIX Security Symposium
guided mutation-based fuzzing on the arguments of each (USENIX Security 20). USENIX Association, Aug. 2020, pp.
2271–2287. [Online]. Available: https://www.usenix.org/conference/
function marked as an amplifier point, subject to a set of user- usenixsecurity20/presentation/ispoglou
specified constraints. By using real-world programs, we can [5] D. Babić, S. Bucur, Y. Chen, F. Ivančić, T. King, M. Kusano,
leverage our approach to fuzz the library within a production- C. Lemieux, L. Szekeres, and W. Wang, “Fudge: Fuzz driver generation
at scale,” in Proceedings of the 2019 27th ACM Joint Meeting on
like usage context. Conversely, we can use a test-suite as a European Software Engineering Conference and Symposium on the
host to explore variants of regression tests and corner cases Foundations of Software Engineering, ser. ESEC/FSE 2019. New
identified by the developers. In contrast to a fuzz-driver based York, NY, USA: Association for Computing Machinery, 2019, p.
975–985. [Online]. Available: https://doi.org/10.1145/3338906.3340456
approach, selected amplifier points need not be part of the API,
[6] B. Jeong, J. Jang, H. Yi, J. Moon, J. Kim, I. Jeon, T. Kim, W. Shim, and
implying that our approach can reach deeper into the code. Y. H. Hwang, “Utopia: Automatic generation of fuzz driver using unit
In our experiments we manage to increase coverage sig- tests,” in 2023 IEEE Symposium on Security and Privacy (SP), 2023,
nificantly over non-amplified executions, indicating that am- pp. 2676–2692.
plification is indeed effective. Furthermore, we manage to
outperform existing state-of-the-art approaches for automated 15 https://anonymous.4open.science/r/afllive-598A/README.md#config-file-1
[7] M. Böhme, V.-T. Pham, and A. Roychoudhury, “Coverage-based [24] H. Green and T. Avgerinos, “Graphfuzz: Library api fuzzing with
greybox fuzzing as markov chain,” in Proceedings of the 2016 lifetime-aware dataflow graphs,” in Proceedings of the 44th International
ACM SIGSAC Conference on Computer and Communications Security, Conference on Software Engineering, ser. ICSE ’22. New York, NY,
ser. CCS ’16. New York, NY, USA: Association for Computing USA: Association for Computing Machinery, 2022, p. 1070–1081.
Machinery, 2016, p. 1032–1043. [Online]. Available: https://doi.org/10. [Online]. Available: https://doi.org/10.1145/3510003.3510228
1145/2976749.2978428 [25] M. Stone, “The ups and downs of 0-days: Our review of 0-days exploited
[8] V.-T. Pham, M. Böhme, A. E. Santosa, A. R. Căciulescu, and A. Roy- in-the-wild in 2022,” July 2023, accessed: 2023-01-11.
choudhury, “Smart greybox fuzzing,” IEEE Transactions on Software [26] C. Boyapati, S. Khurshid, and D. Marinov, “Korat: Automated testing
Engineering, vol. 47, no. 9, pp. 1980–1997, 2021. based on java predicates,” in Proceedings of the 2002 ACM SIGSOFT
[9] C. Aschermann, T. Frassetto, T. Holz, P. Jauernig, A.-R. Sadeghi, and International Symposium on Software Testing and Analysis, ser. ISSTA
D. Teuchert, “Nautilus: Fishing for deep bugs with grammars.” in NDSS, ’02. New York, NY, USA: Association for Computing Machinery, 2002,
2019. p. 123–133. [Online]. Available: https://doi.org/10.1145/566172.566191
[10] J. Wang, B. Chen, L. Wei, and Y. Liu, “Superion: Grammar-aware [27] D. Marinov and S. Khurshid, “Testera: a novel framework for automated
greybox fuzzing,” in 2019 IEEE/ACM 41st International Conference on testing of java programs,” in Proceedings 16th Annual International
Software Engineering (ICSE). IEEE, 2019, pp. 724–735. Conference on Automated Software Engineering (ASE 2001), 2001, pp.
[11] P. Srivastava and M. Payer, “Gramatron: Effective grammar-aware 22–31.
fuzzing,” in Proceedings of the 30th ACM SIGSOFT International [28] G. Fraser and A. Arcuri, “Evolutionary generation of whole test suites,”
Symposium on Software Testing and Analysis, ser. ISSTA 2021. New in International Conference On Quality Software (QSIC). IEEE
York, NY, USA: Association for Computing Machinery, 2021, p. Computer Society, 2011, pp. 31–40.
244–256. [Online]. Available: https://doi.org/10.1145/3460319.3464814 [29] C. Pacheco, S. K. Lahiri, M. D. Ernst, and T. Ball, “Feedback-
[12] V.-T. Pham, M. Böhme, and A. Roychoudhury, “Aflnet: a greybox fuzzer directed random test generation,” in ICSE 2007, Proceedings of the 29th
for network protocols,” in 2020 IEEE 13th International Conference on International Conference on Software Engineering, Minneapolis, MN,
Software Testing, Validation and Verification (ICST). IEEE, 2020, pp. USA, May 2007, pp. 75–84.
460–465. [30] P. McMinn, “Search-based software test data generation: a survey,”
[13] X. Feng, R. Sun, X. Zhu, M. Xue, S. Wen, D. Liu, S. Nepal, Software Testing, Verification and Reliability, vol. 14, no. 2, pp.
and Y. Xiang, “Snipuzz: Black-box fuzzing of iot firmware via 105–156, 2004. [Online]. Available: https://onlinelibrary.wiley.com/doi/
message snippet inference,” in Proceedings of the 2021 ACM SIGSAC abs/10.1002/stvr.294
Conference on Computer and Communications Security, ser. CCS ’21. [31] S. Elbaum, H. N. Chin, M. B. Dwyer, and J. Dokulil, “Carving
New York, NY, USA: Association for Computing Machinery, 2021, p. differential unit test cases from system test cases,” in Proceedings
337–350. [Online]. Available: https://doi.org/10.1145/3460120.3484543 of the 14th ACM SIGSOFT International Symposium on Foundations
[14] S. Schumilo, C. Aschermann, A. Jemmett, A. Abbasi, and T. Holz, of Software Engineering, ser. SIGSOFT ’06/FSE-14. New York,
“Nyx-net: Network fuzzing with incremental snapshots,” in Proceedings NY, USA: Association for Computing Machinery, 2006, p. 253–264.
of the Seventeenth European Conference on Computer Systems, ser. [Online]. Available: https://doi.org/10.1145/1181775.1181806
EuroSys ’22. New York, NY, USA: Association for Computing [32] L. Lampropoulos, M. Hicks, and B. C. Pierce, “Coverage guided,
Machinery, 2022, p. 166–180. [Online]. Available: https://doi.org/10. property based testing,” Proc. ACM Program. Lang., vol. 3, no.
1145/3492321.3519591 OOPSLA, oct 2019. [Online]. Available: https://doi.org/10.1145/
[15] H. Gascon, C. Wressnegger, F. Yamaguchi, D. Arp, and K. Rieck, 3360607
“Pulsar: Stateful black-box fuzzing of proprietary network protocols,” [33] V. Chipounov, V. Kuznetsov, and G. Candea, “S2e: a platform for
in Security and Privacy in Communication Networks: 11th EAI Interna- in-vivo multi-path analysis of software systems,” in Proceedings of
tional Conference, SecureComm 2015, Dallas, TX, USA, October 26-29, the Sixteenth International Conference on Architectural Support for
2015, Proceedings 11. Springer, 2015, pp. 330–347. Programming Languages and Operating Systems. New York, NY,
[16] C. Lattner and V. Adve, “LLVM: A compilation framework for lifelong USA: Association for Computing Machinery, 2011, p. 265–278.
program analysis and transformation,” in CGO, San Jose, CA, USA, [Online]. Available: https://doi.org/10.1145/1950365.1950396
Mar 2004, pp. 75–88. [34] E. Geretto, C. Giuffrida, H. Bos, and E. Van Der Kouwe,
[17] Github, “Codeql,” https://codeql.github.com/, 2021, accessed: 2023-01- “Snappy: Efficient fuzzing with adaptive and mutable snapshots,”
11. in Proceedings of the 38th Annual Computer Security Applications
[18] K. Claessen and J. Hughes, “Quickcheck: A lightweight tool for random Conference, ser. ACSAC ’22. New York, NY, USA: Association
testing of haskell programs,” in Proceedings of the Fifth ACM SIGPLAN for Computing Machinery, 2022, p. 375–387. [Online]. Available:
International Conference on Functional Programming, ser. ICFP ’00. https://doi.org/10.1145/3564625.3564639
New York, NY, USA: Association for Computing Machinery, 2000, p. [35] S. Schumilo, C. Aschermann, A. Abbasi, S. Wör-ner, and T. Holz, “Nyx:
268–279. [Online]. Available: https://doi.org/10.1145/351240.351266 Greybox hypervisor fuzzing using fast snapshots and affine types,” in
[19] R. Padhye, C. Lemieux, and K. Sen, “Jqf: Coverage-guided property- 30th USENIX Security Symposium (USENIX Security 21). USENIX
based testing in java,” in Proceedings of the 28th ACM SIGSOFT Association, Aug. 2021, pp. 2597–2614. [Online]. Available: https:
International Symposium on Software Testing and Analysis, ser. ISSTA //www.usenix.org/conference/usenixsecurity21/presentation/schumilo
2019. New York, NY, USA: Association for Computing Machinery, [36] K. Serebryany, D. Bruening, A. Potapenko, and D. Vyukov, “Address-
2019, p. 398–401. [Online]. Available: https://doi.org/10.1145/3293882. sanitizer: A fast address sanity checker,” in Proceedings of the 2012
3339002 USENIX Conference on Annual Technical Conference, ser. USENIX
[20] A. Fioraldi, D. Maier, H. Eißfeldt, and M. Heuse, “Afl++: Combining ATC’12. USA: USENIX Association, 2012, p. 28.
incremental steps of fuzzing research,” in Proceedings of the 14th [37] M. Morehouse, M. Phillips, and K. Serebryany, “Crowdsourced bug
USENIX Conference on Offensive Technologies, ser. WOOT’20. USA: detection in production: Gwp-asan and beyond,” in Proceedings of the
USENIX Association, 2020. C++ Russia, 2020.
[21] M. Weiser, “Program slicing,” in Proceedings of the 5th International [38] V. Tsyrklevich, “Gwp-asan: Sampling heap memory error detection in-
Conference on Software Engineering, ser. ICSE ’81. IEEE Press, 1981, the-wild,” https://www.chromium.org/Home/chromium-security/articles/
p. 439–449. gwp-asan, accessed: 2023-01-11.
[22] M. Zhang, J. Liu, F. Ma, H. Zhang, and Y. Jiang, “Intelligen: Automatic [39] G. Ren, E. Tune, T. Moseley, Y. Shi, S. Rus, and R. Hundt,
driver synthesis for fuzz testing,” in 2021 IEEE/ACM 43rd International “Google-wide profiling: A continuous profiling infrastructure for
Conference on Software Engineering: Software Engineering in Practice data centers,” IEEE Micro, pp. 65–79, 2010. [Online]. Available:
(ICSE-SEIP), 2021, pp. 318–327. http://www.computer.org/portal/web/csdl/doi/10.1109/MM.2010.68
[23] M. Zhang, C. Zhou, J. Liu, M. Wang, J. Liang, J. Zhu, and Y. Jiang, [40] C. Holler, “Phc (probabilistic heap checker): a port of chromium’s gwp-
“Daisy: Effective fuzz driver synthesis with object usage sequence asan project to firefox,” https://bugzilla.mozilla.org/show bug.cgi?id=
analysis,” in 2023 IEEE/ACM 45th International Conference on Software 1523268, 2021, accessed: 2023-01-11.
Engineering: Software Engineering in Practice (ICSE-SEIP), 2023, pp. [41] L. K. Developers, “Kernel electric-fence (kfence),” https://www.kernel.
87–98. org/doc/html/latest/dev-tools/kfence.html, 2021, accessed: 2023-01-11.
[42] M. Böhme, C. Cadar, and A. Roychoudhury, “Fuzzing: Challenges and
reflections,” IEEE Software, vol. 38, no. 3, pp. 79–86, 2021.
[43] O.-F. Team, “Oss-fuzz integration awards,” https://google.github.io/
oss-fuzz/getting-started/integration-rewards/, accessed: 2023-01-11.
[44] C. Zhang, X. Lin, Y. Li, Y. Xue, J. Xie, H. Chen, X. Ying,
J. Wang, and Y. Liu, “APICraft: Fuzz driver generation for
closed-source SDK libraries,” in 30th USENIX Security Symposium
(USENIX Security 21). USENIX Association, Aug. 2021, pp.
2811–2828. [Online]. Available: https://www.usenix.org/conference/
usenixsecurity21/presentation/zhang-cen
[45] “Openssl at oss-fuzz: Commit history,” https://github.com/google/
oss-fuzz/commits/master/projects/openssl, accessed: 2023-01-11.
[46] A. Herrera, H. Gunadi, S. Magrath, M. Norrish, M. Payer, and A. L.
Hosking, “Seed selection for successful fuzzing,” in Proceedings of
the 30th ACM SIGSOFT International Symposium on Software Testing
and Analysis, ser. ISSTA 2021. New York, NY, USA: Association
for Computing Machinery, 2021, p. 230–243. [Online]. Available:
https://doi.org/10.1145/3460319.3464795
[47] W. You, X. Liu, S. Ma, D. Perry, X. Zhang, and B. Liang,
“Slf: Fuzzing without valid seed inputs,” in Proceedings of the
41st International Conference on Software Engineering, ser. ICSE
’19. IEEE Press, 2019, p. 712–723. [Online]. Available: https:
//doi.org/10.1109/ICSE.2019.00080
[48] Y. Li, B. Chen, M. Chandramohan, S.-W. Lin, Y. Liu, and
A. Tiu, “Steelix: Program-state based binary fuzzing,” in Proceedings
of the 2017 11th Joint Meeting on Foundations of Software
Engineering, ser. ESEC/FSE 2017. New York, NY, USA: Association
for Computing Machinery, 2017, p. 627–637. [Online]. Available:
https://doi.org/10.1145/3106237.3106295
[49] C. Aschermann, S. Schumilo, T. Blazytko, R. Gawlik, and T. Holz,
“Redqueen: Fuzzing with input-to-state correspondence,” in Symposium
on Network and Distributed System Security (NDSS), 2019.
[50] A. Fioraldi, D. C. D’Elia, and E. Coppa, “Weizz: automatic grey-box
fuzzing for structured binary formats,” in Proceedings of the 29th
ACM SIGSOFT International Symposium on Software Testing and
Analysis, ser. ISSTA 2020. New York, NY, USA: Association
for Computing Machinery, 2020, p. 1–13. [Online]. Available:
https://doi.org/10.1145/3395363.3397372
[51] C. Aschermann, S. Schumilo, A. Abbasi, and T. Holz, “Ijon: Exploring
deep state spaces via fuzzing,” in 2020 IEEE Symposium on Security
and Privacy, ser. S&P 2020, 2020, pp. 1597–1612.
[52] P. Godefroid, “Micro execution,” in Proceedings of the 36th
International Conference on Software Engineering, ser. ICSE 2014.
New York, NY, USA: Association for Computing Machinery, 2014, p.
539–549. [Online]. Available: https://doi.org/10.1145/2568225.2568273
[53] W. Gao, V.-T. Pham, D. Liu, O. Chang, T. Murray, and B. I.
Rubinstein, “Beyond the coverage plateau: A comprehensive study
of fuzz blockers (registered report),” in Proceedings of the 2nd
International Fuzzing Workshop, ser. FUZZING 2023. New York, NY,
USA: Association for Computing Machinery, 2023, p. 47–55. [Online].
Available: https://doi.org/10.1145/3605157.3605177
[54] C. Holler, K. Herzig, and A. Zeller, “Fuzzing with code
fragments,” in 21st USENIX Security Symposium (USENIX
Security 12). Bellevue, WA: USENIX Association, Aug. 2012,
pp. 445–458. [Online]. Available: https://www.usenix.org/conference/
usenixsecurity12/technical-sessions/presentation/holler
[55] T. Dullien, “Introducing prodfile,” https://prodfiler.com/blog/, 2021, ac-
cessed: 2023-01-11.
[56] Google, “Syzkaller: an unsupervised coverage-guided kernel fuzzer,”
https://github.com/google/syzkaller, 2021, accessed: 2023-01-11.