0% found this document useful (0 votes)
3 views13 pages

Fuzzing Javascript Interpreters With Coverage-Guided Reinforcement Learning For Llm-Based Mutation

This paper introduces CovRL, a novel fuzzing technique that integrates Large Language Models (LLMs) with coverage-guided reinforcement learning to enhance bug detection in JavaScript interpreters. The CovRL-Fuzz fuzzer utilizes a weighted coverage map to generate diverse test cases, successfully identifying 58 real-world security-related bugs, including 50 previously unknown vulnerabilities. By combining LLM-based mutation with coverage feedback, CovRL-Fuzz outperforms existing fuzzing methods in both code coverage and bug detection capabilities.

Uploaded by

dfgdf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views13 pages

Fuzzing Javascript Interpreters With Coverage-Guided Reinforcement Learning For Llm-Based Mutation

This paper introduces CovRL, a novel fuzzing technique that integrates Large Language Models (LLMs) with coverage-guided reinforcement learning to enhance bug detection in JavaScript interpreters. The CovRL-Fuzz fuzzer utilizes a weighted coverage map to generate diverse test cases, successfully identifying 58 real-world security-related bugs, including 50 previously unknown vulnerabilities. By combining LLM-based mutation with coverage feedback, CovRL-Fuzz outperforms existing fuzzing methods in both code coverage and bug detection capabilities.

Uploaded by

dfgdf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Fuzzing JavaScript Interpreters with Coverage-Guided

Reinforcement Learning for LLM-Based Mutation


Jueon Eom Seyeon Jeong Taekyoung Kwon
Yonsei University Suresofttech Inc. Yonsei University
Seoul, South Korea Seongnam-si, South Korea Seoul, South Korea
[email protected] [email protected] [email protected]

Abstract 1 Introduction
JavaScript interpreters, crucial for modern web browsers, require an JavaScript (JS) interpreters are essential for modern web browsers,
effective fuzzing method to identify security-related bugs. However, enabling interactive web and embedded applications through the
the strict grammatical requirements for input present significant parsing, interpreting, compiling, and executing of JavaScript code.
challenges. Recent efforts to integrate language models for context- With JavaScript being employed as a client-side programming lan-
aware mutation in fuzzing are promising but lack the necessary guage by 98.9% of web browsers as of January 2024 [59], the secu-
coverage guidance to be fully effective. This paper presents a novel rity of JavaScript interpreters is crucial. Vulnerabilities can lead to
technique called CovRL (Coverage-guided Reinforcement Learning) severe security threats, including information disclosure and the
that combines Large Language Models (LLMs) with Reinforcement bypassing of browser security measures [19, 38]. Given their critical
Learning (RL) from coverage feedback. Our fuzzer, CovRL-Fuzz, role and complex nature, JavaScript interpreters require rigorous
integrates coverage feedback directly into the LLM by leveraging and continuous testing methods like fuzzing.
the Term Frequency-Inverse Document Frequency (TF-IDF) method Previous research on fuzzing JavaScript interpreters primarily
to construct a weighted coverage map. This map is key in calcu- falls into two categories: grammar-level and token-level, each ad-
lating the fuzzing reward, which is then applied to the LLM-based dressing the strict grammar requirements of JavaScript. Grammar-
mutator through reinforcement learning. CovRL-Fuzz, through this level fuzzing aims to generate grammatically correct inputs, en-
approach, enables the generation of test cases that are more likely suring syntactic accuracy [1, 20–22, 45, 46, 58, 60, 61]. Token-level
to discover new coverage areas, thus improving bug detection while fuzzing techniques adopt a more flexible approach by manipulating
minimizing syntax and semantic errors, all without needing extra sequences of tokens without strict adherence to grammar rules [4,
post-processing. Our evaluation results show that CovRL-Fuzz out- 50]. Both approaches have employed coverage-guided fuzzing, such
performs the state-of-the-art fuzzers in enhancing code coverage as AFL [39], to enhance the fuzzing effectiveness by promoting an
and identifying bugs in JavaScript interpreters: CovRL-Fuzz iden- exhaustive examination of code paths [4, 22, 45, 50, 61]. However,
tified 58 real-world security-related bugs in the latest JavaScript the evolving nature of the JavaScript language, with its constantly
interpreters, including 50 previously unknown bugs and 15 CVEs. updating grammar, poses significant challenges. Grammar-level
fuzzing focuses intensely on generating mutations that precisely
CCS Concepts follow syntax rules, limiting mutation diversity. Given fuzzing’s
ability to produce vast amounts of input per second, it can manage
• Security and privacy → Software security engineering. some noise and inaccuracies, even when not strictly adhering to
grammar rules. This focus on syntax can paradoxically reduce mu-
Keywords tation variety and constrain program path exploration. Token-level
fuzzing; coverage; reinforcement learning; large language model fuzzing, although more flexible, struggles with maintaining syntac-
tical correctness over successive mutations, often leading to syntax
ACM Reference Format:
errors and hindering the discovery of deeper bugs.
Jueon Eom, Seyeon Jeong, and Taekyoung Kwon. 2024. Fuzzing JavaScript To address these challenges, recent advancements have led to
Interpreters with Coverage-Guided Reinforcement Learning for LLM-Based research into fuzzing techniques that employ LLMs, which are
Mutation. In Proceedings of the 33rd ACM SIGSOFT International Symposium adept at producing syntactically informed, well-formed inputs for
on Software Testing and Analysis (ISSTA ’24), September 16–20, 2024, Vienna, compilers and JavaScript interpreters [65, 67]. A standout example
Austria. ACM, New York, NY, USA, 13 pages. https://doi.org/10.1145/3650212. is Fuzz4All [65], which utilizes pretrained Code-LLMs for com-
3680389 piler fuzzing. These models, trained on extensive datasets across
various programming languages, offer application for LLM-based
mutation without further finetuning. They inherently grasp the
language’s context, enabling the generation of inputs that are gram-
matically accurate and contextually relevant, thereby enhancing
fuzzing effectiveness. However, current LLM-based fuzzing meth-
This work is licensed under a Creative Commons Attribution 4.0 Interna-
tional License. ods are generally considered black-box fuzzing and lack integration
ISSTA ’24, September 16–20, 2024, Vienna, Austria with internal program information like code coverage. In contrast
© 2024 Copyright held by the owner/author(s). to black-box fuzzing, coverage-guided fuzzing utilizes internal pro-
ACM ISBN 979-8-4007-0612-7/24/09 gram data to enhance fuzzing effectiveness. The technique employs
https://doi.org/10.1145/3650212.3680389

1656
ISSTA ’24, September 16–20, 2024, Vienna, Austria Jueon Eom, Seyeon Jeong, and Taekyoung Kwon

Table 1: Average coverage achieved by four LLM-based mu- • We introduce CovRL, a novel method integrating LLMs with
tations compared to random mutation was measured using coverage feedback by reinforcement learning, using TF-IDF. We
AFL for 5 hours on the JavaScript interpreter V8 with a sin- directly feed code coverage to LLMs via a new reward scheme.
gle core. Among these variants, the baseline is Token-Level • We implement CovRL-Fuzz, a new JavaScript interpreter fuzzer
AFL [50] setting in Table 2, and prompt setting used was that outperforms existing methods in code coverage and bug
"Please mutate the following program". The experiment for detection by employing the CovRL technique.
GPT-4 [42] was conducted by requesting and receiving muta- • CovRL-Fuzz identified 58 real-world security-related bugs, in-
tions through the OpenAI API [43]. cluding 50 unknown bugs (15 CVEs) in the latest JavaScript in-
terpreters.
Variants Coverage Improv • To foster future research, we release our implementation of
Error (%)
Strategy LLM Valid Total Ratio (%) CovRL-Fuzz at https://github.com/seclab-yonsei/CovRL-Fuzz.
Random ✗ (Baseline [50]) 88.79% 44,705 53,936 -
GPT-4 [42] 27.50% 46,454 46,738 -13.35% 2 Background
Prompt
StarCoder (1B) [31] 82.72% 41,331 45,034 -16.50%
Mask
Incoder (1B) [17] 49.08% 46,427 47,385 -12.15% 2.1 JavaScript Interpreter Fuzzing
CodeT5+ (220M) [63] 62.68% 55,459 56,576 4.89%
Fuzzing is a powerful automated method for finding software
bugs [41] and is highly regarded in both academia and industry.
However, it faces challenges with JavaScript interpreters due to
an evolutionary strategy for generating "interesting" seeds aimed their need for strictly grammatical input. When inputs are not
at expanding the program’s coverage, thus potentially increasing syntactically correct, the JavaScript interpreter returns a syntax
the likelihood of bug discovery. By effectively using code coverage error, while semantic inconsistencies (e.g., errors with reference,
feedback to guide the mutation of inputs, it can uncover bugs more type, range, or URI) can lead to semantic errors [21]. In both sce-
efficiently than traditional black-box methods [7]. However, as we narios, the interpreter’s internal logic, which may contain hidden
describe below, this is quite challenging. bugs, is not executed. To tackle these challenges, researchers have
Problem. LLMs typically generate sentences at the word level, developed grammar-level and token-level fuzzing methods. The
which leads to the assumption that LLM-based mutations operate grammar-level approach converts seeds into an Intermediate Rep-
at the token level. Replacing traditional random mutators with pre- resentation (IR) to ensure grammatical accuracy of inputs [1, 20–
trained LLM-based mutators in coverage-guided fuzzing generally 22, 45, 46, 58, 60, 61]. While effective in maintaining syntax, ad-
reduces error rates but does not enhance coverage. Our experimen- hering strictly to grammar rules limits mutation diversity, making
tal results, detailed in Table 1, show that using the AFL fuzzing tool, it difficult to detect bugs caused by grammar violations or unex-
LLM-based mutations on V8 for five hours resulted in 12-16% lower pected input patterns. On the other hand, token-level fuzzing offers
coverage compared to the baseline in most cases. Even in cases greater flexibility by breaking inputs into tokens and replacing
where coverage increased, the rise was only slight. This suggests them selectively, enhancing bug detection capabilities [4, 50]. How-
that while LLM-based mutations reduce errors, their constrained ever, it often fails to maintain syntactical correctness due to its
predictions may limit diversity and effectiveness. random substitution method, which does not consider relationships
We hypothesize that LLM-based mutations, focusing on context, between tokens. Recent studies also explore bugs in Just-In-Time
often predict common tokens and unintentionally reduce diversity. (JIT) compilers of JavaScript engines through differential testing in
This is similar to grammar-level mutations, which target grammat- behavior with JIT enabled and disabled [3, 62].
ical accuracy but also limit variability. Consequently, in coverage-
guided fuzzing, while LLM-based mutators decrease errors, their LLM-based Fuzzing. Recently, there has been a shift toward using
context-aware approach diminishes diversity, making them less deep learning-based Language Models (LMs) in fuzzing to address
effective than random fuzzing. the limitations of traditional methods. Initially, RNN-based LMs
were used to mutate seeds [11, 18, 29, 35], and more recently, Large
Our Approach. To address the limitation of LLM-based mutations,
Language Models (LLMs) like GPTs [42, 47] and StarCoder [31]
we propose a novel technique that integrates coverage-guided feed-
have been employed for seed generation and mutation [65, 67], ben-
back directly into the mutation process. Our approach leverages
efiting from extensive training on diverse programming languages
internal program information to enhance fuzzing effectiveness,
datasets.
aiming to generate diverse mutations that go beyond mere gram-
matical correctness. We employ Term Frequency-Inverse Document Coverage-guided Fuzzing. Coverage-guided fuzzing, which lever-
Frequency (TF-IDF) [56] to weight coverage data, establishing a ages coverage feedback to explore diverse code paths, has proven
feedback-driven reward system. This method not only increases more effective than traditional black-box methods in detecting
coverage but also improves bug detection capabilities of LLM-based software bugs [7]. This approach, exemplified by tools like Ameri-
fuzzing, eliminating the need for additional post-processing. We can Fuzzy Lop (AFL) [39], emphasizes maximizing code coverage
term our approach CovRL-Fuzz. Unlike other LLM-based fuzzing through mutations [4, 22, 45, 50, 61], and has been particularly ef-
techniques, CovRL-Fuzz is the first to effectively integrate LLM- fective in finding security vulnerabilities. Despite the success of
based mutation with coverage-guided fuzzing, thereby distinguish- coverage-guided fuzzing, existing LLM-based fuzzing techniques,
ing it from existing methods [65, 67]. including COMFORT and Fuzz4All, primarily use black-box meth-
To sum up, this paper makes the following contributions: ods [65, 67] and have not achieved coverage-guided fuzzing for

1657
Fuzzing JavaScript Interpreters with Coverage-Guided Reinforcement Learning for LLM-Based Mutation ISSTA ’24, September 16–20, 2024, Vienna, Austria

finetuning aim to steer the model during training time by learn-


ing from specific datasets tailored to particular tasks. Particularly,
RL-based finetuning has been proven effective in guiding LLMs
using feedback to optimize factual consistency and reduce the toxic
generation [28, 44, 49]. Recently, there have also been proposed for
applying RL-based finetuning to Code-LLMs aimed at generating
unit tests that are not only grammatically correct but also capable
of solving complex coding tasks [27, 34, 55].
RL-based finetuning consists of the following phases: reward
modeling and reinforcement learning. In reward modeling, an LLM-
based rewarder is trained to evaluate the suitability of output re-
sults when input is provided to the LLM created in the previous
phase. There are various approaches to feedback depending on
Figure 1: Types of LLM-based Mutation. how the rewarder is trained: utilizing an oracle [27, 34, 55], using
deep learning models [28], and using human feedback [44]. We also
adopt the strategy of employing the JavaScript interpreter as a feed-
back oracle. In reinforcement learning, training commonly employs
JavaScript interpreters, as these approaches typically do not incor- Kullback-Leibler (KL) divergence-based optimization. The method
porate coverage feedback into their mutation processes. is designed to optimize the balance between maximizing rewards
and minimizing deviation from the initial training distribution.
2.2 Large Language Models for Code
Following the success of LLMs in Natural Language Processing
(NLP) tasks [6, 9, 10], the field of programming languages is advanc-
3 Design
ing with significant contributions from Large Language Models for In this section, we describe the design of CovRL-Fuzz. The key ac-
code (Code-LLMs) such as CodeT5+ [63], Codex [8], InCoder [17] complishment of CovRL-Fuzz is integrating coverage-guided feed-
and StarCoder [31]. These advancements are facilitating various back directly into the LLM-based mutation process. This ensures
downstream tasks, including code completion [68], program syn- effective fuzzing by guiding mutations to be not only grammatically
thesis [2, 27, 34, 55], program repair [16, 66], and many others. correct but also diverse.
Figure 2 illustrates the workflow of CovRL-Fuzz, which oper-
LLM-based Mutation. In fuzzing, Code-LLMs have recently
ates in three phases based on a coverage-guided fuzzer. Initially,
shown their effectiveness not only in seed generation but also in
CovRL-Fuzz selects a seed from the queue. The seed then under-
mutation [12, 13, 65]. The use of LLMs in mutation can be broadly
goes an LLM-based mutation where specific tokens are masked
categorized into two types: mutation by prompt [13, 65] and mu-
and predicted using a masked language model approach result-
tation by mask [12]. Figure 1 illustrates the types of LLM-based
ing in a new test case ( 1 ). The technique of mutation by mask
mutation. Mutation by prompt involves inputting a code to be mu-
helps to maintain structural integrity while exploring new code
tated along with a mutation request prompt (e.g., "Please mutate
paths [14, 48]. After mutation, the test case is then executed by the
the following program.") into a pretrained Code-LLM to generate
target JavaScript interpreter. If the test case discovers new coverage
a mutated seed ((a) in Figure 1). The Code-LLM fully relies on its
previously unseen, it is considered an interesting seed and is added
understanding of the structure and function of the code to per-
to the seed queue for further mutation. At the same time, CovRL-
form mutations such as changing variable names or adding loops
Fuzz stores the coverage map measured by the test case and validity
and conditional statements as requested. Mutation by mask is a
information, whether the test case led to syntax errors, semantic
methodology where masks are added or replaced in the middle of
errors, or passed successfully. Our rewarding approach uses valid-
code, allowing the model to fill in only parts of it, thereby mutat-
ity information to impose penalties on inputs that result in syntax
ing the code ((b) in Figure 1). The mutation technique allows for
or semantic errors. Following this, it generates rewarding signals
direct selection of the parts to be mutated, enabling more precise
by multiplying the current coverage map with a coverage-based
adjustments of the mutation.
weight map ( 2 ). After completing a mutation cycle, we proceed
We used mutation by mask in our coverage-guided fuzzing as
to finetune the LLM-based mutator using CovRL by utilizing the
it preserves the overall structure while allowing specific tokens to
gathered interesting seeds and rewarding signals. We define the
be mutated. This targeted mutation strategy helps incrementally
notion of one cycle as a predetermined number of mutations. The
increase coverage, efficiently explore new code paths, and conserve
CovRL employs the PPO [51] algorithm, a method that seeks to
resources. The advantages of mutation by mask over mutation by
improve the current model while adhering closely to the previous
prompt in the context are also evidenced in Table 1.
model’s framework. The signal during training prevents the LLM
Finetuning LLMs. Methods for finetuning LLMs, including Code- from making syntax or semantic errors and induces prediction to
LLMs are categorized into supervised finetuning (SFT), instruction find new coverage ( 3 ).
finetuning [64], and RL-based finetuning [28, 44, 49]. While prompt Note that we do not perform any heuristic post-processing on
engineering controls the output of LLMs at inference time through the LLM-based mutation, except for CovRL-based finetuning. We
input manipulation alone, SFT, instruction finetuning, and RL-based demonstrated a minimal error rate in using solely CovRL that is

1658
ISSTA ’24, September 16–20, 2024, Vienna, Austria Jueon Eom, Seyeon Jeong, and Taekyoung Kwon

Figure 2: Workflow of CovRL-Fuzz: The blue-shaded area illustrates the operation of CovRL.

comparable to other latest JavaScript interpreter fuzzing techniques TF-IDF prioritizes less frequent tokens by assigning them higher
in Section 5.2. weights, and more common tokens receive lower weights. We ap-
ply this method to create a weighted coverage map, focusing on
3.1 Phase 1. Mutation by Mask underexplored areas. Figure 3 illustrates the Coverage-Weighted
To mutate the selected seed, CovRL-Fuzz performs a simple mask- Rewarding (CWR) process in action.
ing strategy for mutation by mask ( 1 in Figure 2). Given the input
sequence 𝑊 = {𝑤 1, 𝑤 2, .., 𝑤𝑛 }, we use three masking techniques: Example 1. Consider the scenario of executing a program depicted
insert, overwrite, and splice. The strategy results in the mask se- by Control Flow Graph (CFG) in Figure 3 (a). There is a loop C, and
branches D and E in the CFG. We executed the program with a test
quence 𝑊 𝑀𝐴𝑆𝐾 = {[𝑀𝐴𝑆𝐾], 𝑤 3, .., 𝑤𝑘 } and the masked sequence
case generated by a fuzzer and obtained the coverage results.
𝑊 \𝑀𝐴𝑆𝐾 = {𝑤 1, 𝑤 2, [𝑀𝐴𝑆𝐾], .., 𝑤𝑛 }. The detailed operations are
Building on previous RL-based finetuning methods using Code-
described as follows:
LLM [27, 34, 55], we further extend the idea by applying a rewarding
Insert Randomly select positions and insert [MASK] tokens signal based on software output. Notably, errors in the JavaScript
into the inputs. interpreter can be broadly grouped into syntax errors and semantic
Overwrite Randomly select positions and replace existing tokens errors, which include reference, type, range, and URI errors. Given
with the [MASK] token. that 𝑊 ∗ is the concatenation of the masked sequence 𝑊 \𝑀𝐴𝑆𝐾 and
Splice Statements within a seed are randomly divided into the mask sequence 𝑊 𝑀𝐴𝑆𝐾 , the following returns can be deduced
segments. A portion of these segments is replaced with based on input to the target:
a segment from another seed with [MASK], formatted
as [MASK] statement [MASK]. 
 −1.0 if 𝑊 ∗ is syntax error



After generating a masked sequence 𝑊 \𝑀𝐴𝑆𝐾 via masking, the ∗
𝑟 (𝑊 ) = −0.5 if 𝑊 ∗ is semantic error (2)
seed is mutated by inferring in the masked positions via LLM-based 

 +𝑅𝑐𝑜𝑣
 if 𝑊 ∗ is passed
mutator. The mutation design of CovRL-Fuzz is based on a span-
based masked language model (MLM) that can predict variable- To enhance the LLM-based mutator’s ability to discover diverse
length masks [17, 48]. Therefore, the MLM loss function we utilize coverage, we introduce an additional rewarding signal as outlined
for mutation can be represented as follows: in Eq. 2. Unlike traditional RL-based fuzzing techniques that use
the ratio of current to total coverage, our approach assigns weights
𝑘
Õ based on the frequency of reaching specific coverage points. This
𝐿𝑀𝐿𝑀 (𝜃 ) = −𝑙𝑜𝑔𝑃𝜃 (𝑤𝑖𝑀𝐴𝑆𝐾 |𝑤 \𝑀𝐴𝑆𝐾 , 𝑤 <𝑖
𝑀𝐴𝑆𝐾
) (1) method involves adjusting the coverage map using the IDF weight
𝑖=1 map, calculating the weighted sum for each coverage data point,
𝜃 represents the trainable parameters of the model that are opti- and normalizing these sums to derive scores.
mized during the training process, and 𝑘 is the number of tokens in Initially, we observed that the coverage map functions like Term
𝑊 𝑀𝐴𝑆𝐾 . 𝑤 \𝑀𝐴𝑆𝐾 denotes the masked input tokens where specific Frequency (TF) by tracking how often certain coverage points are
tokens are replaced by mask tokens. 𝑤 𝑀𝐴𝑆𝐾 refers to the original hit. However, in JavaScript interpreters, identical sections of code
tokens that have been substituted with the mask tokens in the input in a test case, such as repeated lines like ‘a=1; a=1;’, can trigger the
sequence. same coverage area multiple times, leading to redundant counts. To
address this, we underscore the importance of recognizing unique
3.2 Phase 2. Coverage-Weighted Rewarding instances of coverage. Hence, we introduce 𝑇 𝐹 𝑐𝑜𝑣 , defining it as
a map that records each coverage point uniquely, reducing redun-
We developed a method called Coverage-Weighted Rewarding
dancy and emphasizing the significance of varied coverage.
(CWR) to guide the LLM-based mutator. The approach employs
TF-IDF to emphasize less common coverage points, effectively fo-
cusing on discovering new areas of code coverage ( 2 in Figure 2). 𝑇 𝐹 𝑐𝑜𝑣 = unique coverage map (3)

1659
Fuzzing JavaScript Interpreters with Coverage-Guided Reinforcement Learning for LLM-Based Mutation ISSTA ’24, September 16–20, 2024, Vienna, Austria

Figure 3: Example of Coverage-Weighted Rewarding process. (a) represents the control flow graph of an example program. (b)
shows the 𝑇 𝐹 𝑐𝑜𝑣 Maps calculated based on the coverage areas traversed when each Test Case (TC) is executed in the program
represented by (a). (c) is the 𝐼 𝐷𝐹 𝑐𝑜𝑣 calculated based on the 𝑇 𝐹 𝑐𝑜𝑣 Maps, which we refer to as the Coverage-based Weight Map.
(d) describes the process of calculating the 𝑅𝑐𝑜𝑣 . Additional TCs are element-wise multiplied by 𝐼 𝐷𝐹 𝑐𝑜𝑣 to become the weighted
coverage map, called the Weighted 𝑇 𝐹 𝑐𝑜𝑣 map, and the sum of this map is rewarded to the model as the reward 𝑅𝑐𝑜𝑣 . Note that
the calculation of 𝑅𝑐𝑜𝑣 in this figure is a simplified version of Eq.5.

Example 2. Assume we obtained coverage from test case 1 (TC1) as higher payouts for test cases that achieve uncommon levels of
follows: [1, 3, 4, 0, 0]. Applying Eq. 3, the coverage transforms into a coverage.
binary map to indicate whether a path was executed (1) or not (0), Example 4. Assume we generated two new test cases, TC4 and TC5,
regardless of the number of times it was executed. Thus, the 𝑇 𝐹 𝑐𝑜𝑣 through fuzzing and executed the program to obtain 𝑇 𝐹 𝑐𝑜𝑣 map. We
map for TC1 is updated to: [1, 1, 1, 0, 0]. And this process is applied create a weighted 𝑇 𝐹 𝑐𝑜𝑣 by element-wise multiplying each TC with
identically to other test cases as well ((b) in Figure 3). 𝐼 𝐷𝐹 𝑐𝑜𝑣 , and then calculate 𝑅𝑐𝑜𝑣 as the sum of the elements according
We also define the coverage-based weight map 𝐼 𝐷𝐹 𝑐𝑜𝑣 using to Eq. 5. In the case of TC4, because it executed mostly the code paths
the 𝑇 𝐹 𝑐𝑜𝑣 of each seed to weight which code paths are accessed that TC1, TC2, and TC3 had executed, it received a penalty of -0.13,
frequently and which are not, as follows: whereas TC5, having executed rare or previously unexecuted code
1 𝑁 paths, obtained a reward of 0.54 ((d) in Figure 3).
𝐼 𝐷𝐹 𝑐𝑜𝑣 = √ 𝑙𝑜𝑔( ) (4) Update Weight Map with Momentum. Although we used the log-
𝑀 1 + 𝐷𝐹 𝑐𝑜𝑣
arithmic function in Eq. 4 to alleviate dramatic changes in 𝐼 𝐷𝐹 𝑐𝑜𝑣 ,
where 𝑁 denotes the total number of unique coverage obtained.
instability can still occur due to abrupt shifts in the reward distri-
𝐷𝐹 𝑐𝑜𝑣 denotes the number of seeds that have achieved the specific
bution. To address this issue, we employed momentum. Following
coverage point. The weight map 𝐼 𝐷𝐹 𝑐𝑜𝑣 is obtained by taking the
each cycle, CovRL updates the IDF weight map. To mitigate dra-
inverse of 𝐷𝐹 𝑐𝑜𝑣 , resulting in higher weights for less common
matic changes in the reward distribution, we use momentum at a
coverage. The variable 𝑀 denotes the overall size of the coverage
rate of 𝛼 to incorporate the prior weight when recalculating the
map, which we utilized as a scale factor to adjust the weight value.
map. The updated weight map is as follows:
The scaling ensures that the weights remain consistent regardless
of the size 𝑀 of the coverage map, thus preserving the stability of
𝐼 𝐷𝐹𝑡𝑐𝑜𝑣 𝑐𝑜𝑣 𝑐𝑜𝑣
−1 = 𝛼𝐼 𝐷𝐹𝑡 −1 + (1 − 𝛼)𝐼 𝐷𝐹𝑡 (6)
the training process.
where 𝐼 𝐷𝐹𝑡𝑐𝑜𝑣 𝑐𝑜𝑣
means new weight map and 𝐼 𝐷𝐹𝑡 −1 means previous
Example 3. Consider in Figure 3 (b) that three 𝑇 𝐹 𝑐𝑜𝑣 maps, [1, 1, 1, weight map.
0, 0], [1, 1, 0, 0, 1], and [1, 0, 0, 0, 0], were generated. Based on these
𝑇 𝐹 𝑐𝑜𝑣 maps, we calculate the coverage-based weight map, 𝐼 𝐷𝐹 𝑐𝑜𝑣 . 3.3 Phase 3. CovRL-Based Finetuning
By applying Eq. 4, 𝐼 𝐷𝐹 𝑐𝑜𝑣 is computed as a weight map with values The fuzzing environment with mutation by mask can be concep-
[-0.13, 0.0, 0.18, 0.49, 0.18] ((c) in Figure 3). tualized as a bandit environment for RL. In this environment, a
The reward is acquired by taking the weighted sum of 𝑇 𝐹 𝑐𝑜𝑣 masked sequence 𝑊 \𝑀𝐴𝑆𝐾 is provided as input (𝑥), and the ex-
and 𝐼 𝐷𝐹 𝑐𝑜𝑣 to create the weighted coverage map, which is then pected output is a mask sequence 𝑊 𝑀𝐴𝑆𝐾 (𝑦). Inspired by previous
weighted to obtain as studies [27, 34, 55], we finetune our model using the PPO algo-
𝑀
rithm [51], an actor-critic reinforcement learning ( 3 in Figure 2).
Õ
𝑅𝑐𝑜𝑣 = 𝜎 (𝑙𝑜𝑔( 𝑡 𝑓𝑖,𝑡 · 𝑖𝑑 𝑓𝑖,𝑡 −1 )) (5) In our situation, it can be implemented by finetuning two LLMs in
𝑖=1 tandem: one LLM acts as a mutator (actor), while the other LLM
serves as a rewarder (critic). We utilize a pretrained LLM to initial-
where 𝑡 represents the current cycle. 𝑡 𝑓𝑖,𝑡 refers to an element in
ize the parameters both of mutator and rewarder. The rewarder is
𝑇 𝐹𝑡𝑐𝑜𝑣 , and 𝑖𝑑 𝑓𝑖,𝑡 −1 refers to an element in 𝐼 𝐷𝐹𝑡𝑐𝑜𝑣
−1 at the previous trained using the Eq. 2. It plays a crucial role in training the mutator.
time step before updating the weights. 𝜎 is a sigmoid function used
For CovRL-based finetuning with PPO, we define the CovRL loss
to map 𝑅𝑐𝑜𝑣 to a value between 0 and 1.
as follows:
The 𝑅𝑐𝑜𝑣 is calculated only if the test case is free from any syntax " !#
or semantic problems. Our rewarding scheme incentivizes the LLM- 𝜋𝜃𝑡 (𝑦|𝑥)
𝐿𝐶𝑜𝑣𝑅𝐿 (𝜃 ) = −E (𝑥,𝑦)∼𝐷𝑡 𝑅(𝑥, 𝑦)𝑡 · log 𝑡 −1 (7)
based mutator to explore a wider range of coverage by providing 𝜋 (𝑦|𝑥)

1660
ISSTA ’24, September 16–20, 2024, Vienna, Austria Jueon Eom, Seyeon Jeong, and Taekyoung Kwon

Algorithm 1: Fuzzing with CovRL we finetune the mutator M𝑝𝑟𝑒𝑣 to generate M𝑐𝑢𝑟 (Line 20). For
Input: finetuning dataset 𝐷𝑇 finetuning the mutator, we apply reward or penalty to the model
1 R𝑝𝑟𝑒𝑣 : Previous LLM-based rewarder
using the CovRL loss from Eq. 7.
2 R𝑐𝑢𝑟 : Current LLM-based rewarder
3 M𝑝𝑟𝑒𝑣 : Previous LLM-based mutator 4 Implementation
4 M𝑐𝑢𝑟 : Current LLM-based mutator We implemented a prototype of CovRL-Fuzz using pytorch v2.0.0,
transformers v4.38.2 and afl 2.52b [39].
5 Function FuzzOne(𝑠𝑒𝑒𝑑_𝑞𝑢𝑒𝑢𝑒):
6 for 𝑖 = 1 to 𝑖𝑡𝑒𝑟 _𝑐𝑦𝑐𝑙𝑒 do Dataset. We collected data from regression test suites in several
7 𝑠𝑒𝑒𝑑 ← SelectSeed(𝑠𝑒𝑒𝑑_𝑞𝑢𝑒𝑢𝑒) repositories including V8, JavaScriptCore, ChakraCore, JerryScript,
Test262 [57], and js-vuln-db [23] as of December 2022. We then
8 𝑇 ← Mutate(M𝑐𝑢𝑟 , 𝑠𝑒𝑒𝑑)
pre-processed the data for training data and seeds, resulting in a
9 𝐼 𝑣𝑎𝑙 , 𝑐𝑜𝑣 ← Execute(𝑇 )
collection of 52K unique JavaScript files for our experiments.
10 if IsInteresting(𝑇 ) then
11 𝑇𝑖𝑛𝑡𝑒𝑟𝑒𝑠𝑡 ← 𝑇 Pre-Processing. We performed a simple pre-processing on the
12 𝑠𝑒𝑒𝑑_𝑞𝑢𝑒𝑢𝑒.append(𝑇𝑖𝑛𝑡𝑒𝑟𝑒𝑠𝑡 ) regression test suites of the JavaScript interpreters mentioned above
13 𝑅𝑐𝑜𝑣 = calcReward(𝐼 𝑣𝑎𝑙 , 𝑐𝑜𝑣) to remove comments, filter out grammatical errors, and simplify
identifiers. We then used the processed data directly for training.
14 𝑑𝑎𝑡𝑎 ← 𝑇𝑖𝑛𝑡𝑒𝑟𝑒𝑠𝑡 , 𝑅𝑐𝑜𝑣
The pre-processing was conducted utilizing the -m and -b options
15 𝐷𝑇 .append(𝑑𝑎𝑡𝑎)
of the UglifyJS tool [40].
16 FinetuneCovRL(𝐷𝑇 ) Training. We utilized the pretrained Code-LLM, CodeT5+
17 Function FinetuneCovRL(𝐷𝑇 ): (220m) [63], as both the rewarder and the mutator. For the process
18 R𝑝𝑟𝑒𝑣 , M𝑝𝑟𝑒𝑣 ← R𝑐𝑢𝑟 , M𝑐𝑢𝑟 of CovRL-based finetuning, we trained the rewarder and mutator
19 R𝑐𝑢𝑟 ← FinetuneRewarder(R𝑝𝑟𝑒𝑣 , 𝐷𝑇 ) for 1 epoch per mutation cycle. We used a batch size of 256 and
learning rate of 1e-4. The optimization utilized the AdamW opti-
20 M𝑐𝑢𝑟 ← FinetuneMutator(M𝑝𝑟𝑒𝑣 , R𝑐𝑢𝑟 , 𝐷𝑇 )
mizer [36] together with a learning rate linear warmup technique.
The LLM-based rewarder uses the encoder from CodeT5+ to predict
the rewarding signal through a classification approach. We utilized
where 𝑅(𝑥, 𝑦) represents the reward of CovRL, and 𝐷𝑡 refers to the the contrastive search method, incorporating a momentum factor
finetuning dataset that has been collected up to time step 𝑡. 𝜋𝜃𝑡 (𝑦|𝑥) 𝛼 of 0.6 and a top-k setting of 32, to enhance the effectiveness of
with parameters 𝜃 is the trainable RL policy for the current mutator, CovRL. In addition, we aligned the coverage map size with AFL’s
and 𝜋 𝑡 −1 (𝑦|𝑥) represents the policy from the previous mutator. recommendations by setting the scaling factor 𝑀 for the map size.
To mitigate the overoptimization and maintain the LLM-based This ensures that the instrumentational capacity is optimized. For
mutator’s mask prediction ability, we also use KL regularization. moderate-sized software (approx. 10K lines), we employed a map
The reward after adding the KL regularization is size of 216 . For larger software exceeding 50K lines, we used a map
! size of 217 , striking a balance between granularity and performance.

𝜋𝜃𝑡 (𝑦|𝑥) For detailed analysis related to the hyperparameters that we have
𝑅(𝑥, 𝑦)𝑡 = 𝑟 (𝑊 ) + log (8) set, such as finetuning epochs and 𝛼, please refer to the Appendix.
𝜋 𝑡 −1 (𝑦|𝑥)

Fuzzing with CovRL Algorithm. Algorithm 1 details one cycle 5 Evaluation


of the fuzzing loop with CovRL. The cycle iterates for a predeter-
To evaluate CovRL-Fuzz, we set four research questions.
mined number of 𝑖𝑡𝑒𝑟 _𝑐𝑦𝑐𝑙𝑒 (Lines 6-15). The LLM-based muta-
tor uses a seed chosen from the 𝑠𝑒𝑒𝑑_𝑞𝑢𝑒𝑢𝑒 to generate the test • RQ1: Is CovRL-Fuzz more effective than other JavaScript inter-
case 𝑇 (Lines 7-8). By executing 𝑇 on the target interpreter, we preter fuzzers?
obtain validity information 𝐼 𝑣𝑎𝑙 and coverage map 𝑐𝑜𝑣 (Line 9). If • RQ2: Is CovRL-Fuzz more effective compared to the other fuzzer
𝑇 is deemed a noteworthy seed, it is added to the seed queue, and using LLM-based mutation?
the reward for the particular 𝑇𝑖𝑛𝑡𝑒𝑟𝑒𝑠𝑡 is calculated and added to • RQ3: How does each component contribute to the effectiveness
the finetuning dataset 𝐷𝑇 (Lines 10-15). After completing these of CovRL-Fuzz?
𝑖𝑡𝑒𝑟 _𝑐𝑦𝑐𝑙𝑒 iterations, the gathered 𝐷𝑇 is utilized as training data to • RQ4: Can CovRL-Fuzz find real-world bugs in JavaScript inter-
call the FinetuneCovRL function, which carries out CovRL-based preters?
finetuning (Line 16). The procedure of FinetuneCovRL involves
the finetuning of the LLM-based Rewarder R and the LLM-based 5.1 Experimental Design
Mutator M (Lines 18-20). Initially, we designate the existing model
as R𝑝𝑟𝑒𝑣 and M𝑝𝑟𝑒𝑣 (Line 18). Following that, R𝑝𝑟𝑒𝑣 is finetuned Experimental setup. Our setup included a 64-bit Ubuntu 20.04
using the 𝐷𝑇 to generate a new rewarder R𝑐𝑢𝑟 (Line 19). At this LTS OS on an Intel(R) Xeon(R) Gold 6134 CPU @ 3.20GHz (64-core).
point, the rewarder has been trained to predict the rewarding sig- Additionally, we harnessed three NVIDIA GeForce RTX 3090 GPUs
nal as described in Eq. 2. By utilizing the finetuned R𝑝𝑟𝑒𝑣 and 𝐷𝑇 , for both training and mutation.

1661
Fuzzing JavaScript Interpreters with Coverage-Guided Reinforcement Learning for LLM-Based Mutation ISSTA ’24, September 16–20, 2024, Vienna, Austria

Benchmarks. We tested it on four JavaScript interpreters, using the Table 2: Baseline fuzzers targeting JavaScript interpreters.
latest versions as of January 2023: JavaScriptCore (2.38.1), Chakra- CGF indicates the use of coverage-guided fuzzing, LLM de-
Core (1.13.0.0-beta), V8 (11.4.73), JerryScript (3.0.0). We also con- notes the usage of LLMs, and Mutation Level refers to the
ducted additional experiments on QuickJS (2021-03-27), Jsish (3.5.0), unit of mutation.
escargot (bd95de3c), Espruino (2v20) and Hermes (0.12.0) for real-
world bug detection experiments. Mutation Post
Fuzzer CGF LLM
We built each target JavaScript interpreter with Address Sanitizer Level Processing
(ASAN) [52] to detect bugs related to abnormal memory access and
Coverage-guided Baselines
in debug mode to find bugs related to undefined behavior. AFL(w/Dict) [39] ✓ Bit/Byte
Superion [61] ✓ Grammar
Fuzzing Campaign. For a fair evaluation, we used the same set of
Token-Level AFL [50] ✓ Token
100 valid seeds. For RQ1 and RQ2, we operated on 3 CPU cores, LM Baselines
considering other fuzzing approaches. For RQ3, we used a sin- Montage [29] Grammar ✓
gle CPU core. For RQ4, we also used 3 CPU cores and conducted COMFORT [67] ✓ Grammar ✓
experiments with the four major JavaScript interpreters and five CovRL-Fuzz ✓ ✓ Token
additional ones. To consider the randomness of fuzzing, we exe-
cuted each fuzzer five times and then averaged the coverage results.
Additionally, to ensure fairness in fuzzing, we measured the results Table 3: Comparison with other JavaScript interpreter fuzzers
of each experiment, including the finetuning time through CovRL. listed in Table 2.
The average finetuning time is 10 minutes, occurring every 2.5
hours. Target Fuzzer Error (%)
Coverage Improv
Valid Total Ratio (%)
Each tool was run on four JavaScript interpreters with default
AFL (w/Dict) 96.90% 29,929 33,531 134.79%
configurations which details can be found in Table 2. Superion 77.35% 33,812 36,985 112.87%
Token-Level AFL 84.10% 39,582 42,303 86.11%
Metrics. We use three metrics for evaluation. V8 Montage 56.24% 38,856 40,155 96.06%
Montage (w/o Import) 94.08% 33,487 36,338 116.66%
• Code Coverage represents the range of the software’s code
COMFORT 79.66% 44,324 46,522 69.23%
that has been executed. We adopt edge coverage from the AFL’s CovRL-Fuzz 48.68% 75,240 78,729 -
coverage bitmap, following FairFuzz [30] and Evaluate-Fuzz- AFL (w/Dict) 74.42% 18,343 20,496 215.86%
Testing [25] settings. We conducted a comparison of coverage Superion 72.02% 17,619 19,772 227.42%
Token-Level AFL 69.70% 52,385 53,719 20.51%
in two categories: total and valid. Total refers to the coverage JSC Montage 42.34% 55,511 56,861 13.85%
across all test cases, while valid refers to the coverage for valid Montage (w/o Import) 93.72% 43,861 47,754 35.57%
test cases. We also employed the Mann-Whitney U-test [37] to COMFORT 79.64% 36,074 36,542 77.16%
assess the statistical significance and verified that all p-values CovRL-Fuzz 48.59% 61,137 64,738 -
AFL (w/Dict) 81.32% 83,038 87,587 27.30%
were less than 0.05. Superion 42.63% 92,314 94,237 18.32%
• Error Rate measures the rate of syntax errors and semantic Token-Level AFL 90.64% 92,621 95,677 16.54%
errors in the generated test cases. The metric provides insight into Chakra Montage 82.21% 101,470 103,589 7.63%
Montage (w/o Import) 94.72% 90,940 98,643 13.03%
how effectively each method explores the core logic in the target
COMFORT 79.47% 81,171 83,142 34.11%
software. For detailed analysis, semantic errors are categorized CovRL-Fuzz 54.87% 105,121 111,498 -
into type errors, reference errors, URI errors, and internal errors AFL (w/Dict) 77.32% 9,307 14,259 63.03%
based on the ECMA standard [15]. It should be noted that while Superion 86.23% 8,944 15,061 54.35%
Token-Level AFL 80.52% 14,361 17,152 35.53%
COMFORT [67] utilized jshint [24] for measurement, focusing Jerry Montage 95.55% 13,114 13,285 74.98%
their error rate on syntax errors, we used JavaScript interpreters, Montage (w/o Import) 95.34% 12,662 15,598 49.03%
allowing us to measure the error rate including both syntax and COMFORT 79.83% 12,268 14,026 65.74%
semantic errors. CovRL-Fuzz 58.84% 20,844 23,246 -
• Bug Detection is what the fuzzer is trying to find.

5.2 RQ1. Comparison with Existing Fuzzers JavaScript interpreters, so we did not include JIT fuzzers [3, 62]
To answer RQ1, we compare CovRL-Fuzz with fuzzers targeting targeting JIT compilers embedded in some JavaScript engines.
JavaScript interpreters from open-source projects, focusing on those Code Coverage. Table 3 depicts the valid and total coverage for
that are either coverage-guided [39, 50, 61] or LM-based[29, 65, 67] each fuzzing technique. The results of our evaluation demonstrate
listed in Table 2 with the 24-hour timeout. In the case of Montage, it that CovRL-Fuzz outperforms state-of-the-art JavaScript interpreter
imports code from its test suite corpus, which might affect coverage fuzzers. Our observation revealed that CovRL-Fuzz attained the
by increasing the amount of executed code. As a result, we included highest coverage across all target interpreters, resulting in an aver-
a version of Montage (w/o Import) in our experimental study, which age increase of 102.62%/98.40%/19.49%/57.11% in edge coverage. To
does not import the other test suites. In the case of COMFORT, we emphasize the effectiveness of CovRL-Fuzz, we monitored a growth
evaluated it solely with the black-box fuzzer, excluding the differ- trend of edge coverage, depicted in Figure 4. In every experiment,
ential testing component. We focused on finding bugs in general CovRL-Fuzz consistently achieved the highest edge coverage more

1662
ISSTA ’24, September 16–20, 2024, Vienna, Austria Jueon Eom, Seyeon Jeong, and Taekyoung Kwon

Figure 4: Number of edge coverage between CovRL-Fuzz and other JavaScript interpreter fuzzers. The solid line represents the
average coverage, while the shaded region depicts the range between the lowest and highest values five times.

Table 4: Unique bugs discovered by CovRL-Fuzz and other


JavaScript interpreter fuzzers.

Target Bug Type AFL Superion TokenAFL Montage COMFORT CovRL-Fuzz


JSC Undefined Behavior ✓ ✓
JSC Out-of-bounds Read ✓
Chakra Undefined Behavior ✓ ✓ ✓
Chakra Out of Memory ✓
Chakra Out of Memory ✓
Jerry Undefined Behavior ✓ ✓ ✓ ✓
Jerry Memory Leak ✓ ✓ ✓
Jerry Undefined Behavior ✓ ✓ ✓
Jerry Undefined Behavior ✓
Jerry Heap Buffer Overflow ✓
Figure 5: The error rate of generated test cases on V8. The Jerry Out of Memory ✓
four error types, excluding syntax error, are classified as Jerry Stack Overflow ✓
Jerry Undefined Behavior ✓
semantic errors. Jerry Heap Buffer Overflow ✓
Total 2 3 4 0 1 14

rapidly than any other fuzzer. In contrast to coverage-guided base-


lines, CovRL-Fuzz immediately and significantly achieved higher (Token-Level AFL). To verify the assumption, we evaluate the error
coverage. This suggests that the LLM-based mutation of CovRL- rate of unique test cases.
Fuzz are more effective than the heuristic mutations in coverage- The experimental results are shown in Table 3. CovRL-Fuzz
guided fuzzing. CovRL-Fuzz achieved high coverage compared to demonstrated a lower error rate than Token-Level AFL for all
LM baselines. However, in ChakraCore, there was only a marginal JavaScript interpreters. Furthermore, CovRL-Fuzz showed a lower
difference in coverage between Montage and CovRL-Fuzz, which error rate in comparison to most of the fuzzers. While it did not
can be attributed to Montage’s strategy of importing and executing achieve the lowest error rate in JavaScriptCore (JSC) and Chakra-
code from its own test suite corpus, achieving higher coverage as a Core (Chakra), CovRL-Fuzz still induced a significantly low error
result. In this respect, without the code import feature (Montage rate compared to the most of baselines. Please note that the high
w/o Import), CovRL-Fuzz recorded significantly higher coverage error rate of Montage (w/o Import) is due to its inability to access
than Montage. Additionally, as shown in Figure 4, while Montage’s functions from other test suites. For a more detailed analysis of
coverage tends to converge over time, CovRL-Fuzz’s coverage con- the error rate, we analyzed the types of errors triggered by fuzzers
tinues to increase. It suggests that CovRL-Fuzz is likely to obtain on V8, which is the largest and most dependable JavaScript inter-
more coverage than Montage as time progresses. Note that, while preter, as shown in Figure 5. The results showed that CovRL-Fuzz
other LM baselines did not account for training time, CovRL-Fuzz triggered fewer syntax errors in comparison to coverage-guided
included the time required for CovRL-based finetuning during the baselines. Furthermore, it also produced less syntax and semantic
experiment. Additionally, we observed that CovRL-Fuzz continues errors than LM baselines, even without using the post-processing
to increase coverage even near the 24 hour mark. It displays its techniques used by COMFORT and Montage. These results indicate
effectiveness in obtaining coverage. that CovRL-Fuzz is successful in reducing error rates exclusively
Syntax and Semantic Correctness. CovRL-Fuzz is not a grammar- through CovRL, without requiring heuristic post-processing.
level fuzzing approach that post-processes for syntax and semantic Finding bugs. To determine whether the coverage improvement
validity. However, it is assumed that CovRL-Fuzz, which uses RL- and low error rate achieved by CovRL-Fuzz aid in detecting bugs,
based finetuning driven by the reward signal derived from testing we conducted experiments by JavaScript interpreters compiled in
results, can achieve higher validity compared to random fuzzing debug mode with ASAN. We relied on the output reports generated

1663
Fuzzing JavaScript Interpreters with Coverage-Guided Reinforcement Learning for LLM-Based Mutation ISSTA ’24, September 16–20, 2024, Vienna, Austria

Table 5: Comparison with state-of-the-art fuzzer using LLM- and error rates of Fuzz4All and CovRL-Fuzz did not show significant
based mutation. differences, it is difficult to assert that the coverage improvement
and error rate achieved by Fuzz4All have significantly contributed
Coverage Improv to finding bugs. Table 6 shows the bugs found by both Fuzz4All
Target Fuzzer Error (%)
Valid Total Ratio (%) and CovRL-Fuzz. The bugs identified by Fuzz4All were only a few
Fuzz4All 48.56% 57,524 60,153 30.88%
V8 of the subsets of those discovered by CovRL-Fuzz. These results
CovRL-Fuzz 48.68% 75,240 78,729 -
Fuzz4All 60.16% 47,705 48,765 32.76% mean that our method, which combines coverage-guided fuzzing
JSC and LLM-based mutation through CovRL, is more useful in bug
CovRL-Fuzz 48.59% 61,137 64,738 -
Chakra
Fuzz4All 50.77% 97,723 99,329 12.25% detection than the state-of-the-art LLM-based fuzzing technique.
CovRL-Fuzz 54.87% 105,121 111,498 -
Fuzz4All 72.30% 18,895 22,681 2.49%
Jerry
CovRL-Fuzz 58.84% 20,844 23,246 - 5.4 RQ3. Ablation Study
To answer RQ3, we conducted an ablation study on the two key
Table 6: Unique bugs discovered by CovRL-Fuzz and com- components of CovRL-Fuzz, CovRL and CWR, based on coverage-
pared LLM-based fuzzer. guided fuzzing, with a 5-hour timeout. Table 7 shows the error rates
and coverage according to different CovRL and CWR variants.
Target Bug Type Fuzz4All CovRL-Fuzz Impact of CovRL. To evaluate the impact of CovRL, we conducted
JSC Undefined Behavior ✓ a comparison with w/o LLM (TokenAFL [50]), which uses token-
JSC Out-of-bounds Read ✓ level heuristic mutation, LLM w/o CovRL, which simply applies
Chakra Undefined Behavior ✓ LLM-based mutation to coverage-guided fuzzing, and LLM w/-
Chakra Out of Memory ✓ CovRL, which represents our CovRL-Fuzz. The experimental results
Chakra Out of Memory ✓
indicated that while the LLM w/o CovRL successfully decreased
Jerry Undefined Behavior ✓ ✓
Jerry Memory Leak ✓ ✓ the error rate in comparison to w/o LLM, it did not significantly
Jerry Undefined Behavior ✓ improve or even slightly decreased coverage. In contrast, the LLM
Jerry Undefined Behavior ✓ w/CovRL demonstrated coverage improvement for all targets. These
Jerry Heap Buffer Overflow ✓
Jerry Out of Memory ✓ findings suggest that applying LLM-based mutation with CovRL in
Jerry Stack Overflow ✓ coverage-guided fuzzing is effective.
Jerry Undefined Behavior ✓
Jerry Heap Buffer Overflow ✓ ✓ Impact of CWR. To evaluate the impact of CWR used in CovRL-
Total 3 14 compared to other rewards, we included w/o RL, and two additional
reward processes in our experiments. For rewarding, we conducted
a comparison between two types of rewards: Coverage Reward (CR)
by ASAN for stack trace analysis to eliminate duplicate bugs. We and Coverage-Rate Reward (CRR). CR is a simple binary rewarding
also manually analyzed and categorized the results by bug types. process we designed, where a reward of 1 is given to test cases
Table 4 shows the number and types of unique bugs found by that find new coverage, and a penalty of 0 is assigned to those that
the CovRL-Fuzz and the compared fuzzers. CovRL-Fuzz discovered do not. CRR, on the other hand, is a reward used in traditional
the most unique bugs compared to other fuzzers. In detail, CovRL- RL-based fuzzing techniques [5, 32, 33], calculated as the ratio of
Fuzz found 14 unique bugs and 9 of these bugs were exclusively current coverage to the total cumulative coverage.
detected by CovRL-Fuzz, including stack overflow and heap buffer In the experimental results, w/CR and w/CRR showed little to no
overflow. These results highlight its effectiveness in bug detection. increase in coverage compared to w/o RL. However, CovRL-Fuzz
As observed in the experimental results, LM-based fuzzers, despite achieved the highest coverage in both valid and total, and exhibited
achieving higher coverage, tend to find fewer bugs, while heuristic a low error rate. These results suggest that the coverage-guided
fuzzers, although achieving lower coverage, generally find more fuzzing technique using CWR effectively contributes to improving
bugs. However, irrespective of this trend, CovRL-Fuzz demonstrated coverage with LLM-based mutation.
superior performance in effectively discovering the most bugs.
5.5 RQ4. Real-World Bugs
5.3 RQ2. Comparison of Fuzzers Using To answer RQ4, we evaluated the ability of CovRL-Fuzz to find
LLM-Based Mutation real-world bugs during a specific period of fuzzing. Specifically, we
To answer RQ2, we conducted 24-hour experiments with Fuzz4All, investigated how many real-world bugs CovRL-Fuzz can find and
a state-of-the-art LLM-based fuzzing technique using mutation by whether it can discover previously unknown bugs. Thus, we eval-
prompt applicable to a compiler. While TitanFuzz [12] and Fuz- uated whether CovRL-Fuzz can find real-world bugs for 2 weeks
zGPT [13] both employ LLM-based mutations, their use of hand- for each target. We tested the latest version of each target inter-
crafted annotations, prompts, and mutation patterns is specifically preter as of January 2023. Table 8 summarizes the bugs found by
designed for deep learning libraries. The specialization makes them CovRL-Fuzz. A total of 58 bugs were identified, of which 50 were
challenging to easily adopt for JavaScript interpreters, which is previously unknown bugs (15 have been registered as CVEs). Out
why they were not included in our experimental subjects. of the discovered bugs, 45 were confirmed by developers, and 18
Table 5 presents the results of comparing the coverage and error have been fixed. The CVEs we identified have an average risk score
rate between CovRL-Fuzz and Fuzz4All [65]. While the coverage of 7.5 according to CVSS v3.1, with some reaching as high as 9.8.

1664
ISSTA ’24, September 16–20, 2024, Vienna, Austria Jueon Eom, Seyeon Jeong, and Taekyoung Kwon

Table 7: The ablation study with each variants. The Improv (%) refers to the improvement ratio compared to the baseline.

Target V8 JavaScriptCore ChakraCore JerryScript


Coverage Improv Coverage Improv Coverage Improv Coverage Improv
Variants Error (%) Error (%) Error (%) Error (%)
Valid Total (%) Valid Total (%) Valid Total (%) Valid Total (%)
Impact of CovRL
w/o LLM 88.79% 44,705 53,936 - 87.45% 35,406 37,461 - 78.98% 81,393 83,785 - 87.39% 12,312 14,795 -
LLM w/o CovRL 62.68% 55,459 56,576 4.89% 55.40% 41,523 42,385 13.14% 45.25% 86,043 86,858 3.67% 78.48% 12,833 14,068 -4.91%
LLM w/CovRL 61.53% 71,319 74,574 38.26% 49.60% 56,370 58,340 55.74% 58.42% 96,257 98,221 17.23% 58.59% 17,481 19,855 34.20%
Impact of CWR
w/o RL 62.68% 55,459 56,576 - 55.40% 41,523 42,385 - 45.25% 86,043 86,858 - 78.48% 12,833 14,068 -
CovRL w/CR 71.77% 55,678 57,735 2.05% 53.00% 47,116 49,083 15.80% 67.50% 92,465 94,145 8.39% 73.35% 16,689 18,629 32.42%
CovRL w/CRR 74.15% 57,401 61,331 8.40% 69.57% 37,230 43,369 2.32% 65.47% 91,427 94,785 9.13% 75.34% 16,118 18,584 32.10%
CovRL w/CWR 61.53% 71,319 74,574 31.81% 49.60% 56,370 58,340 37.64% 58.42% 96,257 98,221 13.08% 58.59% 17,481 19,855 41.14%

1 function i ( t ) { } This suggests that CovRL-Fuzz demonstrated effectiveness in find-


2 async function n ( t ) { ing real-world bugs within JavaScript interpreters.
3 if ( t instanceof i ) {
4 let c = await i ( ) ;
Case Study. Figure 6a represents a minimized test case generated
5 await c >> i ( n ) ; by CovRL-Fuzz. The code triggered an out-of-bounds read bug in
6 } else { the ChakraCore 1.13.0, causing an abnormal termination of the
7 var c = await n ( ) ; JavaScript interpreter. The original seed does not assign await to
8 } var c. CovRL-Fuzz changed it to var c=await n(); and added the
9 } await statement on line 5, and also changed the condition of the
10 n ( true ) ; if conditional. This caused the logic to call await n(); repeatedly,
a) The test case that triggers an out-of-bounds read on Chakra-
which ultimately led to the bug.
Core 1.13.0.0-beta (#13). Figure 6b represents a minimized test case generated by CovRL-
1 class s extends WeakMap { Fuzz, causing a heap buffer overflow in the release version of Jer-
2 static {} ; ryScript 3.0.0. The bug occurs when a function declaration comes
3 } on the line following the declaration of a static initialization block
4 function f ( ) in a class. When the parser read the statement, it didn’t correctly
distinguish the static initialization block range. As a result, mem-
b) The test case that triggers a heap buffer overflow on JerryScript
ory corruption occurred when parsing the function statement. In
3.0.0 (#24).
contrast to other fuzzing tools, CovRL-Fuzz is grammatically some-
Figure 6: On bugs found by CovRL-Fuzz what free and allows for context-aware mutation. This feature led
to the discovery of the bug. Our case study further demonstrates
the effectiveness of CovRL-Fuzz in detecting real-world bugs.
Table 8: The summary of bugs found by CovRL-Fuzz.
6 Related Work
Target Version Total Confirmed Unknown (Fixed) CVE LLM-based Fuzzing. Although not related to JavaScript interpreter
V8 11.4.73 2 2 0 (0) 0 fuzzing, there have been proposals to test python deep learning
JavaScriptCore 2.38.1 3 3 1 (1) 0
ChakraCore 1.13.0.0-beta 9 8 7 (0) 0 libraries using LLMs [12, 13]. TitanFuzz [12] utilized Codex [8]
JerryScript 3.0.0 14 2 12 (0) 10 for seed generation and Incoder [17] for mutation by mask. It also
QuickJS 2021-03-27 2 2 2 (1) 1 aimed to select interesting seeds, not by using internal program
Jsish 3.5.0 4 4 4 (0) 2
escargot bd95de3c 21 21 21 (14) 0
information such as coverage, but by utilizing the static analysis
Espruino 2v20 2 2 2 (2) 2 information of the seeds. FuzzGPT [13] was designed to guide
hermes 0.12.0 1 1 1 (0) 0 mutations around buggy functions by either training on GitHub
Total 58 45 50 (18) 15 bug report data or utilizing snippets from such reports.
RL-based Fuzzing. RL-based fuzzing approaches [5, 32, 33] seek
to enhance performance by incorporating code coverage feedback
CovRL-Fuzz found a variety of bugs including undefined behav- into deep learning models, such as deep neural networks (DNNs)
iors like assertion failures as well as memory bugs such as buffer and recurrent neural networks (RNNs), rather than relying solely
overflow and use after free. Please refer to Appendix for a detailed on coverage for seed selection. They provide feedback using each
list of bugs found by CovRL-Fuzz. Note that the experiment was coverage as a reward. For the purpose, they process the data into a
carried out using only 3 cores and for a relatively short duration. quantified reward, which is the ratio of the current coverage relative
In contrast, other fuzzing techniques have utilized an average of to the total cumulative coverage. We previously referred to this
around 30 cores and conducted their experiments over a whole reward as the Coverage-Rate Reward (CRR).
month [29, 50, 61, 67]. Despite these significant constraints, CovRL- Neural Network-based coverage guidance. Similar to our ap-
Fuzz was still able to find a substantial number of unknown bugs. proach, there have been attempts to use incremental learning for

1665
Fuzzing JavaScript Interpreters with Coverage-Guided Reinforcement Learning for LLM-Based Mutation ISSTA ’24, September 16–20, 2024, Vienna, Austria

coverage guidance, such as NEUZZ [54] and MTFuzz [53]. How- promising results despite being the smallest in size. Due to com-
ever, unlike our method, they use neural networks to identify mu- putational resource constraints, we conducted experiments using
tation positions that influence branching behaviors. Consequently, LLMs with a maximum size of 1B. Even with these constraints, our
NEUZZ and MTFuzz face the same limitations as traditional fuzzing technique is designed to be model-agnostic, and we believe it can
when applied to software requiring strict grammar adherence, as be effectively applied to larger LLMs. We leave this exploration for
they do not directly control the mutations themselves. In contrast, future work.
our approach leverages an LLM to mutate the seeds directly, ef-
fectively addressing these constraints. By allowing the LLM to 8 Conclusion
handle mutations, we ensure that the generated inputs adhere to
We introduced CovRL-Fuzz, a novel LLM-based coverage-guided
grammatical rules to a significant extent.
fuzzing framework that integrates coverage-guided reinforcement
learning for the first time. Our approach directly integrates coverage
7 Discussion feedback into LLM-based mutation, enhancing coverage-guided
We discuss three properties of CovRL-Fuzz in the following: fuzzing to reduce syntax limitations while enabling effective testing
to achieve broader and hidden path exploration of the JavaScript
Time spent between fuzzing and finetuning. As mentioned in
interpreter. Our evaluation results affirmed the superior efficacy
the experimental setup, we calculated the fuzzing time to ensure
of the CovRL-Fuzz methodology in comparison to existing fuzzing
fairness in fuzzing, including the time spent on finetuning in the
strategies. Impressively, it discovered 58 real-world security-related
experimental results. On average, finetuning occurs for 10 minutes
bugs with 15 CVEs in JavaScript interpreters — among these, 50
every 2.5 hours of fuzzing. Despite including the finetuning time
were previously unknown bugs. We believe that our methodology
in the experiment, CovRL-Fuzz achieved high coverage while also
paves the way for future studies focused on harnessing LLMs with
decreasing the error rate.
coverage feedback for software testing.
Performance bottleneck. In our experimental environment, we
observed that CovRL-Fuzz experienced approximately twice the Acknowledgments
delay in comparison to Token-Level AFL. While the degree of the
discrepancy may vary depending on the target software, our results We deeply thank the anonymous reviewers for their helpful com-
were generally consistent. Specifically, we found that 73% of the ments and suggestions. This work was supported by the Institute
time required to generate a single test case was consumed by the of Information & Communications Technology Planning & Eval-
LLM-based mutation process, potentially creating a performance uation (IITP) grant funded by the Korea government (MSIT) (No.
bottleneck. Despite this, our experimental results demonstrated RS-2024-00439762, Developing Techniques for Analyzing and As-
that even though the mutation speed was significantly slowed by sessing Vulnerabilities and Tools for Confidentiality Evaluation in
the LLM-based approach, it still achieved higher performance. It Generative AI Models).
indicates the efficiency of our methodology, suggesting that the
performance improvements brought by the LLM-based mutations References
outweigh the drawbacks of the increased processing time. [1] Cornelius Aschermann, Tommaso Frassetto, Thorsten Holz, Patrick Jauernig,
Ahmad-Reza Sadeghi, and Daniel Teuchert. 2019. NAUTILUS: Fishing for Deep
Catastrophic forgetting for finetuning. We mitigated cata- Bugs with Grammars.. In Proceedings 2019 Network and Distributed System Security
Symposium. 15 pages. https://doi.org/10.14722/ndss.2019.23412
strophic forgetting by applying two simple strategies: finetuning [2] Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk
with a small learning rate and using some of the original seed data Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al.
for finetuning. These measures have prevented forgetting in our 2021. Program synthesis with large language models. arXiv:2108.07732 [cs.PL]
[3] Lukas Bernhard, Tobias Scharnowski, Moritz Schloegel, Tim Blazytko, and
experiments. However, we cannot entirely rule out the possibility Thorsten Holz. 2022. JIT-picking: Differential fuzzing of JavaScript engines.
of future occurrences. This issue is not unique to our work but In Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communica-
is a general problem associated with finetuning for LLMs. We be- tions Security. 351–364.
[4] Tim Blazytko, Matt Bishop, Cornelius Aschermann, Justin Cappos, Moritz
lieve that using the latest prevention techniques [26] could more Schlögel, Nadia Korshun, Ali Abbasi, Marco Schweighauser, Sebastian Schinzel,
effectively address this issue in the future. Sergej Schumilo, et al. 2019. {GRIMOIRE }: Synthesizing structure while fuzzing.
In 28th USENIX Security Symposium (USENIX Security 19). 1985–2002.
Supporting other targets. Through finetuning, the core idea of [5] Konstantin Böttinger, Patrice Godefroid, and Rishabh Singh. 2018. Deep rein-
forcement fuzzing. In 2018 IEEE Security and Privacy Workshops (SPW). IEEE,
guiding coverage information directly with the LLM-based mutator 116–122.
is actually language-agnostic, which suggests its applicability to [6] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan,
other language interpreters or compilers. However, our focus was Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda
Askell, et al. 2020. Language models are few-shot learners. In Advances in neural
more on analyzing the suitability of our idea to existing techniques information processing systems, Vol. 33. 1877–1901.
than supporting various languages. Therefore, we conducted exper- [7] Charlie Miller. 2008. Fuzz by number. https://www.ise.io/wp-content/uploads/
2019/11/cmiller_cansecwest2008.pdf. Accessed: 2024-01-12.
iments only on JavaScript interpreters, which we deemed to have [8] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde
the most impact. Extending to other targets is left as future work. de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph,
Greg Brockman, et al. 2021. Evaluating large language models trained on code.
Application to other LLMs. CovRL-Fuzz is not limited to a spe- arXiv:2107.03374 [cs.LG]
cific LLM but is a general strategy applicable to other open-source [9] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav
Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Se-
LLMs as well. Among the base models listed in Table 1, we chose bastian Gehrmann, et al. 2022. Palm: Scaling language modeling with pathways.
CodeT5+ [63] for our experiments because it showed the most arXiv:2204.02311 [cs.CL]

1666
ISSTA ’24, September 16–20, 2024, Vienna, Austria Jueon Eom, Seyeon Jeong, and Taekyoung Kwon

[10] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Security: 24th International Conference, ICICS 2022, Canterbury, UK, September
Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro 5–8, 2022, Proceedings. Springer-Verlag, Berlin, Heidelberg, 359–375.
Nakano, et al. 2021. Training verifiers to solve math word problems. [34] Jiate Liu, Yiqin Zhu, Kaiwen Xiao, Qiang Fu, Xiao Han, Wei Yang, and De-
arXiv:2110.14168 [cs.LG] heng Ye. 2023. RLTF: Reinforcement Learning from Unit Test Feedback.
[11] Chris Cummins, Pavlos Petoumenos, Alastair Murray, and Hugh Leather. 2018. arXiv:2307.04349 [cs.AI]
Compiler fuzzing through deep learning. In Proceedings of the 27th ACM SIGSOFT [35] Xiao Liu, Xiaoting Li, Rupesh Prajapati, and Dinghao Wu. 2019. Deepfuzz:
International Symposium on Software Testing and Analysis. 95–105. Automatic generation of syntax valid c programs for fuzz testing. In Proceedings
[12] Yinlin Deng, Chunqiu Steven Xia, Haoran Peng, Chenyuan Yang, and Lingming of the AAAI Conference on Artificial Intelligence. 1044–1051. https://doi.org/10.
Zhang. 2023. Large Language Models are Zero-Shot Fuzzers: Fuzzing Deep- 1609/aaai.v33i01.33011044
Learning Libraries via Large Language Models. In Proceedings of the 32nd ACM [36] Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization.
SIGSOFT International Symposium on Software Testing and Analysis (ISSTA 2023). arXiv preprint arXiv:1711.05101 (2017).
[13] Yinlin Deng, Chunqiu Steven Xia, Chenyuan Yang, Shizhuo Dylan Zhang, Shujing [37] Henry B Mann and Donald R Whitney. 1947. On a test of whether one of
Yang, and Lingming Zhang. 2023. Large language models are edge-case fuzzers: two random variables is stochastically larger than the other. The annals of
Testing deep learning libraries via fuzzgpt. arXiv:2304.02014 [cs.SE] mathematical statistics (1947), 50–60.
[14] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: [38] Jasiel Spelman Matt Molinyawe, Adul-Aziz Hariri. 2016. $hell on Earth: From
Pre-training of deep bidirectional transformers for language understanding. Browser to System Compromise. In Black Hat USA.
arXiv:1810.04805 [cs.CL] [39] Michal Zalewski. 2013. AFL: American Fuzzy Lop. https://lcamtuf.coredump.cx/
[15] ECMA International. 1997. ECMAScript language speicification. https://www. afl/. Accessed: 2023-08-15.
ecma-international.org/ecma-262/. Accessed: 2023-08-15. [40] Mihai Bazon. 2010. uglifyjs. https://github.com/mishoo/UglifyJS. Accessed:
[16] Zhiyu Fan, Xiang Gao, Martin Mirchev, Abhik Roychoudhury, and Shin Hwei 2023-08-14.
Tan. 2023. Automated repair of programs from large language models. In 2023 [41] Barton P Miller, Lars Fredriksen, and Bryan So. 1990. An empirical study of the
IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, reliability of UNIX utilities. Commun. ACM 33, 12 (Dec. 1990), 32–44. https:
1469–1481. //doi.org/10.1145/96267.96279
[17] Daniel Fried, Armen Aghajanyan, Jessy Lin, Sida Wang, Eric Wallace, Freda Shi, [42] OpenAI. 2024. gpt4. https://openai.com/gpt-4. Accessed: 2024-03-22.
Ruiqi Zhong, Scott Yih, Luke Zettlemoyer, and Mike Lewis. 2022. InCoder: A [43] OpenAI. 2024. OpenAI API. https://openai.com/index/openai-api. Accessed:
Generative Model for Code Infilling and Synthesis. In The Eleventh International 2024-07-12.
Conference on Learning Representations. [44] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela
[18] Patrice Godefroid, Hila Peleg, and Rishabh Singh. 2017. Learn&fuzz: Machine Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022.
learning for input fuzzing. In 2017 32nd IEEE/ACM International Conference on Training language models to follow instructions with human feedback. Advances
Automated Software Engineering (ASE) (ASE 2017). IEEE, 50–59. https://doi.org/ in Neural Information Processing Systems 35 (2022), 27730–27744.
10.1109/ASE.2017.8115618 [45] Soyeon Park, Wen Xu, Insu Yun, Daehee Jang, and Taesoo Kim. 2020. Fuzzing
[19] Google. 2017. Chrominum Issue 729991. https://bugs.chromium.org/p/chromium/ javascript engines with aspect-preserving mutation. In 2020 IEEE Symposium on
issues/detail?id=729991. Accessed: 2023-08-14. Security and Privacy (SP). IEEE, 1629–1642. https://doi.org/10.1109/SP40000.2020.
[20] Samuel Groß, Simon Koch, Lukas Bernhard, Thorsten Holz, and Martin Johns. 00067
2023. FUZZILLI: Fuzzing for JavaScript JIT Compiler Vulnerabilities.. In NDSS. [46] Jibesh Patra and Michael Pradel. 2016. Learning to fuzz: Application-independent
[21] HyungSeok Han, DongHyeon Oh, and Sang Kil Cha. 2019. CodeAlchemist: fuzz testing with probabilistic, generative models of input data. Technical Report.
Semantics-Aware Code Generation to Find Vulnerabilities in JavaScript Engines.. TU Darmstadt, Department of Computer Science.
In Proceedings 2019 Network and Distributed System Security Symposium. 15 pages. [47] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever,
https://doi.org/10.14722/ndss.2019.23263 et al. 2019. Language models are unsupervised multitask learners. OpenAI blog
[22] Xiaoyu He, Xiaofei Xie, Yuekang Li, Jianwen Sun, Feng Li, Wei Zou, Yang Liu, 1, 8 (2019), 9.
Lei Yu, Jianhua Zhou, Wenchang Shi, et al. 2021. Sofi: Reflection-augmented [48] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang,
fuzzing for javascript engines. In Proceedings of the 2021 ACM SIGSAC Conference Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of
on Computer and Communications Security. ACM, 2229–2242. https://doi.org/10. transfer learning with a unified text-to-text transformer. The Journal of Machine
1145/3460120.3484823 Learning Research 21, 1 (Jan. 2020), 5485–5551. https://dl.acm.org/doi/abs/10.
[23] hoongwoo Han. 2010. js-vuln-db. https://github.com/tunz/js-vuln-db. Accessed: 5555/3455716.3455856
2023-08-15. [49] Paul Roit, Johan Ferret, Lior Shani, Roee Aharoni, Geoffrey Cideron, Robert
[24] JSHint. 2013. JSHint: A JavaScript Code Quality Tool. https://jshint.com/. Ac- Dadashi, Matthieu Geist, Sertan Girgin, Léonard Hussenot, Orgad Keller, et al.
cessed: 2023-08-15. 2023. Factually Consistent Summarization via Reinforcement Learning with
[25] George Klees, Andrew Ruef, Benji Cooper, Shiyi Wei, and Michael Hicks. 2018. Textual Entailment Feedback. arXiv:2306.00186 [cs.CL]
Evaluating fuzz testing. In Proceedings of the 2018 ACM SIGSAC conference on [50] Christopher Salls, Chani Jindal, Jake Corina, Christopher Kruegel, and Giovanni
computer and communications security. ACM, 2123–2138. https://doi.org/10.1145/ Vigna. 2021. {Token-Level } Fuzzing. In 30th USENIX Security Symposium (USENIX
3243734.3243804 Security 21). USENIX Association, 2795–2809. https://www.usenix.org/system/
[26] Suhas Kotha, Jacob Mitchell Springer, and Aditi Raghunathan. [n. d.]. Under- files/sec21-salls.pdf
standing Catastrophic Forgetting in Language Models via Implicit Inference. In [51] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov.
The Twelfth International Conference on Learning Representations. 2017. Proximal policy optimization algorithms. arXiv:1707.06347 [cs.LG]
[27] Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, and Steven [52] Konstantin Serebryany, Derek Bruening, Alexander Potapenko, and Dmitriy
Chu Hong Hoi. 2022. Coderl: Mastering code generation through pretrained Vyukov. 2012. {AddressSanitizer }: A fast address sanity checker. In 2012 USENIX
models and deep reinforcement learning. Advances in Neural Information Pro- annual technical conference (USENIX ATC 12). 309–318.
cessing Systems 35 (2022), 21314–21328. [53] Dongdong She, Rahul Krishna, Lu Yan, Suman Jana, and Baishakhi Ray. 2020.
[28] Harrison Lee, Samrat Phatale, Hassan Mansoor, Kellie Lu, Thomas Mes- MTFuzz: fuzzing with a multi-task neural network. In Proceedings of the 28th
nard, Colton Bishop, Victor Carbune, and Abhinav Rastogi. 2023. Rlaif: ACM joint meeting on European software engineering conference and symposium
Scaling reinforcement learning from human feedback with ai feedback. on the foundations of software engineering. 737–749.
arXiv:2309.00267 [cs.CL] [54] Dongdong She, Kexin Pei, Dave Epstein, Junfeng Yang, Baishakhi Ray, and Suman
[29] Suyoung Lee, HyungSeok Han, Sang Kil Cha, and Sooel Son. 2020. Montage: Jana. 2019. Neuzz: Efficient fuzzing with neural program smoothing. In 2019 IEEE
A Neural Network Language {Model-Guided } {JavaScript } Engine Fuzzer. In Symposium on Security and Privacy (SP). IEEE, 803–817. https://doi.org/10.1109/
29th USENIX Security Symposium (USENIX Security 20). USENIX Association, SP.2019.00052
2613–2630. https://www.usenix.org/system/files/sec20-lee-suyoung.pdf [55] Parshin Shojaee, Aneesh Jain, Sindhu Tipirneni, and Chandan K Reddy.
[30] Caroline Lemieux and Koushik Sen. 2018. Fairfuzz: A targeted mutation strategy 2023. Execution-based code generation using deep reinforcement learning.
for increasing greybox fuzz testing coverage. In Proceedings of the 33rd ACM/IEEE arXiv:2301.13816 [cs.LG]
international conference on automated software engineering. 475–485. [56] Karen Sparck Jones. 1972. A statistical interpretation of term specificity and its
[31] Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, application in retrieval. Journal of documentation 28, 1 (1972), 11–21.
Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, et al. 2023. [57] Technical Committee 39 ECMA International. 2010. Test262. https://github.com/
StarCoder: may the source be with you! arXiv preprint arXiv:2305.06161 (2023). tc39/test262. Accessed: 2023-08-15.
[32] Xiaoting Li, Xiao Liu, Lingwei Chen, Rupesh Prajapati, and Dinghao Wu. 2022. [58] Spandan Veggalam, Sanjay Rawat, Istvan Haller, and Herbert Bos. 2016. Ifuzzer:
ALPHAPROG: reinforcement generation of valid programs for compiler fuzzing. An evolutionary interpreter fuzzer using genetic programming. In Computer
In Proceedings of the AAAI Conference on Artificial Intelligence. 12559–12565. Security–ESORICS 2016: 21st European Symposium on Research in Computer Se-
[33] Xiaoting Li, Xiao Liu, Lingwei Chen, Rupesh Prajapati, and Dinghao Wu. 2022. curity, Heraklion, Greece, September 26-30, 2016, Proceedings, Part I 21. Springer,
FuzzBoost: Reinforcement Compiler Fuzzing. In Information and Communications Cham, 581–601. https://doi.org/10.1007/978-3-319-45744-4_29

1667
Fuzzing JavaScript Interpreters with Coverage-Guided Reinforcement Learning for LLM-Based Mutation ISSTA ’24, September 16–20, 2024, Vienna, Austria

[59] W3Techs. 2024. Usage statistics of JavaScript as client-side programming lan- [65] Chunqiu Steven Xia, Matteo Paltenghi, Jia Le Tian, Michael Pradel, and Ling-
guage on websites. https://w3techs.com/technologies/details/cp-javascript. Ac- ming Zhang. 2023. Universal fuzzing via large language models. arXiv preprint
cessed: 2024-01-17. arXiv:2308.04748 (2023).
[60] Junjie Wang, Bihuan Chen, Lei Wei, and Yang Liu. 2017. Skyfire: Data-driven [66] Chunqiu Steven Xia and Lingming Zhang. 2022. Less training, more repairing
seed generation for fuzzing. In 2017 IEEE Symposium on Security and Privacy (SP). please: revisiting automated program repair via zero-shot learning. In Proceedings
IEEE, 579–594. https://doi.org/10.1109/SP.2017.23 of the 30th ACM Joint European Software Engineering Conference and Symposium
[61] Junjie Wang, Bihuan Chen, Lei Wei, and Yang Liu. 2019. Superion: Grammar- on the Foundations of Software Engineering. 959–971.
aware greybox fuzzing. In 2019 IEEE/ACM 41st International Conference on Soft- [67] Guixin Ye, Zhanyong Tang, Shin Hwei Tan, Songfang Huang, Dingyi Fang, Xi-
ware Engineering (ICSE). IEEE, 724–735. https://doi.org/10.1109/ICSE.2019.00081 aoyang Sun, Lizhong Bian, Haibo Wang, and Zheng Wang. 2021. Automated con-
[62] Junjie Wang, Zhiyi Zhang, Shuang Liu, Xiaoning Du, and Junjie Chen. 2023. formance testing for JavaScript engines via deep compiler fuzzing. In Proceedings
{FuzzJIT }: {Oracle-Enhanced } Fuzzing for {JavaScript } Engine {JIT } Compiler. of the 42nd ACM SIGPLAN International Conference on Programming Language De-
In 32nd USENIX Security Symposium (USENIX Security 23). 1865–1882. sign and Implementation. ACM, 435–450. https://doi.org/10.1145/3453483.3454054
[63] Yue Wang, Hung Le, Akhilesh Deepak Gotmare, Nghi DQ Bui, Junnan Li, and [68] Albert Ziegler, Eirini Kalliamvakou, X Alice Li, Andrew Rice, Devon Rifkin,
Steven CH Hoi. 2023. Codet5+: Open code large language models for code Shawn Simister, Ganesh Sittampalam, and Edward Aftandilian. 2022. Productivity
understanding and generation. arXiv:2305.07922 [cs.CL] assessment of neural code completion. In Proceedings of the 6th ACM SIGPLAN
[64] Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian International Symposium on Machine Programming. 21–29.
Lester, Nan Du, Andrew M Dai, and Quoc V Le. 2021. Finetuned Language Models
are Zero-Shot Learners. In International Conference on Learning Representations. Received 2024-04-12; accepted 2024-07-03

1668

You might also like