Fuzzing Javascript Interpreters With Coverage-Guided Reinforcement Learning For Llm-Based Mutation
Fuzzing Javascript Interpreters With Coverage-Guided Reinforcement Learning For Llm-Based Mutation
Abstract 1 Introduction
JavaScript interpreters, crucial for modern web browsers, require an JavaScript (JS) interpreters are essential for modern web browsers,
effective fuzzing method to identify security-related bugs. However, enabling interactive web and embedded applications through the
the strict grammatical requirements for input present significant parsing, interpreting, compiling, and executing of JavaScript code.
challenges. Recent efforts to integrate language models for context- With JavaScript being employed as a client-side programming lan-
aware mutation in fuzzing are promising but lack the necessary guage by 98.9% of web browsers as of January 2024 [59], the secu-
coverage guidance to be fully effective. This paper presents a novel rity of JavaScript interpreters is crucial. Vulnerabilities can lead to
technique called CovRL (Coverage-guided Reinforcement Learning) severe security threats, including information disclosure and the
that combines Large Language Models (LLMs) with Reinforcement bypassing of browser security measures [19, 38]. Given their critical
Learning (RL) from coverage feedback. Our fuzzer, CovRL-Fuzz, role and complex nature, JavaScript interpreters require rigorous
integrates coverage feedback directly into the LLM by leveraging and continuous testing methods like fuzzing.
the Term Frequency-Inverse Document Frequency (TF-IDF) method Previous research on fuzzing JavaScript interpreters primarily
to construct a weighted coverage map. This map is key in calcu- falls into two categories: grammar-level and token-level, each ad-
lating the fuzzing reward, which is then applied to the LLM-based dressing the strict grammar requirements of JavaScript. Grammar-
mutator through reinforcement learning. CovRL-Fuzz, through this level fuzzing aims to generate grammatically correct inputs, en-
approach, enables the generation of test cases that are more likely suring syntactic accuracy [1, 20–22, 45, 46, 58, 60, 61]. Token-level
to discover new coverage areas, thus improving bug detection while fuzzing techniques adopt a more flexible approach by manipulating
minimizing syntax and semantic errors, all without needing extra sequences of tokens without strict adherence to grammar rules [4,
post-processing. Our evaluation results show that CovRL-Fuzz out- 50]. Both approaches have employed coverage-guided fuzzing, such
performs the state-of-the-art fuzzers in enhancing code coverage as AFL [39], to enhance the fuzzing effectiveness by promoting an
and identifying bugs in JavaScript interpreters: CovRL-Fuzz iden- exhaustive examination of code paths [4, 22, 45, 50, 61]. However,
tified 58 real-world security-related bugs in the latest JavaScript the evolving nature of the JavaScript language, with its constantly
interpreters, including 50 previously unknown bugs and 15 CVEs. updating grammar, poses significant challenges. Grammar-level
fuzzing focuses intensely on generating mutations that precisely
CCS Concepts follow syntax rules, limiting mutation diversity. Given fuzzing’s
ability to produce vast amounts of input per second, it can manage
• Security and privacy → Software security engineering. some noise and inaccuracies, even when not strictly adhering to
grammar rules. This focus on syntax can paradoxically reduce mu-
Keywords tation variety and constrain program path exploration. Token-level
fuzzing; coverage; reinforcement learning; large language model fuzzing, although more flexible, struggles with maintaining syntac-
tical correctness over successive mutations, often leading to syntax
ACM Reference Format:
errors and hindering the discovery of deeper bugs.
Jueon Eom, Seyeon Jeong, and Taekyoung Kwon. 2024. Fuzzing JavaScript To address these challenges, recent advancements have led to
Interpreters with Coverage-Guided Reinforcement Learning for LLM-Based research into fuzzing techniques that employ LLMs, which are
Mutation. In Proceedings of the 33rd ACM SIGSOFT International Symposium adept at producing syntactically informed, well-formed inputs for
on Software Testing and Analysis (ISSTA ’24), September 16–20, 2024, Vienna, compilers and JavaScript interpreters [65, 67]. A standout example
Austria. ACM, New York, NY, USA, 13 pages. https://doi.org/10.1145/3650212. is Fuzz4All [65], which utilizes pretrained Code-LLMs for com-
3680389 piler fuzzing. These models, trained on extensive datasets across
various programming languages, offer application for LLM-based
mutation without further finetuning. They inherently grasp the
language’s context, enabling the generation of inputs that are gram-
matically accurate and contextually relevant, thereby enhancing
fuzzing effectiveness. However, current LLM-based fuzzing meth-
This work is licensed under a Creative Commons Attribution 4.0 Interna-
tional License. ods are generally considered black-box fuzzing and lack integration
ISSTA ’24, September 16–20, 2024, Vienna, Austria with internal program information like code coverage. In contrast
© 2024 Copyright held by the owner/author(s). to black-box fuzzing, coverage-guided fuzzing utilizes internal pro-
ACM ISBN 979-8-4007-0612-7/24/09 gram data to enhance fuzzing effectiveness. The technique employs
https://doi.org/10.1145/3650212.3680389
1656
ISSTA ’24, September 16–20, 2024, Vienna, Austria Jueon Eom, Seyeon Jeong, and Taekyoung Kwon
Table 1: Average coverage achieved by four LLM-based mu- • We introduce CovRL, a novel method integrating LLMs with
tations compared to random mutation was measured using coverage feedback by reinforcement learning, using TF-IDF. We
AFL for 5 hours on the JavaScript interpreter V8 with a sin- directly feed code coverage to LLMs via a new reward scheme.
gle core. Among these variants, the baseline is Token-Level • We implement CovRL-Fuzz, a new JavaScript interpreter fuzzer
AFL [50] setting in Table 2, and prompt setting used was that outperforms existing methods in code coverage and bug
"Please mutate the following program". The experiment for detection by employing the CovRL technique.
GPT-4 [42] was conducted by requesting and receiving muta- • CovRL-Fuzz identified 58 real-world security-related bugs, in-
tions through the OpenAI API [43]. cluding 50 unknown bugs (15 CVEs) in the latest JavaScript in-
terpreters.
Variants Coverage Improv • To foster future research, we release our implementation of
Error (%)
Strategy LLM Valid Total Ratio (%) CovRL-Fuzz at https://github.com/seclab-yonsei/CovRL-Fuzz.
Random ✗ (Baseline [50]) 88.79% 44,705 53,936 -
GPT-4 [42] 27.50% 46,454 46,738 -13.35% 2 Background
Prompt
StarCoder (1B) [31] 82.72% 41,331 45,034 -16.50%
Mask
Incoder (1B) [17] 49.08% 46,427 47,385 -12.15% 2.1 JavaScript Interpreter Fuzzing
CodeT5+ (220M) [63] 62.68% 55,459 56,576 4.89%
Fuzzing is a powerful automated method for finding software
bugs [41] and is highly regarded in both academia and industry.
However, it faces challenges with JavaScript interpreters due to
an evolutionary strategy for generating "interesting" seeds aimed their need for strictly grammatical input. When inputs are not
at expanding the program’s coverage, thus potentially increasing syntactically correct, the JavaScript interpreter returns a syntax
the likelihood of bug discovery. By effectively using code coverage error, while semantic inconsistencies (e.g., errors with reference,
feedback to guide the mutation of inputs, it can uncover bugs more type, range, or URI) can lead to semantic errors [21]. In both sce-
efficiently than traditional black-box methods [7]. However, as we narios, the interpreter’s internal logic, which may contain hidden
describe below, this is quite challenging. bugs, is not executed. To tackle these challenges, researchers have
Problem. LLMs typically generate sentences at the word level, developed grammar-level and token-level fuzzing methods. The
which leads to the assumption that LLM-based mutations operate grammar-level approach converts seeds into an Intermediate Rep-
at the token level. Replacing traditional random mutators with pre- resentation (IR) to ensure grammatical accuracy of inputs [1, 20–
trained LLM-based mutators in coverage-guided fuzzing generally 22, 45, 46, 58, 60, 61]. While effective in maintaining syntax, ad-
reduces error rates but does not enhance coverage. Our experimen- hering strictly to grammar rules limits mutation diversity, making
tal results, detailed in Table 1, show that using the AFL fuzzing tool, it difficult to detect bugs caused by grammar violations or unex-
LLM-based mutations on V8 for five hours resulted in 12-16% lower pected input patterns. On the other hand, token-level fuzzing offers
coverage compared to the baseline in most cases. Even in cases greater flexibility by breaking inputs into tokens and replacing
where coverage increased, the rise was only slight. This suggests them selectively, enhancing bug detection capabilities [4, 50]. How-
that while LLM-based mutations reduce errors, their constrained ever, it often fails to maintain syntactical correctness due to its
predictions may limit diversity and effectiveness. random substitution method, which does not consider relationships
We hypothesize that LLM-based mutations, focusing on context, between tokens. Recent studies also explore bugs in Just-In-Time
often predict common tokens and unintentionally reduce diversity. (JIT) compilers of JavaScript engines through differential testing in
This is similar to grammar-level mutations, which target grammat- behavior with JIT enabled and disabled [3, 62].
ical accuracy but also limit variability. Consequently, in coverage-
guided fuzzing, while LLM-based mutators decrease errors, their LLM-based Fuzzing. Recently, there has been a shift toward using
context-aware approach diminishes diversity, making them less deep learning-based Language Models (LMs) in fuzzing to address
effective than random fuzzing. the limitations of traditional methods. Initially, RNN-based LMs
were used to mutate seeds [11, 18, 29, 35], and more recently, Large
Our Approach. To address the limitation of LLM-based mutations,
Language Models (LLMs) like GPTs [42, 47] and StarCoder [31]
we propose a novel technique that integrates coverage-guided feed-
have been employed for seed generation and mutation [65, 67], ben-
back directly into the mutation process. Our approach leverages
efiting from extensive training on diverse programming languages
internal program information to enhance fuzzing effectiveness,
datasets.
aiming to generate diverse mutations that go beyond mere gram-
matical correctness. We employ Term Frequency-Inverse Document Coverage-guided Fuzzing. Coverage-guided fuzzing, which lever-
Frequency (TF-IDF) [56] to weight coverage data, establishing a ages coverage feedback to explore diverse code paths, has proven
feedback-driven reward system. This method not only increases more effective than traditional black-box methods in detecting
coverage but also improves bug detection capabilities of LLM-based software bugs [7]. This approach, exemplified by tools like Ameri-
fuzzing, eliminating the need for additional post-processing. We can Fuzzy Lop (AFL) [39], emphasizes maximizing code coverage
term our approach CovRL-Fuzz. Unlike other LLM-based fuzzing through mutations [4, 22, 45, 50, 61], and has been particularly ef-
techniques, CovRL-Fuzz is the first to effectively integrate LLM- fective in finding security vulnerabilities. Despite the success of
based mutation with coverage-guided fuzzing, thereby distinguish- coverage-guided fuzzing, existing LLM-based fuzzing techniques,
ing it from existing methods [65, 67]. including COMFORT and Fuzz4All, primarily use black-box meth-
To sum up, this paper makes the following contributions: ods [65, 67] and have not achieved coverage-guided fuzzing for
1657
Fuzzing JavaScript Interpreters with Coverage-Guided Reinforcement Learning for LLM-Based Mutation ISSTA ’24, September 16–20, 2024, Vienna, Austria
1658
ISSTA ’24, September 16–20, 2024, Vienna, Austria Jueon Eom, Seyeon Jeong, and Taekyoung Kwon
Figure 2: Workflow of CovRL-Fuzz: The blue-shaded area illustrates the operation of CovRL.
comparable to other latest JavaScript interpreter fuzzing techniques TF-IDF prioritizes less frequent tokens by assigning them higher
in Section 5.2. weights, and more common tokens receive lower weights. We ap-
ply this method to create a weighted coverage map, focusing on
3.1 Phase 1. Mutation by Mask underexplored areas. Figure 3 illustrates the Coverage-Weighted
To mutate the selected seed, CovRL-Fuzz performs a simple mask- Rewarding (CWR) process in action.
ing strategy for mutation by mask ( 1 in Figure 2). Given the input
sequence 𝑊 = {𝑤 1, 𝑤 2, .., 𝑤𝑛 }, we use three masking techniques: Example 1. Consider the scenario of executing a program depicted
insert, overwrite, and splice. The strategy results in the mask se- by Control Flow Graph (CFG) in Figure 3 (a). There is a loop C, and
branches D and E in the CFG. We executed the program with a test
quence 𝑊 𝑀𝐴𝑆𝐾 = {[𝑀𝐴𝑆𝐾], 𝑤 3, .., 𝑤𝑘 } and the masked sequence
case generated by a fuzzer and obtained the coverage results.
𝑊 \𝑀𝐴𝑆𝐾 = {𝑤 1, 𝑤 2, [𝑀𝐴𝑆𝐾], .., 𝑤𝑛 }. The detailed operations are
Building on previous RL-based finetuning methods using Code-
described as follows:
LLM [27, 34, 55], we further extend the idea by applying a rewarding
Insert Randomly select positions and insert [MASK] tokens signal based on software output. Notably, errors in the JavaScript
into the inputs. interpreter can be broadly grouped into syntax errors and semantic
Overwrite Randomly select positions and replace existing tokens errors, which include reference, type, range, and URI errors. Given
with the [MASK] token. that 𝑊 ∗ is the concatenation of the masked sequence 𝑊 \𝑀𝐴𝑆𝐾 and
Splice Statements within a seed are randomly divided into the mask sequence 𝑊 𝑀𝐴𝑆𝐾 , the following returns can be deduced
segments. A portion of these segments is replaced with based on input to the target:
a segment from another seed with [MASK], formatted
as [MASK] statement [MASK].
−1.0 if 𝑊 ∗ is syntax error
After generating a masked sequence 𝑊 \𝑀𝐴𝑆𝐾 via masking, the ∗
𝑟 (𝑊 ) = −0.5 if 𝑊 ∗ is semantic error (2)
seed is mutated by inferring in the masked positions via LLM-based
+𝑅𝑐𝑜𝑣
if 𝑊 ∗ is passed
mutator. The mutation design of CovRL-Fuzz is based on a span-
based masked language model (MLM) that can predict variable- To enhance the LLM-based mutator’s ability to discover diverse
length masks [17, 48]. Therefore, the MLM loss function we utilize coverage, we introduce an additional rewarding signal as outlined
for mutation can be represented as follows: in Eq. 2. Unlike traditional RL-based fuzzing techniques that use
the ratio of current to total coverage, our approach assigns weights
𝑘
Õ based on the frequency of reaching specific coverage points. This
𝐿𝑀𝐿𝑀 (𝜃 ) = −𝑙𝑜𝑔𝑃𝜃 (𝑤𝑖𝑀𝐴𝑆𝐾 |𝑤 \𝑀𝐴𝑆𝐾 , 𝑤 <𝑖
𝑀𝐴𝑆𝐾
) (1) method involves adjusting the coverage map using the IDF weight
𝑖=1 map, calculating the weighted sum for each coverage data point,
𝜃 represents the trainable parameters of the model that are opti- and normalizing these sums to derive scores.
mized during the training process, and 𝑘 is the number of tokens in Initially, we observed that the coverage map functions like Term
𝑊 𝑀𝐴𝑆𝐾 . 𝑤 \𝑀𝐴𝑆𝐾 denotes the masked input tokens where specific Frequency (TF) by tracking how often certain coverage points are
tokens are replaced by mask tokens. 𝑤 𝑀𝐴𝑆𝐾 refers to the original hit. However, in JavaScript interpreters, identical sections of code
tokens that have been substituted with the mask tokens in the input in a test case, such as repeated lines like ‘a=1; a=1;’, can trigger the
sequence. same coverage area multiple times, leading to redundant counts. To
address this, we underscore the importance of recognizing unique
3.2 Phase 2. Coverage-Weighted Rewarding instances of coverage. Hence, we introduce 𝑇 𝐹 𝑐𝑜𝑣 , defining it as
a map that records each coverage point uniquely, reducing redun-
We developed a method called Coverage-Weighted Rewarding
dancy and emphasizing the significance of varied coverage.
(CWR) to guide the LLM-based mutator. The approach employs
TF-IDF to emphasize less common coverage points, effectively fo-
cusing on discovering new areas of code coverage ( 2 in Figure 2). 𝑇 𝐹 𝑐𝑜𝑣 = unique coverage map (3)
1659
Fuzzing JavaScript Interpreters with Coverage-Guided Reinforcement Learning for LLM-Based Mutation ISSTA ’24, September 16–20, 2024, Vienna, Austria
Figure 3: Example of Coverage-Weighted Rewarding process. (a) represents the control flow graph of an example program. (b)
shows the 𝑇 𝐹 𝑐𝑜𝑣 Maps calculated based on the coverage areas traversed when each Test Case (TC) is executed in the program
represented by (a). (c) is the 𝐼 𝐷𝐹 𝑐𝑜𝑣 calculated based on the 𝑇 𝐹 𝑐𝑜𝑣 Maps, which we refer to as the Coverage-based Weight Map.
(d) describes the process of calculating the 𝑅𝑐𝑜𝑣 . Additional TCs are element-wise multiplied by 𝐼 𝐷𝐹 𝑐𝑜𝑣 to become the weighted
coverage map, called the Weighted 𝑇 𝐹 𝑐𝑜𝑣 map, and the sum of this map is rewarded to the model as the reward 𝑅𝑐𝑜𝑣 . Note that
the calculation of 𝑅𝑐𝑜𝑣 in this figure is a simplified version of Eq.5.
Example 2. Assume we obtained coverage from test case 1 (TC1) as higher payouts for test cases that achieve uncommon levels of
follows: [1, 3, 4, 0, 0]. Applying Eq. 3, the coverage transforms into a coverage.
binary map to indicate whether a path was executed (1) or not (0), Example 4. Assume we generated two new test cases, TC4 and TC5,
regardless of the number of times it was executed. Thus, the 𝑇 𝐹 𝑐𝑜𝑣 through fuzzing and executed the program to obtain 𝑇 𝐹 𝑐𝑜𝑣 map. We
map for TC1 is updated to: [1, 1, 1, 0, 0]. And this process is applied create a weighted 𝑇 𝐹 𝑐𝑜𝑣 by element-wise multiplying each TC with
identically to other test cases as well ((b) in Figure 3). 𝐼 𝐷𝐹 𝑐𝑜𝑣 , and then calculate 𝑅𝑐𝑜𝑣 as the sum of the elements according
We also define the coverage-based weight map 𝐼 𝐷𝐹 𝑐𝑜𝑣 using to Eq. 5. In the case of TC4, because it executed mostly the code paths
the 𝑇 𝐹 𝑐𝑜𝑣 of each seed to weight which code paths are accessed that TC1, TC2, and TC3 had executed, it received a penalty of -0.13,
frequently and which are not, as follows: whereas TC5, having executed rare or previously unexecuted code
1 𝑁 paths, obtained a reward of 0.54 ((d) in Figure 3).
𝐼 𝐷𝐹 𝑐𝑜𝑣 = √ 𝑙𝑜𝑔( ) (4) Update Weight Map with Momentum. Although we used the log-
𝑀 1 + 𝐷𝐹 𝑐𝑜𝑣
arithmic function in Eq. 4 to alleviate dramatic changes in 𝐼 𝐷𝐹 𝑐𝑜𝑣 ,
where 𝑁 denotes the total number of unique coverage obtained.
instability can still occur due to abrupt shifts in the reward distri-
𝐷𝐹 𝑐𝑜𝑣 denotes the number of seeds that have achieved the specific
bution. To address this issue, we employed momentum. Following
coverage point. The weight map 𝐼 𝐷𝐹 𝑐𝑜𝑣 is obtained by taking the
each cycle, CovRL updates the IDF weight map. To mitigate dra-
inverse of 𝐷𝐹 𝑐𝑜𝑣 , resulting in higher weights for less common
matic changes in the reward distribution, we use momentum at a
coverage. The variable 𝑀 denotes the overall size of the coverage
rate of 𝛼 to incorporate the prior weight when recalculating the
map, which we utilized as a scale factor to adjust the weight value.
map. The updated weight map is as follows:
The scaling ensures that the weights remain consistent regardless
of the size 𝑀 of the coverage map, thus preserving the stability of
𝐼 𝐷𝐹𝑡𝑐𝑜𝑣 𝑐𝑜𝑣 𝑐𝑜𝑣
−1 = 𝛼𝐼 𝐷𝐹𝑡 −1 + (1 − 𝛼)𝐼 𝐷𝐹𝑡 (6)
the training process.
where 𝐼 𝐷𝐹𝑡𝑐𝑜𝑣 𝑐𝑜𝑣
means new weight map and 𝐼 𝐷𝐹𝑡 −1 means previous
Example 3. Consider in Figure 3 (b) that three 𝑇 𝐹 𝑐𝑜𝑣 maps, [1, 1, 1, weight map.
0, 0], [1, 1, 0, 0, 1], and [1, 0, 0, 0, 0], were generated. Based on these
𝑇 𝐹 𝑐𝑜𝑣 maps, we calculate the coverage-based weight map, 𝐼 𝐷𝐹 𝑐𝑜𝑣 . 3.3 Phase 3. CovRL-Based Finetuning
By applying Eq. 4, 𝐼 𝐷𝐹 𝑐𝑜𝑣 is computed as a weight map with values The fuzzing environment with mutation by mask can be concep-
[-0.13, 0.0, 0.18, 0.49, 0.18] ((c) in Figure 3). tualized as a bandit environment for RL. In this environment, a
The reward is acquired by taking the weighted sum of 𝑇 𝐹 𝑐𝑜𝑣 masked sequence 𝑊 \𝑀𝐴𝑆𝐾 is provided as input (𝑥), and the ex-
and 𝐼 𝐷𝐹 𝑐𝑜𝑣 to create the weighted coverage map, which is then pected output is a mask sequence 𝑊 𝑀𝐴𝑆𝐾 (𝑦). Inspired by previous
weighted to obtain as studies [27, 34, 55], we finetune our model using the PPO algo-
𝑀
rithm [51], an actor-critic reinforcement learning ( 3 in Figure 2).
Õ
𝑅𝑐𝑜𝑣 = 𝜎 (𝑙𝑜𝑔( 𝑡 𝑓𝑖,𝑡 · 𝑖𝑑 𝑓𝑖,𝑡 −1 )) (5) In our situation, it can be implemented by finetuning two LLMs in
𝑖=1 tandem: one LLM acts as a mutator (actor), while the other LLM
serves as a rewarder (critic). We utilize a pretrained LLM to initial-
where 𝑡 represents the current cycle. 𝑡 𝑓𝑖,𝑡 refers to an element in
ize the parameters both of mutator and rewarder. The rewarder is
𝑇 𝐹𝑡𝑐𝑜𝑣 , and 𝑖𝑑 𝑓𝑖,𝑡 −1 refers to an element in 𝐼 𝐷𝐹𝑡𝑐𝑜𝑣
−1 at the previous trained using the Eq. 2. It plays a crucial role in training the mutator.
time step before updating the weights. 𝜎 is a sigmoid function used
For CovRL-based finetuning with PPO, we define the CovRL loss
to map 𝑅𝑐𝑜𝑣 to a value between 0 and 1.
as follows:
The 𝑅𝑐𝑜𝑣 is calculated only if the test case is free from any syntax " !#
or semantic problems. Our rewarding scheme incentivizes the LLM- 𝜋𝜃𝑡 (𝑦|𝑥)
𝐿𝐶𝑜𝑣𝑅𝐿 (𝜃 ) = −E (𝑥,𝑦)∼𝐷𝑡 𝑅(𝑥, 𝑦)𝑡 · log 𝑡 −1 (7)
based mutator to explore a wider range of coverage by providing 𝜋 (𝑦|𝑥)
1660
ISSTA ’24, September 16–20, 2024, Vienna, Austria Jueon Eom, Seyeon Jeong, and Taekyoung Kwon
Algorithm 1: Fuzzing with CovRL we finetune the mutator M𝑝𝑟𝑒𝑣 to generate M𝑐𝑢𝑟 (Line 20). For
Input: finetuning dataset 𝐷𝑇 finetuning the mutator, we apply reward or penalty to the model
1 R𝑝𝑟𝑒𝑣 : Previous LLM-based rewarder
using the CovRL loss from Eq. 7.
2 R𝑐𝑢𝑟 : Current LLM-based rewarder
3 M𝑝𝑟𝑒𝑣 : Previous LLM-based mutator 4 Implementation
4 M𝑐𝑢𝑟 : Current LLM-based mutator We implemented a prototype of CovRL-Fuzz using pytorch v2.0.0,
transformers v4.38.2 and afl 2.52b [39].
5 Function FuzzOne(𝑠𝑒𝑒𝑑_𝑞𝑢𝑒𝑢𝑒):
6 for 𝑖 = 1 to 𝑖𝑡𝑒𝑟 _𝑐𝑦𝑐𝑙𝑒 do Dataset. We collected data from regression test suites in several
7 𝑠𝑒𝑒𝑑 ← SelectSeed(𝑠𝑒𝑒𝑑_𝑞𝑢𝑒𝑢𝑒) repositories including V8, JavaScriptCore, ChakraCore, JerryScript,
Test262 [57], and js-vuln-db [23] as of December 2022. We then
8 𝑇 ← Mutate(M𝑐𝑢𝑟 , 𝑠𝑒𝑒𝑑)
pre-processed the data for training data and seeds, resulting in a
9 𝐼 𝑣𝑎𝑙 , 𝑐𝑜𝑣 ← Execute(𝑇 )
collection of 52K unique JavaScript files for our experiments.
10 if IsInteresting(𝑇 ) then
11 𝑇𝑖𝑛𝑡𝑒𝑟𝑒𝑠𝑡 ← 𝑇 Pre-Processing. We performed a simple pre-processing on the
12 𝑠𝑒𝑒𝑑_𝑞𝑢𝑒𝑢𝑒.append(𝑇𝑖𝑛𝑡𝑒𝑟𝑒𝑠𝑡 ) regression test suites of the JavaScript interpreters mentioned above
13 𝑅𝑐𝑜𝑣 = calcReward(𝐼 𝑣𝑎𝑙 , 𝑐𝑜𝑣) to remove comments, filter out grammatical errors, and simplify
identifiers. We then used the processed data directly for training.
14 𝑑𝑎𝑡𝑎 ← 𝑇𝑖𝑛𝑡𝑒𝑟𝑒𝑠𝑡 , 𝑅𝑐𝑜𝑣
The pre-processing was conducted utilizing the -m and -b options
15 𝐷𝑇 .append(𝑑𝑎𝑡𝑎)
of the UglifyJS tool [40].
16 FinetuneCovRL(𝐷𝑇 ) Training. We utilized the pretrained Code-LLM, CodeT5+
17 Function FinetuneCovRL(𝐷𝑇 ): (220m) [63], as both the rewarder and the mutator. For the process
18 R𝑝𝑟𝑒𝑣 , M𝑝𝑟𝑒𝑣 ← R𝑐𝑢𝑟 , M𝑐𝑢𝑟 of CovRL-based finetuning, we trained the rewarder and mutator
19 R𝑐𝑢𝑟 ← FinetuneRewarder(R𝑝𝑟𝑒𝑣 , 𝐷𝑇 ) for 1 epoch per mutation cycle. We used a batch size of 256 and
learning rate of 1e-4. The optimization utilized the AdamW opti-
20 M𝑐𝑢𝑟 ← FinetuneMutator(M𝑝𝑟𝑒𝑣 , R𝑐𝑢𝑟 , 𝐷𝑇 )
mizer [36] together with a learning rate linear warmup technique.
The LLM-based rewarder uses the encoder from CodeT5+ to predict
the rewarding signal through a classification approach. We utilized
where 𝑅(𝑥, 𝑦) represents the reward of CovRL, and 𝐷𝑡 refers to the the contrastive search method, incorporating a momentum factor
finetuning dataset that has been collected up to time step 𝑡. 𝜋𝜃𝑡 (𝑦|𝑥) 𝛼 of 0.6 and a top-k setting of 32, to enhance the effectiveness of
with parameters 𝜃 is the trainable RL policy for the current mutator, CovRL. In addition, we aligned the coverage map size with AFL’s
and 𝜋 𝑡 −1 (𝑦|𝑥) represents the policy from the previous mutator. recommendations by setting the scaling factor 𝑀 for the map size.
To mitigate the overoptimization and maintain the LLM-based This ensures that the instrumentational capacity is optimized. For
mutator’s mask prediction ability, we also use KL regularization. moderate-sized software (approx. 10K lines), we employed a map
The reward after adding the KL regularization is size of 216 . For larger software exceeding 50K lines, we used a map
! size of 217 , striking a balance between granularity and performance.
∗
𝜋𝜃𝑡 (𝑦|𝑥) For detailed analysis related to the hyperparameters that we have
𝑅(𝑥, 𝑦)𝑡 = 𝑟 (𝑊 ) + log (8) set, such as finetuning epochs and 𝛼, please refer to the Appendix.
𝜋 𝑡 −1 (𝑦|𝑥)
1661
Fuzzing JavaScript Interpreters with Coverage-Guided Reinforcement Learning for LLM-Based Mutation ISSTA ’24, September 16–20, 2024, Vienna, Austria
Benchmarks. We tested it on four JavaScript interpreters, using the Table 2: Baseline fuzzers targeting JavaScript interpreters.
latest versions as of January 2023: JavaScriptCore (2.38.1), Chakra- CGF indicates the use of coverage-guided fuzzing, LLM de-
Core (1.13.0.0-beta), V8 (11.4.73), JerryScript (3.0.0). We also con- notes the usage of LLMs, and Mutation Level refers to the
ducted additional experiments on QuickJS (2021-03-27), Jsish (3.5.0), unit of mutation.
escargot (bd95de3c), Espruino (2v20) and Hermes (0.12.0) for real-
world bug detection experiments. Mutation Post
Fuzzer CGF LLM
We built each target JavaScript interpreter with Address Sanitizer Level Processing
(ASAN) [52] to detect bugs related to abnormal memory access and
Coverage-guided Baselines
in debug mode to find bugs related to undefined behavior. AFL(w/Dict) [39] ✓ Bit/Byte
Superion [61] ✓ Grammar
Fuzzing Campaign. For a fair evaluation, we used the same set of
Token-Level AFL [50] ✓ Token
100 valid seeds. For RQ1 and RQ2, we operated on 3 CPU cores, LM Baselines
considering other fuzzing approaches. For RQ3, we used a sin- Montage [29] Grammar ✓
gle CPU core. For RQ4, we also used 3 CPU cores and conducted COMFORT [67] ✓ Grammar ✓
experiments with the four major JavaScript interpreters and five CovRL-Fuzz ✓ ✓ Token
additional ones. To consider the randomness of fuzzing, we exe-
cuted each fuzzer five times and then averaged the coverage results.
Additionally, to ensure fairness in fuzzing, we measured the results Table 3: Comparison with other JavaScript interpreter fuzzers
of each experiment, including the finetuning time through CovRL. listed in Table 2.
The average finetuning time is 10 minutes, occurring every 2.5
hours. Target Fuzzer Error (%)
Coverage Improv
Valid Total Ratio (%)
Each tool was run on four JavaScript interpreters with default
AFL (w/Dict) 96.90% 29,929 33,531 134.79%
configurations which details can be found in Table 2. Superion 77.35% 33,812 36,985 112.87%
Token-Level AFL 84.10% 39,582 42,303 86.11%
Metrics. We use three metrics for evaluation. V8 Montage 56.24% 38,856 40,155 96.06%
Montage (w/o Import) 94.08% 33,487 36,338 116.66%
• Code Coverage represents the range of the software’s code
COMFORT 79.66% 44,324 46,522 69.23%
that has been executed. We adopt edge coverage from the AFL’s CovRL-Fuzz 48.68% 75,240 78,729 -
coverage bitmap, following FairFuzz [30] and Evaluate-Fuzz- AFL (w/Dict) 74.42% 18,343 20,496 215.86%
Testing [25] settings. We conducted a comparison of coverage Superion 72.02% 17,619 19,772 227.42%
Token-Level AFL 69.70% 52,385 53,719 20.51%
in two categories: total and valid. Total refers to the coverage JSC Montage 42.34% 55,511 56,861 13.85%
across all test cases, while valid refers to the coverage for valid Montage (w/o Import) 93.72% 43,861 47,754 35.57%
test cases. We also employed the Mann-Whitney U-test [37] to COMFORT 79.64% 36,074 36,542 77.16%
assess the statistical significance and verified that all p-values CovRL-Fuzz 48.59% 61,137 64,738 -
AFL (w/Dict) 81.32% 83,038 87,587 27.30%
were less than 0.05. Superion 42.63% 92,314 94,237 18.32%
• Error Rate measures the rate of syntax errors and semantic Token-Level AFL 90.64% 92,621 95,677 16.54%
errors in the generated test cases. The metric provides insight into Chakra Montage 82.21% 101,470 103,589 7.63%
Montage (w/o Import) 94.72% 90,940 98,643 13.03%
how effectively each method explores the core logic in the target
COMFORT 79.47% 81,171 83,142 34.11%
software. For detailed analysis, semantic errors are categorized CovRL-Fuzz 54.87% 105,121 111,498 -
into type errors, reference errors, URI errors, and internal errors AFL (w/Dict) 77.32% 9,307 14,259 63.03%
based on the ECMA standard [15]. It should be noted that while Superion 86.23% 8,944 15,061 54.35%
Token-Level AFL 80.52% 14,361 17,152 35.53%
COMFORT [67] utilized jshint [24] for measurement, focusing Jerry Montage 95.55% 13,114 13,285 74.98%
their error rate on syntax errors, we used JavaScript interpreters, Montage (w/o Import) 95.34% 12,662 15,598 49.03%
allowing us to measure the error rate including both syntax and COMFORT 79.83% 12,268 14,026 65.74%
semantic errors. CovRL-Fuzz 58.84% 20,844 23,246 -
• Bug Detection is what the fuzzer is trying to find.
5.2 RQ1. Comparison with Existing Fuzzers JavaScript interpreters, so we did not include JIT fuzzers [3, 62]
To answer RQ1, we compare CovRL-Fuzz with fuzzers targeting targeting JIT compilers embedded in some JavaScript engines.
JavaScript interpreters from open-source projects, focusing on those Code Coverage. Table 3 depicts the valid and total coverage for
that are either coverage-guided [39, 50, 61] or LM-based[29, 65, 67] each fuzzing technique. The results of our evaluation demonstrate
listed in Table 2 with the 24-hour timeout. In the case of Montage, it that CovRL-Fuzz outperforms state-of-the-art JavaScript interpreter
imports code from its test suite corpus, which might affect coverage fuzzers. Our observation revealed that CovRL-Fuzz attained the
by increasing the amount of executed code. As a result, we included highest coverage across all target interpreters, resulting in an aver-
a version of Montage (w/o Import) in our experimental study, which age increase of 102.62%/98.40%/19.49%/57.11% in edge coverage. To
does not import the other test suites. In the case of COMFORT, we emphasize the effectiveness of CovRL-Fuzz, we monitored a growth
evaluated it solely with the black-box fuzzer, excluding the differ- trend of edge coverage, depicted in Figure 4. In every experiment,
ential testing component. We focused on finding bugs in general CovRL-Fuzz consistently achieved the highest edge coverage more
1662
ISSTA ’24, September 16–20, 2024, Vienna, Austria Jueon Eom, Seyeon Jeong, and Taekyoung Kwon
Figure 4: Number of edge coverage between CovRL-Fuzz and other JavaScript interpreter fuzzers. The solid line represents the
average coverage, while the shaded region depicts the range between the lowest and highest values five times.
1663
Fuzzing JavaScript Interpreters with Coverage-Guided Reinforcement Learning for LLM-Based Mutation ISSTA ’24, September 16–20, 2024, Vienna, Austria
Table 5: Comparison with state-of-the-art fuzzer using LLM- and error rates of Fuzz4All and CovRL-Fuzz did not show significant
based mutation. differences, it is difficult to assert that the coverage improvement
and error rate achieved by Fuzz4All have significantly contributed
Coverage Improv to finding bugs. Table 6 shows the bugs found by both Fuzz4All
Target Fuzzer Error (%)
Valid Total Ratio (%) and CovRL-Fuzz. The bugs identified by Fuzz4All were only a few
Fuzz4All 48.56% 57,524 60,153 30.88%
V8 of the subsets of those discovered by CovRL-Fuzz. These results
CovRL-Fuzz 48.68% 75,240 78,729 -
Fuzz4All 60.16% 47,705 48,765 32.76% mean that our method, which combines coverage-guided fuzzing
JSC and LLM-based mutation through CovRL, is more useful in bug
CovRL-Fuzz 48.59% 61,137 64,738 -
Chakra
Fuzz4All 50.77% 97,723 99,329 12.25% detection than the state-of-the-art LLM-based fuzzing technique.
CovRL-Fuzz 54.87% 105,121 111,498 -
Fuzz4All 72.30% 18,895 22,681 2.49%
Jerry
CovRL-Fuzz 58.84% 20,844 23,246 - 5.4 RQ3. Ablation Study
To answer RQ3, we conducted an ablation study on the two key
Table 6: Unique bugs discovered by CovRL-Fuzz and com- components of CovRL-Fuzz, CovRL and CWR, based on coverage-
pared LLM-based fuzzer. guided fuzzing, with a 5-hour timeout. Table 7 shows the error rates
and coverage according to different CovRL and CWR variants.
Target Bug Type Fuzz4All CovRL-Fuzz Impact of CovRL. To evaluate the impact of CovRL, we conducted
JSC Undefined Behavior ✓ a comparison with w/o LLM (TokenAFL [50]), which uses token-
JSC Out-of-bounds Read ✓ level heuristic mutation, LLM w/o CovRL, which simply applies
Chakra Undefined Behavior ✓ LLM-based mutation to coverage-guided fuzzing, and LLM w/-
Chakra Out of Memory ✓ CovRL, which represents our CovRL-Fuzz. The experimental results
Chakra Out of Memory ✓
indicated that while the LLM w/o CovRL successfully decreased
Jerry Undefined Behavior ✓ ✓
Jerry Memory Leak ✓ ✓ the error rate in comparison to w/o LLM, it did not significantly
Jerry Undefined Behavior ✓ improve or even slightly decreased coverage. In contrast, the LLM
Jerry Undefined Behavior ✓ w/CovRL demonstrated coverage improvement for all targets. These
Jerry Heap Buffer Overflow ✓
Jerry Out of Memory ✓ findings suggest that applying LLM-based mutation with CovRL in
Jerry Stack Overflow ✓ coverage-guided fuzzing is effective.
Jerry Undefined Behavior ✓
Jerry Heap Buffer Overflow ✓ ✓ Impact of CWR. To evaluate the impact of CWR used in CovRL-
Total 3 14 compared to other rewards, we included w/o RL, and two additional
reward processes in our experiments. For rewarding, we conducted
a comparison between two types of rewards: Coverage Reward (CR)
by ASAN for stack trace analysis to eliminate duplicate bugs. We and Coverage-Rate Reward (CRR). CR is a simple binary rewarding
also manually analyzed and categorized the results by bug types. process we designed, where a reward of 1 is given to test cases
Table 4 shows the number and types of unique bugs found by that find new coverage, and a penalty of 0 is assigned to those that
the CovRL-Fuzz and the compared fuzzers. CovRL-Fuzz discovered do not. CRR, on the other hand, is a reward used in traditional
the most unique bugs compared to other fuzzers. In detail, CovRL- RL-based fuzzing techniques [5, 32, 33], calculated as the ratio of
Fuzz found 14 unique bugs and 9 of these bugs were exclusively current coverage to the total cumulative coverage.
detected by CovRL-Fuzz, including stack overflow and heap buffer In the experimental results, w/CR and w/CRR showed little to no
overflow. These results highlight its effectiveness in bug detection. increase in coverage compared to w/o RL. However, CovRL-Fuzz
As observed in the experimental results, LM-based fuzzers, despite achieved the highest coverage in both valid and total, and exhibited
achieving higher coverage, tend to find fewer bugs, while heuristic a low error rate. These results suggest that the coverage-guided
fuzzers, although achieving lower coverage, generally find more fuzzing technique using CWR effectively contributes to improving
bugs. However, irrespective of this trend, CovRL-Fuzz demonstrated coverage with LLM-based mutation.
superior performance in effectively discovering the most bugs.
5.5 RQ4. Real-World Bugs
5.3 RQ2. Comparison of Fuzzers Using To answer RQ4, we evaluated the ability of CovRL-Fuzz to find
LLM-Based Mutation real-world bugs during a specific period of fuzzing. Specifically, we
To answer RQ2, we conducted 24-hour experiments with Fuzz4All, investigated how many real-world bugs CovRL-Fuzz can find and
a state-of-the-art LLM-based fuzzing technique using mutation by whether it can discover previously unknown bugs. Thus, we eval-
prompt applicable to a compiler. While TitanFuzz [12] and Fuz- uated whether CovRL-Fuzz can find real-world bugs for 2 weeks
zGPT [13] both employ LLM-based mutations, their use of hand- for each target. We tested the latest version of each target inter-
crafted annotations, prompts, and mutation patterns is specifically preter as of January 2023. Table 8 summarizes the bugs found by
designed for deep learning libraries. The specialization makes them CovRL-Fuzz. A total of 58 bugs were identified, of which 50 were
challenging to easily adopt for JavaScript interpreters, which is previously unknown bugs (15 have been registered as CVEs). Out
why they were not included in our experimental subjects. of the discovered bugs, 45 were confirmed by developers, and 18
Table 5 presents the results of comparing the coverage and error have been fixed. The CVEs we identified have an average risk score
rate between CovRL-Fuzz and Fuzz4All [65]. While the coverage of 7.5 according to CVSS v3.1, with some reaching as high as 9.8.
1664
ISSTA ’24, September 16–20, 2024, Vienna, Austria Jueon Eom, Seyeon Jeong, and Taekyoung Kwon
Table 7: The ablation study with each variants. The Improv (%) refers to the improvement ratio compared to the baseline.
1665
Fuzzing JavaScript Interpreters with Coverage-Guided Reinforcement Learning for LLM-Based Mutation ISSTA ’24, September 16–20, 2024, Vienna, Austria
coverage guidance, such as NEUZZ [54] and MTFuzz [53]. How- promising results despite being the smallest in size. Due to com-
ever, unlike our method, they use neural networks to identify mu- putational resource constraints, we conducted experiments using
tation positions that influence branching behaviors. Consequently, LLMs with a maximum size of 1B. Even with these constraints, our
NEUZZ and MTFuzz face the same limitations as traditional fuzzing technique is designed to be model-agnostic, and we believe it can
when applied to software requiring strict grammar adherence, as be effectively applied to larger LLMs. We leave this exploration for
they do not directly control the mutations themselves. In contrast, future work.
our approach leverages an LLM to mutate the seeds directly, ef-
fectively addressing these constraints. By allowing the LLM to 8 Conclusion
handle mutations, we ensure that the generated inputs adhere to
We introduced CovRL-Fuzz, a novel LLM-based coverage-guided
grammatical rules to a significant extent.
fuzzing framework that integrates coverage-guided reinforcement
learning for the first time. Our approach directly integrates coverage
7 Discussion feedback into LLM-based mutation, enhancing coverage-guided
We discuss three properties of CovRL-Fuzz in the following: fuzzing to reduce syntax limitations while enabling effective testing
to achieve broader and hidden path exploration of the JavaScript
Time spent between fuzzing and finetuning. As mentioned in
interpreter. Our evaluation results affirmed the superior efficacy
the experimental setup, we calculated the fuzzing time to ensure
of the CovRL-Fuzz methodology in comparison to existing fuzzing
fairness in fuzzing, including the time spent on finetuning in the
strategies. Impressively, it discovered 58 real-world security-related
experimental results. On average, finetuning occurs for 10 minutes
bugs with 15 CVEs in JavaScript interpreters — among these, 50
every 2.5 hours of fuzzing. Despite including the finetuning time
were previously unknown bugs. We believe that our methodology
in the experiment, CovRL-Fuzz achieved high coverage while also
paves the way for future studies focused on harnessing LLMs with
decreasing the error rate.
coverage feedback for software testing.
Performance bottleneck. In our experimental environment, we
observed that CovRL-Fuzz experienced approximately twice the Acknowledgments
delay in comparison to Token-Level AFL. While the degree of the
discrepancy may vary depending on the target software, our results We deeply thank the anonymous reviewers for their helpful com-
were generally consistent. Specifically, we found that 73% of the ments and suggestions. This work was supported by the Institute
time required to generate a single test case was consumed by the of Information & Communications Technology Planning & Eval-
LLM-based mutation process, potentially creating a performance uation (IITP) grant funded by the Korea government (MSIT) (No.
bottleneck. Despite this, our experimental results demonstrated RS-2024-00439762, Developing Techniques for Analyzing and As-
that even though the mutation speed was significantly slowed by sessing Vulnerabilities and Tools for Confidentiality Evaluation in
the LLM-based approach, it still achieved higher performance. It Generative AI Models).
indicates the efficiency of our methodology, suggesting that the
performance improvements brought by the LLM-based mutations References
outweigh the drawbacks of the increased processing time. [1] Cornelius Aschermann, Tommaso Frassetto, Thorsten Holz, Patrick Jauernig,
Ahmad-Reza Sadeghi, and Daniel Teuchert. 2019. NAUTILUS: Fishing for Deep
Catastrophic forgetting for finetuning. We mitigated cata- Bugs with Grammars.. In Proceedings 2019 Network and Distributed System Security
Symposium. 15 pages. https://doi.org/10.14722/ndss.2019.23412
strophic forgetting by applying two simple strategies: finetuning [2] Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk
with a small learning rate and using some of the original seed data Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al.
for finetuning. These measures have prevented forgetting in our 2021. Program synthesis with large language models. arXiv:2108.07732 [cs.PL]
[3] Lukas Bernhard, Tobias Scharnowski, Moritz Schloegel, Tim Blazytko, and
experiments. However, we cannot entirely rule out the possibility Thorsten Holz. 2022. JIT-picking: Differential fuzzing of JavaScript engines.
of future occurrences. This issue is not unique to our work but In Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communica-
is a general problem associated with finetuning for LLMs. We be- tions Security. 351–364.
[4] Tim Blazytko, Matt Bishop, Cornelius Aschermann, Justin Cappos, Moritz
lieve that using the latest prevention techniques [26] could more Schlögel, Nadia Korshun, Ali Abbasi, Marco Schweighauser, Sebastian Schinzel,
effectively address this issue in the future. Sergej Schumilo, et al. 2019. {GRIMOIRE }: Synthesizing structure while fuzzing.
In 28th USENIX Security Symposium (USENIX Security 19). 1985–2002.
Supporting other targets. Through finetuning, the core idea of [5] Konstantin Böttinger, Patrice Godefroid, and Rishabh Singh. 2018. Deep rein-
forcement fuzzing. In 2018 IEEE Security and Privacy Workshops (SPW). IEEE,
guiding coverage information directly with the LLM-based mutator 116–122.
is actually language-agnostic, which suggests its applicability to [6] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan,
other language interpreters or compilers. However, our focus was Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda
Askell, et al. 2020. Language models are few-shot learners. In Advances in neural
more on analyzing the suitability of our idea to existing techniques information processing systems, Vol. 33. 1877–1901.
than supporting various languages. Therefore, we conducted exper- [7] Charlie Miller. 2008. Fuzz by number. https://www.ise.io/wp-content/uploads/
2019/11/cmiller_cansecwest2008.pdf. Accessed: 2024-01-12.
iments only on JavaScript interpreters, which we deemed to have [8] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde
the most impact. Extending to other targets is left as future work. de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph,
Greg Brockman, et al. 2021. Evaluating large language models trained on code.
Application to other LLMs. CovRL-Fuzz is not limited to a spe- arXiv:2107.03374 [cs.LG]
cific LLM but is a general strategy applicable to other open-source [9] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav
Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Se-
LLMs as well. Among the base models listed in Table 1, we chose bastian Gehrmann, et al. 2022. Palm: Scaling language modeling with pathways.
CodeT5+ [63] for our experiments because it showed the most arXiv:2204.02311 [cs.CL]
1666
ISSTA ’24, September 16–20, 2024, Vienna, Austria Jueon Eom, Seyeon Jeong, and Taekyoung Kwon
[10] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Security: 24th International Conference, ICICS 2022, Canterbury, UK, September
Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro 5–8, 2022, Proceedings. Springer-Verlag, Berlin, Heidelberg, 359–375.
Nakano, et al. 2021. Training verifiers to solve math word problems. [34] Jiate Liu, Yiqin Zhu, Kaiwen Xiao, Qiang Fu, Xiao Han, Wei Yang, and De-
arXiv:2110.14168 [cs.LG] heng Ye. 2023. RLTF: Reinforcement Learning from Unit Test Feedback.
[11] Chris Cummins, Pavlos Petoumenos, Alastair Murray, and Hugh Leather. 2018. arXiv:2307.04349 [cs.AI]
Compiler fuzzing through deep learning. In Proceedings of the 27th ACM SIGSOFT [35] Xiao Liu, Xiaoting Li, Rupesh Prajapati, and Dinghao Wu. 2019. Deepfuzz:
International Symposium on Software Testing and Analysis. 95–105. Automatic generation of syntax valid c programs for fuzz testing. In Proceedings
[12] Yinlin Deng, Chunqiu Steven Xia, Haoran Peng, Chenyuan Yang, and Lingming of the AAAI Conference on Artificial Intelligence. 1044–1051. https://doi.org/10.
Zhang. 2023. Large Language Models are Zero-Shot Fuzzers: Fuzzing Deep- 1609/aaai.v33i01.33011044
Learning Libraries via Large Language Models. In Proceedings of the 32nd ACM [36] Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization.
SIGSOFT International Symposium on Software Testing and Analysis (ISSTA 2023). arXiv preprint arXiv:1711.05101 (2017).
[13] Yinlin Deng, Chunqiu Steven Xia, Chenyuan Yang, Shizhuo Dylan Zhang, Shujing [37] Henry B Mann and Donald R Whitney. 1947. On a test of whether one of
Yang, and Lingming Zhang. 2023. Large language models are edge-case fuzzers: two random variables is stochastically larger than the other. The annals of
Testing deep learning libraries via fuzzgpt. arXiv:2304.02014 [cs.SE] mathematical statistics (1947), 50–60.
[14] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: [38] Jasiel Spelman Matt Molinyawe, Adul-Aziz Hariri. 2016. $hell on Earth: From
Pre-training of deep bidirectional transformers for language understanding. Browser to System Compromise. In Black Hat USA.
arXiv:1810.04805 [cs.CL] [39] Michal Zalewski. 2013. AFL: American Fuzzy Lop. https://lcamtuf.coredump.cx/
[15] ECMA International. 1997. ECMAScript language speicification. https://www. afl/. Accessed: 2023-08-15.
ecma-international.org/ecma-262/. Accessed: 2023-08-15. [40] Mihai Bazon. 2010. uglifyjs. https://github.com/mishoo/UglifyJS. Accessed:
[16] Zhiyu Fan, Xiang Gao, Martin Mirchev, Abhik Roychoudhury, and Shin Hwei 2023-08-14.
Tan. 2023. Automated repair of programs from large language models. In 2023 [41] Barton P Miller, Lars Fredriksen, and Bryan So. 1990. An empirical study of the
IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, reliability of UNIX utilities. Commun. ACM 33, 12 (Dec. 1990), 32–44. https:
1469–1481. //doi.org/10.1145/96267.96279
[17] Daniel Fried, Armen Aghajanyan, Jessy Lin, Sida Wang, Eric Wallace, Freda Shi, [42] OpenAI. 2024. gpt4. https://openai.com/gpt-4. Accessed: 2024-03-22.
Ruiqi Zhong, Scott Yih, Luke Zettlemoyer, and Mike Lewis. 2022. InCoder: A [43] OpenAI. 2024. OpenAI API. https://openai.com/index/openai-api. Accessed:
Generative Model for Code Infilling and Synthesis. In The Eleventh International 2024-07-12.
Conference on Learning Representations. [44] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela
[18] Patrice Godefroid, Hila Peleg, and Rishabh Singh. 2017. Learn&fuzz: Machine Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022.
learning for input fuzzing. In 2017 32nd IEEE/ACM International Conference on Training language models to follow instructions with human feedback. Advances
Automated Software Engineering (ASE) (ASE 2017). IEEE, 50–59. https://doi.org/ in Neural Information Processing Systems 35 (2022), 27730–27744.
10.1109/ASE.2017.8115618 [45] Soyeon Park, Wen Xu, Insu Yun, Daehee Jang, and Taesoo Kim. 2020. Fuzzing
[19] Google. 2017. Chrominum Issue 729991. https://bugs.chromium.org/p/chromium/ javascript engines with aspect-preserving mutation. In 2020 IEEE Symposium on
issues/detail?id=729991. Accessed: 2023-08-14. Security and Privacy (SP). IEEE, 1629–1642. https://doi.org/10.1109/SP40000.2020.
[20] Samuel Groß, Simon Koch, Lukas Bernhard, Thorsten Holz, and Martin Johns. 00067
2023. FUZZILLI: Fuzzing for JavaScript JIT Compiler Vulnerabilities.. In NDSS. [46] Jibesh Patra and Michael Pradel. 2016. Learning to fuzz: Application-independent
[21] HyungSeok Han, DongHyeon Oh, and Sang Kil Cha. 2019. CodeAlchemist: fuzz testing with probabilistic, generative models of input data. Technical Report.
Semantics-Aware Code Generation to Find Vulnerabilities in JavaScript Engines.. TU Darmstadt, Department of Computer Science.
In Proceedings 2019 Network and Distributed System Security Symposium. 15 pages. [47] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever,
https://doi.org/10.14722/ndss.2019.23263 et al. 2019. Language models are unsupervised multitask learners. OpenAI blog
[22] Xiaoyu He, Xiaofei Xie, Yuekang Li, Jianwen Sun, Feng Li, Wei Zou, Yang Liu, 1, 8 (2019), 9.
Lei Yu, Jianhua Zhou, Wenchang Shi, et al. 2021. Sofi: Reflection-augmented [48] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang,
fuzzing for javascript engines. In Proceedings of the 2021 ACM SIGSAC Conference Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of
on Computer and Communications Security. ACM, 2229–2242. https://doi.org/10. transfer learning with a unified text-to-text transformer. The Journal of Machine
1145/3460120.3484823 Learning Research 21, 1 (Jan. 2020), 5485–5551. https://dl.acm.org/doi/abs/10.
[23] hoongwoo Han. 2010. js-vuln-db. https://github.com/tunz/js-vuln-db. Accessed: 5555/3455716.3455856
2023-08-15. [49] Paul Roit, Johan Ferret, Lior Shani, Roee Aharoni, Geoffrey Cideron, Robert
[24] JSHint. 2013. JSHint: A JavaScript Code Quality Tool. https://jshint.com/. Ac- Dadashi, Matthieu Geist, Sertan Girgin, Léonard Hussenot, Orgad Keller, et al.
cessed: 2023-08-15. 2023. Factually Consistent Summarization via Reinforcement Learning with
[25] George Klees, Andrew Ruef, Benji Cooper, Shiyi Wei, and Michael Hicks. 2018. Textual Entailment Feedback. arXiv:2306.00186 [cs.CL]
Evaluating fuzz testing. In Proceedings of the 2018 ACM SIGSAC conference on [50] Christopher Salls, Chani Jindal, Jake Corina, Christopher Kruegel, and Giovanni
computer and communications security. ACM, 2123–2138. https://doi.org/10.1145/ Vigna. 2021. {Token-Level } Fuzzing. In 30th USENIX Security Symposium (USENIX
3243734.3243804 Security 21). USENIX Association, 2795–2809. https://www.usenix.org/system/
[26] Suhas Kotha, Jacob Mitchell Springer, and Aditi Raghunathan. [n. d.]. Under- files/sec21-salls.pdf
standing Catastrophic Forgetting in Language Models via Implicit Inference. In [51] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov.
The Twelfth International Conference on Learning Representations. 2017. Proximal policy optimization algorithms. arXiv:1707.06347 [cs.LG]
[27] Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, and Steven [52] Konstantin Serebryany, Derek Bruening, Alexander Potapenko, and Dmitriy
Chu Hong Hoi. 2022. Coderl: Mastering code generation through pretrained Vyukov. 2012. {AddressSanitizer }: A fast address sanity checker. In 2012 USENIX
models and deep reinforcement learning. Advances in Neural Information Pro- annual technical conference (USENIX ATC 12). 309–318.
cessing Systems 35 (2022), 21314–21328. [53] Dongdong She, Rahul Krishna, Lu Yan, Suman Jana, and Baishakhi Ray. 2020.
[28] Harrison Lee, Samrat Phatale, Hassan Mansoor, Kellie Lu, Thomas Mes- MTFuzz: fuzzing with a multi-task neural network. In Proceedings of the 28th
nard, Colton Bishop, Victor Carbune, and Abhinav Rastogi. 2023. Rlaif: ACM joint meeting on European software engineering conference and symposium
Scaling reinforcement learning from human feedback with ai feedback. on the foundations of software engineering. 737–749.
arXiv:2309.00267 [cs.CL] [54] Dongdong She, Kexin Pei, Dave Epstein, Junfeng Yang, Baishakhi Ray, and Suman
[29] Suyoung Lee, HyungSeok Han, Sang Kil Cha, and Sooel Son. 2020. Montage: Jana. 2019. Neuzz: Efficient fuzzing with neural program smoothing. In 2019 IEEE
A Neural Network Language {Model-Guided } {JavaScript } Engine Fuzzer. In Symposium on Security and Privacy (SP). IEEE, 803–817. https://doi.org/10.1109/
29th USENIX Security Symposium (USENIX Security 20). USENIX Association, SP.2019.00052
2613–2630. https://www.usenix.org/system/files/sec20-lee-suyoung.pdf [55] Parshin Shojaee, Aneesh Jain, Sindhu Tipirneni, and Chandan K Reddy.
[30] Caroline Lemieux and Koushik Sen. 2018. Fairfuzz: A targeted mutation strategy 2023. Execution-based code generation using deep reinforcement learning.
for increasing greybox fuzz testing coverage. In Proceedings of the 33rd ACM/IEEE arXiv:2301.13816 [cs.LG]
international conference on automated software engineering. 475–485. [56] Karen Sparck Jones. 1972. A statistical interpretation of term specificity and its
[31] Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, application in retrieval. Journal of documentation 28, 1 (1972), 11–21.
Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, et al. 2023. [57] Technical Committee 39 ECMA International. 2010. Test262. https://github.com/
StarCoder: may the source be with you! arXiv preprint arXiv:2305.06161 (2023). tc39/test262. Accessed: 2023-08-15.
[32] Xiaoting Li, Xiao Liu, Lingwei Chen, Rupesh Prajapati, and Dinghao Wu. 2022. [58] Spandan Veggalam, Sanjay Rawat, Istvan Haller, and Herbert Bos. 2016. Ifuzzer:
ALPHAPROG: reinforcement generation of valid programs for compiler fuzzing. An evolutionary interpreter fuzzer using genetic programming. In Computer
In Proceedings of the AAAI Conference on Artificial Intelligence. 12559–12565. Security–ESORICS 2016: 21st European Symposium on Research in Computer Se-
[33] Xiaoting Li, Xiao Liu, Lingwei Chen, Rupesh Prajapati, and Dinghao Wu. 2022. curity, Heraklion, Greece, September 26-30, 2016, Proceedings, Part I 21. Springer,
FuzzBoost: Reinforcement Compiler Fuzzing. In Information and Communications Cham, 581–601. https://doi.org/10.1007/978-3-319-45744-4_29
1667
Fuzzing JavaScript Interpreters with Coverage-Guided Reinforcement Learning for LLM-Based Mutation ISSTA ’24, September 16–20, 2024, Vienna, Austria
[59] W3Techs. 2024. Usage statistics of JavaScript as client-side programming lan- [65] Chunqiu Steven Xia, Matteo Paltenghi, Jia Le Tian, Michael Pradel, and Ling-
guage on websites. https://w3techs.com/technologies/details/cp-javascript. Ac- ming Zhang. 2023. Universal fuzzing via large language models. arXiv preprint
cessed: 2024-01-17. arXiv:2308.04748 (2023).
[60] Junjie Wang, Bihuan Chen, Lei Wei, and Yang Liu. 2017. Skyfire: Data-driven [66] Chunqiu Steven Xia and Lingming Zhang. 2022. Less training, more repairing
seed generation for fuzzing. In 2017 IEEE Symposium on Security and Privacy (SP). please: revisiting automated program repair via zero-shot learning. In Proceedings
IEEE, 579–594. https://doi.org/10.1109/SP.2017.23 of the 30th ACM Joint European Software Engineering Conference and Symposium
[61] Junjie Wang, Bihuan Chen, Lei Wei, and Yang Liu. 2019. Superion: Grammar- on the Foundations of Software Engineering. 959–971.
aware greybox fuzzing. In 2019 IEEE/ACM 41st International Conference on Soft- [67] Guixin Ye, Zhanyong Tang, Shin Hwei Tan, Songfang Huang, Dingyi Fang, Xi-
ware Engineering (ICSE). IEEE, 724–735. https://doi.org/10.1109/ICSE.2019.00081 aoyang Sun, Lizhong Bian, Haibo Wang, and Zheng Wang. 2021. Automated con-
[62] Junjie Wang, Zhiyi Zhang, Shuang Liu, Xiaoning Du, and Junjie Chen. 2023. formance testing for JavaScript engines via deep compiler fuzzing. In Proceedings
{FuzzJIT }: {Oracle-Enhanced } Fuzzing for {JavaScript } Engine {JIT } Compiler. of the 42nd ACM SIGPLAN International Conference on Programming Language De-
In 32nd USENIX Security Symposium (USENIX Security 23). 1865–1882. sign and Implementation. ACM, 435–450. https://doi.org/10.1145/3453483.3454054
[63] Yue Wang, Hung Le, Akhilesh Deepak Gotmare, Nghi DQ Bui, Junnan Li, and [68] Albert Ziegler, Eirini Kalliamvakou, X Alice Li, Andrew Rice, Devon Rifkin,
Steven CH Hoi. 2023. Codet5+: Open code large language models for code Shawn Simister, Ganesh Sittampalam, and Edward Aftandilian. 2022. Productivity
understanding and generation. arXiv:2305.07922 [cs.CL] assessment of neural code completion. In Proceedings of the 6th ACM SIGPLAN
[64] Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian International Symposium on Machine Programming. 21–29.
Lester, Nan Du, Andrew M Dai, and Quoc V Le. 2021. Finetuned Language Models
are Zero-Shot Learners. In International Conference on Learning Representations. Received 2024-04-12; accepted 2024-07-03
1668