0% found this document useful (0 votes)
15 views17 pages

Information and Software Technology

Uploaded by

Marwa El Deeb
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views17 pages

Information and Software Technology

Uploaded by

Marwa El Deeb
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Information and Software Technology 171 (2024) 107468

Contents lists available at ScienceDirect

Information and Software Technology


journal homepage: www.elsevier.com/locate/infsof

Effective test generation using pre-trained Large Language Models and


mutation testing
Arghavan Moradi Dakhel ∗, Amin Nikanjam, Vahid Majdinasab, Foutse Khomh,
Michel C. Desmarais
Department of Computer and Software Engineering, Polytechnique Montreal, Montreal, H3T 1J4, Quebec, Canada

ARTICLE INFO ABSTRACT

Dataset link: https://github.com/ExpertiseMod Context: One of the critical phases in the software development life cycle is software testing. Testing helps with
el/MuTAP identifying potential bugs and reducing maintenance costs. The goal of automated test generation tools is to
Keywords:
ease the development of tests by suggesting efficient bug-revealing tests. Recently, researchers have leveraged
Test generation Large Language Models (LLMs) of code to generate unit tests. While the code coverage of generated tests was
Large language model usually assessed, the literature has acknowledged that the coverage is weakly correlated with the efficiency of
Mutation testing tests in bug detection.
Objective: To improve over this limitation, in this paper, we introduce MuTAP (Mutation Test case generation
using Augmented Prompt) for improving the effectiveness of test cases generated by LLMs in terms of revealing
bugs by leveraging mutation testing.
Methods: Our goal is achieved by augmenting prompts with surviving mutants, as those mutants highlight the
limitations of test cases in detecting bugs. MuTAP is capable of generating effective test cases in the absence
of natural language descriptions of the Program Under Test (PUTs). We employ different LLMs within MuTAP
and evaluate their performance on different benchmarks.
Results: Our results show that our proposed method is able to detect up to 28% more faulty human-written
code snippets. Among these, 17% remained undetected by both the current state-of-the-art fully-automated
test generation tool (i.e., Pynguin) and zero-shot/few-shot learning approaches on LLMs. Furthermore, MuTAP
achieves a Mutation Score (MS) of 93.57% on synthetic buggy code, outperforming all other approaches in
our evaluation.
Conclusion: Our findings suggest that although LLMs can serve as a useful tool to generate test cases, they
require specific post-processing steps to enhance the effectiveness of the generated test cases which may suffer
from syntactic or functional errors and may be ineffective in detecting certain types of bugs and testing corner
cases in PUT s.

1. Introduction The automatic generation of unit tests is an important topic in


Software Engineering (SE). It aims to reduce developers’ testing efforts.
Testing is an important yet expensive step in the software devel- Developing good-quality unit tests can prevent bugs in software prod-
opment lifecycle. Generating effective tests is a time-consuming and ucts. There are different tools for automatically generating unit tests
tedious task for developers. Unit tests are essential as they form the and test suites that are either based on random test generators [4,5],
basis of the test automation pyramid [1,2]. Unit tests check if a function dynamic symbolic execution [6,7], or search-based approaches [8,9].
or a component works as expected in isolation. A unit test consists of However, these techniques have some drawbacks and often generate
two components: the first component is a set of test inputs for the
tests with no assertion or too general assertions, or tests with assertions
Program Under Test (PUT ), while the second component is the test
that cannot effectively assess the intended behavior of the PUT [10,11].
oracle that indicates the intended behavior (output) of the PUT and is,
Considering these shortcomings, researchers have recently been
therefore, capable of exposing bugs by verifying the correctness of the
PUT on test inputs [3]. A test oracle can be in the format of assertions. exploring the possibility of leveraging Machine Learning-based code

∗ Corresponding author.
E-mail addresses: [email protected] (A.M. Dakhel), [email protected] (A. Nikanjam), [email protected]
(V. Majdinasab), [email protected] (F. Khomh), [email protected] (M.C. Desmarais).

https://doi.org/10.1016/j.infsof.2024.107468
Received 31 August 2023; Received in revised form 2 April 2024; Accepted 3 April 2024
Available online 6 April 2024
0950-5849/© 2024 Elsevier B.V. All rights reserved.
A.M. Dakhel et al. Information and Software Technology 171 (2024) 107468

synthesis techniques for generating better unit tests [12–16]. Specif- Pynguin (a state-of-the-art fully-automated test generation tool) and
ically, these approaches have been exploring the potential of Large the conventional LLM-based zero-shot/few-shot learning techniques.
Language Models (LLMs) with the transformer architecture, such as Furthermore, our approach detects up to 468 (28%) more buggy code
Codex [17], which has achieved good performance in automatic pro- snippets written by humans than other comparable methods in our
gram synthesis [17–21]. Among such efforts, Bareißet al. [12] evaluate evaluation. Remarkably, it identifies 79 (17%) buggy code snippets
Codex’s performance for test case generation by using a few-shot learn- of humans that none of the other techniques are able to detect. To
ing approach. Their findings on a limited set of 18 Java methods show summarize, this paper makes the following contributions:
that their approach is comparable to feedback-directed test genera-
tion. ATHENATEST [22] leveraged the BART transformer model [23] • We present the first study on leveraging MT to generate test cases
after fine-tuning it on a set of real Java functions and their cor- with LLMs.
responding tests. They also reported achieving comparable coverage • We propose a prompt-based learning technique to improve the
to EvoSuite [9] after an assessment of five Java projects. Lemieux effectiveness of test cases by augmenting the prompts with both
et al. [16] proposed CODAMOSA which utilized test cases generated initial test cases and surviving mutants of a PUT.
by Codex to improve search-based testing techniques, which consists • We assess the effectiveness of generated tests in detecting bugs in
of only the prefix (inputs) of a test case without any test oracles. Their real and synthetic buggy versions of PUT s.
reported results obtained on 27 Python projects show that CODAMOSA • We make the proposed technique, MuTAP, publicly available
surpasses the baseline search-based technique, Pynguin [24] and Codex online [32] for other researchers/practitioners to replicate or
in terms of code coverage. Although the preliminary results of these build upon our work.
studies and others [13–15,25], are promising, none of these studies
The rest of this paper is organized as follows. Section 2 intro-
attempted to improve the bug detection capability of generated tests.
duces a motivating example. Section 3 describes the different steps of
Moreover, it has been acknowledged in the literature that while test
our approach. We present our experimental setup, research questions,
coverage is a useful metric for evaluating the quality of tests, it is
and experimental results in Section 4. We discuss our findings and
weakly correlated with the efficiency of tests in bug detection [26–28].
the potential use cases of our approach in Section 5. Threats to the
Mutation Testing (MT) is a white box testing technique to assess the
validity of our results are reviewed in Section 6. We briefly review the
capability of a test in revealing bugs. MT has been widely studied and
related works in Section 7. Finally, we conclude the paper in Section 8;
successfully used in SE to assess the effectiveness of test cases [29,30].
highlighting some avenues for future works.
MT involves injecting artificial changes based on real faults into a PUT,
resulting in mutated versions of the PUT known as mutants. The more
2. Motivating example
a test case kills mutants, the more effective it is in identifying real
bugs. The surviving mutants highlight the weaknesses of a test case
In this section, we present an example in Fig. 1 showing how our
and the ultimate goal is for the test cases to be able to detect all
proposed approach generates effective test cases. Suppose we have 10
mutants, i.e., kill them. Mutants are not only useful for assessing the
mutants {𝑆𝑀0 , 𝑆𝑀1 , … , 𝑆𝑀9 } for the Program Under Test, 𝑃 𝑈 𝑇 in
effectiveness of test cases but can also be used as a means for designing
Fig. 1. The 𝑃 𝑈 𝑇 is a function that takes a certain expected input and
more effective test cases [9].
In this paper, we present the first study that leverages MT to produces a desired output upon applying its functionality. The goal of
enhance and evaluate the effectiveness of test cases generated by our proposed technique, MuTAP (Mutation Test case generation using
LLMs for Python programs in terms of fault revealing capabilities. Our Augmented Prompt), is to generate effective test cases for 𝑃 𝑈 𝑇 in a
approach aims to optimize test cases for bug detection rather than code way that ensures killing the maximum number of mutants.
coverage. Our proposed technique, MuTAP, employs an LLM as its main The function any_int() in Fig. 1 receives 3 inputs and returns True
Component (LLMC) and starts by feeding a prompt to the LLMC in if all 3 inputs are integers, also one of the inputs is equal to the sum
order to generate test cases. The initial prompt includes the PUT and of two others. Otherwise, it returns False. In the first step, MuTAP uses
instructions for generating test cases by using zero-shot and few-shot the initial prompt, 1 , to run a query on the LLM Component (LLMC)
learning. Next, MuTAP assesses the syntax of the generated test cases and generates initial test cases for this Program Under Test (PUT ). The
and re-prompts its LLMC to rectify any detected syntax issues. After component 2 in Fig. 1 shows the initial test cases generated by LLMC
fixing syntax errors, MuTAP proceeds to appraise the intended behavior after the refining step. We named it Initial Unit Test, IUT. In Section 3,
of the generated test cases. This is achieved by comparing the output of we discuss the refining step (syntax and intended behavior fixing) of
the test oracles on certain test inputs to the expected return values of our approach in detail. The IUT kills 6 out of 10 mutants of PUT.
the PUT using the same test inputs, thereby correcting any unintended The 4 remaining mutants reveal the weaknesses of the generated test,
behavior in the test oracles. meaning that IUT needs new test cases with assertion to kill the injected
Subsequently, MuTAP applies MT to examine the effectiveness of bugs in those 4 mutants.
test cases in killing mutants of PUT s. As surviving mutants highlight To address this limitation and generate more effective test cases,
the limitation of the generated test cases, MuTAP re-prompts its LLMC MuTAP augments the initial prompt with two new components; the
to generate new test cases for the PUT s that have surviving mutants first one is the response of the model to the initial prompt after fixing
by augmenting the initial prompt with both initial test cases and its syntax and intended behavior, IUT, and the second one is the
the surviving mutants. MuTAP halts the process of augmenting the mutant component, 3 in Fig. 1. MuTAP initiates the construction of
initial prompt when either the final test cases can effectively detect all the mutant component by using the first ‘‘Survived Mutant’’ of PUT that
mutants or there are no surviving mutants left that have not already we refer to as SM0 . The red highlight in SM0 shows the injected bug in
been used to augment the initial prompt. 𝑃 𝑈 𝑇 . The injected bug changes the second statement in the condition
We employ two types of LLMs as the LLMC of MuTAP: Codex, which of the inner if in 𝑃 𝑈 𝑇 in a way that the sum of the first and last input
is designed for code-related tasks, and llama-2-chat, which is optimized of function any_int() is not equal to the middle input anymore. Since
for dialog use cases and versatile enough to accommodate a range of there is no test case in IUT to verify that its middle input, y, is equal to
tasks, including programming. We evaluate MuTAP on both synthetic the sum of its first and last inputs, x and z, IUT is not able to kill this
bugs of 164 PUT s [17] and 1710 buggy programs collected from a mutant.
Python bug repairing benchmark [31]. MuTAP uses the concatenation of these three components: 1 , 2 ,
Our results indicate that our proposed approach generates effective and 3 to re-prompt the LLMC. The 4 component in Fig. 1, shows
test cases with an average Mutation Score (MS, the ratio of killed mu- the new set of test cases generated by LLMC appended to IUT after
tants by the total number of mutants) of 93.57%, outperforming both the refining step. We named it Augmented Unit Test, AUT0 . The unit

2
A.M. Dakhel et al. Information and Software Technology 171 (2024) 107468

Fig. 1. Different steps of MuTAP on a PUT. 2 is a set of test cases generated by the initial prompt 1 for 𝑃 𝑈 𝑇 , and 4 is a set of test cases obtained after augmenting the
initial prompt with the surviving mutant, SM0 . 3 ′ shows the mutant component after updating with another surviving mutant of PUT0 that we named SM 1 .

test has two more assertions compared to the IUT and one of them, Algorithm 1: MuTAP
highlighted in red, kills the mutant, SM0 . Input: PUT, LLMC, initial_prompt_type
MuTAP applied AUT0 to the mutants of 𝑃 𝑈 𝑇 again. If there are any /* INS1 , INS2 , INS3 , INS4 and INS𝑓 𝑖𝑥 are global variable as
remaining surviving mutants, MuTAP iterates the augmentation process natural language instructions for the prompts */
by updating the mutant component with another surviving mutant Output: FUT // Final Unit Test
// Initial Prompt
if it has not been used to augment the prompt previously. MuTAP 1 initial_prompt ← GenerateInitialPrompt (PUT, initial_prompt_type)
utilizes each mutant individually because sometimes new test cases 2 raw_IUT ← LLMC (initial_prompt )
that address one mutant can also kill the remaining surviving mutants. // Syntax Fixer and Intended Behavior Repair
Moreover, due to the limited length of the prompt and non-constant 3 IUT ← Refining (raw_IUT, PUT )
// Mutation Testing
length of mutants, applying each surviving mutant separately is a more 4 mutants ← MutPy(PUT )
practical approach. Fig. 1 3 ′ shows an example of how the mutant 5 MS, surviving_mutant ← MutationTesting (IUT, mutants)
component is updated using another surviving mutant. We call this 6 while MS < 100% or surviving_mutant ≠ {} do

mutant SM1 . Unit test, 4 ′ , shows a new set of test cases including one 7 SM ← surviving_mutant.pop()
// Prompt Augmentation
assertion that detects SM1 . MuTAP iterates the augmentation process 8 augmented_prompt ← AugmentingPrompt (initial_prompt, IUT, SM )
until either the final test cases can kill all the mutants, or there are no 9 raw_AUT ← LLMC (augmented_prompt )
surviving mutants left that have not already been used to augment the 10 fixed_AUT ← Refining (raw_AUT, PUT )
initial prompt. 11 IUT ← IUT.append(fixed_AUT )
12 MS , augmnt_surviving_mutant ← MutationTesting (IUT, mutants)
The final test cases generated by our proposed technique, MuTAP, 13 surviving_mutant ← surviving_mutant ∩ augmnt_surviving_mutant
kill 9 out of 10 mutants of this example, 𝑃 𝑈 𝑇 , and it increases the 14 end
MS for 𝑃 𝑈 𝑇 from 60% (6 out of 10) to 90% (9 out of 10). This result // F: Oracle Minimization
can be compared to the state-of-the-art automatic test generation tool 15 FUT ← OracleMinimization (IUT, mutants)
16 return FUT
for Python programming language [24], Pynguin, which generates a
test case for 𝑃 𝑈 𝑇 with only a 40% MS. This tool uses a search-based
generation technique [33] and randomly mutates the test values within
a test case to generate new test cases. The random nature of this method
results in a low chance of generating a new test case that can kill the 3.1. Initial prompt
surviving mutants of PUT.
LLMs are capable of performing those tasks that they are already
3. Approach trained for. Fine-tuning LLMs to perform a new task is computationally
expensive. Also, there are LLMs such as Codex that show a very
In this section, we discuss the different steps of our approach. good performance in generating code but since they are closed-source,
Fig. 2 shows an overview of our proposed approach and Algorithm 1 fine-tuning them for a new task is impossible.
presents the sequence of its different steps. MuTAP initiates the process Prompt-based learning [34,35] is an effective technique to adapt
by invoking an initial prompt on its LLMC to generate test cases for
LLMs for new tasks. A prompt is a combination of natural language
a specific PUT. Subsequently, it applies refining steps to repair the
and/or programming language context and is used as an input to LLMs.
syntax and intended behavior of the generated test cases. Once the
There are studies showing that putting a natural language instruction
test cases are corrected, MuTAP proceeds to the MT step by generating
various mutants for the PUT and calculating the MS. If there are any as a hint (zero-shot learning ) [15,16,36] or several examples (few-shot
surviving mutants leading to MS < 100%, MuTAP uses those surviving learning ) [37–39] in the prompt increases the capability of LLMs in
mutants in different iterations to augment the initial prompt. It then performing a new task.
re-prompts its LLMC with the augmented prompt in each iteration of MuTAP employs both zero-shot and few-shot learning to build the
the augmentation step to generate new test cases. At the end of each initial prompt and calls LLMC on them separately. This step is shown
iteration, MuTAP recalculates the MS on the same set of mutants to in Algorithm 2. In more detail, we employ zero-shot and few-shot as
determine if the new test cases eliminate all surviving mutants or if follows:
there are remaining surviving mutants, prompting further iterations. If
MS=100% or all the surviving mutants are already incorporated into • zero-shot : The initial prompt generated by zero-shot technique
the augmentation step, MuTAP proceeds to the Oracle Minimization contains three units [16]. The component indicated by 1 in
step to eliminate redundant test cases. Fig. 1 shows an example of such a prompt. The first unit in

3
A.M. Dakhel et al. Information and Software Technology 171 (2024) 107468

Fig. 2. The proposed methodology for generating and evaluating tests using LLMs.

Algorithm 2: GenerateInitialPrompt Since MuTAP needs to execute the test function for investigation on
Input: PUT, initial_prompt_type MT and prompt augmentation, samples with syntax errors become
Output: initial_prompt inefficient. However, sometimes a small change in the output of LLMC
1 if initial_prompt_type == "zero-shot" then
can fix the syntactic error and convert it into an executable test case
2 initial_prompt ← CONCAT(INS1 , PUT, INS2 )
3 else for example completing the last line in Listing 1 or removing it.
4 if initial_prompt_type == "few-shot" then
5 initial_prompt ← CONCAT(pair(M,UT ), PUT ) // M: Method, UT: Unit
Test 1 def test ():
6 end 2 assert any_int (3, 2, 5) == True
7 end 3 assert any_int (3, 2, 2) == True
8 return initial_prompt
4 assert any_int (5.2 , -2.2, 2) == True
5 assert any_int (1, 2, 4 ==
Listing 1: The unit test before refining step for the PUT in the
this component is an instruction in a natural language named Motivating example presented in Fig. 1.
𝐼𝑁𝑆 1 and it clarifies the task by asking: ‘‘Generate test cases for
the following code’’. The second unit is the Program Under Test
(𝑃 𝑈 𝑇 ) and the last unit is a set of instructions in a programming MuTAP starts this step by first compiling the test function, if any
language named INS2 . The 𝐼𝑁𝑆 2 acts as a hint to indicate the de- syntax error arises, MuTAP uses the capability of its LLMC to fix syntax
sired output for LLMC. The concatenation of (𝐼𝑁𝑆 1 , 𝑃 𝑈 𝑇 , 𝐼𝑁𝑆 2 ) errors, similar to other studies [35,40]. To do so, LLMC is called on
builds the initial prompt for zero-shot learning (Line 2 in Algo- a new prompt to fix the syntax error in its own output (Procedure
rithm 2). SyntaxFixer in Algorithm 3). The syntax fixing prompt consists of two
• few-shot : Prompt generation based on few-shot learning uses a parts. The first part is a natural language instruction, 𝐼𝑁𝑆 𝑓 𝑖𝑥 , ‘‘Fix the
chain of inputs and expected outputs related to the downstream syntax errors in the following code snippet’’, and the second part is the
task. There are different approaches for presenting the pair of generated test function by LLMC on the initial prompt (Line 7–8 in
input and output in the prompt. We follow a recent approach to Algorithm 3). If the syntax error persists even after re-prompting the
build the initial prompt with few-shot strategy in MuTAP [39]. LLMC, MuTAP employs the Python parser to identify the erroneous line.
Considering the maximum possible length of tokens for LLMC It then retains the lines preceding the problematic line, ensuring they
(4k tokens in our study), few-shot prompt includes two different remain free of syntax errors (Line 13 in Algorithm 3).
demonstrative examples of a Method (M), as an example PUT, and
a Unit Test (UT) as follows (Line 5 in Algorithm 2): 3.2.2. Intended behavior repair
Based on the initial prompt, LLMC generates different test cases that
<code>M_1</code>\n<test>UT_1</test>\n are serialized as an assertion oracle by calling the PUT on certain inputs
<code>M_2</code>\n<test>UT_2</test>\n and comparing the returned output of PUT with the expected output or
<code>PUT_i</code>\n <test> ground truth, for example, {assert any_int (3, 2, 5) == True}.
However, it is possible for the LLMC to generate test cases that are
The methods provided in illustrative examples within the few-shot asserting wrong test output. It means that for some test cases, LLMC
prompt serve as instances of PUT and UT containing their respective does not generate the expected return output of the PUT. The lack of a
test cases. There is no natural language description of PUT in the natural language description about the PUT in the initial prompt could
initial prompt since such descriptions may not always be available, and potentially lead to the generation of test cases that do not accurately
MuTAP relies on the ability of LLMC to synthesize code context. MuTAP reflect the intended behavior of the method. In Listing 1, the initial test
calls the initial prompt, zero-shot or few-shot, on LLMC and then passes case asserts a correct test output, whereas, in the second and third test
the inferred output to the next step (Line 2 in Algorithm 1). cases, the test output is incorrect.
The assertion with wrong test output may fail on mutants, not
3.2. Refining because of detecting the bug, but because of the unintended behav-
ior of the assertion. These failures cause confusion about the effec-
In this section, we describe the process of refining the generated test tiveness of test cases. So, this step of MuTAP aims at repairing the
cases in MuTAP which includes fixing syntactical errors and intended intended behavior of assertion oracles in the test cases (Procedure
behavior repair. The details are shown in Algorithm 3. IntendedBehaviorFixer in Algorithm 3).
For each assertion in the test, MuTAP runs the PUT over the test
3.2.1. Syntax Fixer inputs and compares the return output of PUT with the asserting
The test cases generated by LLMC may have syntax errors (missing output. If the returned output of PUT is the same as the asserting output
brackets, uncompleted lines, etc.). Listing 1 shows the unit test before in the oracle, then MuTAP considers it as an assertion oracle with
the refining step generated by the LLMC for the PUT in the Motivating the correct intended behavior. Otherwise, it repairs those assertions by
example from Fig. 1. The last line in this example has a syntax error. replacing the asserting output with the expected output of PUT (Line

4
A.M. Dakhel et al. Information and Software Technology 171 (2024) 107468

Algorithm 3: Refining Algorithm 4: MutationTesting


Input: raw_IUT, PUT Input: UT, mutants
Output: IUT // The Initial Unit Test after refining Output: MS, surviving_mutant
1 syntax_fixed_IUT ← SyntaxFixer (raw_IUT ) 1 surviving_mutant ← {}
2 IUT ← IntendedBehaviorFixer (syntax_fixed_IUT, PUT ) 2 foreach mut ∈ mutants do
3 return IUT 3 if exec( mut, UT ) then
4 4 surviving_mutant.append(mut )
5 Procedure SyntaxFixer(raw_IUT) 5 end
6 if not AST.parse( raw_IUT ) then 6 end
7 syntax_fixed_prompt ← CONCAT (INS𝑓 𝑖𝑥 , raw_IUT ) 7 MS ← (#(mutants) − #(surviving_mutant))∕#(mutants)
8 syntax_fixed_IUT ← LLMC (syntax_fixed_prompt ) 8 return MS, surviving_mutant
9 end
10 syntax_fixed_IUT ← SyntaxCheck (syntax_fixed_IUT )
11 return syntax_fixed_IUT
12 Algorithm 5: AugmentingPrompt
13 Procedure SyntaxCheck(syntax_fixed_IUT) Input: initial_prompt, UT, SM
14 if AST.parse( syntax_fixed_IUT ) then Output: augmented_prompt
15 return syntax_fixed_IUT 1 augmented_prompt ← CONCAT (initial_prompt, UT, INS3 , SM, INS4 )
16 else 2 return augmented_prompt
17 return SyntaxCheck (syntax_fixed_IUT [:error_line])
18 end
19
20 Procedure IntendedBehaviorFixer(syntax_fixed_IUT, PUT)
21 fixed_IUT ← {} zero-shot or few-shot, by adding four new components (Line 1 in Algo-
22 foreach test_case ∈ syntax_fixed_IUT do
23 expected_output ← PUT (test_case.input)
rithm 5). The first component is the IUT, the initial unit test generated
24 if expected_output ≠ test_case.output then by LLMC after refinement. The second component is an instruction
25 test_case.output ← expected_output in a natural language named INS3 that clarifies the shortcoming of
26 fixed_IUT.append(test_case)
27 end
IUT by ‘‘The test function, test(), cannot detect the fault in the following
28 return fixed_IUT code’’. The third component is one of the surviving mutants of the PUT,
named SM. The last component, INS4 is an instruction in natural and
programming language: the natural language context clarifies the task
by asking to ‘‘Provide a new test case to detect the fault in prior code’’ and
22–27 in Algorithm 3). MuTAP omits those assertions for which the the programming language context acts only as a hint to guide LLMC
input types failed on PUT, for example, if PUT expected a list of for generating the output. An example is shown by 3 in Fig. 1.
integers but the test input is a string. The final outcome of this step MuTAP re-prompt LLMC and repeats the refining step on the gen-
is named Initial Unit Test (IUT) which is a set of test cases generated erated output. Then, it appends new generated test cases to the IUT
by LLMC after refinement as shown by 2 in Fig. 1 for the unit test in
that we call augmented unit test (Line 10–12 in Algorithm 1). The
Listing 1.
augmented unit test and the mutants of 𝑃 𝑈 𝑇 is passed to the MT step
(Line 13 in Algorithm 1). MuTAP recursively repeats prompt augmen-
3.3. Mutation Testing (MT)
tation till either the final test cases kill all the mutants (𝑀𝑆 = 100%)
MT assesses the quality and effectiveness of test cases. Mutants are or there is no surviving mutant that is not used in the augmentation
built by injecting artificial bugs into the PUT to simulate defects. If test process (Line 14 in Algorithm 1). An example of updating the mutant
cases failed on a mutant, we consider it as a killed mutant, otherwise, component in Fig. 1, 3 is changed to 3 ′ by replacing SM0 with SM1 .
it survived, meaning that the test cases within the unit test are not The 4 ′ indicates the generated test cases with LLMC after iterating the
able to detect it. The presence of surviving mutants highlights the process on the next surviving mutant.
shortcomings of test cases, suggesting the need to either add a new test
case or improve an existing one. The Mutation Score (MS) represents 3.5. Oracle Minimization
the effectiveness of test cases by calculating the ratio of killed mutants
out of all mutants of a PUT.
The test cases generated by the LLMC usually consists of redundant
Algorithm 4 presents the details of this step. Similar to other stud-
assertions. Also, the augmentation process may add more redundant
ies [41], MuTAP uses MutPy [42] to generate different mutants for
each PUT and calculate MS (Line 3–7 in Algorithm 4). Executing test assertions to the final unit test. Presenting all of them (with redun-
cases on each mutant involves performing some preliminary setups. For dancy) as the final output can cause confusion for developers. In the
this purpose, MuTAP uses Python’s built-in ‘‘setuptools.find_packages’’ final step, similar to previous tools that generate mutation-driven test
to locate and install the required packages, such as ‘‘math’’, ‘‘numPy’’, oracles [9,43], MuTAP minimizes the number of assertions by utilizing
‘‘pandas’’, ‘‘pytest’’, and others. Additionally, MuTAP implements setup a Greedy technique to eliminate the redundant assertions that do not
functions that are responsible for creating temporary directories, which improve the MS. This step is presented in Algorithm 6. MuTAP starts
are utilized during the execution of the test cases on the mutants. After by tracking the number of mutants that each assertion kills and then
executing the test cases on the mutants and calculating the MS, MuTAP chooses the test case containing the assertion that kills the maximum
properly tears down the setup by removing the temporary directory. number of mutants. This process is then repeated by adding the test
As shown on Line 5–9 in Algorithm 1, if the MS of a PUT reaches cases containing the next assertions that detect the most mutants (Line
100%, MuTAP passes test cases to the oracle minimization step (Sec- 4–10 in Algorithm 6). If adding this new assertion increases the MS,
tion 3.5 and Line 16 in Algorithm 1), otherwise, it collects the list of MuTAP keep the test case and its assertion. Otherwise, the test case
surviving mutants and transfers them to the prompt augmentation step will be discarded as redundant.
(Section 3.4).

3.4. Prompt augmentation 4. Evaluation

Algorithm 5 shows the details of this step. If there is any surviving In this section, we describe the evaluations we designed and con-
mutant from the previous step, MuTAP augments the initial prompt, ducted to investigate the following research questions:

5
A.M. Dakhel et al. Information and Software Technology 171 (2024) 107468

Algorithm 6: OracleMinimization ensuring their freedom from syntax errors. If the removal of lines
Input: UT, mutants results in the absence of any remaining test cases (all test cases prove
Output: FUT non-compilable), we classify the PUT as problematic.
1 MS_old ← 0
We also conducted a statistical test to determine whether the ef-
2 FUT ← {}
// sort test cases in UT based on the
3 sorted_UT ← sort (UT) fectiveness of test cases generated by MuTAP is significantly different
descending order of MS from those generated by Pynguin and other comparable methods. Due
4 foreach test_case ∈ sorted_UT do to the importance of killed/detected mutants/buggy code snippets in
5 FUT.append(test_case)
6 MS, surviving_mutant ← MutationTesting (FUT, mutants)
evaluating the effectiveness of test cases and due to the limited number
7 if MS > MS_old then of PUTs in our datasets, we conducted the statistical test on categorical
8 MS_old ← MS data to determine if the proportion of mutant/buggy code snippets that
9 else
10 FUT.delete(test_case)
were killed/detected by MuTAP is significantly different from other
11 end comparable methods. To apply the statistical test on the categorical
12 end data, we employed the Chi-square test with a df of 1 and a p-value
13 return FUT threshold of 5% (0.05). The result of the statistical test indicates if the
proportion of mutant/buggy code detected by MuTAP is significantly
different from other comparable methods. In this statistical test, the
null hypothesis states that there is no significant difference between
RQ1 How effective are test cases generated by MuTAP in comparison
the proportion of mutant/buggy code detected by MuTAP compared to
to test cases generated by automatic test generation tools?
other methods (p-value > 0.05). A p-value < 0.05 indicates that the null
In this RQ, we aim to assess test cases in revealing bugs. We com-
hypothesis can be rejected and the improvement in the effectiveness of
pare MuTAP-generated test cases with a state-of-the-art automatic
test cases generated by MuTAP is statistically significant. We also calcu-
test generation tool and also the output of LLMC before refining
lated the effect size on the proportion of detected mutant/buggy code
and augmenting steps as baselines. We evaluate the effectiveness
(based on the chi-square test) and reported the magnitude of the effect
of test cases on synthetic and real buggy code. size on the proportion of detected mutant/buggy code snippets [44,45].
RQ2 How do the different parts of MuTAP perform?
In this RQ, we aim to assess the individual impact of different 4.1.2. Comparable tool
components in our proposed method, MuTAP, on the improve- Pynguin [41] is a well-known fully-automated test generation tool
ment of syntax fixing, repair intended behavior, and test case for a dynamically typed programming language such as Python. It
effectiveness. uses different search-based algorithms to satisfy code coverage criteria,
RQ3 What is the performance of MuTAP for each mutation type? i.e., branch coverage. Pynguin first takes a Python code (method, mod-
Since we utilize various types of mutants in the prompt aug- ule, etc.) as input and collects its information such as variable types,
mentation step, it is worth exploring whether MuTAP performs method names, and dependencies. Then it uses one of the search-based
differently for each type. Through this RQ, we aim to ascertain test generation algorithms (MIO [46], MOSA [47], DynaMOSA [48],
if the improvement is limited to a specific mutant type or if it is etc.) to generate test cases. It randomly mutates (deletes, inserts, re-
broadly observed across various mutant types. places) different values and statements within the test case to generate
new test cases and executes them over the 𝑃 𝑈 𝑇 to ensure their cor-
4.1. Experimental setup rectness. Finally, it generates assertions for test cases using a MT
engine [41].
In this section, we present our experiment setup. We describe the For our experiments, we employ Pynguin 0.17.0. with the Dy-
automatic test generation tool used to compare our results, clarify the naMOSA [48]. According to the evaluation of Pynguin [48], Dy-
LLMC of MuTAP and its setup, and indicate the baselines and datasets naMOSA shows the best performance compared to the other algorithm
used in our experiments. in generating test cases with this tool. We set the timeout of test
We conducted the experiment on the Cedar cluster of Compute generation to 600 s which is the default setting of the tool.
Canada, which offers 32 cores CPU, 1TB storage, and one v100l GPU
with 32 GB GPU Memory, and on a system running Linux 5.15.0-69- 4.1.3. Large Language Model Component (LLMC)
generic with AMD FX(tm)-6300 Six-Cores CPU, 512 GB storage, and We employ two different LLMs as the LLMC of MuTAP. The first one
16 GB Memory. is OpenAI’s Codex, designed specifically for code generation tasks [17].
We use Code-davinci-002, with a temperature of 0.8. The lower temper-
4.1.1. Experimental settings ature causes less variation in the outputs of the model while the higher
We conducted an experiment for each PUT. In each experiment, we temperature increases the variation of output and then the chance of
repeated the runs 10 times for MuTAP with different configurations generating useful test cases over different iterations. The evaluation
(i.e., Codex and zero-shot) and also Pynguin. The MS reported is the of CODAMOSA [16] shows that 0.8 is a reasonable temperature to
median of the 10 runs. When selecting unit test candidates from the generate useful test cases with Codex.
output generated by LLMC in different components of MuTAP, we con- The second LLM is Meta’s llama-2-chat, which has been iteratively
sidered two criteria: the candidate should contain both the keywords refined using Reinforcement Learning with Human Feedback (RLHF)
assert and the function name of the PUT. If, after 10 runs, LLMC failed and is appropriate for dialog use cases [49]. Similar to Codex, we have
to generate an output containing these two keywords, we categorized configured the model’s temperature to be 0.8. Furthermore, the model
the PUT as problematic or as a case for which MuTAP cannot generate provides three distinct roles as the structure of the prompt: system, user,
a test case. We followed the same approach for Pynguin; if, after 10 and assistant. These roles serve the purpose of clarifying each com-
runs, Pynguin does not generate a test case for a PUT, we categorize ponent of the prompt to the model by assigning specific components
the PUT as problematic. to each role. This structure follows the model training procedure. The
We also have an internal repetition in each run, regarding the syntax System role defines the model’s behavior, designating it as an assistant
fixing step. We run the syntax fixing prompt for each unit test on the with a specific type of task. For instance, it clarifies whether the model
LLMC for up to 10 iterations. If the syntax error remains unresolved operates as a Python programming assistant or serves as a poet that
even after 10 iterations, MuTAP employs the Python parser to locate tries to find memorable names for variables. Given the model’s training
the erroneous line. It then retains the lines preceding the buggy line, for conversational (chat) setup, the user role is assigned to the user’s

6
A.M. Dakhel et al. Information and Software Technology 171 (2024) 107468

Table 1
List of the mutation operators in our experiments used by 𝑀𝑢𝑡𝑃 𝑦 sorted by alphabetical order.
Operator Example Mutant
AOD — arithmetic operator deletion result.append(numbers[−1]) result.append(numbers [1])
AOR — arithmetic operator replacement return number % 1.0 return number * 1.0
ASR — assignment operator replacement current_depth + = 1 current_depth − = 1
BCR — break continue replacement if i % j != 0: break if i % j != 0: continue
COD — conditional operator deletion if not string: return ‘ ’ if string: return ‘ ’
COI — conditional operator insertion if balance < 0: return True if (not balance < 0): return True
EHD — exception handler deletion except: pass except: raise
EXS — exception swallowing except: return False except: pass
LCR — logical connector replacement if s[−1] == ‘y’ or s[−1] == ‘Y’: if s[−1] == ‘y’ and s[−1] == ‘Y’:
ROR — relational operator replacement if c[n] ≤ 1: if c[n] ≥ 1:
SIR — slice index remove l[::3] = sorted(l[::3]) l[::3] = sorted(l[:])

prompt or request, while assistant encompasses the model’s response Table 2


Datasets used for evaluation.
related to the user’s prompt. Different combinations of these roles can
Dataset # PUTs Mutants\Bugs Description
be utilized in each prompt to tailor the interaction with the model
according to the specific requirements [49]. HumanEval [17] 164 1260 Synthetically generated
bugs from PUTs
In our experiments, the role of the system is defined as {You are a
Refactory [31] 5 1710 Real student buggy
Python coding assistant. Always answer with Python code.}, for all types
code on 5 assignments
of prompts, including zero-shot, few-shot, and augmented prompts. To
handle the zero-shot prompt, we only set the user’s role content to be a
concatenation of (INS1 , PUT𝑖 , INS2 ). For the few-shot prompt, we define
the content of the assistant role as a set of demonstrative examples of 4.1.6. Datasets
Method (M) and Unit Test (UT), while the user role content is set to To conduct our experiments, we use two different datasets as shown
PUT𝑖 . As for the augmented prompt, its various components are set up in Table 2. The first one is HumanEval [17] which is a dataset to eval-
as follows: uate LLMs that generate code. It has 164 human-written programming
problems at easy to medium levels. We consider each programming
problem in this dataset as a PUT. Each problem has different attributes
{user: Initial Prompt, such as descriptions and reference solutions. We use the reference
assistant: IUT, solution of each task as a PUT. The PUTs in the HumanEval dataset are
user: concat(INS3 , SM𝑖 , INS4 )} either self-contained or dependent on public libraries, such as NumPy.
The average number of lines of code (LOC) across all the PUTs in this
For both LLMs, the maximum number of generated tokens is set to dataset is 11.5.
250 for generating test cases and 20 tokens for syntax fixing, based on The second one, Refactory [31], is a benchmark for Python bug
previous studies on similar tasks [16,50]. The stop word is defined as repairing [51]. It has 1710 buggy students’ submissions for 5 assign-
quote (‘‘) for zero-shot and as < ∕𝑡𝑒𝑠𝑡 > for few-shot prompt. For the rest ments of a Python programming course. Each assignment has a correct
of the hyperparameters, we keep the model’s default values. reference solution that we use as PUT. The PUTs in the Refactory
dataset are all self-contained. The average number of LOC across all the
It is important to note that MuTAP is not limited to these two
PUTs in this dataset is 8.4. The advantage of this dataset is buggy code
models, and its LLMC can be replaced with any other LLM as required.
snippets generated by humans that give us the opportunity to evaluate
test cases generated by MuTAP on real bugs and compare them with
4.1.4. Baselines Pynguin and our baselines.
In addition to Pynguin, we propose two baselines for each LLM to Both datasets employed in this study are in Python. Python is a
evaluate our proposed method, MuTAP. very common and widely used programming language across various
domains [52], making it a representative and well-suited choice for
Before-refining: The first baseline is the output of the initial prompt on
our investigation. Moreover, various LLMs show better performance
LLMC (Codex or llama-2-chat), without fixing syntax errors or repairing
in synthesizing Python code. Notably, LLMs like Codex are specifically
the intended behavior. Since assertions with unintended return values trained and optimized for this language. MuTAP can be leveraged for
can fail on mutants or buggy code and present invalid effectiveness, Java as it is language-agnostic by design.
we omit those assertions in this baseline to avoid this side effect. If the
output of the model has syntax errors, we consider it as a wrong test 4.2. Experimental results
and consequently consider the task as a problematic or unsolved task.

After-refining: The second baseline is the output of the initial prompt In this section, we discuss our findings for each RQ.
on LLMC (Codex or llama-2-chat), after applying the following steps:
Refining (Section 3.2) and Oracle Minimization (Section 3.5). 4.2.1. RQ1: How effective are test cases generated by MuTAP in compari-
son to test cases generated by automatic test generation tools?
Since our study focuses on MT to improve the effectiveness of test
4.1.5. Mutant generator cases, we compare MuTAP with Pynguin and our baselines in terms
To apply MT, we need to generate different mutant versions of a of MS, number of killed mutants, and number of PUT with 100% MS.
PUT by injecting bugs into its different lines. For this purpose, we use The MS reflects the effectiveness of test cases generated by the different
MutPy version 2.0 [42]. MutPy is a MT tool for code in Python 3.3+. It methods in revealing bugs, excluding PUTs for which the method failed
benefits from different mutation operators to generate the mutants. The to generate test cases (problematic PUTs). However, the number of
list of mutation operators used in our experiment with corresponding killed mutants, in comparison to the total number of mutants (1260),
examples is shown in Table 1. MutPy injects one operator at a time to provides insight by reflecting the total number of surviving mutants
generate the mutant if the operator is applicable on PUT. for both PUTs with test cases and those without. Given that the goal of

7
A.M. Dakhel et al. Information and Software Technology 171 (2024) 107468

Table 3
Evaluation result of test cases generated by MuTAP and other methods on synthetic buggy programs on the HumanEval. ‘‘MS (Avg.)’’ represents the average of MS with its standard
deviation over all PUTs in the HumanEval dataset. The MS for each PUT is the median of 10 runs. ‘‘Problematic PUT’’ refers to the percentage of PUTs without an accurate test
case and ‘‘PUT with MS = 100%’’ denotes the percentage of PUTs for which their test cases achieve an MS of 100%, for both metrics out of 164 PUTs. ‘‘Killed Mut’’ also represents
the absolute number of killed mutants for all PUTs.
Prompt Model Method # test cases (Avg.) Problematic MS (Avg.) # Killed Mut Effect PUT MS =
PUT (%) (%) ± 𝑠𝑡𝑑 (out of 1260) Size (ES) 100% (%)
– – Pynguin 1.5 (min = 1, max = 4) 18.9 65.94 ± 30.78 649*** Large 28.22
Before-refining 1.5 (min = 1, max = 3) 44.51 72.15 ± 26.95 296*** Large 11.04
Zero-shot Codex After-refining 2.1 (min = 1, max = 3) 18.29 76.82 ± 24.35 749** Medium 24.54
MuTAP 2.5 (min = 1, max = 4) 18.29 89.13% ± 20.32 869 – 41.72
Before-refining 1.2 (min = 1, max = 3) 41.46 62.60% ± 28.82 318*** Large 17.79
Zero-shot llama2-chat After-refining 2.2 (min = 1, max = 5) 0 84.04% ± 17.41 1059* Small 53.98
MuTAP 2.5 (min = 1, max = 5) 0 91.98% ± 13.03 1159 – 68.09
Before-refining 1.5 (min = 1, max = 3) 23.78 72.68% ± 26.23 508*** Large 15.95
Few-shot Codex After-refining 2.2 (min = 1, max = 3) 16.46 82.73% ± 21.91 829** Medium 34.97
MuTAP 2.6 (min = 1, max = 7) 16.46 92.02% ± 13.55 922 – 49.69
Before-refining 1.5 (min = 1, max = 3) 36.58 64.51% ± 24.11 325*** Large 22.69
Few-shot llama2-chat After-refining 2.5 (min = 1, max = 5) 0 85.16% ± 16.36 1073* Small 57.05
MuTAP 2.6 (min = 1, max = 7) 0 93.57% ± 11.18 1179 – 69.93

The result of the Chi-square test with a p-value threshold of 5%. Significant at the *** 𝑝 < 0.001, ** 𝑝 < 0.01, *𝑝 < 0.05.
The magnitude of the effect size at 𝐸𝑆 < 0.1 negligible, Small 𝐸𝑆 ≥ 0.1, Medium 𝐸𝑆 ≥ 0.3, Large 𝐸𝑆 ≥ 0.5.

MT is to generate test cases with MS=100%, the total number of PUTs for up to 70% of PUT s, demonstrating a remarkable improvement
with a 100% MS demonstrates the improvement achieved by MuTAP in the effectiveness of test cases compared to the Pynguin with 649
in this aspect compared to Pynguin and the baselines. Absolute values killed mutants and 28.22% PUT s with MS=100%. As the results of
in Tables 3 and 4 are reported based on the unit tests with the median our statistical test on the portion of killed mutants over all PUTs also
MS. show, the effectiveness of test cases generated by MuTAP is significantly
different from those generated by other comparable methods (all p-
HumanEval dataset values are below 0.05). Also, the magnitude of the effect size on the
Table 3 shows the obtained results for the HumanEval dataset. Prior proportion of mutants killed by MuTAP with different configurations,
to syntax fixing and intended behavior repair (before-refining ), the test compared to other alternative methods, is non-negligible.
cases generated by Codex and llama-2-chat are incorrect for 44.51%
and 41.46% of PUT s, respectively, when using the zero-shot initial Refactory dataset
prompt. However, they managed to kill 295 and 318 mutants (out of To evaluate the performance of MuTAP on buggy programs, we
1260), respectively. employ the Refactory dataset. To evaluate the results on this dataset,
The initial prompt has a more pronounced impact on the output of we select the unit tests with a median MS out of 10 runs generated
by different methods and apply them to the buggy code in Refactory
Codex compared to llama-2-chat. Switching the initial prompt to few-
to assess their effectiveness in detecting real buggy code snippets. We
shot decreases the number of PUT s without test cases to 23.78%, while
report the absolute total number of buggy code snippets over all PUTs
also raising the number of killed mutants to 508 when using Codex
that are detected by unit tests with the median of MS generated by
as LLMC. On the other hand, when using llama-2-chat, the number of
different methods.
PUT s without test cases reduces to 36.58%, and the number of killed
Table 4 that shows the results on this dataset confirms our findings
mutants increases from 318 to 325. This difference in performance
on HumanEval. Overall, our proposed method, MuTAP, detects more
could be attributed to llama-2-chat being more suitable for dialog buggy programs compared to Pynguin and other baseline methods.
prompts, and using a prompt with a pair of demonstrative input and MuTAP with few-shot learning while using llama-2-chat identifies
output, devoid of natural language context, does not improve the 468 more buggy code compared to Pynguin and 111 more buggy code
model’s performance significantly. compared to After-refining. Furthermore, MuTAP while using llama-
In contrast, Pynguin, as the state-of-the-art automatic test gener- 2-chat as its LLMC discovers 79 buggy code that was not detected
ation tool, outperforms the output of both LLMs, before-refining, by by either Pynguin or llama-2-chat’s test cases After-refining process.
killing 649 mutants and failing to generate test cases for 18.9% PUT s. When using Codex, MuTAP detects 73 buggy code that was missed
After applying the post-processing steps of syntax fixing and in- by both Pynguin and Codex’s test cases After-refining stage. Moreover,
tended behavior repair, MuTAP with both LLMs perform better than MuTAP excels in generating more effective test cases, with an average
Pynguin in terms of killing more mutants. Notably, when using both of 2.6 test cases after applying greedy optimization. As the results
zero-shot and few-shot prompts, llama-2-chat is able to generate correct of our statistical test on this dataset also show the proportion of
test cases for all PUT s, after-refining. However, their effectiveness in buggy code detected by MuTAP is significantly different from other
terms of killing mutants is measured at 84.04% and 85.16% with the comparable methods. The stars indicate the degree of statistical sig-
zero-shot and few-shot prompts, respectively. nificance over the alternative methods (always compared to MuTAP
On the other hand, the MS of test cases generated by Codex after with a specific configuration). Moreover, the magnitude of the ef-
refining is 76.82% and 82.73% with the zero-shot and few-shot prompts, fect size on the proportion of buggy code detected by MuTAP with
respectively. Despite this improvement, Codex still fails in generat- different configurations, compared to other alternative methods, is
ing correct test cases for 18.29% (with zero-shot ) and 16.46% (with non-negligible.
few-shot ) of PUT s after refining.
MuTAP, enhances the effectiveness of test cases generated by both 1 def derivative (xs: list):
LLMs, Codex, and llama-2-chat, achieving MS of 89.13% and 91.98% 2 return [(i * x) for i, x in enumerate (xs)
with the zero-shot prompt, and an MS of 92.02% and 93.57% with ][1:]
the few-shot prompt, respectively. Particularly, MuTAP with the few-
Listing 2: A sample PUT for which MuTAP, incorporating Codex, was
shot prompt when using llama-2-chat as its LLMC manages to kill
unable to generate test cases.
1179 mutants out of 1260 and generates test cases with MS=100%

8
A.M. Dakhel et al. Information and Software Technology 171 (2024) 107468

Overall, MuTAP using both llama-2-chat and Codex demonstrates


better performance compared to Pynguin in terms of killing mutants
and detecting buggy code. The effectiveness of these test cases in
1 def test_case_0 ():
detecting bugs is improved through post-processing steps of refining
2 float_0 = 890.6
3 list_0 = [float_0 , float_0 , float_0 , and prompt augmentation.
float_0 ]
4 var_0 = derivative ( list_0 )
5 assert len(var_0 ) == 3 Finding 1: MuTAP generates more effective test cases com-
pared to Pynguin and conventional zero-shot and few-shot
Listing 3: The only test case generated by Pynguin for the PUT in
learning on LLM. The number of MuTAP’s test cases is not
Listing 2.
much greater than the output of other methods after mini-
mization. Additionally, LLM with dialog setup performs better
on the augmented prompt. In conclusion, the effectiveness of
1 def test (): LLM-generated test cases can be enhanced through prompt
2 assert derivative ([1]) == [] augmentation using surviving mutants and post-processing
3 assert derivative ([1 , 2, 3]) == [2, 6] refinement.
4 assert derivative ([1 , 2, 3, 4]) == [2, 6,
12]
5 assert derivative ([3 , 1, 2, 4, 5]) == [1,
4, 12, 20]
6 assert derivative ([1.0 , 2.0 , 3.0] == 4.2.2. RQ2: How do the different parts of MuTAP perform?
[2.0 , 6.0] Syntax Fixer: On average, the percentage of test cases with syntax
7 assert derivative ([ ’a’, ’b’, ’c’, ’d’, ’e errors is 38.98% and 26.48% when using the zero-shot and few-shot
’]) == [’b’, ’cc ’, ’ddd ’, ’eeee ’] prompts, respectively, with Codex. When employing llama-2-chat, this
percentage is 33.85% and 26.32% with the zero-shot and few-shot
Listing 4: The test cases generated by MuTAP, incorporating
prompts, respectively.
llama-2-chat, for the PUT in Listing 2.
When considering syntax errors, three factors contribute to de-
creasing them in the output of LLMs. The first factor is the type of
initial prompt. As shown in Table 5 on the HumanEval dataset, few-
The challenges that Pynguin faces in generating valid and effective
shot learning results in fewer syntax errors in the output of both
test cases for certain PUTs can be attributed to some of its limitations.
LLMs. Specifically, when using Codex, the percentage of syntax errors
While Pynguin is capable of generating test cases for self-contained
decreases from 44.79% to 29.03% after-refining, and for MuTAP, it
PUTs and PUTs with dependencies on public libraries, such as NumPy,
decreases from 33.17% to 23.93%. With llama-2-chat as the LLMC,
it sometimes exhibits limitations in generating effective tests for such
the percentage of syntax errors decreases from 38.03% to 26.99% after
PUTs. However, this limitation is not observed when LLMs are em- refining, and from 29.66% to 25.64% for MuTAP.
ployed in MuTAP for test case generation. Moreover, for PUTs that The second impactful factor, which is also the primary factor, is
incorporate generators like ‘yield’ or iterators such as list comprehen- the Syntax Fixing component. As shown in Table 5, when using Codex,
sions, Pynguin encounters difficulties in generating corresponding test this component in MuTAP on average fixes 14.5% of syntax errors by
cases. In contrast, MuTAP demonstrates no limitations in generating utilizing the LLMC and addresses 81.37% of syntax errors by omitting
test cases for PUTs with such constructs. the lines causing the errors. On the other hand, when using llama-2-
In addition, type information is crucial for generating high-quality chat as the LLMC of MuTAP, the Syntax Fixing component, on average,
test cases. While this information is not available in dynamically typed resolves 32.31% of syntax errors through re-prompting the LLMC, and
languages like Python, Pynguin extracts some of the type information 60.73% of the errors by omitting the problematic lines.
during its test case generation process if available. However, one of The final factor contributing to the improvement of syntax errors
the factors impacting Pynguin’s ability to generate effective test cases in test cases is the prompt augmentation process in MuTAP. By aug-
is the potential presence of incorrect, incomplete, or lacking type menting the prompt with IUT, the occurrence of syntax errors in the
information. In MuTAP, the semantics of function and variable names output of Codex with the zero-shot technique decreases from 44.79%
(input/output names) in PUTs enable LLMC to make assumptions about to 33.17%. Similarly, with llama-2-chat and the zero-shot prompt, the
variable types that help in generating more effective test cases which percentage of syntax errors reduces from 38.03% to 29.66%. Augment-
Pynguin could not benefit from. ing the prompt with IUT provides illustrative examples of test cases
Conversely, MuTAP encounters challenges in generating effective and serves a similar purpose to the demonstrative examples in the few-
test cases for PUTs where syntax does not provide comprehensive infor- shot learning prompt, effectively reducing syntax errors in the output
mation about their functionalities, as no description of the functionality of LLMs.
is incorporated in the prompt. An illustrative example of such PUTs Our finding on the Refactory dataset shows MuTAP generates test
is demonstrated in Listing 2, where MuTAP, incorporating Codex, is cases with syntax errors in only one PUT (out of 5) using Codex and
unable to generate test cases for this PUT. In contrast, Listing 3 shows zero-shot learning. Moreover, none of those syntax errors could be fixed
the only test case generated by Pynguin for the same PUT in Listing by re-prompting LLMC. On the other hand, for both initial prompt
2. Although the PUT involves an iterator within a list comprehension, types, syntax errors decrease to zero using llama-2-chat.
Pynguin can generate test inputs and capture the test output. However, Intended Behavior Repair: In the case of repairing intended be-
the assertion compares the length of the output list with an expected havior, two distinct factors contribute to reducing the error rate in
value, which does not accurately represent the functionality of the PUT assertion oracles. As shown in Table 6, the Intended Behavior Repair
in Listing 2. The MS for this test case generated by Pynguin is 25%. step, when using Codex as the LLMC, on average, fixes 83.98% (82.21%
Listing 4 presents the unit test generated by MuTAP, incorporating with zero-shot and 85.75% with Few-shot ) and 89.86% (89.71% with
llama-2-chat, for example PUT in Listing 2. The test cases within this zero-shot and 90.00% with Few-shot ) of incorrect behaviors in the after-
unit test are more comprehensive in evaluating the functionality of the refining and MuTAP, respectively. When utilizing llama-2-chat, this step
PUT and prove more effective in revealing bugs, achieving an MS of repairs, on average, 84.35% and 95.96% of unintended behavior in the
100%. after-refining and MuTAP, respectively.

9
A.M. Dakhel et al. Information and Software Technology 171 (2024) 107468

Table 4
Evaluation results on buggy programs on the Refactory dataset. The ‘‘Bug Detected’’ column shows the absolute number of real buggy programs detected by each Method. ‘‘MS
(Avg.)’’ represents the average of MS with its standard deviation over 5 PUTs in the Refactory dataset. The MS for each PUT is the median of 10 runs.
Prompt Model Method # test cases (avg) MS (Avg.) (%) ± 𝑠𝑡𝑑 Bug Detected (out of 1710) Effect Size (ES)
– – Pynguin 1.25 (min = 1, max = 4) 55.93 ± 32.45 1155*** Large
After-refining 1.2 (min = 1, max = 2) 66.11 ± 17.77 1356** Medium
Zero-shot Codex
MuTAP 1.6 (min = 1, max = 3) 77.91 ± 19.24 1437 –
After-refining 1.2 (min = 1, max = 3) 76.93 ± 12.97 1478*** Medium
Zero-shot llama-2-chat
MuTAP 2.2 (min = 1, max = 4) 94.40 ± 11.20 1594
After-refining 1.6 (min = 1, max = 3) 67.93 ± 16.94 1411** Medium
Few-shot Codex
MuTAP 2.2 (min = 1, max = 4) 83.73 ± 14.31 1529 –
After-refining 2.1 (min = 1, max = 4) 76.93 ± 12.97 1512*** Medium
Few-shot llama-2-chat
MuTAP 2.2 (min = 1, max = 4) 94.40 ± 11.20 1623 –

The result of the Chi-square test with a p-value threshold of 5%. Significant at the *** 𝑝 < 0.001, ** 𝑝 < 0.01, * 𝑝 < 0.05.
The magnitude of the effect size at 𝐸𝑆 < 0.1 negligible, Small 𝐸𝑆 ≥ 0.1, Medium 𝐸𝑆 ≥ 0.3, Large 𝐸𝑆 ≥ 0.5.

Table 5
Syntax error fixing of test cases. The syntax error rate shows the ratio of unit tests with syntax errors.
Model Method Prompt # iteration (avg) Syntax error rate Fixed by model Fixed by omitting lines
Zero-shot 9.1 44.79% 16.44% 60.27%
After-refining
Few-shot 9.5 29.03% 12.96% 83.33%
Codex
Zero-shot 9.7 33.17% 16.18% 79.41%
MuTAP
Few-shot 9.5 23.93% 12.82% 84.62%
Zero-shot 7.1 38.03% 30.64% 63.86%
After-refining
Few-shot 6.8 26.99% 31.81% 57.96%
llama2-chat
Zero-shot 6.9 29.66% 32.17% 61.05%
MuTAP
Few-shot 6.8 25.64% 32.45% 60.40%

In addition to the Intended Behavior Repair step, the prompt aug- iterate this process (axis in Fig. 3) until either there are no more PUT s
mentation step in MuTAP significantly reduces the occurrence of un- with MS < 100% or no more surviving mutant that is not utilized in the
intended behavior in test cases. In Table 6, it is shown that when argumentation process.
for example llama-2-chat is employed as the LLMC alongside Few- As shown in Fig. 3, each data point represents an iteration of the
shot initial prompt, 63.25% of test cases generated by the model have augmentation step and the average MS for all PUT s across five runs,
assertion errors. This indicates that within this context, about 36.75% derived from a random selection of surviving mutants. The shaded
of the test oracles generated by the model demonstrate the correct area illustrates the standard error of the average MS across these five
behavior, including accurate test outputs. Subsequently, this figure runs in each iteration. The results show that the standard error over
drops to 10.75% after augmenting the initial prompt with the IUT and
5 runs in each iteration is not significant. However, during the initial
surviving mutants. This demonstrates that 89.25% of the test oracles
iterations (up to 7 iterations), the standard error around the average MS
generated by MuTAP have the correct behavior, encompassing accurate
over five runs is greater than what is observed in the final iterations.
test outputs.
That is because after several iterations over the augmentation step with
When using Codex with a zero-shot prompt, the assertions with
different surviving mutants, the improvement in MS stalls. Notably,
unintended behavior, such as wrong test output, decrease from 63.63%
more than 90% of the MS is achieved by using only half of the surviving
to 19.38%. Similarly, with llama-2-chat and using few-shot prompt,
the assertions with unintended behavior decrease from 63.25% to mutants, and the improvement in MS stalls after a certain iteration of
10.75%. The reason behind this improvement could be attributed to the augmentation step for different LLMs. For example, when using
the usage of IUT s (Initial Unit Tests) in MuTAP for augmenting the Codex as LLMC, in zero-shot learning, the MS stops improving even
initial prompt. These IUT s already represent the intended behavior though, on average, 27 surviving mutants (out of 226) are not utilized
of the 𝑃 𝑈 𝑇 , thereby assisting the LLM in suggesting test cases with in the prompt augmentation step. Similarly, in few-shot learning, this
less unintended behavior (i.e., fewer wrong test outputs). Also, on number is equal to 24 (out of 106).
the Refactory dataset, MuTAP repaired all assertions with incorrect Our results for RQ2 demonstrate that test cases generated by LLMs,
behavior on the output of both initial and augmented prompts. regardless of the prompt type, require post-processing, such as syntax
Unlike syntax errors, the prompt type does not significantly help correction or intended behavior repair, in order to function properly
with unintended behavior in assertions. The combination of the In- and detect bugs effectively. Also, the order of surviving mutants to
tended Behavior Repair step and the prompt augmentation process im- augment the prompt does not significantly impact the MS gain.
proves the effectiveness of test cases, ensuring that they align with the
intended behavior of PUT.
Surviving Mutants Representation: We also investigated the impact Finding 2: The Syntax Fixing and Intended Behavior Repair fix
of surviving mutants’ order on MS during prompt augmentation. Fig. 3 up to 95.94% and 89.86% of syntax and functional errors in
illustrates the effect of augmenting the prompt with a random order test cases, respectively. The prompt augmentation in MuTAP
of surviving mutants over 5 runs for all PUT s. For this comparison, decreases the unintended behavior in the output of LLMs
we randomly selected one of the surviving mutants of each PUT with significantly (44.36% using Codex and 52.5% using llama-2-
𝑀𝑆 < 100% and utilized it to augment the initial prompt. We then chat). Furthermore, only a small number of mutants (up to 27)
calculated the average MS for all PUT s. Subsequently, we randomly do not contribute to the improvement of MS.
chose the second surviving mutant for the remaining PUT s with MS <
100% (if any), repeated the augmentation process as a second iteration,
and calculated the average MS for all PUT s again. We continue to

10
A.M. Dakhel et al. Information and Software Technology 171 (2024) 107468

Table 6
Evaluation results of Intended Behavior Repair. The Assertion Error Rate shows the ratio of assertions with wrong behavior.
Model Method Prompt Assertion Error Rate Repaired Not repaired
Zero-shot 63.63% 82.21% 17.79%
After-refining
Few-shot 62.84% 85.75% 14.25%
Codex
Zero-shot 19.38% 89.71% 10.29%
MuTAP
Few-shot 18.36% 90.00% 10.71%
Zero-shot 60.27% 81.80% 18.19%
After-refining
Few-shot 63.25% 86.90% 13.09%
llama-2-chat
Zero-shot 23.40% 94.06% 5.94%
MuTAP
Few-shot 10.75% 94.91% 5.09%

Table 7
Evaluation of killed mutants for each type of injected operator into PUTs.
Zero-shot Few-shot
Codex llama-2-chat Codex llama-2-chat
Pynguin MuTAP MuTAP
Type Killed Total Killed Total Killed Total Killed Total Killed Total
AOD 13 (39.39%) 33 28 (87.50%) 32 39 (86.67%) 45 32 (94.12%) 34 40 (88.89%) 45
AOR 248 (67.39%) 368 336 (91.55%) 367 410 (91.52%) 448 347 (92.53%) 375 417 (93.08%) 448
ASR 45 (60.00%) 75 60 (80.00%) 75 79 (94.05%) 84 64 (85.33%) 75 79 (94.05%) 84
BCR 2 (40.00%) 5 2 (40.00%) 5 5 (55.56%) 9 2 (40.00%) 5 6 (66.67%) 9
COD 8 (53.33%) 15 15 (100.00%) 15 16 (72.73%) 22 17 (100.00%) 17 17 (77.27%) 22
COI 130 (81.76%) 159 154 (96.86%) 159 216 (95.15%) 227 164 (98.80%) 166 218 (96.04%) 227
EHD 1 (100.00%) 1 0 (0.00%) 0 2 (100.00%) 2 1 (100.00%) 1 2 (100.00%) 2
EXS 0 (0.00%) 0 1 (100.00%) 1 1 (100.00%) 1 1 (100.00%) 1 1 (100.00%) 1
LCR 14 (45.16%) 31 23 (74.19%) 31 37 (86.05%) 43 27 (81.82%) 33 39 (90.70%) 43
ROR 174 (66.67%) 261 227 (87.31%) 260 316 (94.61%) 334 239 (91.57%) 261 320 (95.81%) 334
SIR 10 (33.33%) 30 23 (76.67%) 30 38 (84.44%) 45 28 (82.35%) 34 40 (88.89%) 45
Total 645 (65.95%) 978 869 (89.13%) 975 1159 (91.98%) 1260 922 (92.02%) 1002 1179 (93.57%) 1260

4.2.3. RQ3: What is the performance of MuTAP for each mutation type? MuTAP in leveraging this type of mutant to improve the effectiveness of
In this RQ, we evaluate the performance of MuTAP in different test cases in detecting them in the prompt augmentation step. For other
mutant types. We report the total number and number of killed mutants types of mutants, such as CIR or LCR, MuTAP demonstrates significant
by each method on the HumanEval dataset in Table 7. We report improvement compared to Pynguin, with (33.33% vs. 88.89%) and
the performance of Pynguin and MuTAP per mutant type to help the (45.16% vs.90.70%), respectively.
comparison. The total number of mutants in each type is different
for each method since the number of problematic PUT s is not the
same for all methods. The MS for each type/method indicates the Finding 3: All the different types of mutants contribute to
ratio of killed mutants out of the total number of mutants in that enhancing the effectiveness of test cases generated by MuTAP.
type. Our findings indicate that the improvement in the effective- However, augmenting the initial prompt with several less
ness of test cases generated by MuTAP is distributed among different frequent types of mutants in MuTAP, such as EHD, EXS, and
types of mutants. The diversity of mutant types is correlated to the BCR, does not result in a significant increase in the number of
𝑃 𝑈 𝑇 s in our dataset. In our dataset, Arithmetic Operator Replacement killed mutants compared to Pynguin.
(AOR), Conditional Operator Insertion (COI), and Relational Operator
Replacement (ROR) are more prevalent types. Conversely, Exception
Handler Deletion (EHD) and Exception Handler Swallowing (EXS) are 5. Discussion
less common in our dataset (an example for each mutant type is
shown in Table 1). Although the number of mutants in EHD and EXS 5.1. Automatic test case generation
categories is small, both Pynguin and MuTAP with Codex and llama-2-
chat faced challenges in generating test cases to detect these types. The MuTAP leverages the code synthesis capabilities of LLMs and em-
limitation may stem from the fact that the majority of test cases (test ploys prompt-based learning to assist developers in generating effec-
input/output) generated by Pynguin and MuTAP are designed to assess tive test cases without the need for the computationally expensive
the standard behavior of the PUT, rather than addressing exceptional fine-tuning of LLMs.
or unexpected behavior. LLMs are able to generate test cases that are more effective than
In general, MuTAP shows better or similar performance in all mu- those generated by Pynguin in terms of revealing bugs. Listing 5 shows
tant types compared to Pynguin. Considering ASR as an example, a sample test case generated by Pynguin for the PUT of our Motivating
MuTAP shows better performance on this mutant type. For example, example in Section 2. While Pynguin generates the test case shown
test cases generated by Pynguin identified 45 mutants in this cate- in Listing 5 by creating test inputs as random integers and mutating
gory while test cases generated by MuTAP using llama-2-chat and the those values to generate new test cases, LLMs produce test cases such
few-shot prompt identified 79 mutants in this category (out of 84). as those in Listing 6 that are more natural-looking and correlated with
Additionally, for another type of mutant, Break Continue Replace- input/output type and the functionality of the PUT. However, test cases
ment (BCR), there is no improvement in the number of killed mutants generated by LLMs require post-processing to become more effective
when using Codex and llama-2-chat as the LLMC for MuTAP with a in detecting bugs. Our results show that augmenting the prompt with
zero-shot initial prompt. The number of killed mutants increases by surviving mutants and refining test cases (syntax and intended behav-
only one additional mutant while using llama-2-chat and a few-shot ini- ior) helps LLMs generate more effective test cases in terms of fault
tial prompt with MuTAP. This result also highlights the limitation of the detection.

11
A.M. Dakhel et al. Information and Software Technology 171 (2024) 107468

Fig. 3. The impact of using surviving mutants in different random orders on the MS. Results are averaged over 5 runs, the line shows the mean and the shaded area shows
standard error. Each data point represents the average MS for all PUTs across five different runs in an iteration, wherein the surviving mutants were randomly selected for the
prompt augmentation process.

Regarding the problematic PUTs, in the LLM baseline and provides an identifier or tag for automatically collecting the entire
(before_refining), the presence of problematic PUTs may be attributed test function from the output generated by the model. Despite this,
to both syntax and behavioral errors in the model’s output. In MuTAP, given that we already incorporate the PUT with its name in the initial
for instance, when it employs llama-2-chat as its LLMC and applies prompt, it signals the model to generate a test function specifically
the refining step, it successfully generates test cases for all PUTs, for the mentioned PUT. In some instances, we have observed that the
showing the model’s ability to handle diverse constructs. Conversely, model (i.e., llama-2-chat) even replaced the simple test function name
for Pynguin, its limitation in handling PUTs incorporating specific provided in the prompt with a more readable name in the form of
constructs, such as generators and iterators [41], potentially influences ‘‘test_PUT_Name()’’, such as ‘‘test_any_int()’’. To evaluate the impact of
the number of problematic PUTs. In calculating the average MS, we allowing LLM to determine the function name, we randomly sampled
have already excluded PUTs for which Pynguin, baselines, and MuTAP 20% of the programming tasks from the HumanEval dataset, resulting
could not generate test cases. However, this limitation is reflected in the in 33 sample cases. We omitted the test function name from the initial
number of killed mutants by reporting the number of killed mutants, zero-shot prompt in INS_2 and subsequently executed it using ‘‘llama-2-
in comparison to the total number of mutants (1260), for both PUTs chat’’ for these sample PUTs and applied a Mann–Whitney U-test with
with test cases and those without. We contend that reporting various a significance level of 5% (alpha=0.05). The result of the test showed
metrics (percent of problematic PUTs, MS, number of killed mutants, no significant difference in the MS or the effectiveness of test cases
and percent of tasks with MS=100%) offers a fair comparison between following this modification in the initial prompt (MS = 86.22% and
different methods, highlighting the effectiveness of LLMs in test case standard deviation = 19.34% on the sample-set). While this preliminary
generation for PUTs with different constructs. This is in contrast to a analysis shows that using a predefined name as a test function name in
sophisticated tool like Pynguin, which, despite its intricate algorithm, the prompt does not affect the effectiveness of test cases in revealing
encounters difficulties in handling specific programming constructs that bugs, the identifiers suggested by LLM are more readable and relevant
leave a different number of PUTs without test cases. to the functionality of PUTs. However, a comprehensive study involving
Developers can use MuTAP to generate effective test cases in terms human subjects is needed to validate this result.
of fault detection, with the help of LLMs. Additionally, MuTAP can be To address syntax errors in test cases through re-prompting the
integrated into the test generation component of the GitHub Copilot LLMC, we used a fixed instruction (INS_fix), similar to the approach
lab [53] to suggest more effective test cases for developers. Since the proposed by Zhang et al. [35], to leverage the LLM for syntax error
mutants can be generated automatically, prompt augmentation can be
fixing. If the model fails to fix the syntax error, MuTAP omits the line
applied without human engagement.
containing the error, based on the suggestion provided by Lemieux
et al. [16]. This is because our observations indicate that the primary
1 def test_case_0 (): cause of syntax errors in the generated test cases by the model is often
2 int_0 = -2973 the last line of the test function when the model is unable to complete
3 int_1 = 815 it. We observed that re-prompting the LLMC to address syntax errors
4 bool_0 = module_0 . any_int (int_0 , int_0 , sometimes resolves the error line by completing the last line of the
int_1) test function but it introduces new lines, with one of them remaining
5 assert bool_0 is False incomplete. This primarily contributes to the lower rate of syntax error
Listing 5: A sample test case generated by Pynguin for the PUT in the repairs by the LLM in our results. Consequently, we finalize our syntax
Motivating example presented in Fig. 1. fixing steps by omitting the error line (i.e., the incomplete line in the
test functions).
In addition, for INS_3 and INS_4, the prompt serves as an instruction
5.2. Prompting to call the LLMC to generate a test case capable of killing the surviving
mutant in the prompt. As illustrated in Listing 6, it demonstrates the
In the zero-shot initial prompt, one of the fixed instructions, INS_2, ability of LLMC to explain the differences between the surviving mutant
indicates a fixed function name for the test function. The primary and the PUT, pinpointing the specific line containing the bug, and
reason behind using the name ‘‘def test()’’ in the initial Zero-shot subsequently producing a test case to detect that difference. However,
prompt for the test function is twofold. Firstly, it serves as a hint to including information about the buggy line of the mutant in the prompt
the model, indicating our expectation for a set of test cases organized could potentially enhance the effectiveness of the prompt augmentation
within a function. Secondly, it facilitates the automation of experiments step.

12
A.M. Dakhel et al. Information and Software Technology 171 (2024) 107468

5.3. Execution time 29


30 { " role " : " user " , " content " :
The open-access API of Codex has a limit on the number of requests 31 #The test function , test () , cannot detect the
(20 per minute) and the number of tokens (40,000 per minute). For fault in the following code:
this reason, our experiment needs to stop calling the API once in a 32 ‘‘‘python
while to not exceed the limit. As a result, we present the processing
33 def choose_num (x, y):
34
time analysis using llama-2-chat. The overall processing time of MuTAP
35 if x > y:
on HumanEval dataset while using llama-2-chat is on average 39.75 s 36 return -1
with zero-shot learning (with a min of 16.16 and a max of 56.66 s) 37 if y % 2 == 0:
and 42.11 s with the few-shot prompt (with a min of 18.2 and a max 38 return y
of 64.2 s) per task. It includes on average building and calling initial 39 if x != y:
prompts on LLMC with an average of 10.26 s, syntax fixing including 40 return -1
calling the syntax fixing prompt on LLMC with 10.3 s, intended be- 41 return y - 1
havior repair at 0.38 s, MS calculation at 1.7 s, creating augmented 42 ‘‘‘
prompts and calling them on LLM with 12.05 s and greedy optimization 43 # Provide a new test case to detect the fault
with 1.4 s. It is noteworthy that following the prompt augmentation in the prior code:
step, MuTAP must reiterate the processes of syntax fixing, intended 44 ‘‘‘python
45 #test case
behavior repair, and greedy steps which are already included in the
46 def test ():
overall processing time. Among all steps of MuTAP, the most time-
47 assert
consuming ones are those that entail inferring output from the LLM. 48 }
Conversely, the overall processing time on the same dataset with Pyn- 49
guin to complete searching the required space is on average 44.16 s 50 # #################################
with a min of 2.7 and a max of 10 min which is the default timeout of 51 // Generated output of llama -2- chat
the tool. 52 # #################################
53 " Here is a new test case to test the fault in
5.4. The benefit of dialog LLM the previous code: "
54 choose_num (5, 5) == -1
55
Our findings indicate that the dialog setup of llama-2-chat provides
MuTAP with the flexibility to assign distinct roles to each component 56 ‘‘‘python
57 def test ():
of the augmented prompt. For instance, by assigning IUT to an assis-
58 assert choose_num (5, 5) == -1
tant role during the prompt augmentation process, the likelihood of
59 ‘‘‘
repeating the initial tests in the generated output is reduced, while the 60 " This test case checks for the scenario where
chance of generating new test cases for detecting surviving mutants both x and y are equal . In the modified
is increased. Listing 6 illustrates an example of how llama-2-chat ef- 61 code , the condition
fectively synthesizes the difference of 𝑃 𝑈 𝑇 and one of its surviving 62 if x != y will
mutants, explains the differences, and subsequently generates a new 63 evaluate to False , and the expected
test case to detect the fault. 64 result is -1. However , in the original
65 code , the condition if x == y will
66 evaluate to True , and
1 # ########### 67 the incorrect
2 // prompt 68 result would be y - 1, which is 4
3 # ########### 69 instead of -1. "
4 { " role " : " system " , " content " : " You are a
Python Listing 6: A sample of augmented prompt and inference output on
5 coding assistant . Always answer with Python llama-2-chat.
code. " },
6 { " role " : " user " , " content " :
7 # Generate test cases for the following code: 5.5. Evaluation metrics
8 ‘‘‘python
9 def choose_num (x, y): Prior studies [18,25,37] that involve the generation of assertions
10 if x > y: through LLMs have employed the “exact match” as one of their eval-
11 return -1
uation metrics. Exact match calculates the percentage of test cases
12 if y % 2 == 0:
13 return y generated by LLMs (the inferred output) that lexically match with the
14 if x == y: ground truth test cases (expected output). However, CIDAR [37] has
15 return -1 already discussed the inadequacy of exact match as a suitable metric for
16 return y - 1 assessing assertions produced by LLMs. This reason is that the model
17 ‘‘‘ often generates assertions that are semantically correct but may not
18 }, precisely match the ground truth. In our study, MuTAP executed each
19 test case including assertions, both on the 𝑃 𝑈 𝑇 and on its mutants
20 { " role " : " assistant " , " content " : to assess their correctness and effectiveness, reporting their MS. MS is
21 ‘‘‘python a metric frequently used in prior studies and it serves as an effective
22 #test cases
metric for evaluating the quality of the test oracle [3]. While, in this
23 def test ():
24 assert choose_num (6 ,10) == 10 paper, we focus on improving the effectiveness of test cases in terms
25 assert choose_num (11 ,6) == -1 of fault detection, there are other metrics such as test coverage that
26 assert choose_num (8 ,12) == 12 can assess other quality aspects of a test case. Improving MS does
27 ‘‘‘ not necessarily lead to good coverage and test coverage is weakly
28 }, correlated with the efficiency of tests in fault detection [26] and is

13
A.M. Dakhel et al. Information and Software Technology 171 (2024) 107468

challenged as a measure of test effectiveness in revealing faults [27,28], and this consideration guided our choice of datasets. However, since
which can make it challenging for our proposed method to perform well we did not make any specific assumptions while selecting the dataset,
on both metrics [30,43]. our results can be extended to other Python programs.
Furthermore, other researchers reported that approximately 60% of MuTAP, like some other well-known test case generation techniques
the test cases generated by Codex encounter compilation issues due to [9,41], operates under the assumption that the PUT is not buggy.
syntax errors [54]. The incorporation of syntax correction and intended However, this assumption can introduce limitations in generating test
behavior repair steps in our proposed method, MuTAP, significantly cases when the correct PUT is not accessible. Future studies can focus
enhances the utility of the tests generated by LLMs. on leveraging LLM to generate test cases even in the presence of bugs
in PUTs.
5.6. Surviving mutants Finally, it should be acknowledged that the technique proposed and
the evaluations conducted in this paper are conceptually adaptable
We augment the prompt at each iteration for each PUT with a to languages beyond Python. However, the current implementation of
single surviving mutant. The average number of mutants for all PUT s MuTAP is tailored for Python programs, meaning our existing results
in HumanEval and Refactory are 6.6 and 4.2 and the average number cannot be extended to cover other programming languages.
of surviving mutants are 3.6 and 1.8, respectively. Using a combination Reliability validity. For the purpose of enabling other researchers
of surviving mutants to augment the prompt could impact the speed of to replicate or expand upon our study, we provide a replication pack-
reaching 100% MS. However, not all surviving mutants used in prompt age [32]. However, the ongoing enhancement of LLMs could potentially
augmentation contribute to improving MS, sometimes new test cases pose a challenge to achieving an exact replication of our results.
that address one mutant can also kill the remaining surviving mutants.
7. Related work
6. Threats to validity
Bareißet al. [12] studied the impact of few-shot learning across var-
Internal validity. ious downstream tasks, including test case and test oracle generation.
In this study, we employed two different prompt-based learning They compared the performance of few-shot learning with automatic
techniques: zero-shot and few-shot. However, we did not explore the test generation tools. The investigation was conducted on a different
potential impact of altering the natural language instructions or demon- set of Java methods sourced from different benchmarks. The outcomes
strative examples (for few-shot learning) within our prompts. Modifying indicated that LLMs possess the capability to generate test cases and
these instructions or utilizing different demonstrative examples more test oracles that exactly match (in lexical terms) the ground truth tests
closely aligned with the PUT ’s functionality could potentially enhance within the benchmark projects. Furthermore, their test coverage was
the results. As demonstrated by our results in RQ2, including the IUT found to be comparable with test cases generated by automatic test
in the prompt during augmentation steps reduced the instances of generation tools.
unintended behavior in test oracles. Conversely, using, for example, Sch"afer et al. [15] undertook an effort to generate test cases by
lengthy natural language instructions might potentially have an adverse prompting Codex. Their investigation was concentrated on 25
effect on the results. JavaScript packages. The prompt in their study encompassed the im-
We did not integrate additional information about the syntax error plementation of the PUT and also the usage examples of APIs extracted
in the 𝐼𝑈 𝑇 or the injected bug in the mutants, such as error messages from documentation. In instances where a test case proved unsuccessful
or error lines, into the prompt. It is worth considering that including on the PUT, their method incorporated the encountered error message
additional details about the errors or bugs may enhance the LLMC’s into the prompt and re-prompted Codex. Their findings demonstrated
performance to repair its initial output. that the process of enhancing the prompt with such additional infor-
Additionally, we acknowledge that the greedy algorithm employed mation facilitated Codex in producing correct test cases with sufficient
in our approach to minimize the number of test oracles might not be the coverage.
most optimal solution for minimizing test oracles while maximizing MS. LIBRO [56] used the issue reports (both title and body) as few-
However, prior studies [9,43] using the same method to minimize the shot prompts to generate bug-reproducing test cases. The final test
number of assertions have demonstrated its effectiveness in reducing cases were incorporated into appropriate test classes and ranked based
the number of test oracles within test cases, along with its ease of on their validity. The results revealed an enhancement in generating
implementation. correct test cases to reproduce bugs compared to state-of-the-art tools.
Finally, among different types of assertions, we only focus on gener- CEDAR [37], rather than employing fixed demonstrative examples
ating primitive ones in this study. Other assertion types can be explored in few-shot learning, aimed to retrieve demonstrative examples related
in future studies. to each PUT and incorporate them into the prompt. They assessed their
Construct Validity. We employ the notions of mutant killability method based on the lexical match, termed ‘‘exact match’’, between
and bug detection as metrics to gauge the effectiveness of test cases, generated assertions and the ground truth in a benchmark. While their
given that the primary objective of testing is to uncover bugs. Cov- proposed approach demonstrates enhanced performance in achieving
erage has been employed in various other studies to assess test case exact matches between assertions and the ground truth, it necessitates
quality [15,16]. However, it has been demonstrated that while there an extensive pull of code samples for the selection of appropriate
exists a correlation between coverage and bug detection, they may not demonstrative examples for each PUT.
consistently align in ranking different testing strategies, as observed in ATHENATEST [22] employed the BART transformer model [23],
the realm of fuzz testing [55]. which they fine-tuned using a collection of Java functions and their
It is important to note that the bugs present in mutants are artificial corresponding tests. They reported test coverage comparable to those
and might not directly correspond to real-world faults. To address this of EvoSuite [9] upon evaluating generating test cases for five Java
concern, we have employed the Refactory [31] dataset, a bug-repairing projects.
dataset that contains real faulty programs developed by students. TOGA [18] engaged in fine-tuning CodeBERT using the PUT ’s doc-
External Validity. For our experiments, we used two datasets string along with the prefix of a test case featuring a masked assertion.
containing Python programming tasks, which could potentially pose Their goal was to synthesize the assertion. Subsequently, they formu-
external challenges to the validity of our findings. The requirement lated the whole test oracles by incorporating a test oracle grammar and
for executable Python programs is essential to run the generated tests generating a set of assertions. This set was then subjected to ranking
against both the accurate and buggy versions (real or mutated) of PUT through a neural network ranker based on their lexical match to

14
A.M. Dakhel et al. Information and Software Technology 171 (2024) 107468

ground truth test oracles. Although they reported results akin to those Declaration of competing interest
of EvoSuite [9] in bug detection, their focus is only on synthesizing
the assertions. However, synthesizing assertion is not challenging but The authors declare that they have no known competing finan-
generating effective and meaningful test oracles poses a significant cial interests or personal relationships that could have appeared to
challenge. influence the work reported in this paper.
CODAMOSA combined the test cases generated by Codex with
those derived from Pynguin in cases where Pynguin’s test case genera- Data availability
tion halted and failed to enhance test coverage. CODAMOSA achieves
higher test coverage on various Python benchmarks [16] compared to Replication package link: https://github.com/ExpertiseModel/MuT
Pynguin. It is worth noting that, akin to other studies, CODAMOSA AP.
concentrated solely on test coverage improvement, and its generated
test cases lacked assertion oracles for bug detection within programs. Acknowledgments
Two additional studies employed Codex to simultaneously gen-
erate code and corresponding test cases based on a given problem This work is partially supported by the Canada Research Chair,
description. Subsequently, they used these test cases to filter out buggy Fonds de Recherche du Québec (FRQ), the Canadian Institute for
suggestions produced by Codex [13,14]. For code generation, they Advanced Research (CIFAR), and the Natural Sciences and Engineering
employed the problem description as a prompt, and for test case Research Council of Canada (NSERC).
generation, they used the same problem description along with the PUT
and a natural language instruction. Appendix
Although prior research has explored diverse strategies for generat-
ing test cases using LLMs like Codex and assessed them in terms of test To compare the effectiveness of the MuTAP to other methods on a
coverage or lexical match with ground truth tests, none of these studies PUT level, we employ the Vargha–Delaney 𝐴̂ 12 effect size [57] similar
specifically focused on leveraging MT to enhance the effectiveness of to previous studies that applied effect size on a PUT level [41,48], and
the generated test cases. calculate the effect size metric on the MS of unit tests generated in 10
runs per PUT with each method.
8. Conclusion Fig. 4 presents the distribution of Vargha–Delaney 𝐴̂ 12 effect size
[57] for different methods on the MS of unit tests generated in 10
In this paper, we proposed MuTAP as a means of improving and runs per PUTs in the HumanEval dataset, all compared to MuTAP
evaluating the ability of pre-trained LLMs to generate effective test using llama-2-chat as its best performing LLMC. We compared the
cases. MuTAP first prompts its LLMC to generate test cases using effect size of MS on the PUT level between MuTAP with its best-
zero-shot and few-shot learning. After identifying and correcting any performing configuration LLMC (llama-2-chat) and other comparable
potential syntax and return value errors in the generated test cases, methods in our study. The following numbers report where MuTAP
MuTAP evaluates their effectiveness by conducting MT. Then, it uses (using llama-2-chat) performs better (𝐴̂ 12 > 0.5) or worse (𝐴̂ 12 <
the surviving mutants of each PUT, if any, as well as the initial 0.5) than other comparable methods. In this comparison, we excluded
inadequate test case to augment the initial prompt. It re-prompts its the improvement achieved by MuTAP over problematic PUTs of other
LLMC using the augmented prompt to regenerate new test cases that comparable methods.
are capable of detecting surviving mutants. MuTAP with zero-shot performed better than Pynguin on 61 PUTs
We assessed the effectiveness of the test cases generated by LLMs and worse on 20 PUTs. MuTAP with the same setup performed better
to identify bugs in real and synthetic buggy programs. On average, test than before-refining on 55 PUTs and worse on 0 PUTs, while it works
cases generated by MuTAP successfully identify 86.72% of buggy code better than after-refining on 29 PUTs and worse on 0 PUTs. When
in a bug repairing benchmark when using the LLM designed for code MuTAP used llama-2-chat as its LLMC with a few-shot initial prompt,
generation, Codex. When employing LLM with the dialog setup, llama- it performed better than Pynguin on 66 PUTs and worse on 17 PUTs.
2-chat, MuTAP further improves its performance, detecting 94.06% of MuTAP with few-shot works better than before-refining on 63 PUTs,
the buggy code, outperforming both an automatic test generation tool better than after-refining on 40 PUTs, and worse on 0 PUTs for both
and a zero-shot and few-shot learning technique on LLMs. This under-
scores the advantage of employing LLMs as the core of an automatic
test generation tool, as conventional automatic generation tools such
as Pynguin lack access to the insights embedded in surviving mutants.
Although the current version of MuTAP employs two different LLMs
to generate test cases for Python programs, its design and evaluation
methodology are fundamentally adaptable to various programming
languages and models. Therefore, as future work, it can be easily
expanded to encompass other programming languages or incorporate
new LLMs. Future studies can also explore the relationship between
modifying the structure and specifics in the prompt and how this could
potentially impact the effectiveness of the generated unit tests.

CRediT authorship contribution statement

Arghavan Moradi Dakhel: Writing – review & editing, Writing


– original draft, Visualization, Validation, Methodology, Investigation,
Formal analysis, Data curation, Conceptualization. Amin Nikanjam:
Writing – review & editing, Writing – original draft, Validation. Vahid Fig. 4. Distribution of the effect size (𝐴̂ 12 ) for each PUT on MS values presents across
various methods in comparison to MuTAP, using its best performing LLMC, llama-2-
Majdinasab: Validation, Conceptualization. Foutse Khomh: Writing chat. 𝐴̂ 12 > 0.5 signifies that MuTAP outperforms the other alternative method, whereas
– review & editing, Supervision, Conceptualization. Michel C. Des- 𝐴̂ 12 < 0.5 indicates the opposite. 𝐴̂ 12 = 0.5 indicates that there is no statistical difference
marais: Writing – review & editing, Supervision. between MuTAP and the alternative method.

15
A.M. Dakhel et al. Information and Software Technology 171 (2024) 107468

comparable methods. The reason behind having no PUT where MuTAP [25] M. Tufano, D. Drain, A. Svyatkovskiy, N. Sundaresan, Generating accurate assert
performs worse than before-refining and after-refining is attributed to statements for unit test cases using pretrained transformers, in: Proceedings of
the 3rd ACM/IEEE International Conference on Automation of Software Test,
MuTAP improving the test cases in before-refining and after-refining by
2022, pp. 54–64.
applying the augmentation step and adding more effective test cases in [26] X. Cai, M.R. Lyu, The effect of code coverage on fault detection under different
revealing bugs into the initial test cases. Thus, test cases generated by testing profiles, in: Proceedings of the 1st International Workshop on Advances
MuTAP do not perform worse than the initial test cases generated by in Model-Based Testing, in: A-MOST ’05, ACM, New York, NY, USA, 2005, pp.
LLMC. 1–7, URL https://doi.org/10.1145/1083274.1083288.
[27] R. Gopinath, C. Jensen, A. Groce, Code coverage for suite evaluation by
developers, in: Proceedings of the 36th International Conference on Software
References Engineering, 2014, pp. 72–82.
[28] H. Hemmati, How effective are code coverage criteria? in: 2015 IEEE Interna-
[1] J. Shore, S. Warden, The Art of Agile Development, second ed., "O’Reilly", 2021. tional Conference on Software Quality, Reliability and Security, IEEE, 2015, pp.
[2] S. Siddiqui, Learning Test-Driven Development, "O’Reilly", 2021. 151–156.
[3] T. Xie, Augmenting automatically generated unit-test suites with regression [29] Y. Jia, M. Harman, An analysis and survey of the development of mutation
oracle checking, in: ECOOP 2006–Object-Oriented Programming: 20th European testing, IEEE Trans. Softw. Eng. 37 (5) (2011) 649–678, http://dx.doi.org/10.
Conference, Nantes, France, July 3-7, 2006. Proceedings 20, Springer, 2006, pp. 1109/TSE.2010.62.
380–403. [30] M. Papadakis, M. Kintis, J. Zhang, Y. Jia, Y. Le Traon, M. Harman, Mutation
[4] M. Selakovic, M. Pradel, R. Karim, F. Tip, Test generation for higher-order testing advances: an analysis and survey, in: Advances in Computers, volume
functions in dynamic languages, Proc. ACM Programm. Lang. 2 (OOPSLA) (2018) 112, Elsevier, 2019, pp. 275–378.
1–27. [31] Y. Hu, U.Z. Ahmed, S. Mechtaev, B. Leong, A. Roychoudhury, Re-factoring based
[5] E. Arteca, S. Harner, M. Pradel, F. Tip, Nessie: automatically testing JavaScript program repair applied to programming assignments, in: 2019 34th IEEE/ACM
APIs with asynchronous callbacks, in: Proceedings of the 44th International International Conference on Automated Software Engineering, ASE, IEEE, 2019,
Conference on Software Engineering, 2022, pp. 1494–1505. pp. 388–398.
[6] K. Sen, D. Marinov, G. Agha, CUTE: A concolic unit testing engine for C, ACM [32] A. Moradi Dakhel, A. Nikanjam, V. Majdinasab, F. Khomh, M.C. Desmarais, The
SIGSOFT Softw. Eng. Notes 30 (5) (2005) 263–272. replication package, 2023, https://github.com/ExpertiseModel/MuTAP.
[7] P. Godefroid, N. Klarlund, K. Sen, DART: Directed automated random testing, in: [33] A. Arcuri, G. Fraser, Parameter tuning or default values? An empirical inves-
Proceedings of the 2005 ACM SIGPLAN Conference on Programming Language tigation in search-based software engineering, Empir. Softw. Eng. 18 (2013)
Design and Implementation, 2005, pp. 213–223. 594–623.
[8] G. Fraser, A. Arcuri, Evolutionary generation of whole test suites, in: 2011 11th [34] P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, G. Neubig, Pre-train, prompt,
International Conference on Quality Software, IEEE, 2011, pp. 31–40. and predict: A systematic survey of prompting methods in natural language
[9] G. Fraser, A. Arcuri, Evosuite: automatic test suite generation for object- processing, ACM Comput. Surv. 55 (9) (2023) 1–35.
oriented software, in: Proceedings of the 19th ACM SIGSOFT Symposium and [35] J. Zhang, J. Cambronero, S. Gulwani, V. Le, R. Piskac, G. Soares, G. Verbruggen,
the 13th European Conference on Foundations of Software Engineering, 2011, Repairing bugs in python assignments using large language models, 2022, arXiv
pp. 416–419. preprint arXiv:2209.14876.
[10] A. Panichella, S. Panichella, G. Fraser, A.A. Sawant, V.J. Hellendoorn, Revisiting [36] Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, L. Shou, B. Qin, T. Liu,
test smells in automatically generated tests: limitations, pitfalls, and opportu- D. Jiang, et al., Codebert: A pre-trained model for programming and natural
nities, in: 2020 IEEE International Conference on Software Maintenance and languages, 2020, arXiv preprint arXiv:2002.08155.
Evolution, ICSME, IEEE, 2020, pp. 523–533.
[37] N. Nashid, M. Sintaha, A. Mesbah, Retrieval-based prompt selection for
[11] F. Palomba, D. Di Nucci, A. Panichella, R. Oliveto, A. De Lucia, On the diffusion
code-related few-shot learning, 2023.
of test smells in automatically generated test code: An empirical study, in:
[38] T. Brown, B. Mann, N. Ryder, M. Subbiah, J.D. Kaplan, P. Dhariwal, A.
Proceedings of the 9th International Workshop on Search-Based Software Testing,
Neelakantan, P. Shyam, G. Sastry, A. Askell, et al., Language models are few-shot
2016, pp. 5–14.
learners, Adv. Neural Inf. Process. Syst. 33 (2020) 1877–1901.
[12] P. Bareiß, B. Souza, M. d’Amorim, M. Pradel, Code generation tools (almost)
[39] T. Ahmed, P. Devanbu, Few-shot training LLMs for project-specific code-
for free? a study of few-shot, pre-trained language models on code, 2022, arXiv
summarization, 2022, arXiv preprint arXiv:2207.04237.
preprint arXiv:2206.01335.
[40] H. Joshi, J. Cambronero, S. Gulwani, V. Le, I. Radicek, G. Verbruggen, Repair is
[13] B. Chen, F. Zhang, A. Nguyen, D. Zan, Z. Lin, J.-G. Lou, W. Chen, Codet: Code
nearly generation: Multilingual program repair with LLMs, 2022, arXiv preprint
generation with generated tests, 2022, arXiv preprint arXiv:2207.10397.
arXiv:2208.11640.
[14] S.K. Lahiri, A. Naik, G. Sakkas, P. Choudhury, C. von Veh, M. Musuvathi, J.P.
[41] S. Lukasczyk, G. Fraser, Pynguin: Automated unit test generation for python,
Inala, C. Wang, J. Gao, Interactive code generation via test-driven user-intent
in: Proceedings of the ACM/IEEE 44th International Conference on Software
formalization, 2022, arXiv preprint arXiv:2208.05950.
Engineering: Companion Proceedings, 2022, pp. 168–172.
[15] M. Schäfer, S. Nadi, A. Eghbali, F. Tip, Adaptive test generation using a large
[42] K. Hałas, Mutpy: a mutation testing tool for python 3.x source code, 2019,
language model, 2023, arXiv preprint arXiv:2302.06527.
https://github.com/mutpy/mutpy.
[16] C. Lemieux, J.P. Inala, S.K. Lahiri, S. Sen, CODAMOSA: Escaping coverage
plateaus in test generation with pre-trained large language models, in: Accepted [43] G. Fraser, A. Zeller, Mutation-driven generation of unit tests and oracles, in:
By 45th International Conference on Software Engineering, ICSE, 2023. Proceedings of the 19th International Symposium on Software Testing and
Analysis, 2010, pp. 147–158.
[17] M. Chen, J. Tworek, H. Jun, Q. Yuan, H.P.d.O. Pinto, J. Kaplan, H. Edwards, Y.
Burda, N. Joseph, G. Brockman, et al., Evaluating large language models trained [44] T. Dybå, V.B. Kampenes, D.I. Sjøberg, A systematic review of statistical power in
on code, 2021, arXiv preprint arXiv:2107.03374. software engineering experiments, Inf. Softw. Technol. 48 (8) (2006) 745–755.
[18] E. Dinella, G. Ryan, T. Mytkowicz, S.K. Lahiri, TOGA: a neural method for [45] J. Cohen, Statistical Power Analysis for the Behavioral Sciences, Academic Press,
test oracle generation, in: Proceedings of the 44th International Conference on 2013.
Software Engineering, 2022, pp. 2130–2141. [46] A. Arcuri, Test suite generation with the many independent objective (MIO)
[19] C.B. Clement, D. Drain, J. Timcheck, A. Svyatkovskiy, N. Sundaresan, PyMT5: algorithm, Inf. Softw. Technol. 104 (2018) 195–206.
multi-mode translation of natural language and python code with transformers, [47] A. Panichella, F.M. Kifetew, P. Tonella, Reformulating branch coverage as a
2020, arXiv preprint arXiv:2010.03150. many-objective optimization problem, in: 2015 IEEE 8th International Conference
[20] M. Tufano, D. Drain, A. Svyatkovskiy, S.K. Deng, N. Sundaresan, Unit test case on Software Testing, Verification and Validation, ICST, IEEE, 2015, pp. 1–10.
generation with transformers and focal context, 2020, arXiv preprint arXiv: [48] A. Panichella, F.M. Kifetew, P. Tonella, Automated test case generation as a
2009.05617. many-objective optimisation problem with dynamic selection of the targets, IEEE
[21] A. Moradi Dakhel, V. Majdinasab, A. Nikanjam, F. Khomh, M.C. Desmarais, Trans. Softw. Eng. 44 (2) (2017) 122–158.
Z.M.J. Jiang, GitHub copilot AI pair programmer: Asset or liability? J. Syst. [49] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov,
Softw. 203 (2023) 111734, http://dx.doi.org/10.1016/j.jss.2023.111734. S. Batra, P. Bhargava, S. Bhosale, et al., Llama 2: Open foundation and fine-tuned
[22] M. Tufano, D. Drain, A. Svyatkovskiy, S.K. Deng, N. Sundaresan, Unit test case chat models, 2023, arXiv preprint arXiv:2307.09288.
generation with transformers and focal context, 2021, arXiv preprint arXiv: [50] D. Shrivastava, H. Larochelle, D. Tarlow, Repository-level prompt generation for
2009.05617. large language models of code, 2022, arXiv preprint arXiv:2206.12839.
[23] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, [51] Y. Hu, U.Z. Ahmed, S. Mechtaev, B. Leong, A. Roychoudhury, Refactory, 2023,
L. Zettlemoyer, Bart: Denoising sequence-to-sequence pre-training for natural https://github.com/githubhuyang/refactory.
language generation, translation, and comprehension, 2019, arXiv preprint arXiv: [52] S. CASS, Top programming languages 2020, 2020, https://spectrum.ieee.org/top-
1910.13461. programming-language-2020.
[24] S. Lukasczyk, F. Kroiß, G. Fraser, An empirical study of automated unit test [53] I. Alvarado, I. Gazit, A. Wattenberger, GitHub copilot labs, 2023, https://
generation for python, Empir. Softw. Eng. 28 (2) (2023) 36. githubnext.com/projects/copilot-labs/.

16
A.M. Dakhel et al. Information and Software Technology 171 (2024) 107468

[54] M.L. Siddiq, J. Santos, R.H. Tanvir, N. Ulfat, F.A. Rifat, V.C. Lopes, Exploring [56] S. Kang, J. Yoon, S. Yoo, Large language models are few-shot testers: Exploring
the effectiveness of large language models in generating unit tests, 2023, arXiv LLM-based general bug reproduction, 2022, arXiv preprint arXiv:2209.11515.
preprint arXiv:2305.00418. [57] A. Vargha, H.D. Delaney, A critique and improvement of the CL common
[55] M. Böhme, L. Szekeres, J. Metzman, On the reliability of coverage-based fuzzer language effect size statistics of McGraw and wong, J. Educ. Behav. Stat. 25
benchmarking, in: Proceedings of the 44th International Conference on Software (2) (2000) 101–132.
Engineering, 2022, pp. 1621–1633.

17

You might also like