APRMCTS: Improving LLM-based Automated Program Repair with Iterative Tree Search

Haichuan Hu, Quanjun Zhang
Nanjing University of Science and Technology

Abstract

Automated Program Repair (APR) attempts to fix software bugs without human intervention, which plays a crucial role in software development and maintenance. Recently, with the advances in Large Language Models (LLMs), a rapidly increasing number of APR techniques have been proposed with remarkable performance. However, existing LLM-based APR techniques typically adopt trial-and-error strategies, which suffer from two major drawbacks: (1) inherently limited patch effectiveness due to local exploration, and (2) low search efficiency due to redundant exploration. In this paper, we propose APRMCTS, which uses iterative tree search to improve LLM-based APR. APRMCTS incorporates Monte Carlo Tree Search (MCTS) into patch searching by performing a global evaluation of the explored patches and selecting the most promising one for subsequent refinement and generation. APRMCTS effectively resolves the problems of falling into local optima and thus helps improve the efficiency of patch searching. Our experiments on 835 bugs from Defects4J demonstrate that, when integrated with GPT-3.5, APRMCTS can fix a total of 201 bugs, which outperforms all state-of-the-art baselines. Besides, APRMCTS helps GPT-4o-mini, GPT-3.5, Yi-Coder-9B, and Qwen2.5-Coder-7B to fix 30, 27, 37, and 28 more bugs, respectively. More importantly, APRMCTS boasts a significant performance advantage while employing small patch size (16 and 32), notably fewer than the 500 and 10,000 patches adopted in previous studies. We also conduct preliminary experiments on SWE-bench Lite, and the results show that APRMCTS can fix 164 of the 300 bugs, demonstrating its potential across a wide range of real-world defect datasets (e.g., SWE-bench). In terms of cost, compared to existing LLM-based APR methods, APRMCTS takes less time and reduces monetary costs by over 50%. Our extensive study demonstrates that APRMCTS exhibits good effectiveness and efficiency, with particular advantages in addressing complex bugs.

Haichuan Hu, Quanjun Zhang Nanjing University of Science and Technology

1 Introduction

Automated Program Repair (APR) attempts to fix buggy programs by automatically generating patches zhang2023survey. A typical APR process involves two main steps: (1) generating plausible patches that pass all test cases, and (2) verifying the correctness of these patches through manual inspection. Traditional APR techniques can be generally classified into three categories: template-based 44; 45, heuristic-based Simfix:2018; 10.1145/3468264.3468600, and constraint-based 196; 106. Among them, template-based APR leverages well-designed templates to match buggy code patterns, and is widely regarded as state-of-the-art. Despite its effectiveness, template-based APR is inherently constrained by its dependency on predefined templates, which limits its ability to handle previously unseen software bugs.

Over the past few years, researchers have introduced a mass of learning-based approaches, which utilize deep learning to enhance repair capabilities by extracting bug-fixing patterns from existing code repositories yuan2022circle. Compared to traditional APR, learning-based APR demonstrates superior generalization, enabling it to address bugs that are not present in the training data. Recently, with the rapid advancements of Large Language Models (LLMs) in software engineering tasks zhang2023surveyse (e.g., unit testing zhang2025large; shang2025large; zhang2025exploring), numerous LLM-based APR techniques have emerged zhang2024survey. Hossain et al. DBLP:journals/pacmse/Hossain0Z0CLNT24 comprehensively discuss the impact of various prompts and contexts on the effectiveness of LLM-based APR. ChatRepair DBLP:conf/issta/Xia024 uses GPT-3.5 to fix a total of 162 bugs on Defects4J DBLP:conf/issta/JustJE14, marking one of the most representative LLM-based methods. Other studies DBLP:conf/icse/XiaWZ23; zhang2024pre further demonstrate the effectiveness of LLMs in different repair scenarios, such as programming problems.

However, existing state-of-the-art LLM-based APR techniques typically follow a serial, single-path trial-and-error strategy, where a candidate patch is generated, validated against test cases, and then refined based on the test outcomes. While straightforward, this strategy may suffer from two key limitations: local optima in effectiveness and redundant exploration in efficiency. First, it lacks the ability to leverage historical search information, making the repair process prone to getting trapped in local optima. Second, it generates patches in an unstructured and memoryless manner, often resulting in redundant or near-duplicate patches and inefficient use of computational resources. These limitations hinder the model’s capacity to explore promising regions of the search space and adapt its repair strategy based on prior attempts. As a result, current methods often struggle to efficiently discover high-quality patches, especially for complex bugs.

To address these issues, we propose APRMCTS, which helps improve LLM-based APR by utilizing a multi-round iterative tree search method combined with CoT and self-evaluation to generate patches. Unlike the trial-and-error repair paradigm, APRMCTS adopts an evaluate-and-improve approach to guide the model toward the correct repair path. Through effective global patch evaluation, APRMCTS can rapidly identify erroneous paths, backtrack to earlier promising candidates, and gradually converge toward the correct patch. For APRMCTS, each iteration of patch search can be divided into four stages: Patch Selection, Patch Generation, Patch Evaluation, Patch Tree Updating. In the patch selection stage, APRMCTS first selects an explored patch from the patch tree according to the UCT (Upper Confidence Bounds Applied to Trees) value. Then in the patch generation stage, APRMCTS inspires LLMs to perform repairs on the selected patch through CoT, and further conducts self-reflection on the generated patches. In the patch evaluation stage, the generated patches are validated for correctness on test cases. For those patches that fail the tests, APRMCTS assesses their quality and then add them to the patch tree. Specifically, we adopt LLM-as-Judge and Test-as-Judge strategies adaptively for evaluation based on whether the test cases are sufficient. In the patch tree updating stage, back propagation is performed from the selected patch upwards to the root node of the patch tree. After a certain number of iterations (16 and 32 in our work), APRMCTS outputs all the plausible patches found for patch validation.

Compared with prior LLM-based APR techniques, APRMCTS has the following advantages.

(1) Multi-path + Long-trajectory Search.

•

Multi-path. APRMCTS leverages Monte Carlo Tree Search (MCTS) which enables the model to simultaneously investigate multiple paths, instead of expending the entire budget on a single, potentially unproductive path. This breadth keeps the search from being trapped in local optima—an outcome especially common when fixing complex bugs.
•

Long-trajectory. APRMCTS conducts deep, incremental exploration, steadily converging on a correct patch rather than halting after the first misstep. Such extended trajectories are indispensable for bugs that require multiple rounds of trial‑and‑error to isolate and resolve their root causes.

In terms of results, APRMCTS can fix 201 out of 835 bugs on Defects4J, surpassing all 10 state-of-the-art baselines.

(2) Flexibility and generality. APRMCTS is flexible as it works seamlessly with any LLMs. APRMCTS is also generalizable to different search algorithms. Although we adopt the representative MCTS to demonstrate the effectiveness of APRMCTS, it can be replaced by other search algorithms, such as beam search mentioned in Section 6.1.

(3) High efficiency. APRMCTS adopts a rigorous patch‑evaluation module to discard low‑quality candidate patches early, so the limited search budget is concentrated on the most promising patches, boosting both repair efficiency and success rate. For example, APRMCTS adopts a smaller patch size (16 and 32) than that used in previous studies (e.g., 10000 DBLP:conf/icse/JiangL021, 500 DBLP:conf/issta/Xia024).

This paper makes the following contributions:

•

We propose APRMCTS, which utilizes tree search to optimize the LLM-based APR process, representing a new technological endeavor in the field of APR. APRMCTS offers multiple advantages, such as flexible architecture, preferable effectiveness, and efficiency.
•

We evaluate APRMCTS against 10 state-of-the-art baselines (including learning-based, template-based and LLM-based APR techniques) and 13 representative LLMs. Experimental results show that APRMCTS outperforms existing baselines, fixing 108 and 93 bugs on Defects4J-v1.2 and Defects4J-v2, respectively.
•

We implement APRMCTS with seven best-performing LLMs. The results show that APRMCTS can fix 20% more bugs compared to vanilla LLMs on average, demonstrating its model-agnostic nature in enhancing the APR capabilities of diverse LLMs.
•

We validate the multi-language (Python/Java) and multi-type (Repository/Competition) bug repair capability of APRMCTS on ConDefects. Compared to ChatRepair DBLP:conf/issta/Xia024, we find that APRMCTS is faster and reduces monetary costs by over 50%.

2 Background and Motivation

2.1 Automated Program Repair

Automated Program Repair (APR) aims to assist developers in localizing and fixing program bugs automatically. Traditional APR techniques can be classified as heuristic-based Simfix:2018; 10.1145/3468264.3468600, constraint-based 196; 106 and template-based 44; 45. Modern APR methods, primarily based on deep learning, have improved upon the shortcomings of previous APR methods. Learning-based methods VulRepair; 49; TransR strike a balance between performance and effectiveness while offering stronger generalization capabilities. As part of learning-based methods, Neural Machine Translation (NMT) techniques DBLP:conf/icse/MengWZSLH23; DBLP:conf/icse/ZhuSZXZ23; DBLP:conf/kbse/YeML0M22; DBLP:conf/icse/YeMM22; DBLP:conf/sigsoft/ZhuSXZY0Z21; DBLP:conf/icse/JiangL021 have been extensively studied in recent years, they share the same insight that APR can be viewed as an NMT problem that aims to translate buggy code into correct code. LLM-based methods DBLP:conf/sigsoft/XiaZ22; DBLP:conf/iclr/FriedAL0WSZYZL23; DBLP:conf/kbse/XiaDZ23; hu2025repair; hu2024gpto1killbugsevaluation further leverages the code-related capabilities of LLMs to fix bugs through zero-shot or few-shot methods, reducing the dependence on high-quality training datasets. Xia et al. DBLP:conf/icse/XiaWZ23 conducted an extensive study of LLM-based APR techniques based on various LLMs (e.g., Codex DBLP:journals/corr/abs-2107-03374, GPT-NeoX DBLP:journals/corr/abs-2204-06745, CodeT5 DBLP:conf/emnlp/0034WJH21, InCoder DBLP:conf/iclr/FriedAL0WSZYZL23), demonstrating the superiority of LLM-based APR. More recently, ChatRepair DBLP:conf/issta/Xia024 utilizes GPT-3.5 to fix bugs and obtains state-of-the-art results. Our work thoroughly investigates various types of modern LLMs and comprehensively evaluates their capacities of fixing bugs.

Building upon this foundation, we draw inspiration from previous works DBLP:journals/tse/GouesNFW12; DBLP:conf/icse/YeM24 and adopt an iterative algorithm to optimize the performance of LLMs on APR. We employ a search-based approach, integrating LLMs with the MCTS algorithm. The method we propose, APRMCTS, can serve as an LLM-based APR framework that suits variable LLMs.

2.2 Monte Carlo Tree Search

Monte Carlo Tree Search (MCTS) is used to enhance decision-making capabilities in complex scenarios and shows significant results in strategy games such as Go. MCTS is a multi-round iterative algorithm, each round generally involves four key phases DBLP:journals/tciaig/BrownePWLCRTPSC12: Selection, based on UCT strategy to identify a potential starting point for exploration; expansion, where new nodes are added; evaluation, to evaluate the newly expanded nodes; and back propagation, updating the node values based on evaluation results. Compared to other search methods, such as Depth-First Search (DFS) and Breadth-First Search (BFS), which tend to suffer from disadvantages like getting trapped in local errors and having a massive search space, MCTS can strike a balance between efficiency and effectiveness. Previous work (MathBlackBox zhang2024accessing) uses MCTS to guide GPT-4 in solving Olympic-level Math problems. Recently, researchers dainese2024generatingcodeworldmodels; li2024rethinkmctsrefiningerroneousthoughts; delorenzo2024makecountllmbasedhighquality have found that MCTS helps improve the efficiency of code-related tasks such as code generation, test generation, and program debugging. Although our work is general to different search algorithms, we implement it using interactive tree search algorithm MCTS. This approach enables LLMs to iteratively select, search, and evaluate patches, resulting in the generation of a higher number of correct patches at a lower cost.

2.3 Motivation Example

To better illustrate the limitation of existing LLM-based APR methods, we further present a motivation example in this section. As shown in Figure 1, we use a real-world bug Jsoup_54 from Defects4J and evaluate three typical LLM-based APR methods (e.g., single-path search, genetic algorithm, sampling) on it. We find that none of the three methods works effectively. Since the order of function call parameters is incorrect, and there are many possible values for the parameters, it is not feasible to find the correct solution through direct sampling within a limited sample size. For single-path search, this approach keeps trying to fix the first incorrect patch it generates and ignores other potential solutions. For genetic algorithm, it also fails due to lack of effective patch evaluation mechanism to maintain a high-quality patch pool.

Refer to caption — Figure 1: Motivation Example of APRMCTS

We further attempt patch search using MCTS and find that it successfully fixes Jsoup_54. This is because MCTS enables the model to select and prioritize search paths. Although the model initially explores incorrect paths, the MCTS algorithm leverages the patch evaluation mechanism to promptly terminate search along those erroneous paths, instead expanding the search scope and ultimately identifying the correct patch. Based on this example, we can observe that although existing APR methods can leverage LLMs to improve repair effectiveness, they still lack efficient patch search strategies to handle complex bugs. In this paper, we employ MCTS combined with well-designed patch evaluation strategies to guide LLMs in efficient patch search.

3 Approach

In this section, we introduce the concepts used in APRMCTS, the overall workflow of APRMCTS and each stage within the process. Figure 2 illustrates the workflow of APRMCTS, which consists of four stages. In the patch selection stage, as detailed in Section 3.2.1, a partial patch is selected from the patch tree with the goal of refining it into a plausible candidate. In the patch generation stage, as detailed in Section 3.2.2, new patches are generated based on the selected partial patch, leveraging Chain-of-Thought (CoT) reasoning and self-reflection techniques to enhance the quality of generated patches. In the patch evaluation stage, as detailed in Section 3.2.3, the generated patches are scored by two evaluation strategies: LLM-as-Judge and Test-as-Judge. In the patch tree updating stage, as detailed in Section 3.2.4, the entire patch tree is updated to reflect the state of all patches.

3.1 Concepts

Before introduction, first we provide explanations of the concepts used in APRMCTS.

•

Patch Tree. APRMCTS organizes the explored patches in the form of patch tree. The root node of the tree is the original bug, which can be considered as a special patch. Newly discovered patches are added to the patch tree as child nodes.
•

Parent Patch. If patch $a$ is the parent patch of patch $b$ , it means we generate $b$ based on $a$ .
•

Son Patch. If patch $a$ is the child patch of patch $b$ , it means we generate $a$ based on $b$ .
•

Patch Size. Number of candidate patches applied to a bug.

3.2 Stages & Modules

Given a buggy program, the repair process begins by treating the original buggy code as a special form of patch, which is initialized as the root node of the patch tree. As the repair proceeds, newly generated patches are incrementally added as child nodes to their parent patches within the tree.

3.2.1 Patch Selection

In the patch selection stage, APRMCTS aims to identify the most promising patch from the patch tree, which will then be refined into new candidate patches in subsequent stages. In this work, we consider the Upper Confidence Bound for Trees (UCT) as the selection criterion. UCT takes into account both the average quality of child patches and the degree of exploration, thus providing a more comprehensive assessment of a patch’s potential correctness. A higher UCT indicates that starting to search from the corresponding patch is more likely to lead to a plausible patch. In a general standard MCTS process, UCT is defined as follows:

UCT_{j}=\bar{X}_{j}+C\sqrt{\frac{2\ln N_{C}}{N_{j}}}.

(1)

Where $\bar{X}_{j}$ is the average reward of all possible actions, $N_{C}$ is the total visited times of the parent node, and $N_{j}$ is the number of times that the child node $j$ has been visited, C is a constant to balancing exploitation and exploration. During the stage of patch selection, APRMCTS calculates the UCT value for each patch and selects the patch with the highest UCT from the existing patch tree.

3.2.2 Patch Generation

In the patch generation stage, APRMCTS aims to generate new candidate patches based on the partial patch returned by the patch selection stage. To this end, APRMCTS employs a self-refinement strategy that integrates CoT and Self-Reflection, thereby enhancing the quality of the model’s outputs. Specifically, APRMCTS interprets the current state of the bug from the selected partial patch and performs a comprehensive analysis of the buggy lines and the errors reported by the test cases. Based on this analysis, it modifies and refines the partial patch to generate new candidate patches. These newly generated patches may repeat the mistakes of the previously explored partial patches, or fall into a new mistake, thus updating the state of the bug. For a given LLM $\pi$ , the conditional probability distribution of generating a new patch $a^{\prime}$ from a previously explored partial patch $a$ is formalized as follows:

\pi(a^{\prime}|a)=\prod_{k=1}^{K}\pi(a^{\prime}_{k}|a^{\prime}_{<k},a).

(2)

Where $k$ represents the $k$ -th token of $a^{\prime}$ .

APR-specific CoT. We design a specialized prompt tailored for the bug repair task to guide the model to articulate its understanding of the bug and its intended approach to repairing the bug. By leveraging CoT, APRMCTS attempts to generate the patch $a^{\prime}$ through a step-by-step process that promotes transparency and structured thinking. This process enables the model to identify and formulate repair actions based on its interpretation of the buggy behavior. Moreover, by incorporating feedback from failed test cases into its CoT, the model can revise or adapt its repair strategy accordingly.

Self-Reflection. After generating a patch $a^{\prime}$ , we further prompt the model to reflect on its output through a self-reflection mechanism. This process encourages the model to critically evaluate the generated patch, identify potential errors, and revise its solution accordingly. By enabling this self-correction step, the model is able to produce higher-quality and more reliable patches.

3.2.3 Patch Evaluation

In the patch evaluation stage, APRMCTS aims to assess the correctness and quality of the patches returned in the previous stage, thus guiding LLMs toward identifying potentially correct patches. After the Patch Generation stage, we execute test cases to verify the correctness of the generated patches $a^{\prime}$ . If a patch passes all test cases, it is marked as a plausible patch and retained for further human inspection. If it fails any test case, it is treated as a partial patch that needs to be refined later, and is added as a new patch node to the existing patch tree for continued exploration. Following this, APRMCTS performs a quality assessment of the generated patches using two evaluation strategies: LLM-as-Judge and Test-as-Judge.

LLM-as-Judge. This strategy utilizes LLMs to score the quality of generated patches in scenarios where test coverage is limited. For example, a significant portion of bugs in the Defects4J dataset are associated with only a single fault-trigging test case. In such cases, relying solely on test outcomes may lead to sparse reward signals, which reduces the accuracy of the evaluation and the effectiveness of the repair process. To address this issue, APRMCTS employs LLM-as-Judge to evaluate patch quality based on semantic and contextual information rather than exclusively on test results. The input to the evaluation model includes test cases, test results, buggy code, candidate patches, surrounding code context, the reasoning trace of CoT, and the reflection output. The raw score generated by the LLM is further normalized under defined constraints to ensure consistency and fairness in reward computation. The final reward $R(a)$ is defined as follows:

R(a)=\begin{cases}0,&\text{if }Score(a)\leq 0\\ 1,&\text{if }Score(a)\geq 100\\ \frac{Score(a)}{100},&\text{otherwise}\end{cases}.

(3)

\mathbb{E}[R]=\frac{1}{N}\sum_{i=1}^{N}R_{i}

(4)

To handle edge cases, we design several adjustment strategies. For patches that fail to compile, the reward is set to -1. For patches that are identical to their parent patch, a penalty coefficient of 0.5 is applied to the original reward. Since the scores provided by the LLM fluctuate, we also need to calculate the expected value of $R$ . As shown in Equation 4, the expected value of $R$ is obtained by sampling the reward $R$ for N (set to 5 in our study) times and calculating the average, which helps balance worst-case and average outcomes. The patch $a^{\prime}$ is then encapsulated into a tree node and added to the patch tree. Besides, e adopt a self-evaluation strategy, where the same LLM is used for both patch generation and evaluation. This design choice reduces computational overhead during the tree search process, and our experimental results indicate that self-evaluation contributes positively to the overall effectiveness of the repair strategy.

Test-as-Judge. This strategy is designed for bug-fixing datasets with sufficient test cases (e.g., ConDefects), where each bug is associated with more than ten test cases that cover a wide range of scenarios and boundary conditions. In this case, also supported by prior works zhang2024appt; DBLP:conf/icse/YeMM22; chen2021evaluatinglargelanguagemodels, we believe that relying on test execution results provides a highly reliable basis for evaluating patch quality. Specifically, as shown in Equation 5, the reward $R$ is computed as the proportion of passed test cases, representing the test pass rate of the candidate patch:

R(a)=\frac{|\text{T}_{passed}|}{|\text{T}_{total}|}

(5)

\mathbb{E}[R]=R

(6)

3.2.4 Patch Tree Updating

In addition to using $R$ to immediately assess the quality of patches after each generation, we also draw on the knowledge of MCTS, employing Q-value to evaluate the quality of patches throughout the entire search process. The Q-value depends not only on the patch’s own quality $R$ but also on the quality of its child patches. After reward $R$ is calculated for the generated patches, we update the Q-value of their parent patches using the following Equation 7:

Q^{\prime}(a)=\beta\ \frac{\sum_{j=1}^{n}(Q_{j}\cdot N_{j})}{\sum_{j=1}^{n}N_{j}}+(1-\beta)\ Q(a).

(7)

Where $\beta$ is a forgetting factor that ranges from 0 to 1, and $N$ represents the number of children. While $\beta$ is closer to 1, it indicates that the new value of Q is less influenced by the old value. In our work, we set $\beta$ to 0.8.

In each iteration, APRMCTS goes through the above four stages to search for and evaluate new patches, and then initiates the next round of searching based on the patches found and the evaluation results. Upon completing all search iterations, we perform manual validation on the recorded plausible patches. If they match the developer patches or are syntactically equivalent, we consider them as correct patches; otherwise, they are deemed wrong patches.

4 Experimental Setup

4.1 Research Questions

We evaluate APRMCTS on the following research questions:

•

RQ1: How does APRMCTS compare against the state-of-the-art APR techniques?
•

RQ2: How does APRMCTS compare with using vanilla LLMs for APR?
•

RQ3: How much impact does each component of APRMCTS have on the overall effectiveness?
•

RQ4: How effective is APRMCTS in fixing bugs across multiple languages and types?
•

RQ5: Can APRMCTS fix more bugs with large patch size?
•

RQ6: How does the cost of APRMCTS compare to existing methods?

4.2 Datasets

We evaluate APRMCTS on three widely adopted benchmarks: QuixBugs, Defects4J, and ConDefects. These datasets are commonly used in the APR literature zhang2024survey; jiang2023impact; cao2025study, spanning multiple programming languages and bug types. QuixBugs DBLP:conf/oopsla/LinKCS17 is a small but popular defect dataset, including 40 function-level program bugs of both Java and Python version, we only use the Java part. Defects4J DBLP:conf/issta/JustJE14 is a collection of bugs from real Java open-source projects, including 395 bugs from Defects4J-v1.2 and 440 bugs from Defects4J-v2. ConDefects wu2023condefectsnewdatasetaddress is a defect dataset of competition-type, containing 526 Python bugs and 477 Java bugs. We select the Python subset to evaluate the multilingual and multi-type bug repair capabilities of APRMCTS.

4.3 Baselines

We compare APRMCTS against ten state-of-the-art APR baselines from different categories, including five learning-based ones (i.e., SelfAPR DBLP:conf/kbse/YeML0M22, ITER ye2024iter, CURE DBLP:conf/icse/JiangL021, RewardRepair DBLP:conf/icse/YeMM22, Recoder DBLP:conf/sigsoft/ZhuSXZY0Z21), two template-based ones (i.e., Repatt jiang2023enhancingredundancybasedautomatedprogram and GAMMA zhang2023gammarevisitingtemplatebasedautomated), and four LLM-based ones (i.e., RAPGen DBLP:journals/corr/abs-2306-17077, GAMMA zhang2023gammarevisitingtemplatebasedautomated, ChatRepair DBLP:conf/issta/Xia024, RepairAgent bouzenia2024repairagentautonomousllmbasedagent). Specifically, ITER iteratively perturbs correct programs to generate buggy-correct sample pairs and learns repair experience through self-supervised training. RAPGen DBLP:journals/corr/abs-2306-17077 combines retrieval-augmented generation (RAG) and APR together, learning bug-fixing patterns from similar bugs that have been fixed. RepairAgent bouzenia2024repairagentautonomousllmbasedagent employs an agent technique to further enhance the repair effectiveness based on LLMs. GAMMA zhang2023gammarevisitingtemplatebasedautomated revises a variety of fix templates from template-based APR techniques and transforms them into mask patterns. Additionally, we select a total of 13 LLMs with varying parameter sizes as baselines, consisting of five 3B models, six 7–9B models, and two API-accessible models.

4.4 Evaluation Metrics

We consider three widely used metrics zhao2024enhancingautomatedprogramrepair; xin2024practicalusefulautomatedprogram; yang2024revisitingunnaturalnessautomatedprogram to evaluate the effectiveness of both APRMCTS and baselines, and the quality of the generated patches. The definitions of the metrics are listed as follows.

•

Correct Fix (CF) is defined as the number of correctly fixed bugs, which can pass all the tests and is manually checked to ensure semantic or syntactic equivalence to the developer patch.
•

Plausible Fix (PF) is defined as the number of bugs which can pass all the tests after fixing, while no further check is applied.
•

Exact-Match (EM) is defined as the number of fixes that exactly match the developer patch.

4.5 Implementation Details

To implement APRMCTS, we use the API provided by OpenAI and the models available on Hugging Face for initialization. We use tiktoken to count the number of tokens consumed in API calls and calculate the costs. The temperature is set to 0.9, max_token is set to 8000, and the patch size is set to 16. For the primary model (GPT-3.5), we conduct extra experiments with the patch size set to 32. The exploration constant is set to 0.7, alpha is set to 0.8, branch and max_expansion is set to 1 and 3, respectively. We implement APRMCTS based on the PyTorch and Transformers frameworks. All experiments are conducted with two NVIDIA Tesla V100 GPUs on one Ubuntu 20.04 server.

Table 1: Comparison with baselines on Defects4J and QuixBugs (correct/plausible fix).

	Method	Model	Patch Size	Defects4J-v1.2	Defects4J-v2	Total	QuixBugs
APR	SelfAPR DBLP:conf/kbse/YeML0M22	T5	150	65/74	45/47	110/121	-
	ITER ye2024iter	T5	1000	59/89	19/36	78/125	-
	CURE DBLP:conf/icse/JiangL021	GPT-2	5000	57/-	19/-	76/-	26
	RAPGen DBLP:journals/corr/abs-2306-17077	CodeT5	-	72/-	53/-	125/-	-
	RewardRepair DBLP:conf/icse/YeMM22	Transformer	200	45/-	45/-	90/-	20
	Recoder DBLP:conf/sigsoft/ZhuSXZY0Z21	TreeGen	100	53/-	19/-	72/-	31
	Repatt jiang2023enhancingredundancybasedautomatedprogram	-	1200	40/70	35/68	75/138	-
	GAMMA zhang2023gammarevisitingtemplatebasedautomated	GPT-3.5	250	82/108	45/-	127/-	22
	ChatRepair DBLP:conf/issta/Xia024	GPT-3.5	500	114/-	48/-	162/-	40
	RepairAgent bouzenia2024repairagentautonomousllmbasedagent	GPT-3.5	117	92/96	72/90	164/186	-
LLM	Stable-Code-3B	-	16	31/49	27/50	58/99	20
	Calme-3.1-3B	-	16	25/44	20/42	45/86	19
	Starcoder2-3B	-	16	19/35	24/44	43/79	18
	Qwen2.5-Coder-3B	-	16	44/68	43/70	87/138	27
	Llama-3.2-3B	-	16	32/53	27/42	59/95	21
	Phi-3.5-mini	-	16	28/52	29/53	57/105	19
	DeciLM-7B	-	16	23/42	22/41	45/83	19
	Falcon-7B	-	16	8/21	10/25	18/46	4
	Yi-Coder-9B	-	16	48/73	58/93	106/166	31
	Llama-3.1-8B	-	16	43/71	43/68	86/139	25
	Qwen2.5-Coder-7B	-	16	38/66	41/70	79/132	25
	GPT-4o-mini	-	16	67/89	61/81	128/170	35
	GPT-3.5	-	16	69/92	63/84	132/176	36
Ours	APRMCTS	GPT-3.5	16	86/112	73/104	159/216	40
Ours	APRMCTS	GPT-3.5	32	108/146	93/134	201/280	40

5 Evaluation and Results

5.1 RQ1: Comparison with State-of-the-Arts

Experimental Design. In RQ1, we aim to evaluate the performance of APRMCTS. We consider 10 prior APR approaches and 13 LLMs as baselines. To eliminate potential interference caused by model size, we select 7 best-performing models of different size and types to serve as the underlying model for APRMCTS in the subsequent experiments.

Overall Performance. Table 1 presents the comparison results of APRMCTS and baselines on Defects4J and QuixBugs benchmarks. On the Defects4J dataset, APRMCTS obtains the highest 201 bug-fixes, fixing 37 more bugs than the second-place RepairAgent, also outperforming other search-based methods (e.g., ITER). Particularly, APRMCTS fixes 108 and 93 bugs on On Defects4J-v1.2 and Defects4J-v2, ranking second and first, respectively. Although APRMCTS fixes 6 fewer bugs than ChatRepair on Defects4J-v1.2, it is acceptable given the differences of patch size setting. ChatRepair generates and tests an average of 500 candidate patches per bug, while APRMCTS generates only 32 candidate patches per bug. In addition, APRMCTS is able to provide more plausible fixes than previous studies. Specifically, APRMCTS obtains a total of 280 plausible fixes, 94 more plausible fixes than that of RepairAgent. We list the number of project-level bug-fixes in Table 2. When comparing APRMCTS against RepairAgent and ChatRepair, we find that the bug-fix distribution among the three methods shows considerable consistency. APRMCTS significantly outperforms the other two baselines on Compress, JacksonDataBind, and Jsoup. We also evaluate APRMCTS on the QuixBugs dataset. The results show that APRMCTS is capable of fixing all the bugs in QuixBugs.

Table 2: Results of APRMCTS (GPT-3.5, 32 patch) on Defects4J. Core is short for JacksonCore, Xml is short for JacksonXml, Databind is short for JacksonDatabind, Collect is short for Collections.

APRMCTS	Closure	Chart	Lang	Math	Mockito	Time	Cli	Codec	Collect	Compress	Csv	Gson	Core	Databind	Xml	JxPath	Jsoup	Total
# Bugs	174	26	63	106	38	26	39	18	4	47	16	18	26	112	6	22	93	835
Plausible	45	13	29	45	8	6	14	8	0	23	8	6	5	31	1	3	35	280
Correct	28	12	24	32	8	4	12	5	0	15	7	4	4	18	1	1	26	201
RepairAgent	27	11	17	29	6	2	8	9	1	10	6	3	5	11	1	0	18	164
ChatRepair	37	15	21	32	6	3	5	8	0	2	3	3	3	9	1	0	14	162

Overlap Analysis. Figure 3 shows the Venn diagram of the bugs fixed by RapGen DBLP:journals/corr/abs-2306-17077, RewardRepair DBLP:conf/icse/YeMM22, SelfAPR DBLP:conf/kbse/YeML0M22, CURE DBLP:conf/icse/JiangL021 and APRMCTS on Defects4J-v1.2 and Defects4J-v2. Mention that RAP-Gen has 13 and 6 duplicate patches on Defects4J-v1.2 and Defects4J-v2, thus the actual number of bugs fixed by RAP-Gen should be 106 (59 + 47). Figure 3 shows that APRMCTS has excellent repair capabilities, fixing 48 and 52 unique bugs on DefectsJ-v1.2 and v2, respectively, compared to the other 4 baselines. Additionally, we separately take the two best-performing LLM-based baselines, RepairAgent bouzenia2024repairagentautonomousllmbasedagent and ChatRepair DBLP:conf/issta/Xia024, to perform overlap analysis with APRMCTS. Figure 4(a) and Figure 4(b) show that there are 54, 25 bugs that can be repaired by all three methods on Defects4J-v1.2 and v2, respectively, indicating that all three approaches are highly effective and have considerable similarity. This is because the three methods utilize the same backbone model. Despite that, APRMCTS is still able to fix 18 and 25 unique bugs on Defects4J-v1.2 and Defects4J-v2, respectively, which ranks second and first among the three methods, demonstrating the superiority of APRMCTS.

Case Study. To better illustrate the advancement of APRMCTS, we provide several notable fixes. APRMCTS can fix both Gson_15 and Lang_16 which ChartRepair DBLP:conf/issta/Xia024 mentions as unique fixes. We further demonstrate a unique fix from APRMCTS results, as shown in Figure 5. Cli_19 is a function-level bug from Defects4J-v2, which cannot be fixed by simply replacing one or several buggy lines. Instead, fixing this bug requires modifying the function in multiple places, thus bringing much difficulty to APR and no baselines can fix it. The key to fixing Cli_19 lies in understanding that the action $tokens.add(token)$ is necessary under all conditional branches. As shown in Figure 5, APRMCTS arrives at a correct patch that is different from the developer patch but semantically equivalent.

5.2 RQ2: Comparison with LLMs

Experimental Design. We have demonstrated that APRMCTS achieves impressive performance across a range of APR techniques and LLMs. In this section, we further investigate the extent to which APRMCTS improves performance across different underlying LLMs, and whether these improvements are attributable to our proposed framework rather than to the inherent capabilities of the models themselves. To this end, we select seven of the best-performing LLMs from each model scale category in RQ1 and apply our framework to them.

Table 3: Comparison of correct/plausible fix between Vanilla LLMs and APRMCTS on Defects4J and QuixBugs, including three types of bugs, single-line (SL), single-hunk (SH) and single-function (SF).

Category	Model	Patch Size	SL	SH	SF	Defects4J	QuixBugs
3B	Qwen2.5-Coder-3B	16	56/79	13/25	18/34	87/138	27
	Qwen2.5-Coder-3B (APRMCTS)	16	60/86	13/24	22/43	95/153	-
	Stable-Code-3B	16	39/56	4/12	15/31	58/99	20
	Stable-Code-3B (APRMCTS)	16	40/58	5/13	17/35	62/106	-
7B-9B	Yi-Coder-9B	16	60/77	16/30	30/59	106/166	31
	Yi-Coder-9B (APRMCTS)	16	73/90	26/37	44/63	143/190	-
	Llama-3.1-8B	16	48/63	12/21	26/55	86/139	25
	Llama-3.1-8B (APRMCTS)	16	54/75	14/26	27/61	95/162	-
	Qwen2.5-Coder-7B	16	46/62	11/18	22/52	79/132	25
	Qwen2.5-Coder-7B (APRMCTS)	16	61/78	16/34	30/59	107/171	-
API	GPT-4o-mini	16	65/72	27/37	36/61	128/170	35
	GPT-4o-mini (APRMCTS)	16	78/92	32/45	48/71	158/208	40
	GPT-3.5	16	67/73	29/38	36/65	132/176	36
	GPT-3.5 (APRMCTS)	16	84/96	31/46	44/74	159/216	40
	GPT-3.5 (APRMCTS)	32	104/121	42/64	55/95	201/280	40

Results and Analysis. Table 3 presents the performance improvements achieved by APRMCTS across different underlying models. Results show that the repair capabilities of all seven LLMs generally improve after applying APRMCTS. Among these, Yi-Coder-9B, Qwen2.5-Coder-7B, GPT-4o-mini and GPT-3.5 demonstrate the most significant improvements, with an increase of 37, 28, 30 and 27 bug-fixes, respectively. Moreover, with the patch size set to 32, GPT-3.5 (APRMCTS) can fix 201 bugs, which is 69 more bug-fixes than vanilla GPT-3.5. Llama-3.1-8B and Qwen2.5-Coder-3B show certain improvement, both with an additional 9 bug-fixes.

In terms of buggy types, the success rate for fixing single-line (SL) and single-hunk (SH) bugs is significantly higher than that for single-function (SF) bugs. For the former two types of bugs, LLMs can pinpoint the exact location of buggy lines, and the logic of the buggy programs is relatively simpler, requiring less modification compared to SF bugs. Thus it is harder for LLMs to fix SF bugs. Compared to Vanilla LLMs, we notice that APRMCTS significantly enhances the effectiveness of LLMs in fixing SF bugs, with GPT-4o-mini fixing 12 more SF bugs, GPT-3.5 fixing 8 more SF bugs, Yi-Coder-9B fixing 14 more SF bugs, Qwen2.5-Coder-7B fixing 8 more SF bugs, and Qwen2.5-Coder-3B fixing 4 more SF bugs. It indicates that APRMCTS has a particular advantage in fixing complex bugs.

5.3 RQ3: Effectiveness of Each Component

Experimental Design. In RQ3, we perform ablation study to validate the effectiveness of each component, including test information, CoT prompting and search/evaluation. We incrementally incorporate each component into our method to see its impact on performance.

5.3.1 RQ3.1: Effectiveness of Test Information

Table 4: Comparison of the number of bugs-fixes with test information vs. without test information.

	Qwen2.5-Coder-3B	Stable-Code-3B	Yi-Coder-9B	Llama-3.1-8B	Qwen2.5-Coder-7B	GPT-4o-mini	GPT-3.5
without test	75	49	94	77	71	107	114
with test	87( $\uparrow$ 12)	58( $\uparrow$ 9)	106( $\uparrow$ 12)	86( $\uparrow$ 9)	79( $\uparrow$ 8)	128( $\uparrow$ 21)	132( $\uparrow$ 18)

As shown in Table 4, test information positively impacts the repair effectiveness of all LLMs, with the most significant improvements observed in GPT-4o-mini and GPT-3.5, which fix 21 and 18 more bugs, respectively.

5.3.2 RQ3.2: Effectiveness of CoT

We adopt CoT based on Vanilla LLMs to guide LLMs in providing their thinking process before generating patches. We compare CoT with another popular reasoning strategy, Tree of Thought (ToT), and Vanilla LLMs. As shown in Table 5, most LLMs show improvement with CoT compared to Vanilla LLMs. Yi-Coder-9B and GPT-3.5 improve most, with CF increasing by 31 and 7 and PF increasing by 32 and 10. When using ToT, GPT-4o-mini, Llama-3.1-8B, and Stable-Code-3B see decreases of 7, 19, and 2 in CF, respectively. In comparison, CoT generally performs better than ToT across the 7 LLMs.

Table 5: Comparison between Vanilla LLMs, Chain of Thought (CoT), and Tree of Thought (ToT).

Method	CF	PF	EM
GPT-4o-mini (CoT)	131( $\uparrow$ 3)	174( $\uparrow$ 4)	52
GPT-4o-mini (ToT)	121( $\downarrow$ 7)	176( $\uparrow$ 6)	42
GPT-4o-mini (Vanilla)	128	170	46
GPT-3.5 (CoT)	139( $\uparrow$ 7)	186( $\uparrow$ 10)	55
GPT-3.5 (ToT)	134( $\uparrow$ 2)	181( $\uparrow$ 5)	49
GPT-3.5 (Vanilla)	132	176	47
Yi-Coder-9B (CoT)	137( $\uparrow$ 31)	198( $\uparrow$ 32)	54
Yi-Coder-9B (ToT)	116( $\uparrow$ 10)	188( $\uparrow$ 22)	46
Yi-Coder-9B (Vanilla)	106	166	49
Llama-3.1-8B (CoT)	93( $\uparrow$ 7)	128( $\downarrow$ 11)	41
Llama-3.1-8B (ToT)	67( $\downarrow$ 19)	107( $\downarrow$ 32)	28
Llama-3.1-8B (Vanilla)	86	139	39
Qwen2.5-Coder-7B (CoT)	79(-)	132(-)	45
Qwen2.5-Coder-7B (ToT)	84( $\uparrow$ 5)	141( $\uparrow$ 9)	39
Qwen2.5-Coder-7B (Vanilla)	79	132	41
Qwen2.5-Coder-3B (CoT)	93( $\uparrow$ 6)	151( $\uparrow$ 13)	47
Qwen2.5-Coder-3B (ToT)	92( $\uparrow$ 5)	144( $\uparrow$ 6)	51
Qwen2.5-Coder-3B (Vanilla)	87	138	38
Stable-Code-3B (CoT)	60( $\uparrow$ 2)	102( $\uparrow$ 3)	28
Stable-Code-3B (ToT)	56( $\downarrow$ 2)	98( $\downarrow$ 1)	26
Stable-Code-3B (Vanilla)	58	99	27

EM evaluates LLMs’ ability to match ground-truth patches from developers, while low EM may lead to the overfitting problem qi2015analysis. It can be seen that the improvement in EM by CoT is relatively stable, with GPT-4o-mini improving by 2.83%, GPT-3.5 improving by 2.86%, Qwen2.5-Coder-7B improving by 3.03%, Qwen2.5-Coder-3B improving by 3.59%, Llama-3.1-8B improving by 3.98% and Stable-Code-3B improving by 3.7%.

5.3.3 RQ3.3: Effectiveness of Search and Evaluation

Table 6: Correct fix comparison between Vanilla LLMs, CoT and APRMCTS (patch size

\leq

32).

Patch Size	4	8	12	16	32
Qwen2.5-Coder-3B (Vanilla)	54	69	80	87	-
Qwen2.5-Coder-3B (CoT)	59	79	88	93	-
Qwen2.5-Coder-3B (APRMCTS)	58	78	87	95	-
Stable-Code-3B (Vanilla)	36	47	54	58	-
Stable-Code-3B (CoT)	37	49	55	60	-
Stable-Code-3B (APRMCTS)	37	50	57	62	-
Qwen2.5-Coder-7B (Vanilla)	45	63	69	79	-
Qwen2.5-Coder-7B (CoT)	60	81	94	99	-
Qwen2.5-Coder-7B (APRMCTS)	65	87	100	107	-
Llama-3.1-8B (Vanilla)	61	74	81	86	-
Llama-3.1-8B (CoT)	55	76	88	93	-
Llama-3.1-8B (APRMCTS)	56	72	87	97	-
Yi-Coder-9B (Vanilla)	78	94	101	106	-
Yi-Coder-9B (CoT)	100	119	130	137	-
Yi-Coder-9B (APRMCTS)	109	130	138	143	-
GPT-4o-mini (Vanilla)	105	115	123	128	-
GPT-4o-mini (CoT)	101	117	126	131	-
GPT-4o-mini (APRMCTS)	127	147	155	158	-
GPT-3.5 (Vanilla)	111	120	125	132	-
GPT-3.5 (CoT)	118	127	132	139	-
GPT-3.5 (APRMCTS)	134	148	154	159	201

To evaluate the impact of the search and evaluation components on the overall effectiveness of APRMCTS, we compare its performance against CoT-enhanced and vanilla LLM baselines. As shown in Table 6, it can be observed that, all seven LLMs demonstrate improved effectiveness with APRMCTS compared to using only CoT and Vanilla LLMs. In particular, GPT-3.5 (APRMCTS) fixes 20 more bugs than GPT-3.5 (CoT), GPT-4o-mini (APRMCTS) fixes 27 more bugs than GPT-4o-mini (CoT), Yi-Coder-9B (APRMCTS) fixes six more bugs than Yi-Coder-9B (CoT). Furthermore, with the patch size increasing to 32, GPT-3.5 with APRMCTS can fix 42 additional bugs.

When comparing the performance of LLMs of different sizes, we find that large-scale models like GPT-4o-mini, GPT-3.5, Yi-Coder-9B and Qwen2.5-Coder-7B show more significant improvement, compared to smaller models such as Qwen2.5-Coder-3B and Stable-Code-3B. For GPT-4o-mini and GPT-3.5, 90% (27/30) and 74% (20/27) of the overall improvement in bug-fix is attributed to search and evaluation when comparing APRMCTS to Vanilla LLMs, respectively. For Yi-Coder-9B, Qwen2.5-Coder-7B, Qwen2.5-Coder-3B, and Stable-Code-3B, this proportion is 16% (6/37), 28.5% (8/28), 36% (4/11), and 50% (2/4), respectively. It indicates that large-scale models benefit more from searching compared to small-scale models. This is because large-scale models are more accurate in patch evaluation, and accurate evaluation helps guide the search in the right direction.

We also observe that as patch size increases, search and evaluation start playing a more significant role. For Llama-3.1-8B, when patch size is between 8 and 12, the number of bug-fixes by APRMCTS is slightly lower than that of CoT. However, as patch size increases, the performance of APRMCTS gradually ties that of CoT (when patch size = 14) and then surpasses it (when patch size $>$ 14). Qwen2.5-Coder-3B exhibits the same trend, with APRMCTS outperforming CoT when patch size exceeds 14. It indicates that as patch size increases, APRMCTS is able to resolve more complex bugs that other methods cannot solve.

5.4 RQ4: Effectiveness of Multi-lingual and Multi-type Bugs

Experimental Design. In RQ 1-3, we have validated the effectiveness of APRMCTS on project-level Java bugs (e.g., Defects4J). To further validate the repair capability of APRMCTS on bugs of different types and in different languages, we perform extra experiments on the ConDefects-Python dataset. We compare APRMCTS with ChatRepair, GPT-3.5 and AlphaRepair. To ensure fairness, we follow ChatRepair and employ GPT-3.5 as the experimental LLM.

Results and Analysis. As shown in Table 7, when patch size = 48 (16 iterations, 3 patches per iteration), APRMCTS obtains 211 plausible fixes and 204 correct fixes, which is 40 more plausible fixes and 39 more correct fixes than ChatRepair. Since the patch size for ChatRepair is set to 500, it can be seen that with less than one-tenth of the patch size, APRMCTS still significantly enhances the patch search performance of LLMs. When we increase search iteration to 32 and set patch size to 96, we find that the performance of APRMCTS is further enhanced, with 287 plausible fixes and 264 correct fixes, which surpasses ChatRepair by 23/38 correct/plausible fixes. Additionally, we find that Test-as-Judge enables LLMs to quickly generate patches that satisfy simple test cases, and then iteratively refine the details of the patches through complex test cases until all boundary conditions are met. Compared to allowing the model to search patches without evaluation, Test-as-Judge guides LLMs in the right direction for repairs, improving the efficiency of patch search.

Table 7: Results on ConDefects-Python (correct/plausible fix).

ChatRepair	GPT-3.5	AlphaRepair	APRMCTS (48 patch)	APRMCTS (96 patch)
241/249	165/171	142/160	204/211	264/287

The above experimental results demonstrate that APRMCTS has significant advantages over previous methods and vanilla LLMs in repairing bugs across multiple languages (Java/Python) and multiple types (Repository/Competition).

5.5 RQ5: Effectiveness of Large Patch Size

Experimental Design. RQ 1-4 have demonstrated APRMCTS’s effectiveness with small patch size (e.g., 16, 32). To further investigate the impact of large patch size, we select GPT-3.5 for extreme testing. We increase the patch size from 32 to 500 (50 iterations, 10 patches per iteration) to align with ChatRepair’s configuration.

Table 8: APRMCTS (GPT-3.5) can fix 11 more bugs with larger patch size (32

\xrightarrow{}

500) on Defects4J.

Project	Bugfix
Chart	3 ✓
Cli	25 ✓, 14 ✗, 19 ✓, 38 ✓
Closure	53 ✗, 55 ✓, 104 ✓
Codec	2 ✓
JacksonDatabind	17 ✓
Jsoup	26 ✓, 55 ✓, 75 ✗
Math	48 ✗ , 58 ✗
Time	15 ✓

Results and Analysis. We list the newly fixed bugs in Table 8, where $\checkmark$ represents a correct fix, and × represents a plausible but not correct fix. It can be observed that a larger patch size (500) leads to more plausible fixes (16) and correct fixes (11). However, as the patch size increases, the number of newly fixed bugs significantly decreases. This indicates that APRMCTS has already approaches its upper limit.

5.6 RQ6: Cost Analysis

Experimental Design. In RQ6, we aim to analyze the differences between APRMCTS and existing APR tools in terms of patch size, time, token consumption, and monetary cost. Specifically, we select ChatRepair and RepairAgent as baselines, and use the cost on Defects4J for comparison.

Table 9: Cost analysis between APRMCTS, ChatRepair, RepairAgent and Repatt on Defects4J.

Method	Patch/Bug	Time/Bug	Token/Bug	Money/Bug	Charge/1k tokens
ChatRepair (2024) sobania2023analysisautomaticbugfixing	500	$\leq$ 5 hours	210,000	$0.42	$0.002
ChatRepair (today’s price)	500	$\leq$ 5 hours	210,000	$0.14	-
RepairAgent (2024) bouzenia2024repairagentautonomousllmbasedagent	117	920 seconds	270,000	$0.14	-
APRMCTS (2025)	16	23.64 min	20,000	$0.03	$0.0015
APRMCTS (2025)	32	50 min	40,000	$0.06	$0.0015

Results and Analysis. The comparison result is shown in Table 9. With the patch size set to 32, which is the smallest among all three baselines, APRMCTS spends an average of 50 minutes per bug, shorter than that of ChatRepair. Moreover, APRMCTS also has a significant advantage in terms of the average number of tokens spent and monetary cost per bug, which is only 19% of the 210,000 tokens reported by ChatRepair and 14.8% of the 270,000 tokens reported by RepairAgent. In terms of pricing, we calculate based on the current API price. The cost of APRMCTS is $0.06 per bug, which is 43% of ChatRepair ($0.14) and RepairAgent ($0.14).

6 Discussion

6.1 Implementing APRMCTS with Other Search Algorithms

To demonstrate the flexibility of APRMCTS, we replace MCTS with other search algorithms (e.g., beam search). Specifically, we initialize a patch pool of size 4. In each iteration, we apply the beam search algorithm to refine each patch in the patch pool, evaluate the newly generated patches, and retain the top 4 highest-scoring patches for the next iteration. The beam width is set to 5, and the number of iterations is set to 3. We conduct comparative experiments using Qwen2.5-Coder-7B. The results show that Beam Search achieves 149 plausible fixes and 88 correct fixes, fixing 9 more bugs than the vanilla model and 19 fewer bugs than MCTS. This demonstrates the scalability and effectiveness of APRMCTS across multiple search algorithms, and also indicates that the MCTS search algorithm outperforms Beam Search in the bug repair scenario.

6.2 Analysis of Data Leakage

Since GPT can only be accessed via API, we cannot determine its training data, which poses a risk of data leakage zhang2023critical. To address this issue, we take the following actions. For the open-source models (e.g., Qwen), we carefully examine their pre-training datasets and confirm that there is no overlap with benchmarks. For the black-box models (e.g., GPT), we follow prior work DBLP:conf/issta/Xia024 and include the ConDefects dataset in our evaluation to mitigate the risk of data leakage. We also follow prior works DBLP:conf/issta/Xia024; DBLP:conf/sigsoft/XiaZ22 and compare the patches generated by GPT with reference developer fixes. On Defects4J, we find that GPT-3.5 generates 61 patches that are identical to the developer patches. Even after removing the 61 patches overlapping with developer patches, APRMCTS still correctly fixes 49 (55 $\rightarrow$ 49) unique bugs that are beyond the reach of RepairAgent and ChatRepair. In addition, we conduct supplementary experiments on Condefects-Python using another open-source model, Qwen2.5-Coder-32B. We compare the developer-written patches with the model-generated patches and find that Qwen2.5-Coder-32B and GPT-3.5 achieve 29 and 32 exact matches, respectively, a very small difference. Thus, we conclude that the influence of data leakage is minor.

6.3 The Potential of APRMCTS on SWE-Bench

In addition to Defects4J, we also evaluate APRMCTS on SWE-Bench jimenez2023swe, a defect dataset composed of GitHub issues. We use the open-source Qwen3-Coder-480B as the base model. We use the same configuration as Defects4J to set the patch size to 16 and score patches by test reports and patch content. We directly use the test cases and perfect localization provided in the dataset for the convenience of evaluating the patch search capability of APRMCTS. As shown in Table 10, compared with vanilla LLMs, APRMCTS helps Qwen3-Coder-480B fix 51 more bugs. Compared to ChatRepair, APRMCTS fixes 35 more bugs. In addition, APRMCTS outperforms recent approaches such as KGCompass yang2025enhancing and OpenHands wang2025openhands. In future work, APRMCTS can be integrated with advanced fault localization and test generation tools zhang2025improving; zhang2025improvingtosem; zhang2024testbench to form agent-based frameworks with powerful repair capabilities.

Table 10: Results on SWE-bench Lite test.

SWE System	Base Model	Resolved	%Resolved	Date
Refact.ai Agent	NA	180	60%	2025-06-25
SWE-agent yang2024swe	Claude-4 Sonnet	170	56.67%	2025-05-26
APRMCTS (Ours)	Qwen3-Coder-480B	164	54.67%	2025-08-30
KGCompass yang2025enhancing	Claude-3.5 Sonnet	138	46%	2025-06-19
ChatRepair	Qwen3-Coder-480B	129	43%	2025-08-30
OpenHands wang2025openhands	Claude-3.5 Sonnet	125	41.67%	2024-10-25
Vanilla LLMs	Qwen3-Coder-480B	113	37.67%	2025-08-30

7 Threats to Validity

Internal Threat. The main internal threat involves the potential of data leakage. To address this, we assess the impact of data leakage through three approaches: analyzing the training data of open-source models, including more benchmarks, and examining the number of overlapping patches generated by LLMs and the developer patches. Thus, we are confident that the influence of data leakage is minor.

External Threat. The main external threat to validity is that the performance of APRMCTS may not generalize to other datasets. To mitigate this, we evaluate APRMCTS on both repository-level bugs (e.g., Defects4J) and competition-level bugs (e.g., ConDefects). Moreover, APRMCTS is agnostic to bug types and programming languages. Therefore, we believe this threat has a minimal impact on our conclusions, and APRMCTS has the potential to handle more complex and diverse bugs.

8 Conclusion

In this paper, we introduce APRMCTS that employs iterative tree search to improve LLM-based APR. APRMCTS employs the following strategies: (1) incorporate MCTS into the patch search process to enhance efficiency and effectiveness. (2) Perform global evaluation on explored patches to avoid falling into local optima. Our experiments on 835 bugs from Defects4J demonstrate that APRMCTS can fix a total of 201 bugs, which outperforms the other ten state-of-the-art baselines. We further demonstrate APRMCTS’s multi-lingual and multi-type bug fixing ability on ConDefects-Python. Compared to existing LLM-based APR tools, APRMCTS is faster and reduces monetary costs by over 50%.