0% found this document useful (0 votes)
84 views14 pages

2248 Closing The Evaluation Gap

This document discusses the limitations of using a single Large Language Model (LLM) as a judge for generating critiques of outputs, particularly in code generation tasks. It proposes that ensembling multiple LLM judges can produce more reliable and optimal critiques, leading to improved performance in iterative inference-time processes. The authors provide empirical evidence supporting their claims and emphasize the importance of thoughtful design in creating ensemble LLM-judge protocols.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
84 views14 pages

2248 Closing The Evaluation Gap

This document discusses the limitations of using a single Large Language Model (LLM) as a judge for generating critiques of outputs, particularly in code generation tasks. It proposes that ensembling multiple LLM judges can produce more reliable and optimal critiques, leading to improved performance in iterative inference-time processes. The authors provide empirical evidence supporting their claims and emphasize the importance of thoughtful design in creating ensemble LLM-judge protocols.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Closing the Evaluation Gap: Ensembling LLM-Judges Generates More

Reliable Inference-Time Reference-Free Critiques

Anonymous ACL submission

Abstract 2022; Kaplan et al., 2020). Thus, inference-time 042


approaches, such as prompt optimization, refine 043
001 LLM-as-a-Judge allows for efficient and scal- the output without updating the model with an end- 044
002 able natural language evaluations of complex
to-end feedback loop. 045
003 generated outputs, such as code, without the
004 need for a ground-truth reference. These eval- To achieve this inference-time refinement, we 046

005 uation protocols have become a crucial part need a natural language evaluation, or critique as 047
006 for inference-time refinement approaches like part of the feedback loop to direct the update direc- 048
007 prompt optimization. However, an important tion of the next output (Cheng et al., 2023; Wang 049
008 question arises of whether a pre-trained LLM et al., 2023; Zhou et al., 2022; Yuksekgonul et al., 050
009 can generate a reliable evaluation of the out- 2024). For example, a critique such as “this code 051
010 put. In this work, we derive an interesting,
has a logical error...” catches errors and can be used 052
011 insightful result showing that a single LLM-
012 based judge is insufficient in generating an op-
as part of the prompt for updating the generated 053

013 timal critique. We then provide a solution by code in the next iteration. Thus, critiques are a 054
014 demonstrating that aggregating multiple LLM- fundamental component of iterative inference-time 055
015 generated evaluations can better model the opti- improvement. 056
016 mal critique. We empirically show the merits of However, obtaining critiques is a challenge. Hu- 057
017 ensembling multiple LLM-judges via prompt man evaluators are expensive, time-consuming, and 058
018 optimization experiments for code generation. may require domain experts. Therefore, LLM-as- 059
019 Ensembling judges leads to up to a ∼ 9% in-
a-Judge (LLM-judges) has been utilized to verify 060
020 crease in solved coding problems over using a
021 single-judge. We perform ablations utilizing the LLM output automatically (Verga et al., 2024; 061

022 different aggregation methods and diverse eval- He et al., 2024; Kim et al., 2024). While an LLM 062
023 uation instructions, emphasizing the non-trivial can judge efficiently and automatically, for com- 063
024 design of ensembling LLM-judges to suggest plex generative tasks such as code generation, they 064
025 further research. We provide anonmyzied can output incorrect critiques (Stroebl et al., 2024). 065
026 code: https://anonymous.4open.science/ Many prior work assumes that the LLM-judge has 066
027 r/ensemble_eval-891B/ReadMe.md
reference information, such as unit test results or 067
even a ground-truth solution (He et al., 2024; Yuk- 068
028 1 Introduction
sekgonul et al., 2024). This information is often 069
029 Large Language Models (LLMs) (Achiam et al., unavailable (Chen et al., 2024; Nguyen et al., 2024) 070
030 2023; Touvron et al., 2023) have demonstrated ever- Therefore, we must rely on point-wise, reference- 071
031 increasing performance in various tasks such as free critique that is generated from an LLM-judge 072
032 code generation, document summarization, and im- that is only given the output. A natural question 073
033 age captioning (Chen et al., 2024; Gulwani, 2010; arises for this practical, yet challenging scenario: 074
034 Basyal and Sanghvi, 2023; Chen et al., 2022). With Does a reference-free LLM judge generate an opti- 075
035 the broad applicability of LLMs in society, reliably mal critique for iterative refinement? 076
036 evaluating LLM outputs during inference time is In this work, we tackle this question by deriv- 077
037 a pressing matter. For example, although LLMs ing an interesting result demonstrating that using 078
038 in deployment can output realistic code, the code a single reference-free LLM-based judge cannot 079
039 may not necessarily run as intended (Stroebl et al., perfectly model the unknown, oracle critique, cre- 080
040 2024). Training or fine-tuning an LLM for ev- ating a suboptimality gap in critique. This gap can 081
041 ery task has high computational cost (Zan et al., cause issues where a ground-truth solution is not 082

1
Figure 1: On the left, generated zero-shot output from GPT-4o for a LeetCodeHard problem that fails all test cases.
On the right, critiques were generated from a single judge (red) and from multiple judges (green). The ensemble of
judges detects errors that the single judge misses. The readability judge of the ensemble gives more suggestions for
improvement while the single-judge approach gives just one.

Multiple LLM Point-Wise/ Theoretical Prompt Optimization


Reference Critique(s)
Judges Reference-Free Insights Experiments
Madaan et al. (2024) ✗ ✓ ✓ ✗ ✓
Shinn et al. (2024) ✗ ✗ ✓ ✗ ✓
Verga et al. (2024) ✓ ✓ ✗ ✗ ✗
Xu et al. (2024) ✓ ✗ ✗ ✓ ✗
Kim et al. (2024)* ✓ ✓ ✓ ✗ ✗
He et al. (2024) ✓ ✗ ✓ ✗ ✓
Yuksekgonul et al. (2024) ✗ ✗ ✓ ✗ ✓
Badshah and Sajjad (2024) ✓ ✗ ✓ ✗ ✗
This Work ✓ ✓ ✓ ✓ ✓

Table 1: Summary of key elements of our work compared to recent relevant work. Our work is the first to use
multiple LLM natural language evaluators for prompt optimization without reference information and provide
theoretical insights. *Kim et al. (2024) does have prompt optimization experiments but does not use multiple judges
for those experiments.

083 available, such as code generation. Incorrect cri- able reference-free critique. 097
084 tiques can lead to negative update directions for While multiple LLM judges have been analyzed 098
085 inference-time processes. To mitigate the impact in other work (cf. Table 1), to the best of our knowl- 099
086 of the suboptimality gap, we prove that an ensem- edge, we are the first to include both a theoretical 100
087 ble of reference-free LLM evaluators decreases it. motivation for multiple reference-free LLMs to 101
088 We empirically validate this insight extensively by generating critiques and an empirical validation of 102
089 utilizing multiple LLM-judges in a prompt opti- the theory via prompt optimization experiments. 103
090 mization process for code generation. We explore Prior work that introduces multiple LLM judges 104
091 different aggregation methods, concatenation and such as Verga et al. (2024) do not generate critiques 105
092 summarization, and utilizing diverse LLM-judges to be used for inference-time improvement meth- 106
093 where each one judges the output on distinct cri- ods. Here, we show the direct benefits of ensemble 107
094 teria. These empirical results emphasize that de- judges that generate critiques as part of a LLM 108
095 signing the ensemble LLM-judge protocol is non- feedback loop. 109
096 trivial and warrants further research for more reli- Our contributions are summarized as follows: 110

2
111 • Theoretical Motivation for Ensemble LLM- evaluators reduces a evaluation suboptimality gap 161
112 Judges: We propose a novel formulation for and provide results with various aggregation meth- 162
113 the prompt optimization task that specifically ods. Concurrent work, CRISPO by He et al. (2024), 163
114 highlights the suboptimality gap in critiques also looks at multiple evaluators for prompt opti- 164
115 for a single LLM-judge. With this formula- mization for tasks with multiple criteria like we do. 165
116 tion, we prove that increasing the number of However, they rely on access to reference informa- 166
117 LLM judges reduces this gap with a linear tion to generate a critique. Here, we remove that 167
118 additivity assumption. access that may not be available and only input the 168
AI output to the evaluators. 169
119 • Empirical Performance Over Single-
Natural Language Critiques and Prompt Op- 170
120 Evaluation Approach: We thoroughly test
timization. Many prior works have studied us- 171
121 this method via prompt optimization pipeline
ing critiques for prompt optimization. Madaan 172
122 for code generation on three benchmarks. We
et al. (2024) was one of the first works to propose a 173
123 show up to ∼ 9% in coding problems solved
prompt iterative feedback loop for refining LLMs, 174
124 over a single-judge approach.
and Pryzant et al. (2023) established prompt gradi- 175
125 • Extensive Study of Evaluation Design: We ents, or Textual Gradients, as feedback to an AI sys- 176
126 provide multiple studies that demonstrate that tem. Concurrent work, CRISPO by He et al. (2024), 177
127 the choice of aggregation method, the number also looks at multiple evaluators for prompt opti- 178
128 of judges, and the combination of criteria can mization for tasks with multiple criteria. Prompt 179
129 significantly affect performance, emphasizing reinforcement learning with natural language cri- 180
130 that the design of an ensemble of LLM-judges tiques has also been used to improve LLM-based 181
131 is non-trivial. systems (Shinn et al., 2024; Feng et al., 2024). Due 182
to the abstract nature of raw text, theoreicall 183
132 2 Related Works
133 LLM-as-a-Judge. LLM-as-a-Judge (Zheng et al., 3 Problem Formulation 184

134 2023; Li et al., 2024), has been growing in inter- Reference-Free LLM for Generating Critiques. 185
135 est due to the ability of LLMs to evaluate large Let y be the output for a given task and y ∗ be the 186
136 outputs like text (Sellam et al., 2020; Kocmi and optimal response. For code generation, y ∗ would 187
137 Federmann, 2023) quickly and to align with human be a functionally correct, readable, and efficient so- 188
138 preferences. Prior work has also studied finetuning lution code snippet to the problem and y would be a 189
139 LLMs to be judges (Zhou et al., 2022; Xiong et al., snippet attempting to solve the problem. We would 190
140 2024). Ankner et al. (2024) used LLM-generated want y to match y ∗ as closely as possible. Mathe- 191
141 critiques to augment the scalar reward from a re- matically, we can write the optimization problem, 192
142 ward model. Li et al. (2023) used discussion be-
143 tween multiple LLMs to select a strong LLM-judge arg min l(y ∗ , y), (1) 193
y
144 for question-answering. Strong LLM judges have
145 been shown to generalize across tasks (Huang et al., where l is an objective function to capture the close- 194
146 2024). Weak LLM evaluators have been used to ness of sampled response y to the ground truth y ∗ . 195
147 judge the debate between two stronger LLMs (Ken- l is akin to a loss function in machine learning. 196
148 ton et al., 2024). l(y ∗ , y) is a natural language critique c of y com- 197
149 Ensemble LLM-Judges. Verga et al. (2024) paring it to y ∗ . We use the terms “critique” and 198
150 showed a panel of smaller LLM judges can pro- “loss” interchangeably as prior literature has estab- 199
151 vide numeric scores correlating to human judgment lished the analogy (Yuksekgonul et al., 2024). 200
152 than a single larger LLM model can. Similarly, Limitations and Challenges in Critiques. In 201
153 (Kim et al., 2024) has shown repeated sampling an ideal setting, if we had access to y ∗ as a ground- 202
154 of evaluations can also correlate to human judge- truth label for a supervised loss (Tiwari, 2022), then 203
155 ment better. (Badshah and Sajjad, 2024) used mul- we can achieve the optimal performance. How- 204
156 tiple reference-guided LLM judges for question- ever, in practice, they are hard to obtain or simply 205
157 answering. Other work has used multiple LLM- unknown for many tasks such as code generation 206
158 judges for iterative fine-tuning (Xu et al., 2024; (Chen et al., 2024). Therefore, a direct comparison 207
159 Agrawal et al., 2024). While for prompt optimiza- to an optimal output y ∗ and the resulting calcula- 208
160 tion, we theoretically characterize that increasing tion of c are both infeasible. Current SoTA work 209

3
210 instead sample an evaluation c from an evaluation set of diverse judges for x, y. We then define
211 LLM policy conditioned by the response output the sub-optimality metric, ∆Π Cri-sub-opt , as
212 y and prompt x as c ∼ π(·|x, y). Let us denote
213 πc = π(·|x, y) for notation simplicity. When pa- ∆Π Cri-sub-opt = Ec∼πc∗ (·|x,y) [c]
214 rameterizing πc with an LLM, it is known as the
− E{ck ∼πk (·|x,y)}K [g(c1 , · · · , cK )] .
215 LLM-judge. Ideally, we would like the evaluation c k=1

216 of y to be l(y ∗ , y). More specifically, let us assume (3) 250


217 the existence of an optimal LLM-judge denoted
∆ΠCri-sub-opt is difference between the expected value 251
218 by πc∗ , sampling from which will give us samples
of the critique under the optimal unknown critique 252
219 of the true loss function l(y ∗ , y). However, LLM-
distribution, and the expected function g which 253
220 judges tend to fail with no reference (Stureborg
maps the K different critiques to one. In practice, 254
221 et al., 2024). Figure 1 demonstrates this with an
g can be seen as an aggregation function such as 255
222 example LLM-judge letting an error go undetected.
concatenation. For the following theorem, we pro- 256
223 A Single Reference-Free LLM-Judge Outputs
vide the following assumption. 257
224 Suboptimal Critques. As πc∗ is unavailable as
225 discussed before, current SoTA methods sample Assumption 1. g is a linear function.
226 the critique loss from a single evaluator as c ∼ πc .
227 Now, we know that in the majority of the scenarios, If we had access to the optimal evaluator πc∗ , 258

228 πe will not be the true evaluator policy πc∗ . We now we would have been able to get the ground-truth 259

229 define the suboptimality between πc and πc∗ . critique c∗ = l(y ∗ , y) to perform the prompt op- 260
timization. However, in place of that, we have 261
Definition 1. Let c = l(ŷ, y), where ŷ is an a set of evaluators Π = (π1 , π2 · · · πK ) and 262
implicit approximation of y ∗ from πc . Under g(c1 , c2 · · · cK ) is the aggregation function to com- 263
this scenario, we define the suboptimality gap bine the critiques. We now present the follow- 264
in the critique of prior SoTA as, ing theorem to relate the number of critiques to 265
∆ΠCri-sub-opt . 266
π
∆ Cri-sub-opt
= Ec∗ ∼π∗ (·|x,y) [c∗ ] − Ec∼π(·|x,y) [c] Theorem 1. Let dTV denote the total varia-
tion
PK distance between two distributions and let
≤ |cmax |dTV (πc∗ (·|x, y), π(·|x, y)). (2)
k=1 αk = 1. Assuming all pairs π1 , π2 ∈ Π
230 In this definition, we first expand upon the are independent of one another,
231 sub-optimality in the critique and then upper- K
232 bound using the total variation distance (Sripe- ∆Π ≤ |c|max dTV (πc∗ ,
X
αk πk ). (4)
Cri-sub-opt
233 rumbudur et al., 2009). We see that the term k=1
234 dTV (πc∗ (·|x, y), π(·|x, y)) is fixed and it cannot be
235 improved once we have the evaluator π. This result Proof. First, we characterize the 267

236 shows the hardness of a single evaluator reaching sub-optimality of our proposed cri- 268
tique method as ∆ = Ec ∼πc [c ] − ∗
237 πc∗ due to this constant gap and it will only reduce ∗ ∗ 269

238 if our current LLM evaluator is near-optimal which Ec1 ∼π1 (·|x,y),c2 ∼π2 (·|x,y)···πK [g(c1 , c2 , c3 · · · cK )]. 270

239 is not true in the majority of the scenarios. Note that if ∆ is zero, we have the optimal critique. 271

240 An Ensemble of Reference-free LLM-Judges Thus, we want ∆ to be as low as possible. For 272

241 Better Models the Optimal Critique. Our key notation simplicity of the expression, we will keep 273

242 idea is to utilize multiple critiques. The thought to two evaluators without loss of generality. We 274

243 that multiple LLM-judges would work better than provide a version with K evaluators in Appendix 275

244 one sounds intuitive but a naive introduction of A.1. 276

245 multiple evaluators does not work in practice. ∆ = Ec∗ ∼πc∗ [c∗ ]
246 We start our theoretical justification by defining
247 the sub-optimality metric to measure the critique − Ec1 ∼π1 (·|x,y),c2 ∼π2 (·|x,y) [g(c1 , c2 )]
248 performance between πc∗ and Π as = Ec∗ ∼πc∗ [c∗ ] − Ec∼πd (·|x,y) [c] +
| {z } 277
∆1
249 Definition 2. Let Π = {πk (·|x, y)}K
k=1 be the
Ec∼πd (·|x,y) [c] − Ec1 ∼π1 ,c2 ∼π2 [g(c1 , c2 )],
| {z }
∆2

4
278 where we add and subtract the terms Ec∼πd (·|x,y) , We aim to sample a y ∼ π(·|x∗ ) by finding an 324
279 with πd = απ1 + (1 − α)π2 (0 < α < 1) and input prompt x∗ corresponding to x prompt such 325
280 then separate the two terms as ∆1 , ∆2 . We next that y is closer to the optimal response y ∗ . For 326
281 individually analyze the terms ∆1 , ∆2 . code generation, πθ would be the LLM generator; 327
282 We can now bound ∆1 as, x would be the input prompt; y is the generated 328
code; and the y ∗ here would be a code snippet that 329
283 ∆1 = Ec∗ ∼πc∗ [c∗ ] − Ec∼πd (·|x,y) [c] is a functionally correct, readable, and efficient 330
∗ ∗
284 ≤ |c |dTV (π , πd ) solution to the problem. Mathematically, we can 331
∗ ∗ write, 332
285 = |c |dTV (π , απ1 + (1 − α)π2 ),

286 where we use the property of integral probability x∗ = arg min Ey∼πθ (·|x) [l(y ∗ , y)]. (6) 333
x∈X
287 metric to bound ∆1 as the total variation distance
288 between the optimal critique policy and the mixture Iterative prompt optimization. Given an initial 334
289 critique policy. Next, we proceed to ∆2 , prompt x1 , we perform an iterative prompt opti- 335
mization method to find x∗ as follows. For each 336
290 ∆2 =Ec∼πd (·|x,y) [c] − iteration t = 1 to T , we start by (i) sampling yt ∼ 337
291 Ec1 ∼π1 (·|x,y),c2 ∼π2 (·|x,y) [g(c1 , c2 )] πθ (·|xt ), (ii) evaluate the response yt to obtain cri- 338
tique ct = l(y ∗ , yt ), and then finally (iii) generate 339
292 =Ec∼πd (·|x,y) [c] −
the next prompt xt+1 ∼ π(·|yt , ct , xt ). Recent 340
293 Ec1 ∼π1 (·|x,y),c2 ∼π2 (·|x,y) [αc1 + (1 − α)c2 ] work by Yuksekgonul et al. (2024) decompose step 341

294 =Ec∗ ∼πd (·|x,y) [c ] − (iii) into two separate steps and (iii.a) first generate 342

295 αEc1 ∼π1 (·|x,y) [c1 ] − (1 − α)Ec2 ∼π2 (·|x,y) [c2 ] the feedback ft ∼ π(·|yt , ct , xt ), and then (iii.b) 343
generate the next prompt xt+1 ∼ π(·|yt , ft , xt ). 344
296 =0, (5) For simplicity, we use the same variable π for all 345

297 where we expand upon the definition of ∆2 and LLM policies because the outputs are dependent 346

298 use Assumption 1 on the aggregation function. Un- on the input variables the policy is conditioned on, 347

299 der this assumption, the two terms cancel out with so the same LLM model can be utilized. 348

300 the final result ∆2 = 0. Combining both terms The success of this method is heavily dependent 349

301 concluded the proof. This bound indicates that on step (ii), obtaining the LLM-generated critiques. 350

302 the sub-optimality in critique can be expressed as A suboptimal critique can hinder the optimization 351

303 the total variation distance between the optimal process. We now show that in the reference-free 352

304 evaluator and the available mixture of evaluators. case, using an ensemble of LLM-judges will pro- 353

305 We know from Blei et al. (2003); Nguyen et al. vide faster prompt optimization and improved LLM 354

306 (2016) that as we increase the number of mixture system output. 355

307 components and diversity amongst the components 4 Experiments and Results 356
308 increase, it can approximate any distribution under
309 certain assumptions. Code Generation Experiments: We test the mer- 357
310 Our theoretical finding is for a one-shot cri- its of our ensemble judge approach via the code 358
311 tique generation. In the following section, we will generation task because of its practicalness and 359
312 discuss how to introduce them into the iterative its multiple plausible criteria (e.g., correctness, ef- 360
313 prompt optimization pipeline. Our idea is that at ficiency). Here, the LLM generator is given a 361
314 any iteration, aggregating multiple critiques will code prompt and must produce a code snippet 362
315 better model the unknown, optimal critique for the that passes the unit tests for that prompt. This 363
316 current output, thus leading to faster improvement code generation task is a form of instance opti- 364
317 than using a single LLM-judge. mization (Yuksekgonul et al., 2024), whereby the 365
optimization variable, the input prompt, is defined 366
318 3.1 Prompt Optimization with Ensemble as xt+1 := (yt , ft ). y0 , f0 are empty strings. We 367
319 LLM-Judges provide empirical results showing that prompt op- 368
320 Optimal Prompt Search. Let π(·|x) be the LLM timization with an ensemble of judges achieves 369
321 system parameterized by fixed LLM policy that higher success in test cases than single-judge-based 370
322 samples an output response y ∼ π(·|x) given an optimization. Experiments were run on an Apple 371
323 input prompt x ∈ X from the set of prompts X . M1 Pro and macOS 14.5. 372

5
Figure 2: Completion Rate (CR) over 10 iterations Readability Points (RP) for code generation on LeetCodeHard.
Over 10 iterations for each coding problem, increasing the number of judges significantly increases the functional
correctness, and having 2 judges greatly increases the RP. The line plot shows the average over the 3 trials with
a 95% confidence interval. However, for readability, continual increase does not continuously improve RP. This
shows empirically that increasing judges does not monotonically improve all aspects of task.

Judge Method Agg CR (%) RP


Self-Refine (Baseline) – 70.0 ± 4.08 52.3
Vanilla Feedback Loop (Baseline) – 65.0 ± 14.72 52.8
1 Judge (Baseline) – 71.67 ± 2.36 52.6
6 Judges – All Criteria C 80.0 ± 4.08 54.4
6 Judges – All Criteria Sum 71.67 ± 8.5 57.7
6 Judges – All Criteria Sel 78.33 ± 6.24 44.8
6 Judges – One Criterion Each C 78.33 ± 2.36 37.0
6 Judges – One Criterion Each Sum 76.67 ± 6.24 29.2
6 Judges – One Criterion Each Sel 75.0 ± 4.08 19.2
Table 2: The Completion Rate (CR) and Readability Points (RP) over LeetCodeHard comparing various ensemble
evaluation methods against inference-time improvement baselines. Ensemble methods consistently outperform
baselines in terms of CR and the two highest-ranking methods in terms of readability are ensemble. The difference
in CR and RP between ensemble methods emphasizes the non-trivial nature of designing the ensemble evaluation
protocol.

373 Implementation Details: We use TextGrad experiments and ablations. Across all trials for 390
374 from (Yuksekgonul et al., 2024) to implement the both methods, we use the same initial generated 391
375 prompt optimization pipeline. We chose TextGrad code for a given problem so both critique protocols 392
376 because it separates the critique and feedback into can judge the same code in the initial iteration. We 393
377 two separate LLM calls, making it better to analyze share the critique system prompt for both methods 394
378 the critique module in isolation. In TextGrad, the in Appendix A.2. Because we want a diversity 395
379 system prompt that generates the initial code, pinit , of critiques, we set the temperature of all LLM- 396
380 is different from the system prompt that updates the judge call to be 1. We ablate on the judge call 397
381 code in the following refinement iterations pupdate . temperature in the Appendix. All other LLM calls 398
382 At t = 0, pinit specifies to the LLM that it is a code in the Textgrad pipeline with call temperature set 399
383 generator while the pupdate from 1 ≤ t ≤ T speci- to 0 similar to Yuksekgonul et al. (2024). For all 400
384 fies that it generates a new version yt+1 given the experiments, the top_p = 0.99. 401
385 current code yt and the feedback ft . The transition
386 from pinit to pupdate is explicitly programmed and Criteria for Critiquing Code. The set of cri- 402

387 not caused by the optimization process. tique criteria we used for this task are as follows: 403
syntax errors, logic errors, correctness, readabil- 404
388 LLM Setup Details: We use GPT-4o for all ity, runtime, and code redundancy. The following 405
389 LLM calls. In the Appendix, we provide additional results are based on utilizing all these roles. We 406

6
Criteria Judge Method Agg CR (%)
Single (Baseline) – 66.67 ± 4.71
Ensemble - All Criteria C 78.33 ± 2.36
Ensemble - One Criterion Each C 71.67 ± 8.5
Correctness, Logic, Readability,
Ensemble - All Criteria Sum 66.67 ± 8.5
Redundancy, Runtime, Syntax Ensemble - One Criterion Each Sum 75.0 ± 4.08
Ensemble - All Criteria Sel 75.0 ± 8.16
Ensemble - One Criterion Each Sel 73.33 ± 4.71
Single (Baseline) – 66.67 ± 2.36
Ensemble - All Criteria C 71.67 ± 4.71
Ensemble - One Criterion Each C 76.67 ± 4.71
Readability, Redundancy, Runtime Ensemble - All Criteria Sum 73.33 ± 6.24
Ensemble - One Criterion Each Sum 70.00 ± 7.07
Ensemble - All Criteria Sel 70.00 ± 4.08
Ensemble - One Criterion Each Sel 71.67 ± 6.24
Single (Baseline) – 78.33 ± 6.24
Ensemble - All Criteria C 73.33 ± 2.36
Ensemble - One Criterion Each C 76.67 ± 2.36
Correctness, Logic, Syntax Ensemble - All Criteria Sum 78.33 ± 2.36
Ensemble - One Criterion Each Sum 75.00 ± 7.07
Ensemble - All Criteria Sel 78.33 ± 6.24
Ensemble - One Criterion Each Sel 75.00 ± 4.08
Single (Baseline) – 76.67 ± 4.71
Ensemble - All Criteria C 72.5 ± 7.50
Ensemble - One Criterion Each C 70.0 ± 7.07
Logic, Readability Ensemble - All Criteria Sum 75.00 ± 10.80
Ensemble - One Criterion Each Sum 75.00 ± 7.07
Ensemble - All Criteria Sel 80.0 ± 4.08
Ensemble - One Criterion Each Sel 70.00 ± 7.07

Table 3: Utilzing Different Roles Affects CR: This table summarizes the CR and RP for the various evaluation
methods given different combinations of roles. We report the mean and standard deviation of 3 trials for CR. We
use 10 problem of LeetCodeHard for 4 iterations each.

407 chose three roles that correlate to maximizing the the output the most, modeling a max operator on 432
408 number of passed test cases: correctness, logic, and the critiques. 433
409 syntax. We specifically chose these three to incor- Baselines: For baselines other than a single- 434
410 porate an overall correctness role with two more judge approach, we chose Self-Refine (Madaan 435
411 specific roles. et al., 2024) where the LLM code generator it- 436
412 Ensemble Design: One decision in design is eratively reflects and updates on its own output. 437
413 to give the separate LLM calls different criteria to We implement this by having a consistent system 438
414 judge the output. In all criteria, we specify to each prompt throughout all the LLM calls and only 439
415 LLM judge call that it should generate a critique of changing the user prompts. We also compared 440
416 the output based on all the criteria. Effectively, we with a vanilla feedback loop, where there is a sep- 441
417 are doing repeated sampling of the LLM-judge. In arate feedback LLM call but there is no LLM call 442
418 one criterion each, we give each judge a single cri- for explicitly critique generation. Please see the 443
419 terion to focus its judgment. Once we have gener- Appendix for more details on the prompts. 444
420 ated the critiques from all the judge LLM calls, we Metrics for Code: For correctness, we report 445
421 aggregate them. We experiment with three different the Completion Rate (CR), the percentage of 446
422 aggregation methods. 1) String concatenation (C): coding problems with all test cases passed (Yuk- 447
423 a form of addition for string objects that maintains sekgonul et al., 2024). Since we are focused on 448
424 the semantic meaning of the individual critiques; the effect of the evaluation protocol, we report the 449
425 we chose concatenation to model a linear function best-performing code generated in the optimiza- 450
426 for g with uniform weights α. 2) Summarization tion process after the initial zero-shot generation. 451
427 (Sum): another LLM to take in the critiques to Specifically, if a generated snippet at any iteration 452
428 give a final response; summarization is analogous after the initial generation passes all test cases, that 453
429 to applying a non-linear g aggregation method to problem is considered completed. For readability, 454
430 the critiques. 3) Selection (Sel): an LLM selects we take the code snippet of the last iteration of 455
431 the one critique that it believes will help improve each method we are comparing and ask a panel of 456

7
457 LLM-judges, GPT-4o, GPT-4o-mini, GPT-4-turbo, Criteria Order CR (%)
458 and GPT-4, to rank their readability of the code Correctness, Logic, Readability,
71.67 ± 8.50
459 snippets. We then calculate the Borda Count for Redundancy, Runtime, Syntax
460 each method. The Borda Count for a method is Redundancy, Logic, Correctness,
73.33 ± 2.36
461 the number of methods that rank below it. For ex- Runtime, Readability, Syntax
462 ample, a method that is the highest ranked out of Syntax, Runtime, Redundancy,
75.00 ± 7.07
463 four methods gets 3 points. For each method, we Readability, Logic, Correctness
464 sum all of the Borda Counts across all problems.
Table 4: Impact of Criteria Order CR from 10 prob-
465 To normalize across experiments that have varying lems of LeetCodeHard for 4 iterations. We used LLM
466 sets of methods, we divide the total Borda Count judges with separate criteria with concatenation. Crite-
467 by the number of methods. We call this value the ria Order specifies the order the LLM judges, affecting
468 Readability Points (RP). the resulting concatenated critique string.
469 Dataset. We use the LeetCodeHard (Shinn et al.,
470 2024) dataset containing a set of coding problem
471 prompts and multiple unit tests for each problem to Combination and Order of Evaluation Crite- 509
472 evaluate the generated code. We use 20 LeetCode- ria Affects Optimization Performance. In Table 510
473 Hard (Shinn et al., 2024) dataset problems with an 6, we analyze the effect the different combinations 511
474 average of 2 − 3 unit tests per problem. We with- of evaluation criteria have on the CR over Leet- 512
475 hold giving any of the evaluators of either method CodeHard. Similar studies have been performed 513
476 any information on unit tests to simulate the sce- with finetuning using diverse reward models (Rame 514
477 nario where unit tests may be unavailable to help et al., 2024). Surprisingly and counter-intuitively, 515
478 judge (Chen et al., 2024). Please see the Appendix we see some methods increase in correctness when 516
479 where we provide additional results on Humaneval the three criteria for correctness are removed. We 517
480 (Chen et al., 2021) and EvoEval benchmarks (Xia do see an overall increase in methods when the 518
481 et al., 2024). non-correctness criterion are removed, suggesting 519
482 How does increasing judges help empirically? the LLM judges can better focus on analyzing the 520
483 We plot the performance over 3 trials on LeetCode- functionality of the code. Because concatenating 521
484 Hard in Figure 2. In this experiment, we give the strings is not commutative like adding scalar num- 522
485 judge LLMs for all approaches all the criteria to bers. in Table 4, we provide an ablation where we 523
486 critique the output and we use concatenation to change the order of the criteria in the judge system 524
487 aggregate. For functional correctness, ensembling prompt. In this experiment, we used judges with 525
488 judges achieve higher CR rates than a single judge. one criterion each. Thus, changing the order of the 526
489 Furthermore, while using a single judge achieves call changes the concatenated string. We do not see 527
490 similar RP to using 6 judges, it has significantly a significant change in performance between the 528
491 less than RP using 2 evaluators. These results em- orderings suggesting that using ensemble judges is 529
492 pirically show that increasing judges can improve unaffected by the order of critiques. 530
493 code in both aspects but not necessarily monotoni-
494 cally.
495 How Do Design Choices for Ensemble Im-
5 Conclusion 531
496 pact Performance? In Table 2 we report the
497 mean and standard deviation for CR. We see that
498 the lowest-performing ensemble method, 6 judges In this work, we tackle reference-free LLM-judges 532
499 with all criteria with summarization, achieves the for generating natural language critiques. Our key 533
500 same mean CR as the highest-performing baseline, insight was that aggregating multiple generated cri- 534
501 single-judge. Showing the superiority of ensem- tiques reduces the suboptimality gap in evaluations 535
502 bling over baselines for correctness. There is a 9% for a given output. We theoretically motivate en- 536
503 difference in mean CR between ensemble methods. semble LLM-judges and empirically validate the 537
504 For readability, the two highest-ranked methods are paradigm with extensive prompt optimization ex- 538
505 ensemble methods. All other ensemble methods periments in code generation. We also provide 539
506 fall below the baselines in the rankings. This dif- ablations such as on the diversity of roles, role com- 540
507 ference in CR and RP between ensemble methods binations, and evaluation temperature, consistently 541
508 highlights the importance of design. demonstrating the need for multiple evaluators. 542

8
543 Limitations and Further Work Jun Chen, Han Guo, Kai Yi, Boyang Li, and Mohamed 592
Elhoseiny. 2022. Visualgpt: Data-efficient adapta- 593
544 We only empirically study our approach in code tion of pretrained language models for image caption- 594
545 generation. Further work could extend this evalu- ing. In Proceedings of the IEEE/CVF Conference 595
546 ation approach to other tasks that require multiple on Computer Vision and Pattern Recognition, pages 596
18030–18040. 597
547 criteria like molecule optimization or text gener-
548 ation. In terms of system complexity, we only Liguo Chen, Qi Guo, Hongrui Jia, Zhengran Zeng, Xin 598
549 study multiple evaluators for AI systems compris- Wang, Yijiang Xu, Jian Wu, Yidong Wang, Qing 599
Gao, Jindong Wang, et al. 2024. A survey on evalu- 600
550 ing a single LLM-based agent, and using a com- ating large language models in code generation tasks. 601
551 pound system with multiple elements such as a arXiv preprint arXiv:2408.16498. 602
552 web search agent (Agentic AI system) could be
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming 603
553 interesting. Another aspect of the work that can be Yuan, Henrique Ponde De Oliveira Pinto, Jared Ka- 604
554 explored further is weighting the different LLM- plan, Harri Edwards, Yuri Burda, Nicholas Joseph, 605
555 based evaluations. We gave uniform weighting via Greg Brockman, et al. 2021. Evaluating large 606
556 concatenation. However, further work could try language models trained on code. arXiv preprint 607
arXiv:2107.03374. 608
557 and adaptively change the weighting as the out-
558 put progresses, representing the need to change the Jiale Cheng, Xiao Liu, Kehan Zheng, Pei Ke, Hongning 609
559 focus of evaluation over time. Another research di- Wang, Yuxiao Dong, Jie Tang, and Minlie Huang. 610
2023. Black-Box Prompt Optimization: Aligning 611
560 rection involves removing the linearity assumption Large Language Models without Model Training. 612
561 on g. arXiv e-prints, arXiv:2311.04155. 613

562 Acknowledgements Xidong Feng, Ziyu Wan, Mengyue Yang, Ziyan Wang, 614
Girish A Koushik, Yali Du, Ying Wen, and Jun 615
563 ChatGPT (4o) was used to help with coding exper- Wang. 2024. Natural language reinforcement learn- 616
ing. CoRR. 617
564 iments.
Sumit Gulwani. 2010. Dimensions in program synthe- 618
sis. In Proceedings of the 12th international ACM 619
565 References SIGPLAN symposium on Principles and practice of 620
declarative programming, pages 13–24. 621
566 Josh Achiam, Steven Adler, Sandhini Agarwal, Lama
567 Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Han He, Qianchu Liu, Lei Xu, Chaitanya Shivade, 622
568 Diogo Almeida, Janko Altenschmidt, Sam Altman, Yi Zhang, Sundararajan Srinivasan, and Katrin Kirch- 623
569 Shyamal Anadkat, et al. 2023. Gpt-4 technical report. hoff. 2024. Crispo: Multi-aspect critique-suggestion- 624
570 arXiv preprint arXiv:2303.08774. guided automatic prompt optimization for text gener- 625
ation. arXiv preprint arXiv:2410.02748. 626
571 Aakriti Agrawal, Mucong Ding, Zora Che, Chenghao
Hui Huang, Yingqi Qu, Jing Liu, Muyun Yang, and 627
572 Deng, Anirudh Satheesh, John Langford, and Furong
Tiejun Zhao. 2024. An empirical study of llm- 628
573 Huang. 2024. Ensemw2s: Can an ensemble of llms
as-a-judge for llm evaluation: Fine-tuned judge 629
574 be leveraged to obtain a stronger llm? arXiv preprint
models are task-specific classifiers. arXiv preprint 630
575 arXiv:2410.04571.
arXiv:2403.02839. 631
576 Zachary Ankner, Mansheej Paul, Brandon Cui, Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B 632
577 Jonathan D Chang, and Prithviraj Ammanabrolu. Brown, Benjamin Chess, Rewon Child, Scott Gray, 633
578 2024. Critique-out-loud reward models. arXiv Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. 634
579 preprint arXiv:2408.11791. Scaling laws for neural language models. arXiv 635
preprint arXiv:2001.08361. 636
580 Sher Badshah and Hassan Sajjad. 2024. Reference-
581 guided verdict: Llms-as-judges in automatic Zachary Kenton, Noah Y Siegel, János Kramár, 637
582 evaluation of free-form text. arXiv preprint Jonah Brown-Cohen, Samuel Albanie, Jannis Bu- 638
583 arXiv:2408.09235. lian, Rishabh Agarwal, David Lindner, Yunhao Tang, 639
Noah D Goodman, et al. 2024. On scalable oversight 640
584 Lochan Basyal and Mihir Sanghvi. 2023. Text with weak llms judging strong llms. arXiv preprint 641
585 summarization using large language models: a arXiv:2407.04622. 642
586 comparative study of mpt-7b-instruct, falcon-7b-
587 instruct, and openai chat-gpt models. arXiv preprint Seungone Kim, Juyoung Suk, Ji Yong Cho, Shayne 643
588 arXiv:2310.10449. Longpre, Chaeeun Kim, Dongkeun Yoon, Guijin 644
Son, Yejin Cho, Sheikh Shafayat, Jinheon Baek, et al. 645
589 David M Blei, Andrew Y Ng, and Michael I Jordan. 2024. The biggen bench: A principled benchmark 646
590 2003. Latent dirichlet allocation. Journal of machine for fine-grained evaluation of language models with 647
591 Learning research, 3(Jan):993–1022. language models. arXiv preprint arXiv:2406.05761. 648

9
649 Tom Kocmi and Christian Federmann. 2023. Large lan- Benedikt Stroebl, Sayash Kapoor, and Arvind 705
650 guage models are state-of-the-art evaluators of trans- Narayanan. 2024. Inference scaling Flaws: The 706
651 lation quality. In Proceedings of the 24th Annual limits of llm resampling with imperfect verifiers. 707
652 Conference of the European Association for Machine Preprint, arXiv:2411.17501. 708
653 Translation, pages 193–203, Tampere, Finland. Euro-
654 pean Association for Machine Translation. Rickard Stureborg, Dimitris Alikaniotis, and Yoshi 709
Suhara. 2024. Large language models are in- 710
655 Dawei Li, Bohan Jiang, Liangjie Huang, Alimohammad consistent and biased evaluators. arXiv preprint 711
656 Beigi, Chengshuai Zhao, Zhen Tan, Amrita Bhat- arXiv:2405.01724. 712
657 tacharjee, Yuxuan Jiang, Canyu Chen, Tianhao Wu,
658 et al. 2024. From generation to judgment: Opportuni- Ashish Tiwari. 2022. Chapter 2 - supervised learn- 713
659 ties and challenges of llm-as-a-judge. arXiv preprint ing: From theory to applications. In Rajiv Pandey, 714
660 arXiv:2411.16594. Sunil Kumar Khatri, Neeraj kumar Singh, and Parul 715
Verma, editors, Artificial Intelligence and Machine 716
661 Ruosen Li, Teerth Patel, and Xinya Du. 2023. Prd: Peer Learning for EDGE Computing, pages 23–32. Aca- 717
662 rank and discussion improve large language model demic Press. 718
663 based evaluations. arXiv preprint arXiv:2307.02762.
664 Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier 719
665 Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Martinet, Marie-Anne Lachaux, Timothée Lacroix, 720
666 Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Baptiste Rozière, Naman Goyal, Eric Hambro, 721
667 et al. 2024. Self-refine: Iterative refinement with Faisal Azhar, et al. 2023. Llama: Open and effi- 722
668 self-feedback. Advances in Neural Information Pro- cient foundation language models. arXiv preprint 723
669 cessing Systems, 36. arXiv:2302.13971. 724

670 Hien D Nguyen, Luke R Lloyd-Jones, and Geoffrey J Pat Verga, Sebastian Hofstatter, Sophia Althammer, Yix- 725
671 McLachlan. 2016. A universal approximation theo- uan Su, Aleksandra Piktus, Arkady Arkhangorodsky, 726
672 rem for mixture-of-experts models. Neural computa- Minjie Xu, Naomi White, and Patrick Lewis. 2024. 727
673 tion, 28(12):2585–2593. Replacing judges with juries: Evaluating llm gen- 728
erations with a panel of diverse models. Preprint, 729
674 Huyen Nguyen, Haihua Chen, Lavanya Pobbathi, and arXiv:2404.18796. 730
675 Junhua Ding. 2024. A comparative study of quality
676 evaluation methods for text summarization. arXiv Xinyuan Wang, Chenxi Li, Zhen Wang, Fan Bai, 731
677 preprint arXiv:2407.00747. Haotian Luo, Jiayou Zhang, Nebojsa Jojic, Eric P. 732
Xing, and Zhiting Hu. 2023. PromptAgent: 733
678 Reid Pryzant, Dan Iter, Jerry Li, Yin Tat Lee, Chen- Strategic Planning with Language Models Enables 734
679 guang Zhu, and Michael Zeng. 2023. Automatic Expert-level Prompt Optimization. arXiv e-prints, 735
680 prompt optimization with" gradient descent" and arXiv:2310.16427. 736
681 beam search. arXiv preprint arXiv:2305.03495.
Chunqiu Steven Xia, Yinlin Deng, and Lingming Zhang. 737
682 Alexandre Rame, Guillaume Couairon, Corentin 2024. Top leaderboard ranking = top coding pro- 738
683 Dancette, Jean-Baptiste Gaya, Mustafa Shukor, ficiency, always? evoeval: Evolving coding bench- 739
684 Laure Soulier, and Matthieu Cord. 2024. Rewarded marks via llm. arXiv preprint. 740
685 soups: towards pareto-optimal alignment by inter-
686 polating weights fine-tuned on diverse rewards. Ad- Tianyi Xiong, Xiyao Wang, Dong Guo, Qinghao Ye, 741
687 vances in Neural Information Processing Systems, Haoqi Fan, Quanquan Gu, Heng Huang, and Chun- 742
688 36. yuan Li. 2024. Llava-critic: Learning to evaluate mul- 743
689 Thibault Sellam, Dipanjan Das, and Ankur Parikh. 2020. timodal models. arXiv preprint arXiv:2410.02712. 744
690 BLEURT: Learning robust metrics for text genera-
691 tion. In Proceedings of the 58th Annual Meeting of Tengyu Xu, Eryk Helenowski, Karthik Abinav 745
692 the Association for Computational Linguistics, pages Sankararaman, Di Jin, Kaiyan Peng, Eric Han, Shao- 746
693 7881–7892, Online. Association for Computational liang Nie, Chen Zhu, Hejia Zhang, Wenxuan Zhou, 747
694 Linguistics. et al. 2024. The perfect blend: Redefining rlhf with 748
mixture of judges. arXiv preprint arXiv:2409.20370. 749
695 Noah Shinn, Federico Cassano, Ashwin Gopinath,
696 Karthik Narasimhan, and Shunyu Yao. 2024. Re- Mert Yuksekgonul, Federico Bianchi, Joseph Boen, 750
697 flexion: Language agents with verbal reinforcement Sheng Liu, Zhi Huang, Carlos Guestrin, and James 751
698 learning. Advances in Neural Information Process- Zou. 2024. Textgrad: Automatic “differentiation” via 752
699 ing Systems, 36. text. arXiv preprint arXiv:2406.07496. 753

700 Bharath K Sriperumbudur, Kenji Fukumizu, Arthur Daoguang Zan, Bei Chen, Fengji Zhang, Dianjie 754
701 Gretton, Bernhard Schölkopf, and Gert RG Lanck- Lu, Bingchao Wu, Bei Guan, Yongji Wang, and 755
702 riet. 2009. On integral probability metrics,\phi- Jian-Guang Lou. 2022. Large language mod- 756
703 divergences and binary classification. arXiv preprint els meet nl2code: A survey. arXiv preprint 757
704 arXiv:0901.2698. arXiv:2212.09420. 758

10
759 Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan
760 Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin,
761 Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang,
762 Joseph E. Gonzalez, and Ion Stoica. 2023. Judg-
763 ing llm-as-a-judge with mt-bench and chatbot arena.
764 Preprint, arXiv:2306.05685.
765 Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han,
766 Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy
767 Ba. 2022. Large language models are human-level
768 prompt engineers. arXiv preprint arXiv:2211.01910.

11
769 A Appendix A.2 Judge System Prompt 804

770 A.1 Proof of Theorem 1 Extended to K We provide the judge system prompt in Figure 3. 805

771 evaluators For Single-Eval the system prompt is given to only 806
one LLM call and all the roles utilized are listed 807
772 We present the proof of Theorem 4 generalized to together in [INSERT UTILIZED CRITERIA]. For 808
773 K evaluators. ensemble with separate criteria, each evaluator gets 809
774 Proof. First, we characterize the
775 sub-optimality of our proposed cri- one specified in [INSERT UTILIZED CRITERIA]. 810
776 tique method as ∆ = Ec∗ ∼πc∗ [c∗ ] −
777 Ec1 ∼π1 (·|x,y),c2 ∼π2 (·|x,y)···πK [g(c1 , c2 , c3 · · · cK )]. A.3 Baseline Details 811

778 Note that if ∆ is zero, we have the optimal critique. Here are the details for the two baselines that do 812
779 Thus, we want ∆ to be as low as possible. not incorporate a separate evaluation protocol. 813
∗ Self-Refine: In self-refine (Madaan et al., 2024), 814
∆ = Ec∗ ∼πc∗ [c ]
the system prompt p is constant throughout the 815
− Ec1 ∼π1 (·|x,y),c2 ∼π2 (·|x,y)···πK [g(c1 , c2 , ..., cK )]
initial generation, feedback, and update stages. 816
= Ec∗ ∼πc∗ [c∗ ] − Ec∼πd (·|x,y) [c] +
780 | {z } During the feedback and update stages, the user 817
∆1
prompts is modified to specify that an output is 818
Ec∼πd (·|x,y) [c] − Ec1 ∼π1 (·|x,y),···πK [g(c1 , c2 , ..., cK )],
| {z } already given and the LLM must either now self- 819
∆2
reflect to generate feedback or must use both the 820

781 where we add and subtract the terms Ec∼πd (·|x,y) , output and feedback to generate and update the re- 821

with πd =
PK PK sponse. We provide the feedback and user prompts 822
782 i=1 αi πi ( i=1 α = 1) and then
783 separate the two terms as ∆1 , ∆2 . We next individ- in Figure 4. 823

784 ually analyze the terms ∆1 , ∆2 . Vanilla Feedback Loop: A separate LLM pro- 824
vides feedback to the LLM generator. The system 825
785 We can now bound ∆1 as,
prompt for the update generation is different than 826

786

∆1 = Ec∗ ∼πc∗ [c ] − Ec∼πd (·|x,y) [c] the one for the initial generation. 827
We provide addtional results with HumanEval 828
787 ≤ |c∗ |dTV (π ∗ , πd ) (Chen et al., 2021) and EvoEval (Xia et al., 2024) 829
K
X benchmarks. HumanEval is a standard code gener- 830
788 = |c∗ |dTV (π ∗ , αi πi ), ation benchmark and EvoEval (we specifically use 831
i=1 EvoEval-Difficult) is a more recent one that adapts 832

789 where we use the property of integral probability the questions of HumanEval to have more addi- 833

790 metric to bound ∆1 as the total variation distance tional constraints and requirements. We see that 834

791 between the optimal critique policy and the mixture for the harder benchmark of EvoEval, the benefits 835

792 critique policy. Next, we proceed to ∆2 , of ensembling LLM judges is more clear. 836

793 ∆2 =Ec∼πd (·|x,y) [c] −


794 Ec1 ∼π1 (·|x,y),··· ,cK ∼πK (·|x,y) [g(c1 , · · · , cK )]
795 =Ec∼πd (·|x,y) [c] −
"K #
X
796 Ec1 ∼π1 (·|x,y),··· ,cK ∼πK (·|x,y) αi ci
i=1
K
X

797 =Ec∗ ∼πd (·|x,y) [c ] − αi Eci ∼πi (·|x,y) [ci ]
i=1
798 =0, (7)

799 where we expand upon the definition of ∆2 and


800 use Assumption 1 on the aggregation function. Un-
801 der this assumption, the two terms cancel out with
802 the final result ∆2 = 0. Combining both terms
803 concluded the proof.

12
Figure 3: Judge System Prompt.

Figure 4: System Prompts for Vanilla Feedback Loop (Left) and User Prompts for Self-Refine and Vanilla Feedback
Loop (Right)

Judge Method Agg EvoEval CR (%) HumanEval CR (%)


Vanilla Feedback Loop (Baseline) – 40.0 ± 16.33 90.0 ± 0.0
Self-Refine (Baseline) – 33.33 ± 9.43 90.0 ± 0.0

1 Judge (Baseline) – 50.0 ± 0.0 90.0 ± 0.0


6 Judges – All Criteria C 40.0 ± 8.16 86.67 ± 4.71
6 Judges – All Criteria Sum 50.0 ± 0.0 86.67 ± 4.71
6 Judges – All Criteria Sel 44.44 ± 7.86 86.67 ± 4.71
6 Judges – One Criterion Each C 50.0 ± 8.16 90.0 ± 0.0
6 Judges – One Criterion Each Sum 50.0 ± 8.16 86.67 ± 4.71
6 Judges – One Criterion Each Sel 60.0 ± 8.16 82.59 ± 5.32
Table 5: The CR over HumanEval and EvoEval-Difficult comparing various ensemble evaluation methods against
inference-time improvement baselines. Ensemble methods consistently outperform baselines CR and the two highest
ranking methods in terms of readability are ensemble. The difference in CR and RP between ensemble methods
emphasize the non-trivial nature of designing the ensemble evaluation protocol. We use 10 questions from each
dataset and use 4 iterations of prompt optimization per question.

13
Judge Temperature Judge Method CR (%)
Single (Baseline) 76.67 ± 4.71
0 Ensemble - All Criteria 71.67 ± 4.71
Ensemble - One Criterion Each 73.33 ± 4.71
Single (Baseline) 71.67 ± 4.71
0.25 Ensemble - All Criteria 73.33 ± 6.24
Ensemble - One Criterion Each 71.67 ± 4.71
Single (Baseline) 75.0 ± 4.08
0.50 Ensemble - All Criteria 73.33 ± 6.24
Ensemble - One Criterion Each 78.33 ± 2.36
Single (Baseline) 75.0 ± 4.08
0.75 Ensemble - All Criteria 76.67 ± 4.71
Ensemble - One Criterion Each 71.67 ± 8.5
Single (Baseline) 66.67 ± 4.71
1 Ensemble - All Criteria 78.33 ± 2.36
Ensemble - One Criterion Each 71.67 ± 8.50
Table 6: Temperature Ablation: This table summarizes the CR for the various evaluation methods given different
combinations of roles. We report the mean and standard deviation of 3 trials for CR.

837 A.4 Ensemble Methods Outpeform wrong one. 861


838 Single-Judge Method with Incorrect
839 Judge. Judge Method Aggregation CR (%)
Single – 68.33 ± 6.24
6 - All Criteria C 75.0 ± 4.08
840 To highlight the robustness of ensemble judges to
6 - All Criteria Sum 75.0 ± 4.08
841 incorrect evaluations, we introduce an adversarial 6 - All Criteria Sel 78.33 ± 4.71
842 evaluator. For ensemble, methods with separate 6 - Separate Criteria C 76.67 ± 8.5
843 evaluation instructions, we specify in the system 6 - Separate Criteria Sum 76.67 ± 6.24
844 prompt of the correctness judge to always generate 6 - Separate Criteria Sel 71.67 ± 11.79
845 a critique stating that the code solution works. Sim-
846 ilarly, for methods where the system prompt has all Table 7: CR on LeetCodeHard with one purposefully
incorrect judge that always says the code is correct.
847 the criteria, we specify to output a critique claiming
Unsurprisingly, ensemble still outperforms single.
848 that code works when discussing correctness. We
849 repeat the same experiment but attack the readabil-
850 ity criteria by instructing the evaluation protocols
851 to generate a critique stating that the code is read-
852 able. We run all prompt optimization processes for
853 4 iterations.
854 The results of both experiments are shown in
855 Table 7. In both experiments, ensemble methods
856 still outperforms the single-judge approach in terms
857 of CR, with the worst-performing ensemble method
858 having at least the same mean CR as a single-judge.
859 We believe this results is intuitively because having
860 multiple critiques will lessen the influence of one

14

You might also like