This repository provides the code for the following paper Training-free LLM Verification via Recycling Few-shot Examples, and provides the responses and data used in the experiment to reproduce the experiment.
-
Create a new conda environment with Python 3.10
conda create -n my_env python=3.10
-
Activate the environment and install dependencies
conda activate my_env pip install -r requirements.txt
-
Install LaTeX-to-SymPy converter
cd math500/latex2sympy2 pip install -e .
Most shell scripts inside sh/ (e.g. likelihood_all_gpt.sh, response_all_gpt.sh) include #SBATCH directives and are meant to be submitted to a Slurm scheduler.
If you're running on a single node machine, Docker, or a cloud provider without Slurm (e.g. Vast.ai), please refer to the example script without Slurm in the same folder and write a new one.
— for example:
bash sh/vast_likelihood_all_gpt.sh
We basically ran our experiments using the A6000 GPU.
- Open and run
generate.ipynb. - For LLaMA models, leverage the
lm-eval-harness. parquet.ipynb: This file creates prompts for use with LLaMA and saves them in Parquet format.- The accuracy can be checked by running it on the
.ipynbfile in each task folder.
- Run the likelihood script:
./sh/likelihood_all_gpt.sh
- Ensure you have set the correct
model_name,input_dir, andoutput_dirvariables at the the script. - After running this file, it will create an
all_likelihoods.jsonfile inoutput_dir. - This file is used to calculate the
backward consistencyscore.
- Execute the response-based baseline script:
./sh/response_all_gpt.sh
- Verify the same configuration variables (
model_name,input_dir,output_dir) in this script as well. - After running this file, it will create an
{task}_few_few.jsonland{task}_few_zero.jsonlfile inoutput_dir. - This
{task}_few_few.jsonlfile is used to calculate theforward confidencescore, and the{task}_few_zero.jsonlfile is used to calculate thedirectscore.
- After likelihood computation, update the
"is_correct"key in theall_likelihoods.jsonfile. - Helper notebooks are available in each task folder (e.g.,
math500/math500.ipynb) to guide this update. - You should be able to see the
update_predictions_with_is_correctfunction.
- Use
check.ipynbto run our method on updated likelihood data. - You can check
direct_score,forward_score,backward_scoreandreferiwhich represent our final score. - Also you can see the
no_replacerelated metrics, see appendix B of our paper.
cot_wp.ipynb: This is an implementation of the paper Chain-of-Thought Reasoning Without Prompting.- Requires the
{task}_few_few.jsonlfile generated in Step 3.
- Requires the
USC: This baseline is based on Universal Self-Consistency for Large Language Model Generation.LEAP: This baseline is based on In-Context Principle Learning from Mistakes.
- The
baselines/folder contains below..response_likelihood_*.py: File that calculates the forward, direct score required by our metric.
We adapted the original implementations from the reference repositories of each benchmark as listed below.
If you find this work useful for your research, please cite our papers:
@article{lee2025training,
title={Training-free LLM Verification via Recycling Few-shot Examples},
author={Lee, Dongseok and Hong, Jimyung and Kim, Dongyoung and Kim, Jaehyung},
journal={arXiv preprint arXiv:2506.17251},
year={2025}
}