- Introduction
- Quick Start
- Evaluating OSS Model
- Evaluating API Model
- Viewing Results
- Additional Help
- Updating Leaderboard
We develop a systematic personalized data synthesis framework and construct PTBench, the first benchmark for personalized tool invocation, enabling a comprehensive evaluation of models' ability to invoke tools based on user information.
See the live leaderboard at PTBench
# Create a new Conda environment with Python 3.9
conda create -n PTBench python=3.9
conda activate PTBench
# Clone the PTBench repository
git clone --depth 1 https://github.com/hyfshadow/PTBench.git
# change the directory to PTBench
cd PTBench
#install the packages
pip install -r requirements.txt
set up your config in config.yaml
.
If you want to change more specific parameters, such as temperature, you can edit the code in src/predict_oss.py
.
model | template |
---|---|
Qwen2.5 | qwen |
Llama 3 | llama3 |
Mistral | mistral |
xLAM | xlam |
hammer | hammer |
deepseek R1(Distill) | deepseek3 |
You can add unsupported models. To see specifics, read Addtional Help.
python run.py --type oss
set up your api key and base url in config.yaml
.
If you want to change more specific parameters, such as temperature, you can edit the code in src/predict_api.py
.
ATTENTION: We only support openai API
python run.py --type api
You can find your results in your given output_dir
repository. The reult is divided into three part, untrained-user, trained-user and overall, each containing accuaracy and error analysis results. For more specific details, you can read our paper.
You can add the template in src/template.py
.
If your model's template is similar to those of qwen, only varying in tokenizers' forms, you can use the given prompt directly. Otherwise, you will need to difine your own function return_prompt()
and use it in the function register_template()
, such as in the template of xLAM.
You can change the format of the answer in src/template.py
to fit your model better. You should also change the way you parse the answer in src/parser.py
. We already realized a second way of parsing the format of {'platform':platform_name. 'functions':[{'name':func1_name, parameters:{param1_name:param1_value, param2...}}, func2...]
in the function ast_parse_2()
If you want to show your results in the leaderboard, you can sent an email to [email protected]. Your email should contain your public model name on huggingface and your evaluation results .csv
files.