Skip to content

weiyifan1023/senator

Repository files navigation

Structural Entropy Guided Agent for Detecting and Repairing Knowledge Deficiencies in LLMs

arXiv:2505.07184 Hugging Face Models & Datasets

🔔 Code • 📃 Paper • 🤗 Dataset

Abstract

Large language models (LLMs) have achieved unprecedented performance by leveraging vast pretraining corpora—the "fossil fuel" of modern AI—as predicted by scaling laws. However, the diminishing supply of high-quality, human-annotated data, especially in specialized domains, demands a shift toward synthetic data as a new energy source for further advancements. In this paper, we propose a novel Structural Entropy-guided Knowledge Navigator (SENATOR) framework that addresses the intrinsic knowledge deficiencies of LLMs. Our approach employs the Structure Entropy (SE) metric to quantify uncertainty along knowledge graph paths and leverages Monte Carlo Tree Search (MCTS) to selectively explore regions where the model lacks domain-specific knowledge. Guided by these insights, the framework generates targeted synthetic data for supervised fine-tuning, enabling continuous self-improvement. Experimental results on medical benchmarks demonstrate that our SENATOR agent effectively supplements the pretraining corpus by injecting missing domain-specific information, leading to significant performance gains in models such as Llama-3 and Qwen2. Our findings highlight the potential of synthetic data as the “new energy” for LLMs, paving the way for more efficient and scalable strategies to sustain and enhance model performance.

senator
Figure 1: The overall framework of SENATOR

Distribution Analysis

Image text

Prepare Environments

git clone https://github.com/weiyifan1023/senator.git

cd senator
conda create -n senator python=3.10.9
conda activate senator
pip install -r requirements.txt

Quick Start

1. Data preparation

The seed entities of SPOKE KG are derived from the Project KG_RAG

Download the Instruction Tuning dataset from the Paper PMC-LLaMA.

Place the entire ./data/benchmark_data folder under the root folder.

Preprocess your datasets to SFT format by running:

cd  llm_rlhf/step1_supervised_finetuning/train_scripts
python preprocessing.py

2. MCTS for Knowledge Deficiency and Synthetic Data Generation

Here, we initialize subgraph for MCTS, and exploration maximum entropy path.

(Customize the search depth)

python -m prompt_based_generation/MedLLMs/gen_synthetic_data.py

3. Deficiency Knowledge Repair (SFT)

We take Qwen2-7B as an example, replace some paths in run_qwen.sh, and then execute:

cd llm_rlhf/step1_supervised_finetuning
bash train_scripts/qwen2/run_qwen.sh

4. Evaluation

To eval MedQA, MedMCQA and PubMedQA datasets, you can run:

 cd prompt_based_generation/MedLLMs
 python eval_medical_qa.py

Citation

If you find our paper inspiring and have utilized it in your work, please cite our paper.

@article{wei2025structural,
  title={Structural Entropy Guided Agent for Detecting and Repairing Knowledge Deficiencies in LLMs},
  author={Wei, Yifan and Yu, Xiaoyan and Pan, Tengfei and Li, Angsheng and Du, Li},
  journal={arXiv preprint arXiv:2505.07184},
  year={2025}
}

Acknowledgements

Thanks to the authors of KG-RAG and DAMe for releasing their code for retrieving the SPOKE KG and evaluating SE on the graph. Much of this codebase has been adapted from their codes.

Contact

[email protected] and [email protected]

About

Structural Entropy Guided Agent for Detecting and Repairing Knowledge Deficiencies in LLMs

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published