Structural Entropy Guided Agent for Detecting and Repairing Knowledge Deficiencies in LLMs

🔔 Code • 📃 Paper • 🤗 Dataset

Abstract

Large language models (LLMs) have achieved unprecedented performance by leveraging vast pretraining corpora—the "fossil fuel" of modern AI—as predicted by scaling laws. However, the diminishing supply of high-quality, human-annotated data, especially in specialized domains, demands a shift toward synthetic data as a new energy source for further advancements. In this paper, we propose a novel Structural Entropy-guided Knowledge Navigator (SENATOR) framework that addresses the intrinsic knowledge deficiencies of LLMs. Our approach employs the Structure Entropy (SE) metric to quantify uncertainty along knowledge graph paths and leverages Monte Carlo Tree Search (MCTS) to selectively explore regions where the model lacks domain-specific knowledge. Guided by these insights, the framework generates targeted synthetic data for supervised fine-tuning, enabling continuous self-improvement. Experimental results on medical benchmarks demonstrate that our SENATOR agent effectively supplements the pretraining corpus by injecting missing domain-specific information, leading to significant performance gains in models such as Llama-3 and Qwen2. Our findings highlight the potential of synthetic data as the “new energy” for LLMs, paving the way for more efficient and scalable strategies to sustain and enhance model performance.

Figure 1: The overall framework of SENATOR

Distribution Analysis

Prepare Environments

git clone https://github.com/weiyifan1023/senator.git

cd senator
conda create -n senator python=3.10.9
conda activate senator
pip install -r requirements.txt

Quick Start

1. Data preparation

The seed entities of SPOKE KG are derived from the Project KG_RAG

Download the Instruction Tuning dataset from the Paper PMC-LLaMA.

Place the entire ./data/benchmark_data folder under the root folder.

Preprocess your datasets to SFT format by running:

cd  llm_rlhf/step1_supervised_finetuning/train_scripts
python preprocessing.py

2. MCTS for Knowledge Deficiency and Synthetic Data Generation

Here, we initialize subgraph for MCTS, and exploration maximum entropy path.

(Customize the search depth)

python -m prompt_based_generation/MedLLMs/gen_synthetic_data.py

3. Deficiency Knowledge Repair (SFT)

We take Qwen2-7B as an example, replace some paths in run_qwen.sh, and then execute:

cd llm_rlhf/step1_supervised_finetuning
bash train_scripts/qwen2/run_qwen.sh

4. Evaluation

To eval MedQA, MedMCQA and PubMedQA datasets, you can run:

 cd prompt_based_generation/MedLLMs
 python eval_medical_qa.py

Citation

If you find our paper inspiring and have utilized it in your work, please cite our paper.

@article{wei2025structural,
  title={Structural Entropy Guided Agent for Detecting and Repairing Knowledge Deficiencies in LLMs},
  author={Wei, Yifan and Yu, Xiaoyan and Pan, Tengfei and Li, Angsheng and Du, Li},
  journal={arXiv preprint arXiv:2505.07184},
  year={2025}
}

Acknowledgements

Thanks to the authors of KG-RAG and DAMe for releasing their code for retrieving the SPOKE KG and evaluating SE on the graph. Much of this codebase has been adapted from their codes.

Contact

[email protected] and [email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
data		data
figures		figures
kg_rag		kg_rag
llm_rlhf		llm_rlhf
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
config.yaml		config.yaml
requirements.txt		requirements.txt
system_prompts.yaml		system_prompts.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Structural Entropy Guided Agent for Detecting and Repairing Knowledge Deficiencies in LLMs

Abstract

Distribution Analysis

Prepare Environments

Quick Start

1. Data preparation

2. MCTS for Knowledge Deficiency and Synthetic Data Generation

3. Deficiency Knowledge Repair (SFT)

4. Evaluation

Citation

Acknowledgements

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

weiyifan1023/senator

Folders and files

Latest commit

History

Repository files navigation

Structural Entropy Guided Agent for Detecting and Repairing Knowledge Deficiencies in LLMs

Abstract

Distribution Analysis

Prepare Environments

Quick Start

1. Data preparation

2. MCTS for Knowledge Deficiency and Synthetic Data Generation

3. Deficiency Knowledge Repair (SFT)

4. Evaluation

Citation

Acknowledgements

Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages