Data Whisperer: Efficient Data Selection for Task-Specific LLM Fine-Tuning via Few-Shot In-Context Learning

Wang, Shaobo; Jin, Xiangqi; Wang, Ziming; Wang, Jize; Zhang, Jiajun; Li, Kaixin; Wen, Zichen; Li, Zhong; He, Conghui; Hu, Xuming; Zhang, Linfeng

Computer Science > Computation and Language

arXiv:2505.12212 (cs)

[Submitted on 18 May 2025 (v1), last revised 1 Jun 2025 (this version, v3)]

Title:Data Whisperer: Efficient Data Selection for Task-Specific LLM Fine-Tuning via Few-Shot In-Context Learning

Authors:Shaobo Wang, Xiangqi Jin, Ziming Wang, Jize Wang, Jiajun Zhang, Kaixin Li, Zichen Wen, Zhong Li, Conghui He, Xuming Hu, Linfeng Zhang

View PDF HTML (experimental)

Abstract:Fine-tuning large language models (LLMs) on task-specific data is essential for their effective deployment. As dataset sizes grow, efficiently selecting optimal subsets for training becomes crucial to balancing performance and computational costs. Traditional data selection methods often require fine-tuning a scoring model on the target dataset, which is time-consuming and resource-intensive, or rely on heuristics that fail to fully leverage the model's predictive capabilities. To address these challenges, we propose Data Whisperer, an efficient, training-free, attention-based method that leverages few-shot in-context learning with the model to be fine-tuned. Comprehensive evaluations were conducted on both raw and synthetic datasets across diverse tasks and models. Notably, Data Whisperer achieves superior performance compared to the full GSM8K dataset on the Llama-3-8B-Instruct model, using just 10% of the data, and outperforms existing methods with a 3.1-point improvement and a 7.4$\times$ speedup. The code is available at this https URL.

Comments:	Accepted by ACL 2025 main, 18 pages, 8 figures, 6 tables
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2505.12212 [cs.CL]
	(or arXiv:2505.12212v3 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2505.12212

Submission history

From: Shaobo Wang [view email]
[v1] Sun, 18 May 2025 03:10:00 UTC (611 KB)
[v2] Thu, 22 May 2025 06:07:23 UTC (611 KB)
[v3] Sun, 1 Jun 2025 05:57:00 UTC (614 KB)

Computer Science > Computation and Language

Title:Data Whisperer: Efficient Data Selection for Task-Specific LLM Fine-Tuning via Few-Shot In-Context Learning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Data Whisperer: Efficient Data Selection for Task-Specific LLM Fine-Tuning via Few-Shot In-Context Learning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators