Large Language Models and Synthetic Data for Monitoring Dataset Mentions in Research Papers

Solatorio, Aivin V.; Macalaba, Rafael; Liounis, James

Computer Science > Computation and Language

arXiv:2502.10263 (cs)

[Submitted on 14 Feb 2025]

Title:Large Language Models and Synthetic Data for Monitoring Dataset Mentions in Research Papers

Authors:Aivin V. Solatorio, Rafael Macalaba, James Liounis

View PDF

Abstract:Tracking how data is mentioned and used in research papers provides critical insights for improving data discoverability, quality, and production. However, manually identifying and classifying dataset mentions across vast academic literature is resource-intensive and not scalable. This paper presents a machine learning framework that automates dataset mention detection across research domains by leveraging large language models (LLMs), synthetic data, and a two-stage fine-tuning process. We employ zero-shot extraction from research papers, an LLM-as-a-Judge for quality assessment, and a reasoning agent for refinement to generate a weakly supervised synthetic dataset. The Phi-3.5-mini instruct model is pre-fine-tuned on this dataset, followed by fine-tuning on a manually annotated subset. At inference, a ModernBERT-based classifier efficiently filters dataset mentions, reducing computational overhead while maintaining high recall. Evaluated on a held-out manually annotated sample, our fine-tuned model outperforms NuExtract-v1.5 and GLiNER-large-v2.1 in dataset extraction accuracy. Our results highlight how LLM-generated synthetic data can effectively address training data scarcity, improving generalization in low-resource settings. This framework offers a pathway toward scalable monitoring of dataset usage, enhancing transparency, and supporting researchers, funders, and policymakers in identifying data gaps and strengthening data accessibility for informed decision-making.

Comments:	Project GitHub repository at this https URL
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Databases (cs.DB); Machine Learning (cs.LG)
Cite as:	arXiv:2502.10263 [cs.CL]
	(or arXiv:2502.10263v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2502.10263

Submission history

From: Aivin Solatorio [view email]
[v1] Fri, 14 Feb 2025 16:16:02 UTC (1,302 KB)

Computer Science > Computation and Language

Title:Large Language Models and Synthetic Data for Monitoring Dataset Mentions in Research Papers

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Large Language Models and Synthetic Data for Monitoring Dataset Mentions in Research Papers

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators