Generating Realistic Tabular Data with Large Language Models

Nguyen, Dang; Gupta, Sunil; Do, Kien; Nguyen, Thin; Venkatesh, Svetha

Computer Science > Machine Learning

arXiv:2410.21717 (cs)

[Submitted on 29 Oct 2024]

Title:Generating Realistic Tabular Data with Large Language Models

Authors:Dang Nguyen, Sunil Gupta, Kien Do, Thin Nguyen, Svetha Venkatesh

View PDF HTML (experimental)

Abstract:While most generative models show achievements in image data generation, few are developed for tabular data generation. Recently, due to success of large language models (LLM) in diverse tasks, they have also been used for tabular data generation. However, these methods do not capture the correct correlation between the features and the target variable, hindering their applications in downstream predictive tasks. To address this problem, we propose a LLM-based method with three important improvements to correctly capture the ground-truth feature-class correlation in the real data. First, we propose a novel permutation strategy for the input data in the fine-tuning phase. Second, we propose a feature-conditional sampling approach to generate synthetic samples. Finally, we generate the labels by constructing prompts based on the generated samples to query our fine-tuned LLM. Our extensive experiments show that our method significantly outperforms 10 SOTA baselines on 20 datasets in downstream tasks. It also produces highly realistic synthetic samples in terms of quality and diversity. More importantly, classifiers trained with our synthetic data can even compete with classifiers trained with the original data on half of the benchmark datasets, which is a significant achievement in tabular data generation.

Comments:	To appear at ICDM 2024
Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2410.21717 [cs.LG]
	(or arXiv:2410.21717v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2410.21717

Submission history

From: Dang Nguyen [view email]
[v1] Tue, 29 Oct 2024 04:14:32 UTC (772 KB)

Computer Science > Machine Learning

Title:Generating Realistic Tabular Data with Large Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Generating Realistic Tabular Data with Large Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators