Benchmarking Synthetic Tabular Data: A Multi-Dimensional Evaluation Framework

Sidorenko, Andrey; Platzer, Michael; Scriminaci, Mario; Tiwald, Paul

Computer Science > Machine Learning

arXiv:2504.01908 (cs)

[Submitted on 2 Apr 2025]

Title:Benchmarking Synthetic Tabular Data: A Multi-Dimensional Evaluation Framework

Authors:Andrey Sidorenko, Michael Platzer, Mario Scriminaci, Paul Tiwald

View PDF HTML (experimental)

Abstract:Evaluating the quality of synthetic data remains a key challenge for ensuring privacy and utility in data-driven research. In this work, we present an evaluation framework that quantifies how well synthetic data replicates original distributional properties while ensuring privacy. The proposed approach employs a holdout-based benchmarking strategy that facilitates quantitative assessment through low- and high-dimensional distribution comparisons, embedding-based similarity measures, and nearest-neighbor distance metrics. The framework supports various data types and structures, including sequential and contextual information, and enables interpretable quality diagnostics through a set of standardized metrics. These contributions aim to support reproducibility and methodological consistency in benchmarking of synthetic data generation techniques. The code of the framework is available at this https URL.

Comments:	16 pages, 7 figures, 1 table
Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2504.01908 [cs.LG]
	(or arXiv:2504.01908v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2504.01908

Submission history

From: Andrey Sidorenko [view email]
[v1] Wed, 2 Apr 2025 17:10:30 UTC (1,387 KB)

Computer Science > Machine Learning

Title:Benchmarking Synthetic Tabular Data: A Multi-Dimensional Evaluation Framework

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Benchmarking Synthetic Tabular Data: A Multi-Dimensional Evaluation Framework

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators