TabDiff: a Mixed-type Diffusion Model
for Tabular Data Generation

Juntong Shi2†, Minkai Xu1†, Harper Hua1†, Hengrui Zhang3†,
Stefano Ermon1, Jure Leskovec1
1
Stanford University 2University of Southern California 3University of Illinois Chicago
Corresponding author. 🖂 [email protected]. Equal contribution
Abstract

Synthesizing high-quality tabular data is an important topic in many data science tasks, ranging from dataset augmentation to privacy protection. However, developing expressive generative models for tabular data is challenging due to its inherent heterogeneous data types, complex inter-correlations, and intricate column-wise distributions. In this paper, we introduce TabDiff, a joint diffusion framework that models all mixed-type distributions of tabular data in one model. Our key innovation is the development of a joint continuous-time diffusion process for numerical and categorical data, where we propose feature-wise learnable diffusion processes to counter the high disparity of different feature distributions. TabDiff is parameterized by a transformer handling different input types, and the entire framework can be efficiently optimized in an end-to-end fashion. We further introduce a mixed-type stochastic sampler to automatically correct the accumulated decoding error during sampling, and propose classifier-free guidance for conditional missing column value imputation. Comprehensive experiments on seven datasets demonstrate that TabDiff achieves superior average performance over existing competitive baselines across all eight metrics, with up to 22.5%percent22.522.5\%22.5 % improvement over the state-of-the-art model on pair-wise column correlation estimations. Code is available at https://github.com/MinkaiXu/TabDiff.

1 Introduction

Tabular data is ubiquitous in various databases, and developing effective generative models for it is a fundamental problem in many data processing and analysis tasks, ranging from training data augmentation (Fonseca & Bacao, 2023), data privacy protection (Assefa et al., 2021; Hernandez et al., 2022), to missing value imputation (You et al., 2020; Zheng & Charoenphakdee, 2022). With versatile synthetic tabular data that share the same format and statistical properties as the existing dataset, we are able to completely replace real data in a workflow or supplement the data to enhance its utility, which makes it easier to share and use. The capability of anonymizing data and enlarging sample size without compromising the overall data quality enables it to revolutionize the field of data science. Unlike image data, which comprises pure continuous pixel values with local spatial correlations, or text data, which comprises tokens that share the same dictionary space, tabular data features have much more complex and varied distributions (Xu et al., 2019; Borisov et al., 2023), making it challenging to learn joint probabilities across multiple columns. More specifically, such inherent heterogeneity leads to obstacles from two aspects: 1) typical tabular data often contains mixed-type data types, i.e., continuous (e.g., numerical features) and discrete (e.g., categorical features) variables; 2) within the same feature type, features do not share the exact same data property because of the different meaning they represent, resulting in different column-wise marginal distributions (even after normalizing them into same value ranges).

In recent years, numerous deep generative models have been proposed for tabular data generation with autoregressive models (Borisov et al., 2023), VAEs (Liu et al., 2023), and GANs (Xu et al., 2019) in the past few years. Though they have notably improved the generation quality compared to traditional machine learning generation techniques such as resampling (Chawla et al., 2002), the generated data quality is still far from satisfactory due to limited model capacity. Recently, with the rapid progress in diffusion models (Song & Ermon, 2019; Ho et al., 2020; Rombach et al., 2022), researchers have been actively exploring extending this powerful framework to tabular data (Kim et al., 2022; Kotelnikov et al., 2023; Zhang et al., 2024). For example, Zheng & Charoenphakdee (2022); Zhang et al. (2024) transform all features into a latent continuous space via various encoding techniques and apply Gaussian diffusion there, while Kotelnikov et al. (2023); Lee et al. (2023) combine discrete-time continuous and discrete diffusion processes (Austin et al., 2021) to deal with numerical and categorical features separately. However, prior methods are trapped in sub-optimal performance due to additional encoding overhead or imperfect discrete-time diffusion modeling, and none of them consider the feature-wise distribution heterogeneity issue in a mixed-type framework.

In this paper, we present TabDiff, a novel and principled mixed-type diffusion framework for tabular data generation. TabDiff perturbs numerical and categorical features with a joint diffusion process, and learns a single model to simultaneously denoising all modalities. Our key innovation is the development of mixed-type feature-wise learnable diffusion processes to counteract the high heterogeneity across different feature distributions. Such feature-specific learnable noise schedules enable the model to optimally allocate the model capacity to different features in the training phase. Besides, it encourages the model to capture the inherent correlations during sampling since the model can denoise different features in a flexible order based on the learned schedule. We parameterize TabDiff with a transformer operating on different input types and optimize the entire framework efficiently in an end-to-end fashion. The framework is trained with a continuous-time limit of evidence lower bound. To reduce the decoding error during denoising sampling, we design a mixed-type stochastic sampler that automatically corrects the accumulated decoding error during sampling. In addition, we highlight that TabDiff can also be applied to conditional generation tasks such as missing column imputation, and we further introduce classifier-free guidance technique to improve the conditional generation quality.

TabDiff enjoys several notable advantages: 1) our model learns the joint distribution in the original data space with an expressive continuous-time diffusion framework; 2) the framework is sensitive to varying feature marginal distribution and can adaptively reason about feature-specific information and pair-wise correlations. We conduct comprehensive experiments to evaluate TabDiff against state-of-the-art methods across seven widely adopted tabular synthesis benchmarks. Results show that TabDiff consistently outperforms previous methods over eight distinct evaluation metrics, with up to 22.5%percent22.522.5\%22.5 % improvement over the state-of-the-art model on pair-wise column correlation estimations, suggesting our superior generative capacity on mixed-type tabular data.

Refer to caption
Figure 1: A high-level overview of TabDiff. TabDiff operates by normalizing numerical columns and converting categorical columns into one-hot vectors with an extra [MASK] class. Joint forward diffusion processes are applied to all modalities with each column’s noise rate controlled by learnable schedules. New samples are generated via reverse process, with the denoising network gradually denoising 𝐱1subscript𝐱1{\mathbf{x}}_{1}bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT into 𝐱0subscript𝐱0{{\mathbf{x}}}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and then applying the inverse transform to recover the original format.

2 Method

2.1 Overview

Notation. For a given mixed-type tabular dataset 𝒯𝒯{\mathcal{T}}caligraphic_T, we denote the number of numerical features and categorical features as Mnumsubscript𝑀numM_{{\rm num}}italic_M start_POSTSUBSCRIPT roman_num end_POSTSUBSCRIPT and Mcatsubscript𝑀catM_{{\rm cat}}italic_M start_POSTSUBSCRIPT roman_cat end_POSTSUBSCRIPT, respectively. The dataset is represented as a collection of data entries 𝒯={𝐱}={[𝐱num,𝐱cat]}𝒯𝐱superscript𝐱numsuperscript𝐱cat{\mathcal{T}}=\{{\mathbf{x}}\}=\{[{\mathbf{x}}^{{\rm num}},{\mathbf{x}}^{{\rm cat% }}]\}caligraphic_T = { bold_x } = { [ bold_x start_POSTSUPERSCRIPT roman_num end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT roman_cat end_POSTSUPERSCRIPT ] }, where each entry 𝐱𝐱{\mathbf{x}}bold_x is a concatenated vector consisting of its numerical features 𝐱numsuperscript𝐱num{\mathbf{x}}^{{\rm num}}bold_x start_POSTSUPERSCRIPT roman_num end_POSTSUPERSCRIPT and categorical features 𝐱catsuperscript𝐱cat{\mathbf{x}}^{{\rm cat}}bold_x start_POSTSUPERSCRIPT roman_cat end_POSTSUPERSCRIPT. We represent the numerical features as a Mnumsubscript𝑀numM_{{\rm num}}italic_M start_POSTSUBSCRIPT roman_num end_POSTSUBSCRIPT dimensional vector 𝐱numMnumsuperscript𝐱numsuperscriptsubscript𝑀num{\mathbf{x}}^{{\rm num}}\in\mathbb{R}^{M_{{\rm num}}}bold_x start_POSTSUPERSCRIPT roman_num end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT roman_num end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and denote the i𝑖iitalic_i-th feature as (𝐱num)isubscriptsuperscript𝐱num𝑖({\mathbf{x}}^{{\rm num}})_{i}\in\mathbb{R}( bold_x start_POSTSUPERSCRIPT roman_num end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R. We represent each categorical column with Cjsubscript𝐶𝑗C_{j}italic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT finite categories as a one-hot column vector (𝐱cat)j{0,1}(Cj+1)subscriptsuperscript𝐱cat𝑗superscript01subscript𝐶𝑗1({\mathbf{x}}^{{\rm cat}})_{j}\in\{0,1\}^{(C_{j}+1)}( bold_x start_POSTSUPERSCRIPT roman_cat end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ { 0 , 1 } start_POSTSUPERSCRIPT ( italic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT + 1 ) end_POSTSUPERSCRIPT, with an extra dimension dedicated to the [MASK] state. The (Cj+1)subscript𝐶𝑗1(C_{j}+1)( italic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT + 1 )-th category corresponds to the special [MASK] state and we use 𝐦{0,1}K𝐦superscript01𝐾{\mathbf{m}}\in\{0,1\}^{K}bold_m ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT as the one-hot vector for it. In addition, we define cat(;𝝅)cat𝝅{\rm cat}(\cdot;\bm{\pi})roman_cat ( ⋅ ; bold_italic_π ) as the categorical distribution over K𝐾Kitalic_K classes with probabilities given by 𝝅ΔK𝝅superscriptΔ𝐾\bm{\pi}\in\Delta^{K}bold_italic_π ∈ roman_Δ start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT, where ΔKsuperscriptΔ𝐾\Delta^{K}roman_Δ start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT is the K𝐾Kitalic_K-simplex.

Different from common data types such as images and text, developing generative models for tabular data is challenging as the distribution is determined by mixed-type data. We therefore propose TabDiff, a unified generative model for modeling the joint distribution p(𝐱)𝑝𝐱p({\mathbf{x}})italic_p ( bold_x ) using a continuous-time diffusion framework. TabDiff can learn the distribution from finite samples and generate faithful, diverse, and novel samples unconditionally. We provide a high-level overview in Figure 1, which includes a forward diffusion process and a reverse generative process, both defined in continuous time. The diffusion process gradually adds noise to data, and the generative process learns to recover the data from prior noise distribution with neural networks parameterized by θ𝜃\thetaitalic_θ. In the following sections, we elaborate on how we develop the unified diffusion framework with learnable noise schedules and perform training and sampling in practice.

2.2 Mixed-type Diffusion Framework

Diffusion models (Sohl-Dickstein et al., 2015; Song & Ermon, 2019; Ho et al., 2020) are likelihood-based generative models that learn the data distribution via forward and reverse Markov processes. Our goal is to develop a principled diffusion model for generating mixed-type tabular data that faithfully mimics the statistical distribution of the real dataset. Our framework TabDiff is designed to directly operate on the data space and naturally handle each tabular column in its built-in datatype. TabDiff is built on a hybrid forward process that gradually injects noise to numerical and categorical data types separately with different diffusion schedules 𝝈numsuperscript𝝈num\bm{\sigma}^{{\rm num}}bold_italic_σ start_POSTSUPERSCRIPT roman_num end_POSTSUPERSCRIPT and 𝝈catsuperscript𝝈cat\bm{\sigma}^{{\rm cat}}bold_italic_σ start_POSTSUPERSCRIPT roman_cat end_POSTSUPERSCRIPT. Let {𝐱t:t[0,1]}conditional-setsubscript𝐱𝑡similar-to𝑡01\{{\mathbf{x}}_{t}:t\sim[0,1]\}{ bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT : italic_t ∼ [ 0 , 1 ] } denote a sequence of data in the diffusion process indexed by a continuous time variable t[0,1]𝑡01t\in[0,1]italic_t ∈ [ 0 , 1 ], where 𝐱0p0similar-tosubscript𝐱0subscript𝑝0{\mathbf{x}}_{0}\sim p_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT are i.i.d. samples from real data distribution and 𝐱1p1similar-tosubscript𝐱1subscript𝑝1{\mathbf{x}}_{1}\sim p_{1}bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∼ italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT are pure noise from prior distribution. The hybrid forward diffusion process can be then represented as:

q(𝐱t𝐱0)=q(𝐱tnum𝐱0num,𝝈num(t))q(𝐱tcat𝐱0cat,𝝈cat(t)).𝑞conditionalsubscript𝐱𝑡subscript𝐱0𝑞conditionalsuperscriptsubscript𝐱𝑡numsuperscriptsubscript𝐱0numsuperscript𝝈num𝑡𝑞conditionalsuperscriptsubscript𝐱𝑡catsuperscriptsubscript𝐱0catsuperscript𝝈cat𝑡q({\mathbf{x}}_{t}\mid{\mathbf{x}}_{0})=q\left({\mathbf{x}}_{t}^{{\rm num}}% \mid{\mathbf{x}}_{0}^{{\rm num}},\bm{\sigma}^{{\rm num}}(t)\right)\cdot q\left% ({\mathbf{x}}_{t}^{{\rm cat}}\mid{\mathbf{x}}_{0}^{{\rm cat}},\bm{\sigma}^{{% \rm cat}}(t)\right).italic_q ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = italic_q ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_num end_POSTSUPERSCRIPT ∣ bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_num end_POSTSUPERSCRIPT , bold_italic_σ start_POSTSUPERSCRIPT roman_num end_POSTSUPERSCRIPT ( italic_t ) ) ⋅ italic_q ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_cat end_POSTSUPERSCRIPT ∣ bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_cat end_POSTSUPERSCRIPT , bold_italic_σ start_POSTSUPERSCRIPT roman_cat end_POSTSUPERSCRIPT ( italic_t ) ) . (1)

Then the true reverse process can be represented as the joint posterior:

q(𝐱s𝐱t,𝐱0)=q(𝐱snum𝐱t,𝐱0)q(𝐱scat𝐱t,𝐱0),𝑞conditionalsubscript𝐱𝑠subscript𝐱𝑡subscript𝐱0𝑞conditionalsuperscriptsubscript𝐱𝑠numsubscript𝐱𝑡subscript𝐱0𝑞conditionalsuperscriptsubscript𝐱𝑠catsubscript𝐱𝑡subscript𝐱0q({\mathbf{x}}_{s}\mid{\mathbf{x}}_{t},{\mathbf{x}}_{0})=q({\mathbf{x}}_{s}^{% \text{num}}\mid{\mathbf{x}}_{t},{\mathbf{x}}_{0})\cdot q({\mathbf{x}}_{s}^{{% \rm cat}}\mid{\mathbf{x}}_{t},{\mathbf{x}}_{0}),italic_q ( bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∣ bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = italic_q ( bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT num end_POSTSUPERSCRIPT ∣ bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ⋅ italic_q ( bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_cat end_POSTSUPERSCRIPT ∣ bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , (2)

where s𝑠sitalic_s and t𝑡titalic_t are two arbitrary timesteps that 0<s<t<10𝑠𝑡10<s<t<10 < italic_s < italic_t < 1. We aim to learn a denoising model pθ(𝐱s|𝐱t)subscript𝑝𝜃conditionalsubscript𝐱𝑠subscript𝐱𝑡p_{\theta}({\mathbf{x}}_{s}|{\mathbf{x}}_{t})italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) to match the true posterior. In the following, we discuss the detailed formulations of diffusion processes for continuous and categorical features in separate manners. To enhance clarity, we omit the superscripts on 𝐱numsuperscript𝐱num{\mathbf{x}}^{{\rm num}}bold_x start_POSTSUPERSCRIPT roman_num end_POSTSUPERSCRIPT and 𝐱catsuperscript𝐱cat{\mathbf{x}}^{{\rm cat}}bold_x start_POSTSUPERSCRIPT roman_cat end_POSTSUPERSCRIPT when the inclusion is unnecessary for understanding.

Gaussian Diffusion for Numerical Features. In this paper, we model the forward diffusion for continuous features 𝐱numsuperscript𝐱num{\mathbf{x}}^{{\rm num}}bold_x start_POSTSUPERSCRIPT roman_num end_POSTSUPERSCRIPT as a stochastic differential equation (SDE) d𝐱=𝐟(𝐱,t)dt+g(t)d𝐰,d𝐱𝐟𝐱tdtgtd𝐰\rm d{\mathbf{x}}={\mathbf{f}}({\mathbf{x}},t)\rm dt+g(t)\rm d{\mathbf{w}},roman_d bold_x = bold_f ( bold_x , roman_t ) roman_dt + roman_g ( roman_t ) roman_d bold_w , with 𝐟(,t):MnumMnum:𝐟𝑡superscriptsubscript𝑀numsuperscriptsubscript𝑀num{\mathbf{f}}(\cdot,t):\mathbb{R}^{M_{{\rm num}}}\rightarrow\mathbb{R}^{M_{{\rm num% }}}bold_f ( ⋅ , italic_t ) : blackboard_R start_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT roman_num end_POSTSUBSCRIPT end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT roman_num end_POSTSUBSCRIPT end_POSTSUPERSCRIPT being the drift coefficient, g()::𝑔g(\cdot):\mathbb{R}\rightarrow\mathbb{R}italic_g ( ⋅ ) : blackboard_R → blackboard_R being the diffusion coefficient, and 𝒘𝒘\bm{w}bold_italic_w being the standard Wiener process (Song et al., 2021; Karras et al., 2022). The revere generation process solves the probability flow ordinary differential equation (ODE) d𝐱=[𝐟(𝐱,t)12g(t)2𝐱logpt(𝐱)]dt,d𝐱delimited-[]𝐟𝐱t12gsuperscriptt2subscript𝐱subscriptpt𝐱dt\rm d{\mathbf{x}}=\bigl{[}{\mathbf{f}}({\mathbf{x}},t)-\frac{1}{2}g(t)^{2}% \nabla_{{\mathbf{x}}}\log p_{t}({\mathbf{x}})\bigr{]}\rm dt,roman_d bold_x = [ bold_f ( bold_x , roman_t ) - divide start_ARG 1 end_ARG start_ARG 2 end_ARG roman_g ( roman_t ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∇ start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT roman_log roman_p start_POSTSUBSCRIPT roman_t end_POSTSUBSCRIPT ( bold_x ) ] roman_dt , where 𝐱logpt(𝐱)subscript𝐱subscript𝑝𝑡𝐱\nabla_{{\mathbf{x}}}\log p_{t}({\mathbf{x}})∇ start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x ) is the score function of pt(𝐱)subscript𝑝𝑡𝐱p_{t}({\mathbf{x}})italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x ). In this paper, we use the Variance Exploding formulation with 𝐟(,t)=𝟎𝐟𝑡0{\mathbf{f}}(\cdot,t)={\bm{0}}bold_f ( ⋅ , italic_t ) = bold_0 and g(t)=2[ddt𝝈num(t)]𝝈num(t)𝑔𝑡2delimited-[]𝑑𝑑𝑡superscript𝝈num𝑡superscript𝝈num𝑡g(t)=\sqrt{2[\frac{d}{dt}\bm{\sigma}^{\text{num}}(t)]\bm{\sigma}^{\text{num}}(% t)}italic_g ( italic_t ) = square-root start_ARG 2 [ divide start_ARG italic_d end_ARG start_ARG italic_d italic_t end_ARG bold_italic_σ start_POSTSUPERSCRIPT num end_POSTSUPERSCRIPT ( italic_t ) ] bold_italic_σ start_POSTSUPERSCRIPT num end_POSTSUPERSCRIPT ( italic_t ) end_ARG, which yields the forward process :

𝐱tnum=𝐱0num+𝝈num(t)ϵ,ϵ𝒩(𝟎,𝑰Mnum).formulae-sequencesuperscriptsubscript𝐱𝑡numsuperscriptsubscript𝐱0numsuperscript𝝈num𝑡bold-italic-ϵsimilar-tobold-italic-ϵ𝒩0subscript𝑰subscript𝑀num{\mathbf{x}}_{t}^{\text{num}}={\mathbf{x}}_{0}^{\text{num}}+\bm{\sigma}^{\text% {num}}(t)\bm{\epsilon},\quad\bm{\epsilon}\sim\mathcal{N}(\bm{0},\bm{I}_{M_{% \text{num}}}).bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT num end_POSTSUPERSCRIPT = bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT num end_POSTSUPERSCRIPT + bold_italic_σ start_POSTSUPERSCRIPT num end_POSTSUPERSCRIPT ( italic_t ) bold_italic_ϵ , bold_italic_ϵ ∼ caligraphic_N ( bold_0 , bold_italic_I start_POSTSUBSCRIPT italic_M start_POSTSUBSCRIPT num end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) . (3)

And the reversal can then be formulated accordingly as:

d𝐱num=[ddt𝝈num(t)]𝝈num(t)𝐱logpt(𝐱num)dt.dsuperscript𝐱numdelimited-[]ddtsuperscript𝝈numtsuperscript𝝈numtsubscript𝐱subscriptptsuperscript𝐱numdt\rm d{\mathbf{x}}^{\text{num}}=-[\frac{d}{dt}\bm{\sigma}^{\text{num}}(t)]\bm{% \sigma}^{\text{num}}(t)\nabla_{{\mathbf{x}}}\log p_{t}({\mathbf{x}}^{\text{num% }})\rm dt.roman_d bold_x start_POSTSUPERSCRIPT num end_POSTSUPERSCRIPT = - [ divide start_ARG roman_d end_ARG start_ARG roman_dt end_ARG bold_italic_σ start_POSTSUPERSCRIPT num end_POSTSUPERSCRIPT ( roman_t ) ] bold_italic_σ start_POSTSUPERSCRIPT num end_POSTSUPERSCRIPT ( roman_t ) ∇ start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT roman_log roman_p start_POSTSUBSCRIPT roman_t end_POSTSUBSCRIPT ( bold_x start_POSTSUPERSCRIPT num end_POSTSUPERSCRIPT ) roman_dt . (4)

In TabDiff, we train the diffusion model 𝝁θsubscript𝝁𝜃{\bm{\mu}}_{\theta}bold_italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT to jointly denoise the numerical and categorical features. We use 𝝁θnumsuperscriptsubscript𝝁𝜃num{\bm{\mu}}_{\theta}^{\text{num}}bold_italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT num end_POSTSUPERSCRIPT to denote numerical part of the denoising model output, and train the model via optimizing the denoising loss:

num(θ,ρ)=𝔼𝐱0p(𝐱0)𝔼tU[0,1]𝔼ϵ𝒩(𝟎,𝑰)𝝁θnum(𝐱t,t)ϵ22.subscriptnum𝜃𝜌subscript𝔼similar-tosubscript𝐱0𝑝subscript𝐱0subscript𝔼similar-to𝑡𝑈01subscript𝔼similar-tobold-italic-ϵ𝒩0𝑰superscriptsubscriptnormsuperscriptsubscript𝝁𝜃numsubscript𝐱𝑡𝑡bold-italic-ϵ22\mathcal{L_{\text{num}}}(\theta,\rho)=\mathbb{E}_{{\mathbf{x}}_{0}\sim p({% \mathbf{x}}_{0})}\mathbb{E}_{t\sim U[0,1]}\mathbb{E}_{\bm{\epsilon}\sim% \mathcal{N}(\bm{0,I})}\left\|{\bm{\mu}}_{\theta}^{\text{num}}({\mathbf{x}}_{t}% ,t)-\bm{\epsilon}\right\|_{2}^{2}.caligraphic_L start_POSTSUBSCRIPT num end_POSTSUBSCRIPT ( italic_θ , italic_ρ ) = blackboard_E start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_p ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_t ∼ italic_U [ 0 , 1 ] end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT bold_italic_ϵ ∼ caligraphic_N ( bold_0 bold_, bold_italic_I ) end_POSTSUBSCRIPT ∥ bold_italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT num end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) - bold_italic_ϵ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (5)

Masked Diffusion for Categorical Features, For categorical features, we take inspiration from the recent progress on discrete state-space diffusion for language modeling (Austin et al., 2021; Shi et al., 2024; Sahoo et al., 2024). The forward diffusion process is defined as a masking (absorbing) process that smoothly interpolates between the data distribution cat(;𝐱)cat𝐱{\rm cat}(\cdot;{\mathbf{x}})roman_cat ( ⋅ ; bold_x ) and the target distribution cat(;𝐦)cat𝐦{\rm cat}(\cdot;{\mathbf{m}})roman_cat ( ⋅ ; bold_m ), where all probability mass are assigned on the [MASK] state:

q(𝐱t|𝐱0)=cat(𝐱t;αt𝐱0+(1αt)𝐦).𝑞conditionalsubscript𝐱𝑡subscript𝐱0catsubscript𝐱𝑡subscript𝛼𝑡subscript𝐱01subscript𝛼𝑡𝐦q({\mathbf{x}}_{t}|{\mathbf{x}}_{0})={\rm cat}({\mathbf{x}}_{t};\alpha_{t}{% \mathbf{x}}_{0}+(1-\alpha_{t}){\mathbf{m}}).italic_q ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = roman_cat ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_m ) . (6)

αt[0,1]subscript𝛼𝑡01\alpha_{t}\in[0,1]italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ [ 0 , 1 ] is a strictly decreasing function of t𝑡titalic_t, with α01subscript𝛼01\alpha_{0}\approx 1italic_α start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ≈ 1 and α10subscript𝛼10\alpha_{1}\approx 0italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≈ 0. It represents the probability for the real data 𝐱0subscript𝐱0{\mathbf{x}}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to be masked at time step t𝑡titalic_t. By the time t=1𝑡1t=1italic_t = 1, all inputs are masked with probability 1. In practice this schedule is parameterized by αt=exp(𝝈cat(t))subscript𝛼𝑡superscript𝝈cat𝑡\alpha_{t}=\exp(-\bm{\sigma}^{{\rm cat}}(t))italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_exp ( - bold_italic_σ start_POSTSUPERSCRIPT roman_cat end_POSTSUPERSCRIPT ( italic_t ) ), where 𝝈cat(t):[0,1]+:superscript𝝈cat𝑡01superscript\bm{\sigma}^{{\rm cat}}(t):[0,1]\to\mathbb{R}^{+}bold_italic_σ start_POSTSUPERSCRIPT roman_cat end_POSTSUPERSCRIPT ( italic_t ) : [ 0 , 1 ] → blackboard_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT is a strictly increasing function. Such forward process entails the step transition probabilities q(𝐱t|𝐱s)=cat(𝐱t;αts𝐱s+(1αts)𝐦)𝑞conditionalsubscript𝐱𝑡subscript𝐱𝑠catsubscript𝐱𝑡subscript𝛼conditional𝑡𝑠subscript𝐱𝑠1subscript𝛼conditional𝑡𝑠𝐦q({\mathbf{x}}_{t}|{\mathbf{x}}_{s})={\rm cat}({\mathbf{x}}_{t};\alpha_{t\mid s% }{\mathbf{x}}_{s}+(1-\alpha_{t\mid s}){\mathbf{m}})italic_q ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) = roman_cat ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_α start_POSTSUBSCRIPT italic_t ∣ italic_s end_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT + ( 1 - italic_α start_POSTSUBSCRIPT italic_t ∣ italic_s end_POSTSUBSCRIPT ) bold_m ), where αts=αt/αssubscript𝛼conditional𝑡𝑠subscript𝛼𝑡subscript𝛼𝑠\alpha_{t\mid s}=\alpha_{t}/\alpha_{s}italic_α start_POSTSUBSCRIPT italic_t ∣ italic_s end_POSTSUBSCRIPT = italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT / italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT. Under the hood, this transition means that at each diffusion step, the data will be perturbed to the [MASK] state with a probability of (1αts)1subscript𝛼conditional𝑡𝑠(1-\alpha_{t\mid s})( 1 - italic_α start_POSTSUBSCRIPT italic_t ∣ italic_s end_POSTSUBSCRIPT ), and remains there until t=1𝑡1t=1italic_t = 1 if perturbed.

Similar to numerical features, in the reverse denoising process for categorical ones, the diffusion model 𝝁θsubscript𝝁𝜃{\bm{\mu}}_{\theta}bold_italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT aims to progressively unmask each column from the ‘masked’ state. The true posterior distribution conditioned on 𝐱0subscript𝐱0{\mathbf{x}}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT has the close form of:

q(𝐱s|𝐱t,𝐱0)={cat(𝐱s;𝐱t)𝐱t𝐦,cat(𝐱s;(1αs)𝐦+(αsαt)𝐱01αt)𝐱t=𝐦.𝑞conditionalsubscript𝐱𝑠subscript𝐱𝑡subscript𝐱0casescatsubscript𝐱𝑠subscript𝐱𝑡subscript𝐱𝑡𝐦catsubscript𝐱𝑠1subscript𝛼𝑠𝐦subscript𝛼𝑠subscript𝛼𝑡subscript𝐱01subscript𝛼𝑡subscript𝐱𝑡𝐦q({\mathbf{x}}_{s}|{\mathbf{x}}_{t},{\mathbf{x}}_{0})=\begin{cases}{\rm cat}({% \mathbf{x}}_{s};{\mathbf{x}}_{t})&{\mathbf{x}}_{t}\neq{\mathbf{m}},\\ {\rm cat}\left({\mathbf{x}}_{s};\frac{(1-\alpha_{s}){\mathbf{m}}+(\alpha_{s}-% \alpha_{t}){\mathbf{x}}_{0}}{1-\alpha_{t}}\right)&{\mathbf{x}}_{t}={\mathbf{m}% }.\end{cases}italic_q ( bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = { start_ROW start_CELL roman_cat ( bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ; bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_CELL start_CELL bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≠ bold_m , end_CELL end_ROW start_ROW start_CELL roman_cat ( bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ; divide start_ARG ( 1 - italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) bold_m + ( italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ) end_CELL start_CELL bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_m . end_CELL end_ROW (7)

We introduce the denoising network μθcat(𝐱t,t):C×[0,1]ΔC:subscriptsuperscript𝜇cat𝜃subscript𝐱𝑡𝑡𝐶01superscriptΔ𝐶\mu^{{\rm cat}}_{\theta}({\mathbf{x}}_{t},t):C\times[0,1]\to\Delta^{C}italic_μ start_POSTSUPERSCRIPT roman_cat end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) : italic_C × [ 0 , 1 ] → roman_Δ start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT to estimate 𝐱0subscript𝐱0{\mathbf{x}}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, through which we can approximate the unknown true posterior as:

pθ(𝐱scat|𝐱tcat)={cat(𝐱scat;𝐱tcat)𝐱tcat𝐦,cat(𝐱scat;(1αs)𝐦+(αsαt)𝝁θcat(𝐱t,t)1αt)𝐱t=𝐦,subscript𝑝𝜃conditionalsuperscriptsubscript𝐱𝑠catsuperscriptsubscript𝐱𝑡catcasescatsuperscriptsubscript𝐱𝑠catsuperscriptsubscript𝐱𝑡catsuperscriptsubscript𝐱𝑡cat𝐦catsuperscriptsubscript𝐱𝑠cat1subscript𝛼𝑠𝐦subscript𝛼𝑠subscript𝛼𝑡superscriptsubscript𝝁𝜃catsubscript𝐱𝑡𝑡1subscript𝛼𝑡subscript𝐱𝑡𝐦p_{\theta}({\mathbf{x}}_{s}^{{\rm cat}}|{\mathbf{x}}_{t}^{{\rm cat}})=\begin{% cases}{\rm cat}({\mathbf{x}}_{s}^{{\rm cat}};{\mathbf{x}}_{t}^{{\rm cat}})&{% \mathbf{x}}_{t}^{{\rm cat}}\neq{\mathbf{m}},\\ {\rm cat}\left({\mathbf{x}}_{s}^{{\rm cat}};\frac{(1-\alpha_{s}){\mathbf{m}}+(% \alpha_{s}-\alpha_{t}){\bm{\mu}}_{\theta}^{{\rm cat}}({\mathbf{x}}_{t},t)}{1-% \alpha_{t}}\right)&{\mathbf{x}}_{t}={\mathbf{m}},\end{cases}italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_cat end_POSTSUPERSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_cat end_POSTSUPERSCRIPT ) = { start_ROW start_CELL roman_cat ( bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_cat end_POSTSUPERSCRIPT ; bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_cat end_POSTSUPERSCRIPT ) end_CELL start_CELL bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_cat end_POSTSUPERSCRIPT ≠ bold_m , end_CELL end_ROW start_ROW start_CELL roman_cat ( bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_cat end_POSTSUPERSCRIPT ; divide start_ARG ( 1 - italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) bold_m + ( italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_cat end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) end_ARG start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ) end_CELL start_CELL bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_m , end_CELL end_ROW (8)

which implied that at each reverse step, we have a probability of (αsαt)/(1αt)subscript𝛼𝑠subscript𝛼𝑡1subscript𝛼𝑡(\alpha_{s}-\alpha_{t})/(1-\alpha_{t})( italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) / ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) to recover x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, and once being recovered, xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT stays fixed for the remainder of the process. Extensive works (Kingma et al., 2021; Shi et al., 2024) have shown that increasing discretization resolution can help approximate tighter evidence lower bound (ELBO). Therefore, we resort to optimizing the likelihood bound catsubscriptcat\mathcal{L}_{{\rm cat}}caligraphic_L start_POSTSUBSCRIPT roman_cat end_POSTSUBSCRIPT under continuous time limit:

cat(θ,k)=𝔼qt=0t=1αt1αt𝟙{𝐱t=𝐦}log𝝁θcat(𝐱t,t),𝐱0cat𝑑t,subscriptcat𝜃𝑘subscript𝔼𝑞superscriptsubscript𝑡0𝑡1subscriptsuperscript𝛼𝑡1subscript𝛼𝑡subscript1subscript𝐱𝑡𝐦superscriptsubscript𝝁𝜃catsubscript𝐱𝑡𝑡superscriptsubscript𝐱0catdifferential-d𝑡\mathcal{L}_{{\rm cat}}(\theta,k)=\mathbb{E}_{q}\int_{t=0}^{t=1}\frac{\alpha^{% \prime}_{t}}{1-\alpha_{t}}\mathbbm{1}_{\{{\mathbf{x}}_{t}={\mathbf{m}}\}}\log% \langle\bm{\mu}_{\theta}^{{\rm cat}}({\mathbf{x}}_{t},t),{\mathbf{x}}_{0}^{{% \rm cat}}\rangle dt,caligraphic_L start_POSTSUBSCRIPT roman_cat end_POSTSUBSCRIPT ( italic_θ , italic_k ) = blackboard_E start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t = 1 end_POSTSUPERSCRIPT divide start_ARG italic_α start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG blackboard_1 start_POSTSUBSCRIPT { bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_m } end_POSTSUBSCRIPT roman_log ⟨ bold_italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_cat end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_cat end_POSTSUPERSCRIPT ⟩ italic_d italic_t , (9)

where αtsubscriptsuperscript𝛼𝑡\alpha^{\prime}_{t}italic_α start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the first order derivative of αtsubscript𝛼𝑡\alpha_{t}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

2.3 Training with Adaptively Learnable Noise Schedules

Algorithm 1 Training
1:  repeat
2:     Sample 𝐱0p0(𝐱)similar-tosubscript𝐱0subscript𝑝0𝐱\mathbf{x}_{0}\sim p_{0}(\mathbf{x})bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( bold_x )
3:     Sample tU(0,1)similar-to𝑡𝑈01t\sim U(0,1)italic_t ∼ italic_U ( 0 , 1 )
4:     Sample ϵnum𝒩(0,𝐈M)num\bm{\epsilon}_{\text{num}}\sim\mathcal{N}(0,\mathbf{I}_{M}{{}_{\text{num}}})bold_italic_ϵ start_POSTSUBSCRIPT num end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , bold_I start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_FLOATSUBSCRIPT num end_FLOATSUBSCRIPT )
5:     𝐱tnum=𝐱0num+𝝈num(t)ϵnumsuperscriptsubscript𝐱𝑡numsuperscriptsubscript𝐱0numsuperscript𝝈num𝑡subscriptbold-italic-ϵnum{\mathbf{x}}_{t}^{{\rm num}}={\mathbf{x}}_{0}^{\text{num}}+\bm{\sigma}^{\text{% num}}(t)\bm{\epsilon}_{\text{num}}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_num end_POSTSUPERSCRIPT = bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT num end_POSTSUPERSCRIPT + bold_italic_σ start_POSTSUPERSCRIPT num end_POSTSUPERSCRIPT ( italic_t ) bold_italic_ϵ start_POSTSUBSCRIPT num end_POSTSUBSCRIPT
6:     𝜶t=exp(𝝈cat(t))subscript𝜶𝑡superscript𝝈cat𝑡\bm{\alpha}_{t}=\exp(-\bm{\sigma}^{{\rm cat}}(t))bold_italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_exp ( - bold_italic_σ start_POSTSUPERSCRIPT roman_cat end_POSTSUPERSCRIPT ( italic_t ) )
7:     Sample 𝐱tcatq(𝐱t|𝐱0,𝜶t)similar-tosuperscriptsubscript𝐱𝑡catqconditionalsubscript𝐱tsubscript𝐱0subscript𝜶t{\mathbf{x}}_{t}^{{\rm cat}}\sim\rm q({\mathbf{x}}_{t}|{\mathbf{x}}_{0},\bm{% \alpha}_{t})bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_cat end_POSTSUPERSCRIPT ∼ roman_q ( bold_x start_POSTSUBSCRIPT roman_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_italic_α start_POSTSUBSCRIPT roman_t end_POSTSUBSCRIPT ) Equation 6
8:     𝐱t=[𝐱tnum,𝐱tcat]subscript𝐱𝑡superscriptsubscript𝐱𝑡numsuperscriptsubscript𝐱𝑡cat{\mathbf{x}}_{t}=[{\mathbf{x}}_{t}^{{\rm num}},{\mathbf{x}}_{t}^{{\rm cat}}]bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = [ bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_num end_POSTSUPERSCRIPT , bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_cat end_POSTSUPERSCRIPT ]
9:     Take gradient descent step on θ,ρ,kTabDiffsubscript𝜃𝜌𝑘subscriptTabDiff\nabla_{\theta,\rho,{k}}{\mathcal{L}}_{\textsc{TabDiff}}∇ start_POSTSUBSCRIPT italic_θ , italic_ρ , italic_k end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT TabDiff end_POSTSUBSCRIPT
10:  until converged

Tabular data is inherently highly heterogeneous of mixed numerical and categorical data types, and mixed feature distributions within each data type. Therefore, unlike pixels that share a similar distribution across three RGB channels and word tokens that share the exact same vocabulary space, each column (feature) of the table has its own specific marginal distributions, which requires the model to amortize its capacity adaptively across different features. We propose to adaptively learn a more fine-grained noise schedule for each feature respectively. To balance the trade-off between the learnable noise schedule’s flexibility and robustness, we design two function families: the power mean numerical schedule and the log-linear categorical schedule.

Power-mean schedule for numerical features. For the numerical noise schedule 𝝈num(t)superscript𝝈num𝑡\bm{\sigma}^{\text{num}}(t)bold_italic_σ start_POSTSUPERSCRIPT num end_POSTSUPERSCRIPT ( italic_t ) in Equation 3, we define 𝝈num(t)=[σρinum(t)]superscript𝝈num𝑡delimited-[]superscriptsubscript𝜎subscript𝜌𝑖num𝑡\bm{\sigma}^{\text{num}}(t)=[\sigma_{\rho_{i}}^{\text{num}}(t)]bold_italic_σ start_POSTSUPERSCRIPT num end_POSTSUPERSCRIPT ( italic_t ) = [ italic_σ start_POSTSUBSCRIPT italic_ρ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT num end_POSTSUPERSCRIPT ( italic_t ) ], with ρisubscript𝜌𝑖\rho_{i}italic_ρ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT being a learnable parameter for individual numerical features. For i{1,,Mnum}for-all𝑖1subscript𝑀num\forall i\in\{1,\cdots,M_{\text{num}}\}∀ italic_i ∈ { 1 , ⋯ , italic_M start_POSTSUBSCRIPT num end_POSTSUBSCRIPT }, we have σρinum(t)superscriptsubscript𝜎subscript𝜌𝑖num𝑡\sigma_{\rho_{i}}^{\text{num}}(t)italic_σ start_POSTSUBSCRIPT italic_ρ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT num end_POSTSUPERSCRIPT ( italic_t ) as:

σρinum(t)=(σmin1ρi+t(σmax1ρiσmin1ρi))ρi.superscriptsubscript𝜎subscript𝜌𝑖num𝑡superscriptsuperscriptsubscript𝜎min1subscript𝜌𝑖𝑡superscriptsubscript𝜎max1subscript𝜌𝑖superscriptsubscript𝜎min1subscript𝜌𝑖subscript𝜌𝑖\sigma_{\rho_{i}}^{\text{num}}(t)=\left(\sigma_{\text{min}}^{\frac{1}{\rho_{i}% }}+t(\sigma_{\text{max}}^{\frac{1}{\rho_{i}}}-\sigma_{\text{min}}^{\frac{1}{% \rho_{i}}})\right)^{\rho_{i}}.italic_σ start_POSTSUBSCRIPT italic_ρ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT num end_POSTSUPERSCRIPT ( italic_t ) = ( italic_σ start_POSTSUBSCRIPT min end_POSTSUBSCRIPT start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_ρ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG end_POSTSUPERSCRIPT + italic_t ( italic_σ start_POSTSUBSCRIPT max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_ρ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG end_POSTSUPERSCRIPT - italic_σ start_POSTSUBSCRIPT min end_POSTSUBSCRIPT start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_ρ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT . (10)

Log-linear schedule for categorical features. Similarly, for the categorical noise schedule 𝝈cat(t)superscript𝝈cat𝑡\bm{\sigma}^{{\rm cat}}(t)bold_italic_σ start_POSTSUPERSCRIPT roman_cat end_POSTSUPERSCRIPT ( italic_t ) in Section 2.2, we define 𝝈cat(t)=[σkjcat(t)]superscript𝝈cat𝑡delimited-[]superscriptsubscript𝜎subscript𝑘𝑗cat𝑡\bm{\sigma}^{{\rm cat}}(t)=[\sigma_{k_{j}}^{{\rm cat}}(t)]bold_italic_σ start_POSTSUPERSCRIPT roman_cat end_POSTSUPERSCRIPT ( italic_t ) = [ italic_σ start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_cat end_POSTSUPERSCRIPT ( italic_t ) ], with kisubscript𝑘𝑖k_{i}italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT being a learnable parameter for individual categorical features. For j{1,,Mcat}for-all𝑗1subscript𝑀cat\forall j\in\{1,\cdots,M_{{\rm cat}}\}∀ italic_j ∈ { 1 , ⋯ , italic_M start_POSTSUBSCRIPT roman_cat end_POSTSUBSCRIPT }, we have σkjcat(t)superscriptsubscript𝜎subscript𝑘𝑗cat𝑡\sigma_{k_{j}}^{{\rm cat}}(t)italic_σ start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_cat end_POSTSUPERSCRIPT ( italic_t ) as:

αkjcat(t)=1tkjsuperscriptsubscript𝛼subscript𝑘𝑗cat𝑡1superscript𝑡𝑘𝑗\alpha_{k_{j}}^{{\rm cat}}(t)=1-t^{kj}italic_α start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_cat end_POSTSUPERSCRIPT ( italic_t ) = 1 - italic_t start_POSTSUPERSCRIPT italic_k italic_j end_POSTSUPERSCRIPT (11)

In practice, we fix the same initial and final noise levels across all numerical features so that σinum(0)=σminsuperscriptsubscript𝜎𝑖num0subscript𝜎min\sigma_{i}^{\text{num}}(0)=\sigma_{\text{min}}italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT num end_POSTSUPERSCRIPT ( 0 ) = italic_σ start_POSTSUBSCRIPT min end_POSTSUBSCRIPT and σinum(1)=σmaxsuperscriptsubscript𝜎𝑖num1subscript𝜎max\sigma_{i}^{\text{num}}(1)=\sigma_{\text{max}}italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT num end_POSTSUPERSCRIPT ( 1 ) = italic_σ start_POSTSUBSCRIPT max end_POSTSUBSCRIPT for i{1,,Mnum}for-all𝑖1subscript𝑀num\forall i\in\{1,\cdots,M_{\text{num}}\}∀ italic_i ∈ { 1 , ⋯ , italic_M start_POSTSUBSCRIPT num end_POSTSUBSCRIPT }. We similarly bound the initial and final noise levels for the categorical features, as detailed in Section B.1. This enables us to constrain the freedom of schedules and thus stabilize the training.

Joint objective function. We update Mnum+Mcatsubscript𝑀numsubscript𝑀catM_{\text{num}}+M_{{\rm cat}}italic_M start_POSTSUBSCRIPT num end_POSTSUBSCRIPT + italic_M start_POSTSUBSCRIPT roman_cat end_POSTSUBSCRIPT parameters ρ1,,ρMnumsubscript𝜌1subscript𝜌subscript𝑀num\rho_{1},\cdots,\rho_{M_{\text{num}}}italic_ρ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_ρ start_POSTSUBSCRIPT italic_M start_POSTSUBSCRIPT num end_POSTSUBSCRIPT end_POSTSUBSCRIPT and k1,,kMcatsubscript𝑘1subscript𝑘subscript𝑀catk_{1},\cdots,k_{M_{{\rm cat}}}italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_k start_POSTSUBSCRIPT italic_M start_POSTSUBSCRIPT roman_cat end_POSTSUBSCRIPT end_POSTSUBSCRIPT via backpropagation without the need of modifying the loss function. Consolidating numsubscriptnum\mathcal{L}_{\text{num}}caligraphic_L start_POSTSUBSCRIPT num end_POSTSUBSCRIPT and catsubscriptcat\mathcal{L}_{{\rm cat}}caligraphic_L start_POSTSUBSCRIPT roman_cat end_POSTSUBSCRIPT, we have the total loss \mathcal{L}caligraphic_L with two weight terms λnumsubscript𝜆num\lambda_{\text{num}}italic_λ start_POSTSUBSCRIPT num end_POSTSUBSCRIPT and λcatsubscript𝜆cat\lambda_{{\rm cat}}italic_λ start_POSTSUBSCRIPT roman_cat end_POSTSUBSCRIPT as:

TabDiff(θ,ρ,k)=λnumnum(θ,ρ)+λcatcat(θ,k)=𝔼tU(0,1)𝔼(𝐱t,𝐱0)q(𝐱t,𝐱0)(λnum𝝁θnum(𝐱t,t)ϵ22+λcatαt1αt𝟙{𝐱t=𝐦}log𝝁θcat(𝐱t,t),𝐱0cat).missing-subexpressionsubscriptTabDiff𝜃𝜌𝑘subscript𝜆numsubscriptnum𝜃𝜌subscript𝜆catsubscriptcat𝜃𝑘missing-subexpressionabsentsubscript𝔼similar-to𝑡𝑈01subscript𝔼similar-tosubscript𝐱𝑡subscript𝐱0𝑞subscript𝐱𝑡subscript𝐱0subscript𝜆numsuperscriptsubscriptnormsuperscriptsubscript𝝁𝜃numsubscript𝐱𝑡𝑡bold-italic-ϵ22subscript𝜆catsubscriptsuperscript𝛼𝑡1subscript𝛼𝑡subscript1subscript𝐱𝑡𝐦superscriptsubscript𝝁𝜃catsubscript𝐱𝑡𝑡superscriptsubscript𝐱0cat\begin{aligned} &\mathcal{L}_{\textsc{TabDiff}}(\theta,\rho,{k})=\lambda_{% \text{num}}\mathcal{L}_{\text{num}}(\theta,\rho)+\lambda_{{\rm cat}}\mathcal{L% }_{{\rm cat}}(\theta,{k})\\ \quad&=\mathbb{E}_{t\sim U(0,1)}\mathbb{E}_{({\mathbf{x}}_{t},{\mathbf{x}}_{0}% )\sim q({\mathbf{x}}_{t},{\mathbf{x}}_{0})}\left(\lambda_{\text{num}}\left\|{% \bm{\mu}}_{\theta}^{\text{num}}({\mathbf{x}}_{t},t)-\bm{\epsilon}\right\|_{2}^% {2}+\frac{\lambda_{{\rm cat}}\,\alpha^{\prime}_{t}}{1-\alpha_{t}}\mathbbm{1}_{% \{{\mathbf{x}}_{t}={\mathbf{m}}\}}\log\langle\bm{\mu}_{\theta}^{{\rm cat}}({% \mathbf{x}}_{t},t),{\mathbf{x}}_{0}^{{\rm cat}}\rangle\right).\end{aligned}start_ROW start_CELL end_CELL start_CELL caligraphic_L start_POSTSUBSCRIPT TabDiff end_POSTSUBSCRIPT ( italic_θ , italic_ρ , italic_k ) = italic_λ start_POSTSUBSCRIPT num end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT num end_POSTSUBSCRIPT ( italic_θ , italic_ρ ) + italic_λ start_POSTSUBSCRIPT roman_cat end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT roman_cat end_POSTSUBSCRIPT ( italic_θ , italic_k ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = blackboard_E start_POSTSUBSCRIPT italic_t ∼ italic_U ( 0 , 1 ) end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ∼ italic_q ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT num end_POSTSUBSCRIPT ∥ bold_italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT num end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) - bold_italic_ϵ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_λ start_POSTSUBSCRIPT roman_cat end_POSTSUBSCRIPT italic_α start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG blackboard_1 start_POSTSUBSCRIPT { bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_m } end_POSTSUBSCRIPT roman_log ⟨ bold_italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_cat end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_cat end_POSTSUPERSCRIPT ⟩ ) . end_CELL end_ROW

(12)

With the forward process defined in Equation 3 and Equation 6, we present the detailed training procedure in Algorithm 1. Here, we sample a continuous time step t𝑡titalic_t from a uniform distribution U(0,1)𝑈01U(0,1)italic_U ( 0 , 1 ) and then perturb numerical and categorical features with their respective noise schedules based on this same time index. Then, we input the concatenated 𝐱tnumsuperscriptsubscript𝐱𝑡num{\mathbf{x}}_{t}^{{\rm num}}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_num end_POSTSUPERSCRIPT and 𝐱tcatsuperscriptsubscript𝐱𝑡cat{\mathbf{x}}_{t}^{{\rm cat}}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_cat end_POSTSUPERSCRIPT into the model and take gradient on the joint loss function defined in Equation 12.

2.4 Sampling with Backward Stochastic Sampler

Algorithm 2 Sampling
1:  Sample 𝐱Tnum𝒩(0,𝐈M)num{\mathbf{x}}_{T}^{\text{num}}\sim\mathcal{N}(0,\mathbf{I}_{M}{{}_{\text{num}}})bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT num end_POSTSUPERSCRIPT ∼ caligraphic_N ( 0 , bold_I start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_FLOATSUBSCRIPT num end_FLOATSUBSCRIPT ), 𝐱Tcat=𝒎superscriptsubscript𝐱𝑇cat𝒎{\mathbf{x}}_{T}^{{\rm cat}}=\bm{m}bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_cat end_POSTSUPERSCRIPT = bold_italic_m
2:  for t=T𝑡𝑇t=Titalic_t = italic_T to 1111 do
3:      t+t+γtt,γt=1/Tformulae-sequencesuperscript𝑡𝑡subscript𝛾𝑡𝑡subscript𝛾𝑡1𝑇t^{+}\leftarrow t+\gamma_{t}t,\gamma_{t}=1/Titalic_t start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ← italic_t + italic_γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_t , italic_γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 1 / italic_T  \triangleright Numerical forward perturbation:
4:      Sample ϵnum𝒩(0,𝐈M)num\bm{\epsilon}^{\text{num}}\sim\mathcal{N}(0,\mathbf{I}_{M}{{}_{\text{num}}})bold_italic_ϵ start_POSTSUPERSCRIPT num end_POSTSUPERSCRIPT ∼ caligraphic_N ( 0 , bold_I start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_FLOATSUBSCRIPT num end_FLOATSUBSCRIPT )
5:      𝐱t+num𝐱tnum+𝝈num(t+)2𝝈num(t)2ϵnumsuperscriptsubscript𝐱superscript𝑡numsuperscriptsubscript𝐱𝑡numsuperscript𝝈numsuperscriptsuperscript𝑡2superscript𝝈numsuperscript𝑡2superscriptbold-italic-ϵnum{\mathbf{x}}_{t^{+}}^{\text{num}}\leftarrow{\mathbf{x}}_{t}^{\text{num}}+\sqrt% {\bm{\sigma}^{\text{num}}(t^{+})^{2}-\bm{\sigma}^{\text{num}}(t)^{2}}\bm{% \epsilon}^{\text{num}}bold_x start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT num end_POSTSUPERSCRIPT ← bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT num end_POSTSUPERSCRIPT + square-root start_ARG bold_italic_σ start_POSTSUPERSCRIPT num end_POSTSUPERSCRIPT ( italic_t start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - bold_italic_σ start_POSTSUPERSCRIPT num end_POSTSUPERSCRIPT ( italic_t ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG bold_italic_ϵ start_POSTSUPERSCRIPT num end_POSTSUPERSCRIPT  \triangleright Categorical forward perturbation:
6:      Sample 𝐱t+catq(𝐱t+cat|𝐱tcat, 1𝜶t+/𝜶t)similar-tosuperscriptsubscript𝐱superscript𝑡cat𝑞conditionalsuperscriptsubscript𝐱superscript𝑡catsuperscriptsubscript𝐱𝑡cat1subscript𝜶superscript𝑡subscript𝜶𝑡{\mathbf{x}}_{t^{+}}^{{\rm cat}}\sim q\left({\mathbf{x}}_{t^{+}}^{{\rm cat}}|{% \mathbf{x}}_{t}^{{\rm cat}},\,1-\bm{\alpha}_{t^{+}}/\bm{\alpha}_{t}\right)bold_x start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_cat end_POSTSUPERSCRIPT ∼ italic_q ( bold_x start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_cat end_POSTSUPERSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_cat end_POSTSUPERSCRIPT , 1 - bold_italic_α start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUBSCRIPT / bold_italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) Equation 6  \triangleright Concatenate:
7:      𝐱t+=[𝐱t+num,𝐱t+cat]subscript𝐱superscript𝑡superscriptsubscript𝐱superscript𝑡numsuperscriptsubscript𝐱superscript𝑡cat{\mathbf{x}}_{t^{+}}=[{\mathbf{x}}_{t^{+}}^{\text{num}},{\mathbf{x}}_{t^{+}}^{% {\rm cat}}]bold_x start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = [ bold_x start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT num end_POSTSUPERSCRIPT , bold_x start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_cat end_POSTSUPERSCRIPT ]  \triangleright Numerical backward ODE:
8:      d𝐱num=(𝐱t+numμθnum(𝐱t+,t+))/𝝈num(t+)𝑑superscript𝐱numsuperscriptsubscript𝐱superscript𝑡numsuperscriptsubscript𝜇𝜃numsubscript𝐱superscript𝑡superscript𝑡superscript𝝈numsuperscript𝑡d{\mathbf{x}}^{\text{num}}=({\mathbf{x}}_{t^{+}}^{\text{num}}-\mu_{\theta}^{{% \rm num}}({\mathbf{x}}_{t^{+}},t^{+}))/\bm{\sigma}^{\text{num}}(t^{+})italic_d bold_x start_POSTSUPERSCRIPT num end_POSTSUPERSCRIPT = ( bold_x start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT num end_POSTSUPERSCRIPT - italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_num end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_t start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ) ) / bold_italic_σ start_POSTSUPERSCRIPT num end_POSTSUPERSCRIPT ( italic_t start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT )
9:      𝐱t1num𝐱t+num+(𝝈num(t1)𝝈num(t+))d𝐱numsuperscriptsubscript𝐱𝑡1numsuperscriptsubscript𝐱superscript𝑡numsuperscript𝝈num𝑡1superscript𝝈numsuperscript𝑡𝑑superscript𝐱num{\mathbf{x}}_{t-1}^{\text{num}}\leftarrow{\mathbf{x}}_{t^{+}}^{\text{num}}+(% \bm{\sigma}^{\text{num}}(t-1)-\bm{\sigma}^{\text{num}}(t^{+}))d{\mathbf{x}}^{% \text{num}}bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT num end_POSTSUPERSCRIPT ← bold_x start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT num end_POSTSUPERSCRIPT + ( bold_italic_σ start_POSTSUPERSCRIPT num end_POSTSUPERSCRIPT ( italic_t - 1 ) - bold_italic_σ start_POSTSUPERSCRIPT num end_POSTSUPERSCRIPT ( italic_t start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ) ) italic_d bold_x start_POSTSUPERSCRIPT num end_POSTSUPERSCRIPT  \triangleright Categorical backward sampling:
10:      Sample 𝐱t1catpθ(𝐱t1cat|𝐱t+cat,μθcat(𝐱t+,t+))similar-tosuperscriptsubscript𝐱𝑡1catsubscript𝑝𝜃conditionalsuperscriptsubscript𝐱𝑡1catsuperscriptsubscript𝐱superscript𝑡catsuperscriptsubscript𝜇𝜃catsubscript𝐱superscript𝑡superscript𝑡{\mathbf{x}}_{t-1}^{{\rm cat}}\sim p_{\theta}({\mathbf{x}}_{t-1}^{{\rm cat}}|{% \mathbf{x}}_{t^{+}}^{{\rm cat}},\mu_{\theta}^{{\rm cat}}({\mathbf{x}}_{t^{+}},% t^{+}))bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_cat end_POSTSUPERSCRIPT ∼ italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_cat end_POSTSUPERSCRIPT | bold_x start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_cat end_POSTSUPERSCRIPT , italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_cat end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_t start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ) ) Equation 8
11:  end for
12:  return  𝐱0num,𝐱0catsuperscriptsubscript𝐱0numsuperscriptsubscript𝐱0cat{\mathbf{x}}_{0}^{\text{num}},{\mathbf{x}}_{0}^{{\rm cat}}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT num end_POSTSUPERSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_cat end_POSTSUPERSCRIPT

One notable property of the joint sampling process is that the intermediate decoded categorical feature will not be updated anymore during sampling (see Equation 8). However, as tabular data are highly structured with complicated inter-column correlations, we expect the model to correct the error during sampling. To this end, we introduce a novel stochastic sampler by restarting the backward process with an additional forward process at each denoising step. Related work on continuous diffusions Karras et al. (2022); Xu et al. (2023) has shown that incorporating such stochasticity can yield better generation quality. We extend such intuition to both numerical and categorical features in tabular generation. At each sampling step t𝑡titalic_t, we first add a small time increment to the current time step t𝑡titalic_t to t+=t+γttsuperscript𝑡𝑡subscript𝛾𝑡𝑡t^{+}=t+\gamma_{t}titalic_t start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT = italic_t + italic_γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_t according to a factor γtsubscript𝛾𝑡\gamma_{t}italic_γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and then perform the intermediate forward sampling between t+superscript𝑡t^{+}italic_t start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT and t𝑡titalic_t by joint diffusion process Equations 6 and 3. From the increased-noise sample 𝐱t+subscript𝐱superscript𝑡{\mathbf{x}}_{t^{+}}bold_x start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, we solve the ODE backward for 𝐱numsuperscript𝐱num{\mathbf{x}}^{{\rm num}}bold_x start_POSTSUPERSCRIPT roman_num end_POSTSUPERSCRIPT and 𝐱catsuperscript𝐱cat{\mathbf{x}}^{{\rm cat}}bold_x start_POSTSUPERSCRIPT roman_cat end_POSTSUPERSCRIPT from t+superscript𝑡t^{+}italic_t start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT to t1𝑡1t-1italic_t - 1, respectively, with a single update. This framework enables self-correction by randomly perturbing decoded features in the forward step. We summarize the sampling framework in Algorithm 2, and provide the ablation study for the stochastic sampler in Section 4.4. We also provide an illustrative example of the sampling process in Appendix C.

2.5 Classifier-free Guidance Conditional Generation

TabDiff can also be extended as a conditional generative model, which is important in many tasks such as missing value imputation. Let 𝐲={[𝐲num,𝐲cat]}𝐲superscript𝐲numsuperscript𝐲cat{\mathbf{y}}=\{[{\mathbf{y}}^{{\rm num}},{\mathbf{y}}^{{\rm cat}}]\}bold_y = { [ bold_y start_POSTSUPERSCRIPT roman_num end_POSTSUPERSCRIPT , bold_y start_POSTSUPERSCRIPT roman_cat end_POSTSUPERSCRIPT ] } be the collection of provided properties in tabular data, containing both categorical and numerical features, and let 𝐱𝐱{\mathbf{x}}bold_x denote the missing interest features in this section. Imputation means we want to predict 𝐱={[𝐱num,𝐱cat]}𝐱superscript𝐱numsuperscript𝐱cat{\mathbf{x}}=\{[{\mathbf{x}}^{{\rm num}},{\mathbf{x}}^{{\rm cat}}]\}bold_x = { [ bold_x start_POSTSUPERSCRIPT roman_num end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT roman_cat end_POSTSUPERSCRIPT ] } conditioned on 𝐲𝐲{\mathbf{y}}bold_y. TabDiff can be freely extended to conditional generation by only conducting denoising sampling for 𝐱tsubscript𝐱𝑡{\mathbf{x}}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, while keeping other given features 𝐲tsubscript𝐲𝑡{\mathbf{y}}_{t}bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT fixed as 𝐲𝐲{\mathbf{y}}bold_y.

Previous works on diffusion models (Dhariwal & Nichol, 2021) show that conditional generation quality can be further improved with a guidance classifier/regressor p(𝐲𝐱)𝑝conditional𝐲𝐱p({\mathbf{y}}\mid{\mathbf{x}})italic_p ( bold_y ∣ bold_x ). However, training the guidance classifier becomes challenging when 𝐱𝐱{\mathbf{x}}bold_x is a high-dimensional discrete object, and existing methods typically handle this by relaxing 𝐱𝐱{\mathbf{x}}bold_x as continuous (Vignac et al., 2023). Inspired by the classifier-free guidance (CFG) framework (Ho & Salimans, 2022) developed for continuous diffusion, we propose a unified CFG framework that eliminates the need for a classifier and handles mixed-type 𝐱𝐱{\mathbf{x}}bold_x and 𝐲𝐲{\mathbf{y}}bold_y effectively. The guided conditional sample distribution is given by p~θ(𝐱t|𝐲)pθ(𝐱t|𝐲)pθ(𝐲|𝐱t)ωproportional-tosubscript~𝑝𝜃conditionalsubscript𝐱𝑡𝐲subscript𝑝𝜃conditionalsubscript𝐱𝑡𝐲subscript𝑝𝜃superscriptconditional𝐲subscript𝐱𝑡𝜔\tilde{p}_{\theta}({\mathbf{x}}_{t}|{\mathbf{y}})\propto p_{\theta}({\mathbf{x% }}_{t}|{\mathbf{y}})p_{\theta}({\mathbf{y}}|{\mathbf{x}}_{t})^{\omega}over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_y ) ∝ italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_y ) italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_y | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_ω end_POSTSUPERSCRIPT, where ω>0𝜔0\omega>0italic_ω > 0 controls strength of the guidance. Applying Bayes’ Rule, we get

p~θ(𝐱t|𝐲)pθ(𝐱t|𝐲)pθ(𝐲|𝐱t)ω=pθ(𝐱t|𝐲)(pθ(𝐱t|𝐲)p(𝐲)pθ(𝐱t))ω=pθ(𝐱t|𝐲)ω+1pθ(𝐱t)ωp(𝐲)ω.proportional-tosubscript~𝑝𝜃conditionalsubscript𝐱𝑡𝐲subscript𝑝𝜃conditionalsubscript𝐱𝑡𝐲subscript𝑝𝜃superscriptconditional𝐲subscript𝐱𝑡𝜔subscript𝑝𝜃conditionalsubscript𝐱𝑡𝐲superscriptsubscript𝑝𝜃conditionalsubscript𝐱𝑡𝐲𝑝𝐲subscript𝑝𝜃subscript𝐱𝑡𝜔subscript𝑝𝜃superscriptconditionalsubscript𝐱𝑡𝐲𝜔1subscript𝑝𝜃superscriptsubscript𝐱𝑡𝜔𝑝superscript𝐲𝜔\displaystyle\tilde{p}_{\theta}({\mathbf{x}}_{t}|{\mathbf{y}})\propto p_{% \theta}({\mathbf{x}}_{t}|{\mathbf{y}})p_{\theta}({\mathbf{y}}|{\mathbf{x}}_{t}% )^{\omega}=p_{\theta}({\mathbf{x}}_{t}|{\mathbf{y}})\left(\frac{p_{\theta}({% \mathbf{x}}_{t}|{\mathbf{y}})p({\mathbf{y}})}{p_{\theta}({\mathbf{x}}_{t})}% \right)^{\omega}=\frac{p_{\theta}({\mathbf{x}}_{t}|{\mathbf{y}})^{\omega+1}}{p% _{\theta}({\mathbf{x}}_{t})^{\omega}}p({\mathbf{y}})^{\omega}.over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_y ) ∝ italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_y ) italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_y | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_ω end_POSTSUPERSCRIPT = italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_y ) ( divide start_ARG italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_y ) italic_p ( bold_y ) end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG ) start_POSTSUPERSCRIPT italic_ω end_POSTSUPERSCRIPT = divide start_ARG italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_y ) start_POSTSUPERSCRIPT italic_ω + 1 end_POSTSUPERSCRIPT end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_ω end_POSTSUPERSCRIPT end_ARG italic_p ( bold_y ) start_POSTSUPERSCRIPT italic_ω end_POSTSUPERSCRIPT . (13)

We drop p(𝐲)𝑝𝐲p({\mathbf{y}})italic_p ( bold_y ) for it does no depend on θ𝜃\thetaitalic_θ. Taking the logarithm of the probabilities, we obtain,

logp~θ(𝐱t|𝐲)=(1+ω)logpθ(𝐱t|𝐲)ωlogpθ(𝐱t),subscript~𝑝𝜃conditionalsubscript𝐱𝑡𝐲1𝜔subscript𝑝𝜃conditionalsubscript𝐱𝑡𝐲𝜔subscript𝑝𝜃subscript𝐱𝑡\log\tilde{p}_{\theta}({\mathbf{x}}_{t}|{\mathbf{y}})=(1+\omega)\log p_{\theta% }({\mathbf{x}}_{t}|{\mathbf{y}})-\omega\log p_{\theta}({\mathbf{x}}_{t}),roman_log over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_y ) = ( 1 + italic_ω ) roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_y ) - italic_ω roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , (14)

which implies the following changes in the sampling steps. For the numerical features, 𝝁θnum(𝐱t,t)superscriptsubscript𝝁𝜃numsubscript𝐱𝑡𝑡{\bm{\mu}}_{\theta}^{\text{num}}({\mathbf{x}}_{t},t)bold_italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT num end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) is replaced by the interpolation of the conditional and unconditional estimates (Ho & Salimans, 2022):

𝝁~θnum(𝐱t,𝐲,t)=(1+ω)𝝁θnum(𝐱t,𝐲,t)ω𝝁θnum(𝐱t,t).superscriptsubscript~𝝁𝜃numsubscript𝐱𝑡𝐲𝑡1𝜔superscriptsubscript𝝁𝜃numsubscript𝐱𝑡𝐲𝑡𝜔superscriptsubscript𝝁𝜃numsubscript𝐱𝑡𝑡\tilde{{\bm{\mu}}}_{\theta}^{\text{num}}({\mathbf{x}}_{t},{\mathbf{y}},t)=(1+% \omega){\bm{\mu}}_{\theta}^{\text{num}}({\mathbf{x}}_{t},{\mathbf{y}},t)-% \omega{\bm{\mu}}_{\theta}^{\text{num}}({\mathbf{x}}_{t},t).over~ start_ARG bold_italic_μ end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT num end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_y , italic_t ) = ( 1 + italic_ω ) bold_italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT num end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_y , italic_t ) - italic_ω bold_italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT num end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) . (15)

And for the categorical features, we instead predict x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT with p~θ(𝐱scat|𝐱t,𝐲)subscript~𝑝𝜃conditionalsuperscriptsubscript𝐱𝑠catsubscript𝐱𝑡𝐲\tilde{p}_{\theta}({\mathbf{x}}_{s}^{{\rm cat}}|{\mathbf{x}}_{t},{\mathbf{y}})over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_cat end_POSTSUPERSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_y ), satisfying

logp~θ(𝐱scat|𝐱t,𝐲)=(1+ω)logpθ(𝐱scat|𝐱t,𝐲)ωlogpθ(𝐱scat|𝐱t).subscript~𝑝𝜃conditionalsuperscriptsubscript𝐱𝑠catsubscript𝐱𝑡𝐲1𝜔subscript𝑝𝜃conditionalsuperscriptsubscript𝐱𝑠catsubscript𝐱𝑡𝐲𝜔subscript𝑝𝜃conditionalsuperscriptsubscript𝐱𝑠catsubscript𝐱𝑡\log\tilde{p}_{\theta}({\mathbf{x}}_{s}^{{\rm cat}}|{\mathbf{x}}_{t},{\mathbf{% y}})=(1+\omega)\log p_{\theta}({\mathbf{x}}_{s}^{{\rm cat}}|{\mathbf{x}}_{t},{% \mathbf{y}})-\omega\log{p}_{\theta}({\mathbf{x}}_{s}^{{\rm cat}}|{\mathbf{x}}_% {t}).roman_log over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_cat end_POSTSUPERSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_y ) = ( 1 + italic_ω ) roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_cat end_POSTSUPERSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_y ) - italic_ω roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_cat end_POSTSUPERSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) . (16)

Under the missing value imputation task, our target columns is 𝐱𝐱{\mathbf{x}}bold_x, and the remaining columns constitute 𝐲𝐲{\mathbf{y}}bold_y. Implementing CFG becomes very lightweight, as the guided probability utilizes the original unconditional model trained over all table columns as the conditional model and requires only an additional small model for the unconditional probabilities over the missing columns. We provide empirical results for CFG sampling in Section 4.3 and implementation details in Section B.2.

3 Related Work

Recent studies have developed different generative models for tabular data, including VAE-based methods, TVAE (Xu et al., 2019) and GOGGLE (Liu et al., 2023), and GAN (Generative Adversarial Networks)-based methods, CTGAN (Xu et al., 2019) and TabelGAN (Park et al., 2018). These methods usually lack sufficient model expressivity for complicated data distribution. Recently, diffusion models have shown powerful generative ability for diverse data types and thus have been adopted by many tabular generation methods. Kotelnikov et al. (2023); Lee et al. (2023) designed separate discrete-time diffusion processes (Austin et al., 2021) for numerical and categorical features separately. However, they built their diffusion processes on discrete time steps, which have been proven to yield a looser ELBO estimation and thus lead to sub-optimal generation quality (Song et al., 2021; Kingma et al., 2021). To tackle such a problem caused by limited discretization of diffusion processes and push it to a continuous time framework, Zheng & Charoenphakdee (2022); Zhang et al. (2024) transform features into a latent continuous space via various encoding techniques, since advanced diffusion models are mainly designed for continuous random variables with Gaussian perturbation and thus cannot directly handle tabular data. However, it has shown that these solutions either are trapped with sub-optimal performance due to encoding overhead or cannot capture complex co-occurrence patterns of different modalities because of the indirect modeling and low model capacity. Concurrent work Mueller et al. (2024) also proposed feature-wise diffusion schedules, but the model still relies on encoding to continuous latent space with Gaussian diffusion framework. In summary, none of existing methods have explored the powerful mixed-type diffusion framework in the continuous-time limit and explicitly tackle the feature-wise heterogeneity issue in the mixed-type diffusion process.

4 Experiments

We evaluate TabDiff by comparing it to various generative models across multiple datasets and metrics, ranging from data fidelity and privacy to downstream task performance. Furthermore, we conduct ablation studies to investigate the effectiveness of each component of TabDiff, e.g., the learnable noise schedules.

4.1 Experimental Setups

Datasets. We conduct experiments on seven real-world tabular datasets – Adult, Default, Shoppers, Magic, Faults, Beijing, News, and Diabetes – each containing both numerical and categorical attributes. In addition, each dataset has an inherent machine-learning task, either classification or regression. Detailed profiles of the datasets are presented in Section A.1.

Baselines. We compare the proposed TabDiff with eight popular synthetic tabular data generation methods that are categorized into four groups: 1) GAN-based method: CTGAN (Xu et al., 2019); 2) VAE-based methods: TVAE (Xu et al., 2019) and GOGGLE (Liu et al., 2023); 3) Autoregressive Language Model: GReaT (Borisov et al., 2023); 4) Diffusion-based methods: STaSy (Kim et al., 2023), CoDi (Lee et al., 2023), TabDDPM (Kotelnikov et al., 2023) and TabSyn (Zhang et al., 2024).

Evaluation Methods. Following the previous methods (Zhang et al., 2024), we evaluate the quality of the synthetic data using eight distinct metrics categorized into three groups – 1) Fidelity: Shape, Trend, α𝛼\alphaitalic_α-Precision, β𝛽\betaitalic_β-Recall, and Detection assess how well the synthetic data can faithfully recover the ground-truth data distribution; 2) Downstream tasks: Machine learning efficiency and missing value imputation reveal the models’ potential to power downstream tasks; 3) Privacy: The Distance to Closest Records (DCR) score evaluates the level of privacy protection by measuring how closely the synthetic data resembles the training data. We provide a detailed introduction of all these metrics in Section A.2.

Implementation Details. All reported experiment results are the average of 20 random sampled synthetic data generated by the best-validated models. Additional implementation details, such as the hardware/software information as well as hyperparameter settings, are in Appendix D.

4.2 Data Fidelity and Privacy

Shape and Trend. We first evaluate the fidelity of synthetic data using the Shape and Trend metrics. Shape measures the synthetic data’s ability to capture each single column’s marginal density, while Trend assesses its capacity to replicate the correlation between different columns in the real data.

The detailed results for Shape and Trend metrics, measured across each dataset separately, are presented in Tables 1 and 2, respectively. On the Shape metric, TabDiff outperforms all baselines on five out of seven datasets and surpasses the current state-of-the-art method TabSyn by an average of 13.3%percent13.313.3\%13.3 %. This demonstrates TabDiff’s superior performance in maintaining the marginal distribution of individual attributes across various datasets. Regarding the Trend metric, TabDiff consistently outperforms all baselines and surpasses TabSyn by 22.6%percent22.622.6\%22.6 %. This significant improvement suggests that TabDiff is substantially better at capturing column-column relationships than previous methods. Notably, TabDiff maintains strong performance in Diabetes, a larger, more categorical-heavy dataset, surpassing the most competitive baseline by over 35%percent3535\%35 % on both Shape and Trend. This exceptional performance thus demonstrates our model’s capacity to model datasets with higher dimensionality and discrete features.

Additional Fidelity Metrics. We further evaluate the fidelity metrics across α𝛼\alphaitalic_α-precision, β𝛽\betaitalic_β-recall, and CS2T scores. On average, TabDiff outperforms other methods on all these three metrics. We present the results for these three additional fidelity metrics in Section E.1.

Data Privacy. The ability to protect privacy is another important factor when evaluating synthetic data since we wish the synthetic data to be uniformly sampled from the data distribution manifold rather than being copied (or slightly modified) from each individual real data example. In this section, we use the Distance to Closest Records (DCR) score metric (Zhang et al., 2024), which measures the probability that a synthetic example’s nearest neighbor is from a holdout v.s. the training set.

Due to space limits, the explanations for the additional fidelity metrics and data privacy metrics, along with the corresponding experiments, are deferred to Sections A.2 and E.

4.3 Performance on Downstream Tasks

Table 1: Performance comparison on the error rates (%) of Shape.
Method Adult Default Shoppers Magic Beijing News Diabetes Average
CTGAN 16.8416.8416.8416.84±plus-or-minus\pm± 0.030.030.030.03 16.8316.8316.8316.83±plus-or-minus\pm±0.040.040.040.04 21.1521.1521.1521.15±0.10plus-or-minus0.10\pm 0.10± 0.10 9.819.819.819.81±0.08plus-or-minus0.08\pm 0.08± 0.08 21.3921.3921.3921.39±0.05plus-or-minus0.05\pm 0.05± 0.05 16.0916.0916.0916.09±0.02plus-or-minus0.02\pm 0.02± 0.02 9.829.829.829.82±plus-or-minus\pm±0.080.080.080.08 15.9915.9915.9915.99
TVAE 14.2214.2214.2214.22±0.08plus-or-minus0.08\pm 0.08± 0.08 10.1710.1710.1710.17±plus-or-minus\pm±0.050.050.050.05 24.5124.5124.5124.51±0.06plus-or-minus0.06\pm 0.06± 0.06 8.258.258.258.25±0.06plus-or-minus0.06\pm 0.06± 0.06 19.1619.1619.1619.16±0.06plus-or-minus0.06\pm 0.06± 0.06 16.6216.6216.6216.62±0.03plus-or-minus0.03\pm 0.03± 0.03 18.8618.8618.8618.86±plus-or-minus\pm±0.130.130.130.13 15.9715.9715.9715.97
GOGGLE 16.9716.9716.9716.97 17.0217.0217.0217.02 22.3322.3322.3322.33 1.901.901.901.90 16.9316.9316.9316.93 25.3225.3225.3225.32 24.9224.9224.9224.92 17.9117.9117.9117.91
GReaT 12.1212.1212.1212.12±plus-or-minus\pm±0.040.040.040.04 19.9419.9419.9419.94±plus-or-minus\pm±0.060.060.060.06 14.5114.5114.5114.51±0.12plus-or-minus0.12\pm 0.12± 0.12 16.1616.1616.1616.16±0.09plus-or-minus0.09\pm 0.09± 0.09 8.258.258.258.25±0.12plus-or-minus0.12\pm 0.12± 0.12 OOM OOM 14.2014.2014.2014.20
STaSy 11.2911.2911.2911.29±0.06plus-or-minus0.06\pm 0.06± 0.06 5.775.775.775.77±0.06plus-or-minus0.06\pm 0.06± 0.06 9.379.379.379.37±0.09plus-or-minus0.09\pm 0.09± 0.09 6.296.296.296.29±0.13plus-or-minus0.13\pm 0.13± 0.13 6.716.716.716.71±0.03plus-or-minus0.03\pm 0.03± 0.03 6.896.896.896.89±0.03plus-or-minus0.03\pm 0.03± 0.03 OOM 7.727.727.727.72
CoDi 21.3821.3821.3821.38±0.06plus-or-minus0.06\pm 0.06± 0.06 15.7715.7715.7715.77±plus-or-minus\pm± 0.070.070.070.07 31.8431.8431.8431.84±0.05plus-or-minus0.05\pm 0.05± 0.05 11.5611.5611.5611.56±0.26plus-or-minus0.26\pm 0.26± 0.26 16.9416.9416.9416.94±0.02plus-or-minus0.02\pm 0.02± 0.02 32.2732.2732.2732.27±0.04plus-or-minus0.04\pm 0.04± 0.04 21.1321.1321.1321.13±plus-or-minus\pm±0.250.250.250.25 21.5521.5521.5521.55
TabDDPM 1.751.751.751.75±0.03plus-or-minus0.03\pm 0.03± 0.03 1.571.571.571.57±plus-or-minus\pm± 0.080.080.080.08 2.722.722.722.72±0.13plus-or-minus0.13\pm 0.13± 0.13 1.011.011.011.01±0.09plus-or-minus0.09\pm 0.09± 0.09 1.301.301.301.30±0.03plus-or-minus0.03\pm 0.03± 0.03 78.7578.7578.7578.75±0.01plus-or-minus0.01\pm 0.01± 0.01 31.4431.4431.4431.44±0.05plus-or-minus0.05\pm 0.05± 0.05 16.9316.9316.9316.93
TabSyn 1 0.810.81{0.81}0.81±0.05plus-or-minus0.05{\pm 0.05}± 0.05 1.011.01\bf{{\color[rgb]{0.0,0.45,0.81}1.01}}bold_1.01±0.08plus-or-minus0.08\bf{{\color[rgb]{0.0,0.45,0.81}\pm 0.08}}± bold_0.08 1.441.44{1.44}1.44±0.07plus-or-minus0.07{\pm 0.07}± 0.07 1.031.03{1.03}1.03±0.14plus-or-minus0.14{\pm 0.14}± 0.14 1.261.26{1.26}1.26±0.05plus-or-minus0.05{\pm 0.05}± 0.05 2.062.06\bf{{\color[rgb]{0.0,0.45,0.81}2.06}}bold_2.06±0.04plus-or-minus0.04\bf{{\color[rgb]{0.0,0.45,0.81}\pm 0.04}}± bold_0.04 1.851.85{1.85}1.85±0.02plus-or-minus0.02{\pm 0.02}± 0.02 1.351.35{1.35}1.35
TabDiff 0.630.63\bf{{\color[rgb]{0.0,0.45,0.81}0.63}}bold_0.63±0.05plus-or-minus0.05\bf{{\color[rgb]{0.0,0.45,0.81}\pm 0.05}}± bold_0.05 1.241.24{1.24}1.24±0.07plus-or-minus0.07{\pm 0.07}± 0.07 1.281.28\bf{{\color[rgb]{0.0,0.45,0.81}1.28}}bold_1.28±0.09plus-or-minus0.09\bf{{\color[rgb]{0.0,0.45,0.81}\pm 0.09}}± bold_0.09 0.780.78\bf{{\color[rgb]{0.0,0.45,0.81}0.78}}bold_0.78±0.08plus-or-minus0.08\bf{{\color[rgb]{0.0,0.45,0.81}\pm 0.08}}± bold_0.08 1.031.03\bf{{\color[rgb]{0.0,0.45,0.81}1.03}}bold_1.03±0.05plus-or-minus0.05\bf{{\color[rgb]{0.0,0.45,0.81}\pm 0.05}}± bold_0.05 2.352.35{2.35}2.35±0.03plus-or-minus0.03{\pm 0.03}± 0.03 0.890.89\bf{{\color[rgb]{0.0,0.45,0.81}0.89}}bold_0.89±0.23plus-or-minus0.23\bf{{\color[rgb]{0.0,0.45,0.81}\pm 0.23}}± bold_0.23 1.171.17\bf{{\color[rgb]{0.0,0.45,0.81}1.17}}bold_1.17
Improv. 22.2%percent22.2absent{22.2\%\downarrow}22.2 % ↓ 0.0%percent0.0absent{0.0\%\downarrow}0.0 % ↓ 11.11%percent11.11absent{11.11\%\downarrow}11.11 % ↓ 14.29%percent14.29absent{14.29\%\downarrow}14.29 % ↓ 18.25%percent18.25absent{18.25\%\downarrow}18.25 % ↓ 0%percent0absent{0\%\downarrow}0 % ↓ 46.39%percent46.39absent{46.39\%\downarrow}46.39 % ↓ 13.3%percent13.3absent{13.3\%\downarrow}13.3 % ↓
  • 1

    TabSyn’s performance is obtained via our reproduction. The results of other baselines except on
    Diabetes, are taken from Zhang et al. (2024). The OOM entries are explained in Appendix D.

Table 2: Performance comparison on the error rates (%) of Trend.
Method Adult Default Shoppers Magic Beijing News Diabetes Average
CTGAN 20.2320.2320.2320.23±1.20plus-or-minus1.20\pm 1.20± 1.20 26.9526.9526.9526.95±0.93plus-or-minus0.93\pm 0.93± 0.93 13.0813.0813.0813.08±0.16plus-or-minus0.16\pm 0.16± 0.16 7.007.007.007.00±0.19plus-or-minus0.19\pm 0.19± 0.19 22.9522.9522.9522.95±0.08plus-or-minus0.08\pm 0.08± 0.08 5.375.375.375.37±0.05plus-or-minus0.05\pm 0.05± 0.05 18.9518.9518.9518.95±0.34plus-or-minus0.34\pm 0.34± 0.34 16.3616.3616.3616.36
TVAE 14.1514.1514.1514.15±0.88plus-or-minus0.88\pm 0.88± 0.88 19.5019.5019.5019.50±plus-or-minus\pm±0.950.950.950.95 18.6718.6718.6718.67±0.38plus-or-minus0.38\pm 0.38± 0.38 5.825.825.825.82±0.49plus-or-minus0.49\pm 0.49± 0.49 18.0118.0118.0118.01±0.08plus-or-minus0.08\pm 0.08± 0.08 6.176.176.176.17±0.09plus-or-minus0.09\pm 0.09± 0.09 32.7432.7432.7432.74±0.26plus-or-minus0.26\pm 0.26± 0.26 16.4416.4416.4416.44
GOGGLE 45.2945.2945.2945.29 21.9421.9421.9421.94 23.9023.9023.9023.90 9.479.479.479.47 45.9445.9445.9445.94 23.1923.1923.1923.19 27.5627.5627.5627.56 28.1828.1828.1828.18
GReaT 17.5917.5917.5917.59±0.22plus-or-minus0.22\pm 0.22± 0.22 70.0270.0270.0270.02±plus-or-minus\pm±0.120.120.120.12 45.1645.1645.1645.16±0.18plus-or-minus0.18\pm 0.18± 0.18 10.2310.2310.2310.23±0.40plus-or-minus0.40\pm 0.40± 0.40 59.6059.6059.6059.60±0.55plus-or-minus0.55\pm 0.55± 0.55 OOM OOM 44.2444.2444.2444.24
STaSy 14.5114.5114.5114.51±0.25plus-or-minus0.25\pm 0.25± 0.25 5.965.965.965.96±plus-or-minus\pm±0.260.260.260.26 8.498.498.498.49±0.15plus-or-minus0.15\pm 0.15± 0.15 6.616.616.616.61±0.53plus-or-minus0.53\pm 0.53± 0.53 8.008.008.008.00±0.10plus-or-minus0.10\pm 0.10± 0.10 3.073.073.073.07±0.04plus-or-minus0.04\pm 0.04± 0.04 OOM 7.777.777.777.77
CoDi 22.4922.4922.4922.49±0.08plus-or-minus0.08\pm 0.08± 0.08 68.4168.4168.4168.41±plus-or-minus\pm±0.050.050.050.05 17.7817.7817.7817.78±0.11plus-or-minus0.11\pm 0.11± 0.11 6.536.536.536.53±0.25plus-or-minus0.25\pm 0.25± 0.25 7.077.077.077.07±0.15plus-or-minus0.15\pm 0.15± 0.15 11.1011.1011.1011.10±0.01plus-or-minus0.01\pm 0.01± 0.01 29.2129.2129.2129.21±0.12plus-or-minus0.12\pm 0.12± 0.12 23.2123.2123.2123.21
TabDDPM 3.013.013.013.01±0.25plus-or-minus0.25\pm 0.25± 0.25 4.894.894.894.89±0.10plus-or-minus0.10\pm 0.10± 0.10 6.616.616.616.61±0.16plus-or-minus0.16\pm 0.16± 0.16 1.701.701.701.70±0.22plus-or-minus0.22\pm 0.22± 0.22 2.712.712.712.71±0.09plus-or-minus0.09\pm 0.09± 0.09 13.1613.1613.1613.16±0.11plus-or-minus0.11\pm 0.11± 0.11 51.5451.5451.5451.54±0.05plus-or-minus0.05\pm 0.05± 0.05 11.9511.9511.9511.95
TabSyn 1.931.93{1.93}1.93±0.07plus-or-minus0.07{\pm 0.07}± 0.07 2.812.81{2.81}2.81±0.48plus-or-minus0.48{\pm 0.48}± 0.48 2.132.13{2.13}2.13±0.10plus-or-minus0.10{\pm 0.10}± 0.10 0.880.88{0.88}0.88±0.18plus-or-minus0.18{\pm 0.18}± 0.18 3.133.13{3.13}3.13±0.34plus-or-minus0.34{\pm 0.34}± 0.34 1.521.52{1.52}1.52±0.03plus-or-minus0.03{\pm 0.03}± 0.03 3.903.90{3.90}3.90±0.04plus-or-minus0.04{\pm 0.04}± 0.04 2.332.33{2.33}2.33
TabDiff 1.491.49\bf{{\color[rgb]{0.0,0.45,0.81}1.49}}bold_1.49±0.16plus-or-minus0.16\bf{{\color[rgb]{0.0,0.45,0.81}\pm 0.16}}± bold_0.16 2.552.55\bf{{\color[rgb]{0.0,0.45,0.81}2.55}}bold_2.55±0.75plus-or-minus0.75\bf{{\color[rgb]{0.0,0.45,0.81}\pm 0.75}}± bold_0.75 1.741.74\bf{{\color[rgb]{0.0,0.45,0.81}1.74}}bold_1.74±0.08plus-or-minus0.08\bf{{\color[rgb]{0.0,0.45,0.81}\pm 0.08}}± bold_0.08 0.760.76\bf{{\color[rgb]{0.0,0.45,0.81}0.76}}bold_0.76±0.12plus-or-minus0.12\bf{{\color[rgb]{0.0,0.45,0.81}\pm 0.12}}± bold_0.12 2.592.59\bf{{\color[rgb]{0.0,0.45,0.81}2.59}}bold_2.59±0.15plus-or-minus0.15\bf{{\color[rgb]{0.0,0.45,0.81}\pm 0.15}}± bold_0.15 1.281.28\bf{{\color[rgb]{0.0,0.45,0.81}1.28}}bold_1.28±0.04plus-or-minus0.04\bf{{\color[rgb]{0.0,0.45,0.81}\pm 0.04}}± bold_0.04 2.202.20\bf{{\color[rgb]{0.0,0.45,0.81}2.20}}bold_2.20±0.16plus-or-minus0.16\bf{{\color[rgb]{0.0,0.45,0.81}\pm 0.16}}± bold_0.16 1.801.80\bf{{\color[rgb]{0.0,0.45,0.81}1.80}}bold_1.80
Improve. 22.8%percent22.8absent{22.8\%\downarrow}22.8 % ↓ 9.3%percent9.3absent{9.3\%\downarrow}9.3 % ↓ 18.3%percent18.3absent{18.3\%\downarrow}18.3 % ↓ 13.6%percent13.6absent{13.6\%\downarrow}13.6 % ↓ 4.4%percent4.4absent{4.4\%\downarrow}4.4 % ↓ 15.8%percent15.8absent{15.8\%\downarrow}15.8 % ↓ 37.3%percent37.3absent{37.3\%\downarrow}37.3 % ↓ 22.6%percent22.6absent{22.6\%\downarrow}22.6 % ↓
Table 3: Evaluation of MLE (Machine Learning Efficiency): AUC and RMSE are used for classification and regression tasks, respectively.
Methods Adult Default Shoppers Magic Beijing News1 Diabetes Average Gap
AUC \uparrow AUC \uparrow AUC \uparrow AUC \uparrow RMSE \downarrow RMSE \downarrow AUC \uparrow %percent\%%
Real .927.927.927.927±.000plus-or-minus.000\pm.000± .000 .770.770.770.770±.005plus-or-minus.005\pm.005± .005 .926.926.926.926±.001plus-or-minus.001\pm.001± .001 .946.946.946.946±.001plus-or-minus.001\pm.001± .001 .423.423.423.423±.003plus-or-minus.003\pm.003± .003 .842.842.842.842±.002plus-or-minus.002\pm.002± .002 .704.704.704.704±.002plus-or-minus.002\pm.002± .002 0.00.00.00.0
CTGAN .886.886.886.886±.002plus-or-minus.002\pm.002± .002 .696.696.696.696±.005plus-or-minus.005\pm.005± .005 .875.875.875.875±.009plus-or-minus.009\pm.009± .009 .855.855.855.855±.006plus-or-minus.006\pm.006± .006 .902.902.902.902±.019plus-or-minus.019\pm.019± .019 .880.880.880.880±.016plus-or-minus.016\pm.016± .016 .569.569.569.569±.004plus-or-minus.004\pm.004± .004 23.723.723.723.7
TVAE .878.878.878.878±.004plus-or-minus.004\pm.004± .004 .724.724.724.724±.005plus-or-minus.005\pm.005± .005 .871.871.871.871±.006plus-or-minus.006\pm.006± .006 .887.887.887.887±.003plus-or-minus.003\pm.003± .003 .770.770.770.770±.011plus-or-minus.011\pm.011± .011 1.011.011.011.01±.016plus-or-minus.016\pm.016± .016 .594.594.594.594±.009plus-or-minus.009\pm.009± .009 20.220.220.220.2
GOGGLE .778.778.778.778±.012plus-or-minus.012\pm.012± .012 .584.584.584.584±.005plus-or-minus.005\pm.005± .005 .658.658.658.658±.052plus-or-minus.052\pm.052± .052 .654.654.654.654±.024plus-or-minus.024\pm.024± .024 1.091.091.091.09±.025plus-or-minus.025\pm.025± .025 .877.877.877.877±.002plus-or-minus.002\pm.002± .002 .475.475.475.475±.008plus-or-minus.008\pm.008± .008 42.142.142.142.1
GReaT .913.913.913.913±.003plus-or-minus.003\pm.003± .003 .755.755.755.755±.006plus-or-minus.006\pm.006± .006 .902.902.902.902±.005plus-or-minus.005\pm.005± .005 .888.888.888.888±.008plus-or-minus.008\pm.008± .008 .653.653.653.653±.013plus-or-minus.013\pm.013± .013 OOM OOM 13.313.313.313.3
STaSy .906.906.906.906±.001plus-or-minus.001\pm.001± .001 .752.752.752.752±.006plus-or-minus.006\pm.006± .006 .914.914.914.914±.005plus-or-minus.005\pm.005± .005 .934.934.934.934±.003plus-or-minus.003\pm.003± .003 .656.656.656.656±.014plus-or-minus.014\pm.014± .014 .871.871.871.871±.002plus-or-minus.002\pm.002± .002 OOM 10.910.910.910.9
CoDi .871.871.871.871±.006plus-or-minus.006\pm.006± .006 .525.525.525.525±.006plus-or-minus.006\pm.006± .006 .865.865.865.865±.006plus-or-minus.006\pm.006± .006 .932.932.932.932±.003plus-or-minus.003\pm.003± .003 .818.818.818.818±.021plus-or-minus.021\pm.021± .021 1.211.211.211.21±.005plus-or-minus.005\pm.005± .005 .505.505.505.505±.004plus-or-minus.004\pm.004± .004 30.230.230.230.2
TabDDPM .907.907.907.907±.001plus-or-minus.001\pm.001± .001 .758.758.758.758±.004plus-or-minus.004\pm.004± .004 .918.918.918.918±.005plus-or-minus.005\pm.005± .005 .935.935.935.935±.003plus-or-minus.003\pm.003± .003 .592.592.592.592±.011plus-or-minus.011\pm.011± .011 4.864.864.864.86±3.04plus-or-minus3.04\pm 3.04± 3.04 .521.521.521.521±.008plus-or-minus.008\pm.008± .008 11.9511.9511.9511.95
TabSyn .909.909{.909}.909±.001plus-or-minus.001{\pm.001}± .001 .763.763\bf{{\color[rgb]{0.0,0.45,0.81}.763}}bold_.763±.002plus-or-minus.002\bf{{\color[rgb]{0.0,0.45,0.81}\pm.002}}± bold_.002 .914.914{.914}.914±.004plus-or-minus.004{\pm.004}± .004 .937.937\bf{{\color[rgb]{0.0,0.45,0.81}.937}}bold_.937±.002plus-or-minus.002\bf{{\color[rgb]{0.0,0.45,0.81}\pm.002}}± bold_.002 .580.580{.580}.580±.009plus-or-minus.009\pm.009± .009 .862.862\bf{{\color[rgb]{0.0,0.45,0.81}.862}}bold_.862±.024plus-or-minus.024\bf{{\color[rgb]{0.0,0.45,0.81}\pm.024}}± bold_.024 .684.684{.684}.684±.002plus-or-minus.002{\pm.002}± .002 6.786.78{6.78}6.78
TabDiff .912.912\bf{{\color[rgb]{0.0,0.45,0.81}.912}}bold_.912±.002plus-or-minus.002\bf{{\color[rgb]{0.0,0.45,0.81}\pm.002}}± bold_.002 .763.763\bf{{\color[rgb]{0.0,0.45,0.81}.763}}bold_.763±.005plus-or-minus.005\bf{{\color[rgb]{0.0,0.45,0.81}\pm.005}}± bold_.005 .921.921\bf{{\color[rgb]{0.0,0.45,0.81}.921}}bold_.921±.004plus-or-minus.004\bf{{\color[rgb]{0.0,0.45,0.81}\pm.004}}± bold_.004 .936.936{.936}.936±.003plus-or-minus.003{\pm.003}± .003 .555.555\bf{{\color[rgb]{0.0,0.45,0.81}.555}}bold_.555±.013plus-or-minus.013\bf{{\color[rgb]{0.0,0.45,0.81}\pm.013}}± bold_.013 .866.866{.866}.866±.021plus-or-minus.021{\pm.021}± .021 .689.689\bf{{\color[rgb]{0.0,0.45,0.81}.689}}bold_.689±.016plus-or-minus.016\bf{{\color[rgb]{0.0,0.45,0.81}\pm.016}}± bold_.016 5.765.76\bf{{\color[rgb]{0.0,0.45,0.81}5.76}}bold_5.76

Machine Learning Efficiency. A key advantage of high-quality synthetic data is its ability to serve as an anonymized proxy for real datasets and power effective learning on downstream tasks such as classification and regression. We measure the synthetic table’s capacity to support downstream task learning via Machine Learning Efficiency (MLE). Following established protocols (Kim et al., 2023; Lee et al., 2023; Xu et al., 2019), we first split the real dataset into training and test sets, then train the given generative model on the real training set. Subsequently, we sample a synthetic dataset of equal size to the real training set from the models and use it to train an XGBoost Classifier or XGBoost Regressor (Chen & Guestrin, 2016). Finally, we evaluate these machine learning models against the real test set to calculate the AUC score and RMSE for classification and regression tasks, respectively.

According to the MLE results presented in Table 3, TabDiff consistently achieves the best or second-best performance across all datasets, with the highest average performance outperforming the most competitive baseline TabSyn by 15.0%percent15.015.0\%15.0 %. This demonstrates our method’s competitive capacity to capture and replicate key features of the real data that are most relevant to learning downstream machine learning tasks. However, while TabDiff shows strong performance on MLE, we observe that methods with varying performance on data fidelity metrics might have very close MLE scores. This suggests that the MLE score evaluated under the current setting may not be a reliable indicator of data quality. Therefore, we complement MLE with additional quality metrics in Appendix E, which better highlights the superior performance of TabDiff.

Table 4: Performance of TabDiff in the Missing Value Imputation task. We draw a direct comparison to the generative approach employed by TabSyn, with the performance of XGBoost classifiers/regressors included as a reference.
Methods Adult Default Shoppers Magic Beijing News Diabetes Avg. Improv.
AUC \uparrow AUC \uparrow AUC \uparrow AUC \uparrow RMSE \downarrow RMSE \downarrow AUC \uparrow %
Predicted by XGBoost 92.792.792.792.7 77.077.077.077.0 92.692.692.692.6 94.694.6{94.6}94.6 0.4230.4230.4230.423 0.8420.842{0.842}0.842 70.470.4{70.4}70.4 0.00.00.00.0
Impute with TabSyn 93.193.1{93.1}93.1 86.786.7{86.7}86.7 96.596.5\bf{{\color[rgb]{0.0,0.45,0.81}96.5}}bold_96.5 91.391.3{91.3}91.3 0.3860.386\bf{{\color[rgb]{0.0,0.45,0.81}0.386}}bold_0.386 0.8180.818{0.818}0.818 66.666.6{66.6}66.6 4.994.99{4.99}4.99
Impute with TabDiff + CFG (ω=0.0)𝜔0.0(\omega=0.0)( italic_ω = 0.0 ) 92.592.5{92.5}92.5 91.691.6{91.6}91.6 95.795.7{95.7}95.7 92.592.5{92.5}92.5 0.4240.424{0.424}0.424 0.8280.828{0.828}0.828 66.066.0{66.0}66.0 3.763.76{3.76}3.76
Impute with TabDiff + CFG (ω=0.6)𝜔0.6(\omega=0.6)( italic_ω = 0.6 ) 93.293.2\bf{{\color[rgb]{0.0,0.45,0.81}93.2}}bold_93.2 91.791.7\bf{{\color[rgb]{0.0,0.45,0.81}91.7}}bold_91.7 96.496.4{96.4}96.4 93.093.0\bf{{\color[rgb]{0.0,0.45,0.81}93.0}}bold_93.0 0.4140.414{0.414}0.414 0.8150.815\bf{{\color[rgb]{0.0,0.45,0.81}0.815}}bold_0.815 66.966.9\bf{{\color[rgb]{0.0,0.45,0.81}66.9}}bold_66.9 5.605.60\bf{{\color[rgb]{0.0,0.45,0.81}5.60}}bold_5.60

Missing Value Imputation. We further evaluate TabDiff’s conditional generation capacity through the Missing Value Imputation task. Following the approach in Zhang et al. (2024), we treat the inherent classification/regression task of each dataset as an imputation task. Specifically, for each table, we train generative models on the training set to generate the target column while conditioning on the remaining columns. The imputation performance is measured by the model’s accuracy in recovering the target column of the test set. Implementing classifier-free guidance (CFG) for this task is straightforward. We approximate the conditional model using the unconditioned TabDiff trained on all columns from the previous unconditional generation tasks. For the unconditional model, we train TabDiff on the target column with a significantly smaller denoising network. Detailed implementation is provided in Appendix D, and results are presented in Table 4.

As demonstrated, TabDiff achieves higher imputation accuracy than TabSyn on five out of seven datasets, with an average improvement of 5.60%percent5.605.60\%5.60 % over the non-generative XGBoost classifier. This indicates TabDiff’s superior capacity for conditional tabular data generation. Moreover, we empirically demonstrate the efficacy of our CFG framework by showing that the model consistently performs better with ω=0.6𝜔0.6\omega=0.6italic_ω = 0.6 compared to ω=0.0𝜔0.0\omega=0.0italic_ω = 0.0 (which is equivalent to TabDiff without CFG).

4.4 Ablation Studies

Method Shape Trend
TabSyn 1.351.35{1.35}1.35 2.332.33{2.33}2.33
TabDiff-Fix.+Det. 1.391.39{1.39}1.39 2.292.29{2.29}2.29
TabDiff-Fix.+Sto. 1.201.20{1.20}1.20 1.931.93{1.93}1.93
TabDiff-Learn.+Det. 1.241.24{1.24}1.24 1.921.92{1.92}1.92
TabDiff-Learn.+Sto. 1.171.17{\bf{{\color[rgb]{0.0,0.45,0.81}1.17}}}bold_1.17 1.801.80{\bf{{\color[rgb]{0.0,0.45,0.81}1.80}}}bold_1.80
Table 5: Ablation Studies on the stochastic sampler and learnable noise schedules.
Refer to caption
Figure 2: The adaptively learnable noise schedules reduce training loss.

Stochastic Sampler. We conduct ablation studies to assess the effectiveness of the stochastic sampler, discussed in Section 2.4. The results are presented in Table 5. We use ‘Det.’ and ‘Sto.’ as abbreviations for deterministic and stochastic samplers. The deterministic sampler refers to the conventional diffusion backward process described in Song et al. (2021); Karras et al. (2022), consisting of a series of deterministic ODE steps. According to Table 5, under both fixed and learnable noise schedules, TabDiff with the stochastic sampler consistently outperforms the deterministic sampler on the fidelity metrics Shape and Trend, regardless of whether learnable noise schedules are enabled. These confirm the efficacy of additional stochasticity in reducing decoding errors during backward diffusion sampling.

Adaptively Learnable Noise Schedule. Next, we perform an ablation study to evaluate the effectiveness of our adaptively learnable noise schedules. We compare the model with learnable schedules against the model with non-learnable noise schedules, where the noise parameters for numerical features are fixed to ρi7,isubscript𝜌𝑖7for-all𝑖\rho_{i}\equiv 7,\,\forall iitalic_ρ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≡ 7 , ∀ italic_i in Equation 10 and, for numerical features, fixed to kj1,jsubscript𝑘𝑗1for-all𝑗k_{j}\equiv 1,\forall jitalic_k start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ≡ 1 , ∀ italic_j in Equation 11. We refer to these models as ‘Learn.’ and ‘Fix.’, respectively. According to the results in Table 5, the learnable noise schedules substantially improve performance, particularly in Trend and regardless of whether the stochastic sampler is enabled. Furthermore, we closely examine the training process of both models on the Adult dataset by plotting their changes of training loss in Figure 2. According to the figure, the learnable schedules (orange curves) significantly reduce both numerical and categorical losses in Equation 12.

4.5 Visualizations of Synthetic Data

Refer to caption
Figure 3: Visualization of the marginal densities of the generated data in comparison to the real data. Top and Middle: individual numerical column; Bottom: individual categorical column.
Refer to caption
Figure 4: Pair-wise correlation heatmaps. Values represent the error rate (the lighter, the better).

We present a comprehensive set of visualizations to compare single-column marginal distributions and pairwise correlations across four models—CoDi, TabDDPM, TabSyn, and our TabDiff—and four distinct datasets: Adult, Beijing, Magic, and Shoppers. In Figure 3, we provide 1-dimensional kernel density estimation (KDE) curves for a chosen numerical feature, alongside histograms for a chosen categorical feature. According to the figures, the density of TabDiff’s samples matches most closely with that of the real data, highlighting TabDiff’s ability to capture the original distribution patterns. Furthermore, in Figure 4, we include correlation heatmaps that show the correlation error rate for each pair of columns. These pictures consistently demonstrate that the TabDiff archives the closest match to real correlation scores, highlighting its superior ability to capture the column-wise correlation of the real data.

5 Conclusion

In this paper, we have introduced TabDiff, a mixed-type diffusion framework for generating high-quality synthetic data. TabDiff combines a hybrid diffusion process to handle numerical and categorical features in their native formats. To address the disparate distributions of features and their interrelationships, we further introduced several key innovations, including learnable column-wise noise schedules and the stochastic sampler. We conducted extensive experiments using a diverse set of datasets and metrics, comprehensively comparing TabDiff with existing approaches. The results demonstrate TabDiff’s superior capacity in learning the original data distribution and generating faithful and diverse synthetic data to power downstream tasks.

Acknowledgment

We gratefully acknowledge the support of NSF under Nos. OAC-1835598 (CINES), CCF-1918940 (Expeditions), DMS-2327709 (IHBEM), IIS-2403318 (III); Stanford Data Applications Initiative, Wu Tsai Neurosciences Institute, Stanford Institute for Human-Centered AI, Chan Zuckerberg Initiative, Amazon, Genentech, GSK, Hitachi, SAP, and UCB. We also gratefully acknowledge the support of ARO (W911NF-21-1-0125), ONR (N00014-23-1-2159), and Chan Zuckerberg Biohub. Minkai Xu thanks the generous support of Sequoia Capital Stanford Graduate Fellowship.

References

  • Alaa et al. (2022) Ahmed Alaa, Boris Van Breugel, Evgeny S Saveliev, and Mihaela van der Schaar. How faithful is your synthetic data? sample-level metrics for evaluating and auditing generative models. In International Conference on Machine Learning, pp.  290–306. PMLR, 2022.
  • Assefa et al. (2021) Samuel A. Assefa, Danial Dervovic, Mahmoud Mahfouz, Robert E. Tillman, Prashant Reddy, and Manuela Veloso. Generating synthetic data in finance: Opportunities, challenges and pitfalls. In Proceedings of the First ACM International Conference on AI in Finance, ICAIF ’20. Association for Computing Machinery, 2021. ISBN 9781450375849.
  • Austin et al. (2021) Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne Van Den Berg. Structured denoising diffusion models in discrete state-spaces. Advances in Neural Information Processing Systems, 34:17981–17993, 2021.
  • Borisov et al. (2023) Vadim Borisov, Kathrin Sessler, Tobias Leemann, Martin Pawelczyk, and Gjergji Kasneci. Language models are realistic tabular data generators. In The Eleventh International Conference on Learning Representations, 2023.
  • Chawla et al. (2002) Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer. Smote: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16:321–357, 2002.
  • Chen & Guestrin (2016) Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp.  785–794, 2016.
  • Dhariwal & Nichol (2021) Prafulla Dhariwal and Alexander Quinn Nichol. Diffusion models beat GANs on image synthesis. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, 2021.
  • Fonseca & Bacao (2023) Joao Fonseca and Fernando Bacao. Tabular and latent space synthetic data generation: a literature review. Journal of Big Data, 10(1):115, 2023.
  • Hernandez et al. (2022) Mikel Hernandez, Gorka Epelde, Ane Alberdi, Rodrigo Cilla, and Debbie Rankin. Synthetic data generation for tabular health records: A systematic review. Neurocomputing, 493:28–45, 2022.
  • Ho & Salimans (2022) Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
  • Ho et al. (2020) Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Proceedings of the 34th International Conference on Neural Information Processing Systems, pp.  6840–6851, 2020.
  • Karras et al. (2022) Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. In Proceedings of the 36th International Conference on Neural Information Processing Systems, pp.  26565–26577, 2022.
  • Kim et al. (2022) Jayoung Kim, Chaejeong Lee, Yehjin Shin, Sewon Park, Minjung Kim, Noseong Park, and Jihoon Cho. Sos: Score-based oversampling for tabular data. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp.  762–772, 2022.
  • Kim et al. (2023) Jayoung Kim, Chaejeong Lee, and Noseong Park. Stasy: Score-based tabular data synthesis. In The Eleventh International Conference on Learning Representations, 2023.
  • Kingma et al. (2021) Diederik Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models. Advances in neural information processing systems, 34:21696–21707, 2021.
  • Kotelnikov et al. (2023) Akim Kotelnikov, Dmitry Baranchuk, Ivan Rubachev, and Artem Babenko. Tabddpm: Modelling tabular data with diffusion models. In International Conference on Machine Learning, pp.  17564–17579. PMLR, 2023.
  • Lee et al. (2023) Chaejeong Lee, Jayoung Kim, and Noseong Park. Codi: Co-evolving contrastive diffusion models for mixed-type tabular synthesis. In International Conference on Machine Learning, pp.  18940–18956. PMLR, 2023.
  • Liu et al. (2023) Tennison Liu, Zhaozhi Qian, Jeroen Berrevoets, and Mihaela van der Schaar. Goggle: Generative modelling for tabular data by learning relational structure. In The Eleventh International Conference on Learning Representations, 2023.
  • Mueller et al. (2024) Markus Mueller, Kathrin Gruber, and Dennis Fok. Continuous diffusion for mixed-type tabular data, 2024. URL https://arxiv.org/abs/2312.10431.
  • Park et al. (2018) Noseong Park, Mahmoud Mohammadi, Kshitij Gorde, Sushil Jajodia, Hongkyu Park, and Youngmin Kim. Data synthesis based on generative adversarial networks. Proceedings of the VLDB Endowment, 11(10), 2018.
  • Rombach et al. (2022) Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  10684–10695, 2022.
  • Sahoo et al. (2024) Subham Sekhar Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin T Chiu, Alexander Rush, and Volodymyr Kuleshov. Simple and effective masked diffusion language models. arXiv preprint arXiv:2406.07524, 2024.
  • Shi et al. (2024) Jiaxin Shi, Kehang Han, Zhe Wang, Arnaud Doucet, and Michalis K Titsias. Simplified and generalized masked diffusion for discrete data. arXiv preprint arXiv:2406.04329, 2024.
  • Sohl-Dickstein et al. (2015) Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pp.  2256–2265. PMLR, 2015.
  • Song & Ermon (2019) Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems, 32, 2019.
  • Song et al. (2021) Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In The Ninth International Conference on Learning Representations, 2021.
  • Vignac et al. (2023) Clement Vignac, Igor Krawczuk, Antoine Siraudin, Bohan Wang, Volkan Cevher, and Pascal Frossard. Digress: Discrete denoising diffusion for graph generation. In The Eleventh International Conference on Learning Representations, 2023.
  • Xu et al. (2019) Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, and Kalyan Veeramachaneni. Modeling tabular data using conditional gan. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, pp.  7335–7345, 2019.
  • Xu et al. (2023) Yilun Xu, Mingyang Deng, Xiang Cheng, Yonglong Tian, Ziming Liu, and Tommi Jaakkola. Restart sampling for improving generative processes. Advances in Neural Information Processing Systems, 36:76806–76838, 2023.
  • You et al. (2020) Jiaxuan You, Xiaobai Ma, Yi Ding, Mykel J Kochenderfer, and Jure Leskovec. Handling missing data with graph representation learning. Advances in Neural Information Processing Systems, 33:19075–19087, 2020.
  • Zhang et al. (2024) Hengrui Zhang, Jiani Zhang, Zhengyuan Shen, Balasubramaniam Srinivasan, Xiao Qin, Christos Faloutsos, Huzefa Rangwala, and George Karypis. Mixed-type tabular data synthesis with score-based diffusion in latent space. In The Twelfth International Conference on Learning Representations, 2024.
  • Zheng & Charoenphakdee (2022) Shuhan Zheng and Nontawat Charoenphakdee. Diffusion models for missing value imputation in tabular data. arXiv preprint arXiv:2210.17128, 2022.

Appendix A Detailed Experiment Setups

A.1 Datasets

We use seven tabular datasets from UCI Machine Learning Repository111https://archive.ics.uci.edu/datasets: Adult, Default, Shoppers, Magic, Beijing, News, and Diabetes, where each tabular dataset is associated with a machine-learning task. Classification: Adult, Default, Magic, Shoppers, and Diabetes. Regression: Beijing and News. The statistics of the datasets are presented in Table 6.

Table 6: Statistics of datasets. # Num stands for the number of numerical columns, and # Cat stands for the number of categorical columns. # Max Cat stands for the number of categories of the categorical column with the most categories.
Dataset # Rows # Num # Cat # Max Cat # Train # Validation # Test Task
Adult 48,8424884248,84248 , 842 6666 9999 42424242 28,9432894328,94328 , 943 3,61836183,6183 , 618 16,2811628116,28116 , 281 Classification
Default 30,0003000030,00030 , 000 14141414 11111111 11111111 24,0002400024,00024 , 000 3,00030003,0003 , 000 3,00030003,0003 , 000 Classification
Shoppers 12,3301233012,33012 , 330 10101010 8888 20202020 9,86498649,8649 , 864 1,23312331,2331 , 233 1,23312331,2331 , 233 Classification
Magic 19,0191901919,01919 , 019 10101010 1111 2222 15,2151521515,21515 , 215 1,90219021,9021 , 902 1,90219021,9021 , 902 Classification
Beijing 43,8244382443,82443 , 824 7777 5555 31313131 35,0583505835,05835 , 058 4,38343834,3834 , 383 4,38343834,3834 , 383 Regression
News 39,6443964439,64439 , 644 46464646 2222 7777 31,7143171431,71431 , 714 3,96539653,9653 , 965 3,96539653,9653 , 965 Regression
Diabetes 101,766101766101,766101 , 766 9999 27272727 716716716716 61,0596105961,05961 , 059 2,0353203532,03532 , 0353 20,3542035420,35420 , 354 Classification

A.2 Metrics

A.2.1 Shape and Trend

Shape and Trend are proposed by SDMetrics222https://docs.sdv.dev/sdmetrics. They are used to measure the column-wise density estimation performance and pair-wise column correlation estimation performance, respectively. Shape uses Kolmogorov-Sirnov Test (KST) for numerical columns and the Total Variation Distance (TVD) for categorical columns to quantify column-wise density estimation. Trend uses Pearson correlation for numerical columns and contingency similarity for categorical columns to quantify pair-wise correlation.

Shape. Kolmogorov-Sirnov Test (KST): Given two (continuous) distributions pr(x)subscript𝑝𝑟𝑥p_{r}(x)italic_p start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_x ) and ps(x)subscript𝑝𝑠𝑥p_{s}(x)italic_p start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_x ) (r𝑟ritalic_r denotes real and s𝑠sitalic_s denotes synthetic), KST quantifies the distance between the two distributions using the upper bound of the discrepancy between two corresponding Cumulative Distribution Functions (CDFs):

KST=supx|Fr(x)Fs(x)|,KSTsubscriptsupremum𝑥subscript𝐹𝑟𝑥subscript𝐹𝑠𝑥{\rm KST}=\sup\limits_{x}|F_{r}(x)-F_{s}(x)|,roman_KST = roman_sup start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT | italic_F start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_x ) - italic_F start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_x ) | , (17)

where Fr(x)subscript𝐹𝑟𝑥F_{r}(x)italic_F start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_x ) and Fs(x)subscript𝐹𝑠𝑥F_{s}(x)italic_F start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_x ) are the CDFs of pr(x)subscript𝑝𝑟𝑥p_{r}(x)italic_p start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_x ) and ps(x)subscript𝑝𝑠𝑥p_{s}(x)italic_p start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_x ), respectively:

F(x)=xp(x)dx.𝐹𝑥superscriptsubscript𝑥𝑝𝑥differential-d𝑥F(x)=\int_{-\infty}^{x}p(x){\rm d}x.italic_F ( italic_x ) = ∫ start_POSTSUBSCRIPT - ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT italic_p ( italic_x ) roman_d italic_x . (18)

Total Variation Distance: TVD computes the frequency of each category value and expresses it as a probability. Then, the TVD score is the average difference between the probabilities of the categories:

TVD=12ωΩ|R(ω)S(ω)|,TVD12subscript𝜔Ω𝑅𝜔𝑆𝜔{\rm TVD}=\frac{1}{2}\sum\limits_{\omega\in\Omega}|R(\omega)-S(\omega)|,roman_TVD = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_ω ∈ roman_Ω end_POSTSUBSCRIPT | italic_R ( italic_ω ) - italic_S ( italic_ω ) | , (19)

where ω𝜔\omegaitalic_ω describes all possible categories in a column ΩΩ\Omegaroman_Ω. R()𝑅R(\cdot)italic_R ( ⋅ ) and S()𝑆S(\cdot)italic_S ( ⋅ ) denotes the real and synthetic frequencies of these categories.

Trend. Pearson Correlation Coefficient: The Pearson correlation coefficient measures whether two continuous distributions are linearly correlated and is computed as:

ρx,y=Cov(x,y)σxσy,subscript𝜌𝑥𝑦Cov𝑥𝑦subscript𝜎𝑥subscript𝜎𝑦\rho_{x,y}=\frac{{\rm Cov}(x,y)}{\sigma_{x}\sigma_{y}},italic_ρ start_POSTSUBSCRIPT italic_x , italic_y end_POSTSUBSCRIPT = divide start_ARG roman_Cov ( italic_x , italic_y ) end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_ARG , (20)

where x𝑥xitalic_x and y𝑦yitalic_y are two continuous columns. Cov is the covariance, and σ𝜎\sigmaitalic_σ is the standard deviation.

Then, the performance of correlation estimation is measured by the average differences between the real data’s correlations and the synthetic data’s corrections:

Pearson Score=12𝔼x,y|ρR(x,y)ρS(x,y)|,Pearson Score12subscript𝔼𝑥𝑦superscript𝜌𝑅𝑥𝑦superscript𝜌𝑆𝑥𝑦{\text{Pearson Score}}=\frac{1}{2}\mathbb{E}_{x,y}|\rho^{R}(x,y)-\rho^{S}(x,y)|,Pearson Score = divide start_ARG 1 end_ARG start_ARG 2 end_ARG blackboard_E start_POSTSUBSCRIPT italic_x , italic_y end_POSTSUBSCRIPT | italic_ρ start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ( italic_x , italic_y ) - italic_ρ start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT ( italic_x , italic_y ) | , (21)

where ρR(x,y)superscript𝜌𝑅𝑥𝑦\rho^{R}(x,y)italic_ρ start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ( italic_x , italic_y ) and ρS(x,y))\rho^{S}(x,y))italic_ρ start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT ( italic_x , italic_y ) ) denotes the Pearson correlation coefficient between column x𝑥xitalic_x and column y𝑦yitalic_y of the real data and synthetic data, respectively. As ρ[1,1]𝜌11\rho\in[-1,1]italic_ρ ∈ [ - 1 , 1 ], the average score is divided by 2222 to ensure that it falls in the range of [0,1]01[0,1][ 0 , 1 ], then the smaller the score, the better the estimation.

Contingency similarity: For a pair of categorical columns A𝐴Aitalic_A and B𝐵Bitalic_B, the contingency similarity score computes the difference between the contingency tables using the Total Variation Distance. The process is summarized by the formula below:

Contingency Score=12αAβB|Rα,βSα,β|,Contingency Score12subscript𝛼𝐴subscript𝛽𝐵subscript𝑅𝛼𝛽subscript𝑆𝛼𝛽\text{Contingency Score}=\frac{1}{2}\sum\limits_{\alpha\in A}\sum\limits_{% \beta\in B}|R_{\alpha,\beta}-S_{\alpha,\beta}|,Contingency Score = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_α ∈ italic_A end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_β ∈ italic_B end_POSTSUBSCRIPT | italic_R start_POSTSUBSCRIPT italic_α , italic_β end_POSTSUBSCRIPT - italic_S start_POSTSUBSCRIPT italic_α , italic_β end_POSTSUBSCRIPT | , (22)

where α𝛼\alphaitalic_α and β𝛽\betaitalic_β describe all the possible categories in column A𝐴Aitalic_A and column B𝐵Bitalic_B, respectively. Rα,βsubscript𝑅𝛼𝛽R_{\alpha,\beta}italic_R start_POSTSUBSCRIPT italic_α , italic_β end_POSTSUBSCRIPT and Sα,βsubscript𝑆𝛼𝛽S_{\alpha,\beta}italic_S start_POSTSUBSCRIPT italic_α , italic_β end_POSTSUBSCRIPT are the joint frequency of α𝛼\alphaitalic_α and β𝛽\betaitalic_β in the real data and synthetic data, respectively.

A.2.2 α𝛼\alphaitalic_α-Precision and β𝛽\betaitalic_β-Recall

Following Liu et al. (2023) and  Alaa et al. (2022), we adopt the α𝛼\alphaitalic_α-Precision and β𝛽\betaitalic_β-Recall proposed in Alaa et al. (2022), two sample-level metric quantifying how faithful the synthetic data is. In general, α𝛼\alphaitalic_α-Precision evaluates the fidelity of synthetic data – whether each synthetic example comes from the real-data distribution, β𝛽\betaitalic_β-Recall evaluates the coverage of the synthetic data, e.g., whether the synthetic data can cover the entire distribution of the real data (In other words, whether a real data sample is close to the synthetic data.)

A.2.3 Detection

The detection measures the difficulty of detecting the synthetic data from the real data when they are mixed. We use the classifer-two-sample-test (C2ST) implemented by SDMetrics, where a logistic regression model plays the role of a detector.

A.2.4 Machine Learning Efficiency

In MLE, each dataset is first split into the real training and testing set. The generative models are learned on the real training set. After training, a synthetic set of equivalent size is sampled.

The performance of synthetic data on MLE tasks is evaluated based on the divergence of test scores using the real and synthetic training data. Therefore, we first train the machine learning model on the real training set, split into training and validation sets with a 8:1:818:18 : 1 ratio. The classifier/regressor is trained on the training set, and the optimal hyperparameter setting is selected according to the performance on the validation set. After the optimal hyperparameter setting is obtained, the corresponding classifier/regressor is retrained on the training set and evaluated on the real testing set. The performance of synthetic data is obtained in the same way.

Appendix B Method Details

B.1 Adaptively Learnable Noise Schedules

For numerical stability, we need to bound σminsubscript𝜎min\sigma_{\text{min}}italic_σ start_POSTSUBSCRIPT min end_POSTSUBSCRIPT and σmaxsubscript𝜎max\sigma_{\text{max}}italic_σ start_POSTSUBSCRIPT max end_POSTSUBSCRIPT within (0,1)01(0,1)( 0 , 1 ). As shown in Equation 10, our formulation of the power-mean noise schedule boundaries the noise level in between σminsubscript𝜎min\sigma_{\text{min}}italic_σ start_POSTSUBSCRIPT min end_POSTSUBSCRIPT and σmaxsubscript𝜎max\sigma_{\text{max}}italic_σ start_POSTSUBSCRIPT max end_POSTSUBSCRIPT. To make sure that the noise level for numerical features is also bounded, we linearly map t𝑡titalic_t to the interval [δ,1δ]𝛿1𝛿[\delta,1-\delta][ italic_δ , 1 - italic_δ ], thus recasting Equation 11 into

σkjcat(t)=log(1((1δ)tkj+δ)).superscriptsubscript𝜎subscript𝑘𝑗cat𝑡11𝛿superscript𝑡subscript𝑘𝑗𝛿{\sigma_{k_{j}}^{{\rm cat}}(t)=-\log\left(1-\left((1-\delta)\cdot t^{k_{j}}+% \delta\right)\right)}.italic_σ start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_cat end_POSTSUPERSCRIPT ( italic_t ) = - roman_log ( 1 - ( ( 1 - italic_δ ) ⋅ italic_t start_POSTSUPERSCRIPT italic_k start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT + italic_δ ) ) . (23)

B.2 Classifier-free Guidance

In this section, we elaborate on how to implement our classifier-free guided conditional generation.

Simple way to compute p~θ(𝐱scat|𝐱t,𝐲)subscript~𝑝𝜃conditionalsuperscriptsubscript𝐱𝑠catsubscript𝐱𝑡𝐲\tilde{p}_{\theta}({\mathbf{x}}_{s}^{{\rm cat}}|{\mathbf{x}}_{t},{\mathbf{y}})over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_cat end_POSTSUPERSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_y ). We first show that, under our simple masked diffusion framework, the guided posterior probability for categorical columns, p~θ(𝐱scat|𝐱t,𝐲)subscript~𝑝𝜃conditionalsuperscriptsubscript𝐱𝑠catsubscript𝐱𝑡𝐲\tilde{p}_{\theta}({\mathbf{x}}_{s}^{{\rm cat}}|{\mathbf{x}}_{t},{\mathbf{y}})over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_cat end_POSTSUPERSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_y ) can be simply computed by directly interpolating the model’s raw estimates of 𝐱0subscript𝐱0{\mathbf{x}}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, i.e.,𝝁θcat(𝐱t,𝐲,t)superscriptsubscript𝝁𝜃catsubscript𝐱𝑡𝐲𝑡{\bm{\mu}}_{\theta}^{{\rm cat}}({\mathbf{x}}_{t},{\mathbf{y}},t)bold_italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_cat end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_y , italic_t ) and 𝝁θcat(𝐱t,t)superscriptsubscript𝝁𝜃catsubscript𝐱𝑡𝑡{\bm{\mu}}_{\theta}^{{\rm cat}}({\mathbf{x}}_{t},t)bold_italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_cat end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ).

If 𝐱tsubscript𝐱𝑡{\mathbf{x}}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is already unmasked (i.e., 𝐱=𝐦𝐱𝐦{\mathbf{x}}={\mathbf{m}}bold_x = bold_m), we remain at the current state as before. Otherwise, we compute the posterior according to Equation 16. Note that all operations below are performed element-wise.

logp~θ(𝐱scat|𝐱t,𝐲)subscript~𝑝𝜃conditionalsuperscriptsubscript𝐱𝑠catsubscript𝐱𝑡𝐲\displaystyle\log\tilde{p}_{\theta}({\mathbf{x}}_{s}^{{\rm cat}}|{\mathbf{x}}_% {t},{\mathbf{y}})roman_log over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_cat end_POSTSUPERSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_y ) =(1+ω)logpθ(𝐱scat|𝐱t,𝐲)ωlogpθ(𝐱scat|𝐱t).absent1𝜔subscript𝑝𝜃conditionalsuperscriptsubscript𝐱𝑠catsubscript𝐱𝑡𝐲𝜔subscript𝑝𝜃conditionalsuperscriptsubscript𝐱𝑠catsubscript𝐱𝑡\displaystyle=(1+\omega)\log p_{\theta}({\mathbf{x}}_{s}^{{\rm cat}}|{\mathbf{% x}}_{t},{\mathbf{y}})-\omega\log{p}_{\theta}({\mathbf{x}}_{s}^{{\rm cat}}|{% \mathbf{x}}_{t}).= ( 1 + italic_ω ) roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_cat end_POSTSUPERSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_y ) - italic_ω roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_cat end_POSTSUPERSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) .
p~θ(𝐱scat|𝐱t,𝐲)subscript~𝑝𝜃conditionalsuperscriptsubscript𝐱𝑠catsubscript𝐱𝑡𝐲\displaystyle\tilde{p}_{\theta}({\mathbf{x}}_{s}^{{\rm cat}}|{\mathbf{x}}_{t},% {\mathbf{y}})over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_cat end_POSTSUPERSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_y ) =pθ(𝐱scat|𝐱t,𝐲)ω+1pθ(𝐱scat|𝐱t)ωabsentsubscript𝑝𝜃superscriptconditionalsuperscriptsubscript𝐱𝑠catsubscript𝐱𝑡𝐲𝜔1subscript𝑝𝜃superscriptconditionalsuperscriptsubscript𝐱𝑠catsubscript𝐱𝑡𝜔\displaystyle=\frac{p_{\theta}({\mathbf{x}}_{s}^{{\rm cat}}|{\mathbf{x}}_{t},{% \mathbf{y}})^{\omega+1}}{{p}_{\theta}({\mathbf{x}}_{s}^{{\rm cat}}|{\mathbf{x}% }_{t})^{\omega}}= divide start_ARG italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_cat end_POSTSUPERSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_y ) start_POSTSUPERSCRIPT italic_ω + 1 end_POSTSUPERSCRIPT end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_cat end_POSTSUPERSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_ω end_POSTSUPERSCRIPT end_ARG
=((1αs)𝐦+(αsαt)𝝁θcat(𝐱t,𝐲,t)1αt)ω+1((1αs)𝐦+(αsαt)𝝁θcat(𝐱t,t)1αt)ωabsentsuperscript1subscript𝛼𝑠𝐦subscript𝛼𝑠subscript𝛼𝑡superscriptsubscript𝝁𝜃catsubscript𝐱𝑡𝐲𝑡1subscript𝛼𝑡𝜔1superscript1subscript𝛼𝑠𝐦subscript𝛼𝑠subscript𝛼𝑡superscriptsubscript𝝁𝜃catsubscript𝐱𝑡𝑡1subscript𝛼𝑡𝜔\displaystyle=\frac{\left(\frac{(1-\alpha_{s}){\mathbf{m}}+(\alpha_{s}-\alpha_% {t}){\bm{\mu}}_{\theta}^{{\rm cat}}({\mathbf{x}}_{t},{\mathbf{y}},t)}{1-\alpha% _{t}}\right)^{\omega+1}}{\left(\frac{(1-\alpha_{s}){\mathbf{m}}+(\alpha_{s}-% \alpha_{t}){\bm{\mu}}_{\theta}^{{\rm cat}}({\mathbf{x}}_{t},t)}{1-\alpha_{t}}% \right)^{\omega}}= divide start_ARG ( divide start_ARG ( 1 - italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) bold_m + ( italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_cat end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_y , italic_t ) end_ARG start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT italic_ω + 1 end_POSTSUPERSCRIPT end_ARG start_ARG ( divide start_ARG ( 1 - italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) bold_m + ( italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_cat end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) end_ARG start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT italic_ω end_POSTSUPERSCRIPT end_ARG
=((1αs)𝐦+(αsαt)𝝁θcat(𝐱t,𝐲,t))ω+1((1αs)𝐦+(αsαt)𝝁θcat(𝐱t,t))ω11αtabsentsuperscript1subscript𝛼𝑠𝐦subscript𝛼𝑠subscript𝛼𝑡superscriptsubscript𝝁𝜃catsubscript𝐱𝑡𝐲𝑡𝜔1superscript1subscript𝛼𝑠𝐦subscript𝛼𝑠subscript𝛼𝑡superscriptsubscript𝝁𝜃catsubscript𝐱𝑡𝑡𝜔11subscript𝛼𝑡\displaystyle=\frac{\left((1-\alpha_{s}){\mathbf{m}}+(\alpha_{s}-\alpha_{t}){% \bm{\mu}}_{\theta}^{{\rm cat}}({\mathbf{x}}_{t},{\mathbf{y}},t)\right)^{\omega% +1}}{\left((1-\alpha_{s}){\mathbf{m}}+(\alpha_{s}-\alpha_{t}){\bm{\mu}}_{% \theta}^{{\rm cat}}({\mathbf{x}}_{t},t)\right)^{\omega}}\frac{1}{1-\alpha_{t}}= divide start_ARG ( ( 1 - italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) bold_m + ( italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_cat end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_y , italic_t ) ) start_POSTSUPERSCRIPT italic_ω + 1 end_POSTSUPERSCRIPT end_ARG start_ARG ( ( 1 - italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) bold_m + ( italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_cat end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) ) start_POSTSUPERSCRIPT italic_ω end_POSTSUPERSCRIPT end_ARG divide start_ARG 1 end_ARG start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG

Since 𝝁θcatsuperscriptsubscript𝝁𝜃cat{\bm{\mu}}_{\theta}^{{\rm cat}}bold_italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_cat end_POSTSUPERSCRIPT and 𝐦𝐦{\mathbf{m}}bold_m must have zero probability mass in each other’s dimension that can have a positive mass, the exponent of summations into the summation of exponents:

=((1αs)𝐦)ω+1+((αsαt)𝝁θcat(𝐱t,𝐲,t))ω+1((1αs)𝐦)ω+((αsαt)𝝁θcat(𝐱t,t))ω11αtabsentsuperscript1subscript𝛼𝑠𝐦𝜔1superscriptsubscript𝛼𝑠subscript𝛼𝑡superscriptsubscript𝝁𝜃catsubscript𝐱𝑡𝐲𝑡𝜔1superscript1subscript𝛼𝑠𝐦𝜔superscriptsubscript𝛼𝑠subscript𝛼𝑡superscriptsubscript𝝁𝜃catsubscript𝐱𝑡𝑡𝜔11subscript𝛼𝑡\displaystyle=\frac{\left((1-\alpha_{s}){\mathbf{m}}\right)^{\omega+1}+\left((% \alpha_{s}-\alpha_{t}){\bm{\mu}}_{\theta}^{{\rm cat}}({\mathbf{x}}_{t},{% \mathbf{y}},t)\right)^{\omega+1}}{\left((1-\alpha_{s}){\mathbf{m}}\right)^{% \omega}+\left((\alpha_{s}-\alpha_{t}){\bm{\mu}}_{\theta}^{{\rm cat}}({\mathbf{% x}}_{t},t)\right)^{\omega}}\frac{1}{1-\alpha_{t}}= divide start_ARG ( ( 1 - italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) bold_m ) start_POSTSUPERSCRIPT italic_ω + 1 end_POSTSUPERSCRIPT + ( ( italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_cat end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_y , italic_t ) ) start_POSTSUPERSCRIPT italic_ω + 1 end_POSTSUPERSCRIPT end_ARG start_ARG ( ( 1 - italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) bold_m ) start_POSTSUPERSCRIPT italic_ω end_POSTSUPERSCRIPT + ( ( italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_cat end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) ) start_POSTSUPERSCRIPT italic_ω end_POSTSUPERSCRIPT end_ARG divide start_ARG 1 end_ARG start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG

By the same property, we can perform division for 𝐦𝐦{\mathbf{m}}bold_m and 𝝁θcatsuperscriptsubscript𝝁𝜃cat{\bm{\mu}}_{\theta}^{{\rm cat}}bold_italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_cat end_POSTSUPERSCRIPT separately:

=((1αs)𝐦+(αsαt)𝝁θcat(𝐱t,𝐲,t)ω+1𝝁θcat(𝐱t,t)ω)11αtabsent1subscript𝛼𝑠𝐦subscript𝛼𝑠subscript𝛼𝑡superscriptsubscript𝝁𝜃catsuperscriptsubscript𝐱𝑡𝐲𝑡𝜔1superscriptsubscript𝝁𝜃catsuperscriptsubscript𝐱𝑡𝑡𝜔11subscript𝛼𝑡\displaystyle=\left((1-\alpha_{s}){\mathbf{m}}+(\alpha_{s}-\alpha_{t})\frac{{% \bm{\mu}}_{\theta}^{{\rm cat}}({\mathbf{x}}_{t},{\mathbf{y}},t)^{\omega+1}}{{% \bm{\mu}}_{\theta}^{{\rm cat}}({\mathbf{x}}_{t},t)^{\omega}}\right)\frac{1}{1-% \alpha_{t}}= ( ( 1 - italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) bold_m + ( italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) divide start_ARG bold_italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_cat end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_y , italic_t ) start_POSTSUPERSCRIPT italic_ω + 1 end_POSTSUPERSCRIPT end_ARG start_ARG bold_italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_cat end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) start_POSTSUPERSCRIPT italic_ω end_POSTSUPERSCRIPT end_ARG ) divide start_ARG 1 end_ARG start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG
=(1αs)𝐦+(αsαt)exp((1+ω)log𝝁θcat(𝐱t,𝐲,t)ωlog𝝁θcat(𝐱t,t))1αtabsent1subscript𝛼𝑠𝐦subscript𝛼𝑠subscript𝛼𝑡1𝜔superscriptsubscript𝝁𝜃catsubscript𝐱𝑡𝐲𝑡𝜔superscriptsubscript𝝁𝜃catsubscript𝐱𝑡𝑡1subscript𝛼𝑡\displaystyle=\frac{(1-\alpha_{s}){\mathbf{m}}+(\alpha_{s}-\alpha_{t})\exp% \bigl{(}(1+\omega)\log{\bm{\mu}}_{\theta}^{{\rm cat}}({\mathbf{x}}_{t},{% \mathbf{y}},t)-\omega\log{\bm{\mu}}_{\theta}^{{\rm cat}}({\mathbf{x}}_{t},t)% \bigr{)}}{1-\alpha_{t}}= divide start_ARG ( 1 - italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) bold_m + ( italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) roman_exp ( ( 1 + italic_ω ) roman_log bold_italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_cat end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_y , italic_t ) - italic_ω roman_log bold_italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_cat end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) ) end_ARG start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG

Therefore, we can formulate p~θ(𝐱scat|𝐱t,𝐲)subscript~𝑝𝜃conditionalsuperscriptsubscript𝐱𝑠catsubscript𝐱𝑡𝐲\tilde{p}_{\theta}({\mathbf{x}}_{s}^{{\rm cat}}|{\mathbf{x}}_{t},{\mathbf{y}})over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_cat end_POSTSUPERSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_y ) as

p~θ(𝐱scat|𝐱t,𝐲)={cat(𝐱scat;𝐱tcat)𝐱tcat𝐦,cat(𝐱scat;(1αs)𝐦+(αsαt)𝝁~θcat(𝐱t,t)1αt)𝐱t=𝐦,subscript~𝑝𝜃conditionalsuperscriptsubscript𝐱𝑠catsubscript𝐱𝑡𝐲casescatsuperscriptsubscript𝐱𝑠catsuperscriptsubscript𝐱𝑡catsuperscriptsubscript𝐱𝑡cat𝐦catsuperscriptsubscript𝐱𝑠cat1subscript𝛼𝑠𝐦subscript𝛼𝑠subscript𝛼𝑡superscriptsubscript~𝝁𝜃catsubscript𝐱𝑡𝑡1subscript𝛼𝑡subscript𝐱𝑡𝐦\tilde{p}_{\theta}({\mathbf{x}}_{s}^{{\rm cat}}|{\mathbf{x}}_{t},{\mathbf{y}})% =\begin{cases}{\rm cat}({\mathbf{x}}_{s}^{{\rm cat}};{\mathbf{x}}_{t}^{{\rm cat% }})&{\mathbf{x}}_{t}^{{\rm cat}}\neq{\mathbf{m}},\\ {\rm cat}\left({\mathbf{x}}_{s}^{{\rm cat}};\frac{(1-\alpha_{s}){\mathbf{m}}+(% \alpha_{s}-\alpha_{t})\tilde{{\bm{\mu}}}_{\theta}^{{\rm cat}}({\mathbf{x}}_{t}% ,t)}{1-\alpha_{t}}\right)&{\mathbf{x}}_{t}={\mathbf{m}},\end{cases}over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_cat end_POSTSUPERSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_y ) = { start_ROW start_CELL roman_cat ( bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_cat end_POSTSUPERSCRIPT ; bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_cat end_POSTSUPERSCRIPT ) end_CELL start_CELL bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_cat end_POSTSUPERSCRIPT ≠ bold_m , end_CELL end_ROW start_ROW start_CELL roman_cat ( bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_cat end_POSTSUPERSCRIPT ; divide start_ARG ( 1 - italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) bold_m + ( italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) over~ start_ARG bold_italic_μ end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_cat end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) end_ARG start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ) end_CELL start_CELL bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_m , end_CELL end_ROW

where 𝝁~θcat(𝐱t,t)superscriptsubscript~𝝁𝜃catsubscript𝐱𝑡𝑡\tilde{{\bm{\mu}}}_{\theta}^{{\rm cat}}({\mathbf{x}}_{t},t)over~ start_ARG bold_italic_μ end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_cat end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) can be simply computed as the interpolation of 𝝁θcat(𝐱t,𝐲,t)superscriptsubscript𝝁𝜃catsubscript𝐱𝑡𝐲𝑡{\bm{\mu}}_{\theta}^{{\rm cat}}({\mathbf{x}}_{t},{\mathbf{y}},t)bold_italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_cat end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_y , italic_t ) and 𝝁θcat(𝐱t,t)superscriptsubscript𝝁𝜃catsubscript𝐱𝑡𝑡{\bm{\mu}}_{\theta}^{{\rm cat}}({\mathbf{x}}_{t},t)bold_italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_cat end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ):

log𝝁~θcat(𝐱t,t)=(1+ω)log𝝁θcat(𝐱t,𝐲,t)ωlog𝝁θcat(𝐱t,t)superscriptsubscript~𝝁𝜃catsubscript𝐱𝑡𝑡1𝜔superscriptsubscript𝝁𝜃catsubscript𝐱𝑡𝐲𝑡𝜔superscriptsubscript𝝁𝜃catsubscript𝐱𝑡𝑡\log\tilde{{\bm{\mu}}}_{\theta}^{{\rm cat}}({\mathbf{x}}_{t},t)=(1+\omega)\log% {\bm{\mu}}_{\theta}^{{\rm cat}}({\mathbf{x}}_{t},{\mathbf{y}},t)-\omega\log{% \bm{\mu}}_{\theta}^{{\rm cat}}({\mathbf{x}}_{t},t)roman_log over~ start_ARG bold_italic_μ end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_cat end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) = ( 1 + italic_ω ) roman_log bold_italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_cat end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_y , italic_t ) - italic_ω roman_log bold_italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_cat end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t )

Appendix C Detailed Illustrations of Training and Sampling Processes

Training details. Algorithm 1 outlines the training procedure for our hybrid diffusion model that jointly processes numerical and categorical variables. At each training iteration, we begin by sampling an initial data point 𝐱0subscript𝐱0\mathbf{{\mathbf{x}}}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT from the training data distribution p(𝐱𝟎)𝑝subscript𝐱0p(\mathbf{{\mathbf{x}}_{0}})italic_p ( bold_x start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT ) and a timestep t𝑡titalic_t uniformly from [0,1]01[0,1][ 0 , 1 ]. Then, we perform the forward diffusion step.Each numerical feature is perturbed with a Gaussian noise whose intensity is determined by the 𝝈num(t)superscript𝝈num𝑡\bm{\sigma}^{\text{num}}(t)bold_italic_σ start_POSTSUPERSCRIPT num end_POSTSUPERSCRIPT ( italic_t ); each categorical feature flipped into the mask token 𝐦𝐦{\mathbf{m}}bold_m with probability 𝜶tsubscript𝜶𝑡\bm{\alpha}_{t}bold_italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (i.e., eq. 6). The noised numerical and categorical components are concatenated together to form a noisy version 𝐱tsubscript𝐱𝑡{\mathbf{x}}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT of the table row. Lastly, we compute the training objective TabDiffsubscriptTabDiff{\mathcal{L}}_{\textsc{TabDiff}}caligraphic_L start_POSTSUBSCRIPT TabDiff end_POSTSUBSCRIPT and perform gradient descent on the model parameters θ𝜃\thetaitalic_θ, ρ𝜌\rhoitalic_ρ, and k𝑘kitalic_k.

Sampling details. Here, we present a vivid visual example in Figure 5 to illustrate the backward sampling process described in Algorithm 2. Our example demonstrates generating a table with two numerical columns (movie duration and IMDB rating) and two categorical columns (genre and awards status). Each row represents an independent sample.

First, at t=1.0𝑡1.0t=1.0italic_t = 1.0 (the first frame), the numerical features are initialized with Gaussian noise, and all categorical components are masked. The algorithm then iterates backward through time steps from t=1.0𝑡1.0t=1.0italic_t = 1.0 to 0.00.00.00.0.

At each timestep t𝑡titalic_t, we first perform the forward stochastic perturbation step (the yellow section of Algorithm 2), the core of our stochastic sampler. All features are first perturbed forward to t+superscript𝑡t^{+}italic_t start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT, a slightly noisier state, following the same process as the forward step during training. While this step is not explicitly depicted in our visualization in Figure 5, it implies that, for instance, during the transition at the third frame, the unmasked “None” entry could be stochastically flipped back to the [MASK] state. This would then allow the model to re-predict the value, potentially yielding a different result (e.g “Won”) than “None” in the subsequent frame.

After the stochastic perturbation, we perform the denoising/unmasking step (the blue section of Algorithm 2). For numerical features, we denoise to 𝐱t1numsuperscriptsubscript𝐱𝑡1num{\mathbf{x}}_{t-1}^{\text{num}}bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT num end_POSTSUPERSCRIPT by solving an ODE The update delta is determined by the normalized difference, d𝐱num𝑑superscript𝐱numd{\mathbf{x}}^{\text{num}}italic_d bold_x start_POSTSUPERSCRIPT num end_POSTSUPERSCRIPT, between the current state and the model’s prediction, scaled by the decrease in noise levels. For categorical variables, we perform the unmasking step. Intuitively, if the column is already unmasked, we stay in the current state. This is demonstrated in Figure 5, where the “None” entry persists once it has been flipped. If it is still masked, we flip the mask token to some valid value of that column with a certain probability (αt1αt1αtsubscript𝛼𝑡1subscript𝛼𝑡1subscript𝛼𝑡\frac{\alpha_{t-1}-\alpha_{t}}{1-\alpha_{t}}divide start_ARG italic_α start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG) that increases as sampling proceeds. We choose which unmasked token to move to based on the model’s predicted probability μθcat(xt,t)superscriptsubscript𝜇𝜃catsubscript𝑥𝑡𝑡\mu_{\theta}^{{\rm cat}}(x_{t},t)italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_cat end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) over all possible categories of the column.

Refer to caption
Figure 5: A vivid visualization of TabDiff’s generation process.

Appendix D Implementation Details

We perform our experiment on an Nvidia RTX A4000 GPU with 16G memory and implement TabDiff with PyTorch.

Data preprocessing. The raw tabular datasets usually contain missing entries. Thus in the first phase of preprocessing, we make up these missing values in the same way as existing works (Kotelnikov et al., 2023; Zhang et al., 2024), with numerical missing values being replaced by the column average and categorical missing values being treated as a new category. Moreover, the diverse range of numerical features typically leads to more difficult and unstable training. To counter this, we then transform the numerical values with the QuantileTransformer333https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.QuantileTransformer.html, and recover the original values using its inverse during sampling.

Data splits. For datasets other than Diabetes, we follow the exact same split as Zhang et al. (2024). Each dataset is split into the “real” and “test” sets. For the unconditional generation task on which data fidelity and the imputation task, the models are trained on the “real” set and evaluated on the “real” set. For the MLE task, the “real” set is further split into a training and validation set, and the “test” set is used for testing. Finally, for the data privacy measure DCR, the original dataset is equally split into two halves, with one being treated as the training set and the other as the holdout set.

For Diabetes, we split it into train, validation, and test sets with a ratio of 0.6/0.2/0.2. For the MLE task. The training and test sets are regarded as the “real” and “test” sets for the unconditional generation and imputation tasks. For DCR, we apply an equal split.

Architecture of the denoising network. In our implementation, we project each column individually to a d𝑑ditalic_d dimensional vector using a linear layer, ensuring that all columns are treated with the same importance. We set the embedding size d𝑑ditalic_d as 4, matching the size used in Zhang et al. (2024). We then process these projected vectors with a two-layer transformer, appending positional encodings at the end. The transformed vectors are then concatenated and passed through a five-layer MLP conditioned on the time embedding. Finally, the output is obtained by sequentially applying another transformer followed by a projection layer that recovers the original dimensions. Our denoising network has a comparable number of parameters as those experimented in TabDDPM (Kotelnikov et al., 2023) and TabSyn (Zhang et al., 2024), as our shared MLP model accounts for most of the parameters.

Hyperparameters Setting. TabDiff employs the same hyperparameter setting for all datasets. We train our models for 8000 epochs with the Adam optimizer. The training and sampling batch sizes are set to 4,096 and 10,000, respectively. Regarding the hyperparameters in TabDiff, the values σminsubscript𝜎min\sigma_{\text{min}}italic_σ start_POSTSUBSCRIPT min end_POSTSUBSCRIPT and σmaxsubscript𝜎max\sigma_{\text{max}}italic_σ start_POSTSUBSCRIPT max end_POSTSUBSCRIPT are set to 0.0020.0020.0020.002 and 80.080.080.080.0, referencing the optimal setting in Karras et al. (2022), and δ𝛿\deltaitalic_δ are set to 1e31e31\mathrm{e}{-3}1 roman_e - 3. For the loss weightings, we fix λcatsubscript𝜆cat\lambda_{{\rm cat}}italic_λ start_POSTSUBSCRIPT roman_cat end_POSTSUBSCRIPT to 1.0 and linear decay λnumsubscript𝜆num\lambda_{{\rm num}}italic_λ start_POSTSUBSCRIPT roman_num end_POSTSUBSCRIPT from 1.01.01.01.0 to 0.00.00.00.0 as training proceeds.

During inference, we select the checkpoint with the lowest training loss. We observe that our model achieves the superior performance reported paper with as few as 50 discretization steps (T=50𝑇50T=50italic_T = 50).

Details on OOMs in experiment result tables:

  1. 1.

    GOOGLE set fixed random seed during sampling in the official codes, and we follow it for consistency.

  2. 2.

    GReaT cannot be applied on News for maximum length limit.

  3. 3.

    STaSy runs out of memory on Diabetes that has hight cardinality categorical columns

  4. 4.

    TabDDPM cannot produce meaningful content on the News and Diabetes datasets.

Imputation. As mentioned in Section 4.3, we obtain the unconditional model of the target column by training TabDiff on it with a smaller denoising network. For this network, we keep the same architecture but reduce the number of MLP layers to one.

Appendix E Detailed Experiments Results

In the following sections, we discuss the result on α𝛼\alphaitalic_α-precision, β𝛽\betaitalic_β-recall, detection score (C2ST), and DCR in detail.

E.1 Additional Fidelity Metrics

𝜶𝜶\bm{\alpha}bold_italic_α-precision. We first evaluate TabDiff on α𝛼\alphaitalic_α-Precision score, a metric that measures the quality of synthetic data. Higher scores indicate the synthetic data is more faithful to the real. We present the results across all seven datasets in Table 7. TabDiff achieves the best or second-best performance on all datasets. Specifically, TabDiff ranks first with an average α𝛼\alphaitalic_α-Precision score of 98.22, surpassing all other baseline methods.

𝜷𝜷\bm{\beta}bold_italic_β-recall. Next, we compare TabDiff to the baselines on the β𝛽\betaitalic_β-Recall scores, which evaluates the extent to which synthetic data covers the real data distribution. The results are presented in Table 8, with a higher score reflecting a more comprehensive coverage of the real data’s feature space. TabDiff consistently outperforms or matches the top-performing baselines, achieving the highest average β𝛽\betaitalic_β-Recall score of 49.40. This indicates that the generated data spans a broad range of the real distribution. Though some baseline methods attained higher scores on specific datasets, they fail to demonstrate competitive performance on α𝛼\alphaitalic_α-Precision, as models have to trade off fine-grained details in order to capture a broader range of features.

Overall, TabDiff maintains a balance between broad data coverage and preserving fine-grained details. This balance highlights TabDiff ’s capability in generating synthetic data that faithfully captures both the breadth and depth of the original data distribution.

Detection Score (C2ST). Lastly, we assess the fidelity of synthetic data by using the C2ST test, which evaluates how difficult it is to distinguish the synthetic data from the real data. The results are shown in Table 9, where a higher score indicates better fidelity. TabDiff achieves the best performance on five of seven datasets, outperforming the most competitive baseline model by 6.89%percent6.896.89\%6.89 % on average. Notably, TabDiff excels on Diabetes, which contains many numerous high-cardinality categorical features (as indicated by # Max Cat in Table 6), showcasing its ability to generate high-quality categorical data. These results, therefore, demonstrate TabDiff’s capacity to generate synthetic data that closely resembles the real data.

Table 7: Comparison of α𝛼\alphaitalic_α-Precision scores. Bold Face highlights the best score for each dataset. Higher scores reflect better performance.
Methods Adult Default Shoppers Magic Beijing News Diabetes Average Ranking
CTGAN 77.7477.7477.7477.74±0.15plus-or-minus0.15\pm 0.15± 0.15 62.0862.0862.0862.08±0.08plus-or-minus0.08\pm 0.08± 0.08 76.9776.9776.9776.97±0.39plus-or-minus0.39\pm 0.39± 0.39 86.9086.9086.9086.90±0.22plus-or-minus0.22\pm 0.22± 0.22 96.2796.2796.2796.27±0.14plus-or-minus0.14\pm 0.14± 0.14 96.9696.9696.9696.96±0.17plus-or-minus0.17\pm 0.17± 0.17 79.8979.8979.8979.89±0.10plus-or-minus0.10\pm 0.10± 0.10 82.4082.4082.4082.40 5555
TVAE 98.1798.1798.1798.17±0.17plus-or-minus0.17\pm 0.17± 0.17 85.5785.5785.5785.57±0.34plus-or-minus0.34\pm 0.34± 0.34 58.1958.1958.1958.19±0.26plus-or-minus0.26\pm 0.26± 0.26 86.1986.1986.1986.19±0.48plus-or-minus0.48\pm 0.48± 0.48 97.2097.2097.2097.20±0.10plus-or-minus0.10\pm 0.10± 0.10 86.4186.4186.4186.41±0.17plus-or-minus0.17\pm 0.17± 0.17 19.2419.2419.2419.24±0.15plus-or-minus0.15\pm 0.15± 0.15 75.8575.8575.8575.85 7777
GOGGLE 50.6850.6850.6850.68 68.8968.8968.8968.89 86.9586.9586.9586.95 90.8890.8890.8890.88 88.8188.8188.8188.81 86.4186.4186.4186.41 23.0923.0923.0923.09 70.8170.8170.8170.81 9999
GReaT 55.7955.7955.7955.79±0.03plus-or-minus0.03\pm 0.03± 0.03 85.9085.9085.9085.90±0.17plus-or-minus0.17\pm 0.17± 0.17 78.8878.8878.8878.88±0.13plus-or-minus0.13\pm 0.13± 0.13 85.4685.4685.4685.46±0.54plus-or-minus0.54\pm 0.54± 0.54 98.3298.32\bf{{\color[rgb]{0.0,0.45,0.81}98.32}}bold_98.32±0.22plus-or-minus0.22\bf{{\color[rgb]{0.0,0.45,0.81}\pm 0.22}}± bold_0.22 OOM OOM 80.8780.8780.8780.87 6666
STaSy 82.8782.8782.8782.87±0.26plus-or-minus0.26\pm 0.26± 0.26 90.4890.4890.4890.48±0.11plus-or-minus0.11\pm 0.11± 0.11 89.6589.6589.6589.65±0.25plus-or-minus0.25\pm 0.25± 0.25 86.5686.5686.5686.56±0.19plus-or-minus0.19\pm 0.19± 0.19 89.1689.1689.1689.16±0.12plus-or-minus0.12\pm 0.12± 0.12 94.7694.7694.7694.76±0.33plus-or-minus0.33\pm 0.33± 0.33 OOM 88.9188.9188.9188.91 3333
CoDi 77.5877.5877.5877.58±0.45plus-or-minus0.45\pm 0.45± 0.45 82.3882.3882.3882.38±0.15plus-or-minus0.15\pm 0.15± 0.15 94.9594.9594.9594.95±0.35plus-or-minus0.35\pm 0.35± 0.35 85.0185.0185.0185.01±0.36plus-or-minus0.36\pm 0.36± 0.36 98.1398.13{98.13}98.13±0.38plus-or-minus0.38{\pm 0.38}± 0.38 87.1587.1587.1587.15±0.12plus-or-minus0.12\pm 0.12± 0.12 64.8064.8064.8064.80±0.53plus-or-minus0.53\pm 0.53± 0.53 84.2984.2984.2984.29 4444
TabDDPM 96.3696.3696.3696.36±0.20plus-or-minus0.20\pm 0.20± 0.20 97.5997.5997.5997.59±0.36plus-or-minus0.36\pm 0.36± 0.36 88.5588.5588.5588.55±0.68plus-or-minus0.68\pm 0.68± 0.68 98.5998.5998.5998.59±0.17plus-or-minus0.17\pm 0.17± 0.17 97.9397.9397.9397.93±0.30plus-or-minus0.30{\pm 0.30}± 0.30 0.000.000.000.00±0.00plus-or-minus0.00\pm 0.00± 0.00 28.3528.3528.3528.35±0.11plus-or-minus0.11\pm 0.11± 0.11 72.4872.4872.4872.48 8888
TabSyn 99.3999.39\bf{{\color[rgb]{0.0,0.45,0.81}99.39}}bold_99.39±0.18plus-or-minus0.18\bf{{\color[rgb]{0.0,0.45,0.81}\pm 0.18}}± bold_0.18 98.6598.65\bf{{\color[rgb]{0.0,0.45,0.81}98.65}}bold_98.65±0.23plus-or-minus0.23\bf{{\color[rgb]{0.0,0.45,0.81}\pm 0.23}}± bold_0.23 98.3698.36{98.36}98.36±0.52plus-or-minus0.52{\pm 0.52}± 0.52 99.4299.42{99.42}99.42±0.28plus-or-minus0.28{\pm 0.28}± 0.28 97.5197.51{97.51}97.51±0.24plus-or-minus0.24{\pm 0.24}± 0.24 95.0595.05{95.05}95.05±0.30plus-or-minus0.30{\pm 0.30}± 0.30 96.6196.61\bf{{\color[rgb]{0.0,0.45,0.81}96.61}}bold_96.61±0.24plus-or-minus0.24\bf{{\color[rgb]{0.0,0.45,0.81}\pm 0.24}}± bold_0.24 97.8697.86{97.86}97.86 2222
TabDiff 99.0299.02{99.02}99.02±0.20plus-or-minus0.20{\pm 0.20}± 0.20 98.4998.49{98.49}98.49±0.28plus-or-minus0.28{\pm 0.28}± 0.28 99.1199.11\bf{{\color[rgb]{0.0,0.45,0.81}99.11}}bold_99.11±0.34plus-or-minus0.34\bf{{\color[rgb]{0.0,0.45,0.81}\pm 0.34}}± bold_0.34 99.4799.47\bf{{\color[rgb]{0.0,0.45,0.81}99.47}}bold_99.47±0.21plus-or-minus0.21\bf{{\color[rgb]{0.0,0.45,0.81}\pm 0.21}}± bold_0.21 98.0698.06{98.06}98.06±0.24plus-or-minus0.24{\pm 0.24}± 0.24 97.3697.36\bf{{\color[rgb]{0.0,0.45,0.81}97.36}}bold_97.36±0.17plus-or-minus0.17\bf{{\color[rgb]{0.0,0.45,0.81}\pm 0.17}}± bold_0.17 95.6995.69{95.69}95.69±0.19plus-or-minus0.19{\pm 0.19}± 0.19 98.2298.22\bf{{\color[rgb]{0.0,0.45,0.81}98.22}}bold_98.22 1111
Table 8: Comparison of β𝛽\betaitalic_β-Recall scores. Bold Face highlights the best score for each dataset. Higher scores reflects better results.
Methods Adult Default Shoppers Magic Beijing News Diabetes Average Ranking
CTGAN 30.8030.8030.8030.80±0.20plus-or-minus0.20\pm 0.20± 0.20 18.2218.2218.2218.22±0.17plus-or-minus0.17\pm 0.17± 0.17 31.8031.8031.8031.80±0.350plus-or-minus0.350\pm 0.350± 0.350 11.7511.7511.7511.75±0.20plus-or-minus0.20\pm 0.20± 0.20 34.8034.8034.8034.80±0.10plus-or-minus0.10\pm 0.10± 0.10 24.9724.9724.9724.97±0.29plus-or-minus0.29\pm 0.29± 0.29 9.429.429.429.42±0.26plus-or-minus0.26\pm 0.26± 0.26 23.1123.1123.1123.11 8888
TVAE 38.8738.8738.8738.87±0.31plus-or-minus0.31\pm 0.31± 0.31 23.1323.1323.1323.13±0.11plus-or-minus0.11\pm 0.11± 0.11 19.7819.7819.7819.78±0.10plus-or-minus0.10\pm 0.10± 0.10 32.4432.4432.4432.44±0.35plus-or-minus0.35\pm 0.35± 0.35 28.4528.4528.4528.45±0.08plus-or-minus0.08\pm 0.08± 0.08 29.6629.6629.6629.66±0.21plus-or-minus0.21\pm 0.21± 0.21 4.924.924.924.92±0.13plus-or-minus0.13\pm 0.13± 0.13 25.3225.3225.3225.32 7777
GOGGLE 8.808.808.808.80 14.3814.3814.3814.38 9.799.799.799.79 9.889.889.889.88 19.8719.8719.8719.87 2.032.032.032.03 3.743.743.743.74 9.789.789.789.78 9999
GReaT 49.1249.12{49.12}49.12±0.18plus-or-minus0.18{\pm 0.18}± 0.18 42.0442.0442.0442.04±0.19plus-or-minus0.19\pm 0.19± 0.19 44.9044.9044.9044.90±0.17plus-or-minus0.17\pm 0.17± 0.17 34.9134.9134.9134.91±0.28plus-or-minus0.28\pm 0.28± 0.28 43.3443.3443.3443.34±0.31plus-or-minus0.31\pm 0.31± 0.31 OOM OOM 43.3443.3443.3443.34 3333
STaSy 29.2129.2129.2129.21±0.34plus-or-minus0.34\pm 0.34± 0.34 39.3139.3139.3139.31±0.39plus-or-minus0.39\pm 0.39± 0.39 37.2437.2437.2437.24±0.45plus-or-minus0.45\pm 0.45± 0.45 53.9753.97{53.97}53.97±0.57plus-or-minus0.57{\pm 0.57}± 0.57 54.7954.7954.7954.79±0.18plus-or-minus0.18\pm 0.18± 0.18 39.4239.4239.4239.42±0.32plus-or-minus0.32\pm 0.32± 0.32 OOM 42.3242.3242.3242.32 4444
CoDi 9.209.209.209.20±0.15plus-or-minus0.15\pm 0.15± 0.15 19.9419.9419.9419.94±0.22plus-or-minus0.22\pm 0.22± 0.22 20.8220.8220.8220.82±0.23plus-or-minus0.23\pm 0.23± 0.23 50.5650.5650.5650.56±0.31plus-or-minus0.31\pm 0.31± 0.31 52.1952.1952.1952.19±0.12plus-or-minus0.12\pm 0.12± 0.12 34.4034.4034.4034.40±0.31plus-or-minus0.31\pm 0.31± 0.31 2.702.702.702.70±0.06plus-or-minus0.06\pm 0.06± 0.06 27.1227.1227.1227.12 6666
TabDDPM 47.0547.0547.0547.05±0.25plus-or-minus0.25\pm 0.25± 0.25 47.8347.8347.8347.83±0.35plus-or-minus0.35\pm 0.35± 0.35 47.7947.7947.7947.79±0.25plus-or-minus0.25\pm 0.25± 0.25 48.4648.46\bf{{\color[rgb]{0.0,0.45,0.81}48.46}}bold_48.46±0.42plus-or-minus0.42\bf{{\color[rgb]{0.0,0.45,0.81}\pm 0.42}}± bold_0.42 56.9256.92{56.92}56.92±0.13plus-or-minus0.13{\pm 0.13}± 0.13 0.000.000.000.00±0.00plus-or-minus0.00\pm 0.00± 0.00 0.030.030.030.03±0.01plus-or-minus0.01\pm 0.01± 0.01 35.4435.4435.4435.44 5555
TabSyn 47.9247.92{47.92}47.92±0.23plus-or-minus0.23{\pm 0.23}± 0.23 46.4546.45{46.45}46.45±0.35plus-or-minus0.35{\pm 0.35}± 0.35 49.1049.10{49.10}49.10±0.60plus-or-minus0.60{\pm 0.60}± 0.60 48.0348.03{48.03}48.03±0.50plus-or-minus0.50{\pm 0.50}± 0.50 59.1559.1559.1559.15±0.22plus-or-minus0.22{\pm 0.22}± 0.22 43.0143.01\bf{{\color[rgb]{0.0,0.45,0.81}43.01}}bold_43.01±0.28plus-or-minus0.28\bf{{\color[rgb]{0.0,0.45,0.81}\pm 0.28}}± bold_0.28 33.7233.72{33.72}33.72±0.16plus-or-minus0.16{\pm 0.16}± 0.16 46.7746.77{46.77}46.77 2222
TabDiff 51.6451.64\bf{{\color[rgb]{0.0,0.45,0.81}51.64}}bold_51.64±0.20plus-or-minus0.20\bf{{\color[rgb]{0.0,0.45,0.81}\pm 0.20}}± bold_0.20 51.0951.09\bf{{\color[rgb]{0.0,0.45,0.81}51.09}}bold_51.09±0.25plus-or-minus0.25\bf{{\color[rgb]{0.0,0.45,0.81}\pm 0.25}}± bold_0.25 49.7549.75\bf{{\color[rgb]{0.0,0.45,0.81}49.75}}bold_49.75±0.64plus-or-minus0.64\bf{{\color[rgb]{0.0,0.45,0.81}\pm 0.64}}± bold_0.64 48.0148.01{48.01}48.01±0.31plus-or-minus0.31{\pm 0.31}± 0.31 59.6359.63\bf{{\color[rgb]{0.0,0.45,0.81}59.63}}bold_59.63±0.23plus-or-minus0.23\bf{{\color[rgb]{0.0,0.45,0.81}\pm 0.23}}± bold_0.23 42.1042.10{42.10}42.10±0.32plus-or-minus0.32{\pm 0.32}± 0.32 41.7441.74\bf{{\color[rgb]{0.0,0.45,0.81}41.74}}bold_41.74±0.17plus-or-minus0.17\bf{{\color[rgb]{0.0,0.45,0.81}\pm 0.17}}± bold_0.17 49.4049.40\bf{{\color[rgb]{0.0,0.45,0.81}49.40}}bold_49.40 1111
Table 9: Detection score (C2ST) using logistic regression classifier. Higher scores reflect superior performance.
Method Adult Default Shoppers Magic Beijing News Diabetes Average
CTGAN 0.59490.59490.59490.5949 0.48750.48750.48750.4875 0.74880.74880.74880.7488 0.67280.67280.67280.6728 0.75310.75310.75310.7531 0.69470.69470.69470.6947 0.55930.55930.55930.5593 0.64440.64440.64440.6444
TVAE 0.63150.63150.63150.6315 0.65470.65470.65470.6547 0.29620.29620.29620.2962 0.77060.77060.77060.7706 0.86590.86590.86590.8659 0.40760.40760.40760.4076 0.04870.04870.04870.0487 0.52500.52500.52500.5250
GOGGLE 0.11140.11140.11140.1114 0.51630.51630.51630.5163 0.14180.14180.14180.1418 0.95260.95260.95260.9526 0.47790.47790.47790.4779 0.07450.07450.07450.0745 0.09120.09120.09120.0912 0.33800.33800.33800.3380
GReaT 0.53760.53760.53760.5376 0.47100.47100.47100.4710 0.42850.42850.42850.4285 0.43260.43260.43260.4326 0.68930.68930.68930.6893 OOM OOM 0.51180.51180.51180.5118
STaSy 0.40540.40540.40540.4054 0.68140.68140.68140.6814 0.54820.54820.54820.5482 0.69390.69390.69390.6939 0.79220.79220.79220.7922 0.52870.52870.52870.5287 OOM 0.60830.60830.60830.6083
CoDi 0.20770.20770.20770.2077 0.45950.45950.45950.4595 0.27840.27840.27840.2784 0.72060.72060.72060.7206 0.71770.71770.71770.7177 0.02010.02010.02010.0201 0.00080.00080.00080.0008 0.34350.34350.34350.3435
TabDDPM 0.97550.97550.97550.9755 0.97120.97120.97120.9712 0.83490.83490.83490.8349 0.99980.9998\bf{{\color[rgb]{0.0,0.45,0.81}0.9998}}bold_0.9998 0.95130.95130.95130.9513 0.00020.00020.00020.0002 0.19800.1980{0.1980}0.1980 0.70440.70440.70440.7044
TabSyn 0.99100.9910{0.9910}0.9910 0.98260.9826\bf{{\color[rgb]{0.0,0.45,0.81}0.9826}}bold_0.9826 0.96620.9662{0.9662}0.9662 0.99600.99600.99600.9960 0.95280.9528{0.9528}0.9528 0.92550.9255{0.9255}0.9255 0.59530.5953{0.5953}0.5953 0.91560.91560.91560.9156
TabDiff 0.99500.9950\bf{{\color[rgb]{0.0,0.45,0.81}0.9950}}bold_0.9950 0.97740.9774{0.9774}0.9774 0.98430.9843\bf{{\color[rgb]{0.0,0.45,0.81}0.9843}}bold_0.9843 0.99890.9989{0.9989}0.9989 0.97810.9781\bf{{\color[rgb]{0.0,0.45,0.81}0.9781}}bold_0.9781 0.93080.9308\bf{{\color[rgb]{0.0,0.45,0.81}0.9308}}bold_0.9308 0.98650.9865\bf{{\color[rgb]{0.0,0.45,0.81}0.9865}}bold_0.9865 0.97870.9787\bf{{\color[rgb]{0.0,0.45,0.81}0.9787}}bold_0.9787
Improv. 0.40%percent0.40absent{0.40\%\downarrow}0.40 % ↓ 0.0%percent0.0absent{0.0\%\downarrow}0.0 % ↓ 1.87%percent1.87absent{1.87\%\downarrow}1.87 % ↓ 0.0%percent0.0absent{0.0\%\downarrow}0.0 % ↓ 2.66%percent2.66absent{2.66\%\downarrow}2.66 % ↓ 0.57%percent0.57absent{0.57\%\downarrow}0.57 % ↓ 65.71%percent65.71absent{65.71\%\downarrow}65.71 % ↓ 6.89%percent6.89absent{6.89\%\downarrow}6.89 % ↓

E.2 Data Privacy.

Table 10 shows the DCR scores across all datasets. The DCR score represents the probability that a generated data sample is more similar to the training set than to the test set, with a score closer to 50% being ideal, as it indicates a balance between the similarity to training and test distributions. Across the datasets, TabDiff consistently achieves DCR scores near 50%, highlighting its ability to generalize well while maintaining fidelity to the original data distribution.

Table 10: DCR score, which represents the probability that a generated data sample is more similar to the training set than to the test set. A score closer to 50%percent5050\%50 % is more preferable. Bold Face highlights the best score for each dataset.
Method Adult Default Shoppers Beijing News Diabetes
STaSy 50.3350.3350.3350.33%±0.19plus-or-minus0.19\pm 0.19± 0.19 50.2350.2350.2350.23%±0.09plus-or-minus0.09\pm 0.09± 0.09 51.5351.5351.5351.53%±0.16plus-or-minus0.16\pm 0.16± 0.16 50.5950.5950.5950.59%±0.29plus-or-minus0.29\pm 0.29± 0.29 50.5950.5950.5950.59%±0.14plus-or-minus0.14\pm 0.14± 0.14 OOM
CoDi 49.9249.9249.9249.92%±0.18plus-or-minus0.18\pm 0.18± 0.18 51.8251.8251.8251.82%±0.26plus-or-minus0.26\pm 0.26± 0.26 51.0651.0651.0651.06%±0.18plus-or-minus0.18\pm 0.18± 0.18 50.8750.8750.8750.87%±0.11plus-or-minus0.11\pm 0.11± 0.11 50.7950.7950.7950.79%±0.23plus-or-minus0.23\pm 0.23± 0.23 51.1251.1251.1251.12%±0.19plus-or-minus0.19\pm 0.19± 0.19
TabDDPM 51.1451.1451.1451.14%±0.18plus-or-minus0.18\pm 0.18± 0.18 52.1552.1552.1552.15%±0.20plus-or-minus0.20\pm 0.20± 0.20 63.2363.2363.2363.23%±0.25plus-or-minus0.25\pm 0.25± 0.25 80.1180.1180.1180.11%±2.68plus-or-minus2.68\pm 2.68± 2.68 79.3179.3179.3179.31% ±0.29plus-or-minus0.29\pm 0.29± 0.29 37.7637.7637.7637.76% ±0.23plus-or-minus0.23\pm 0.23± 0.23
TabSyn 50.9450.9450.9450.94%±0.17plus-or-minus0.17\pm 0.17± 0.17 51.2051.2051.2051.20%±0.18plus-or-minus0.18\pm 0.18± 0.18 52.9052.9052.9052.90% ±0.22plus-or-minus0.22\pm 0.22± 0.22 50.3750.3750.3750.37%±0.13plus-or-minus0.13\pm 0.13± 0.13 50.8550.8550.8550.85% ±0.33plus-or-minus0.33\pm 0.33± 0.33 50.6250.6250.6250.62% ±0.28plus-or-minus0.28\pm 0.28± 0.28
TabDiff 50.1050.1050.1050.10%±0.32plus-or-minus0.32\pm 0.32± 0.32 51.1151.1151.1151.11%±0.36plus-or-minus0.36\pm 0.36± 0.36 50.2450.2450.2450.24% ±0.62plus-or-minus0.62\pm 0.62± 0.62 50.5050.5050.5050.50% ±0.36plus-or-minus0.36\pm 0.36± 0.36 51.0451.0451.0451.04% ±.32plus-or-minus.32\pm.32± .32 50.4350.4350.4350.43% ±0.18plus-or-minus0.18\pm 0.18± 0.18

Appendix F Study of Model Efficiency and Robustness

In this section, we present a thorough analysis of TabDDPM’s efficiency and robustness. For efficiency, we compare the training and sampling speed of TabDiff against other competitive baseline methods (TabDDPM, TabSyn) using four different metrics. For robustness, we first explore the tradeoff between discretization steps (i.e., efficiency) and sample quality in diffusion-based models. We then dig into detailed error rates Shape and Trend to see whether models are biased towards learning some particular columns of datasets. Lastly, we discussed the robustness issues we found with the competitive baseline. We use the representative Adult dataset, which contains a balanced number of numerical and categorical columns, throughout the experiment. Our results show that, among the competitive methods, TabDiff is not only the most effective but also the most robust and highly efficient. Below, we present a detailed analysis.

F.1 Efficiency

Training Time. We first measure the training time of each method. The epoch lengths are set to the default configuration of each method, and the validation frequencies are set to the same value of once every 200 epochs. Our measurements are presented in the first column of the Table 11, with the entry of TabSyn being split into – VAE training time + Diffusion training time.

We see that all three methods take a comparable time to complete one training run, with TabSyn being 10%percent1010\%10 % faster than TabDiff and TabDiff being 20%percent2020\%20 % faster than TabDDPM. The current gap between TabDiff and TabSyn is likely due to TabDiff ’s slightly deeper network architecture compared to TabSyn. We believe that the architecture of TabDiff can be further optimized to improve efficiency, and we leave this as future work.

Nevertheless, it is also important to consider model robustness when assessing efficiency. As highlighted in Section F.2, the training process for TabSyn is notably unstable due to the difficulty in training VAEs, often requiring you to retrain many times in order to produce a model capable of generating samples comparable in quality to TabDiff. On the other hand, TabDDPM fails to scale to more complicated datasets as shown by its poor performance on News and Diabetes. Thus, when taking into account training robustness, TabDiff is the most robust and efficient model among all competitive methods.

Training Convergence. Next, we assess training convergence by evaluating the quality of samples generated from intermediate checkpoints during the training process. Figure 6 plots our result. The curves show that TabDiff converges faster than the other methods, as TabDiff produces more high-quality samples when shown to the datasets for the same number of times (i.e., at a same epoch).

Number of Function Evaluation (NFE). For sampling, we first theoretically analyze the number of network calls involved in a single diffusion step (i.e. the denoising step from xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to xt1subscript𝑥𝑡1x_{t-1}italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT)) for each method. The numbers are shown in the third column of Table 11. TabDiff and TabSyn involve an extra network call because TabDiff employs the second-order correction trick introduced in Karras et al. (2022).

Sampling Time. We empirically measure the time to generate the same number of samples as the test set (32561 samples for Adult). The numbers are presented in the second column of Table 11. According to them, TabDiff and TabSyn samples 10×\sim 10\times∼ 10 × faster than TabDDPM. This result is expected, as both TabDiff and TabSyn, by default, use 20 times fewer sampling steps than TabDDPM, while making twice as many network calls per step compared to TabDDPM. TabDiff ’s slightly longer sampling time is also attributed to the evaluation of its deeper network. We believe this can be optimized in future work.

Method Train.T (min) Sample.T (sec) NFE
TabDDPM 112112112112 125.1125.1125.1125.1±0.01plus-or-minus0.01\pm{0.01}± 0.01 1111
TabSyn 45+40=8545408545+40=8545 + 40 = 85 8.88.88.88.8±0.005plus-or-minus0.005\pm{0.005}± 0.005 2222
TabDiff 94949494 15.215.215.215.2±0.007plus-or-minus0.007\pm{0.007}± 0.007 2222
Table 11: Model Efficiency.
Refer to caption
Figure 6: Model Convergence speed.

F.2 Robustness

Controling quality-efficiency tradeoff through discretization steps. One advantage of continuous-time diffusion models (which currently include TabDiff and TabSyn) is the ability to sample with arbitrary discretization steps, allowing them to flexibly tradeoff sample quality with sample efficiency. We conduct an experiment that compares how TabDiff and TabSyn perform when sampled with different discretization steps. The results and their visualizations are presented in Table 12 and Figure 7.

Our results show that TabDiff consistently achieves higher sample quality across all levels of discretization steps. Notably, when the number of steps is reduced to just 5 (requiring only one second for sampling), TabSyn fails to generate meaningful content, as indicated by an error rate approaching 50%percent5050\%50 %. In contrast, TabDiff continues to produce valid and coherent content even under these highly constrained conditions.

TabSyn TabDiff
Steps Shape Trend Shape Trend
5 34.0934.0934.0934.09 49.3049.3049.3049.30 12.5112.5112.5112.51 22.1522.1522.1522.15
10 1.991.991.991.99 3.923.923.923.92 1.551.551.551.55 3.363.363.363.36
25 0.840.840.840.84 1.961.961.961.96 0.620.620.620.62 1.501.501.501.50
50 0.810.810.810.81 1.951.951.951.95 0.630.630.630.63 1.491.491.491.49
100 0.820.820.820.82 1.941.941.941.94 0.640.640.640.64 1.531.531.531.53
Table 12: Ablate sample steps.
Refer to caption
Figure 7: Visualize the sample step ablation

Evaluting column-wise learning bias To address what it means by “optimal allocation of model capacity across heterogeneous features,” we analyze how generation quality varies between different features. The Shape and Trend errors presented in Table 5 are averaged over all columns and columns pars. Now, using the representative Adult dataset, we visualize the errors at each column and column pair in Figures 8 and 9, along with the normalized standard deviations in Table 13 to quantitatively measure the variations of errors. All results show that TabDiff with learnable schedules not only achieves lower average error but also exhibits more consistent errors across columns. This balance indicates that learnable schedules help the model balance its capacity on different dimensions of the data, improving the model’s ability to handle heterogeneous distributions.

Method Shape Trend Shape Std. Trend Std.
TabSyn 0.810.81{0.81}0.81 1.931.93{1.93}1.93 1.011.01{1.01}1.01 0.880.88{0.88}0.88
TabDiff-Fixed 0.740.74{0.74}0.74 1.731.73{1.73}1.73 0.420.42{0.42}0.42 0.750.75{0.75}0.75
TabDiff-Learn. 0.630.63\bf{{\color[rgb]{0.0,0.45,0.81}0.63}}bold_0.63 1.491.49\bf{{\color[rgb]{0.0,0.45,0.81}1.49}}bold_1.49 0.290.29\bf{{\color[rgb]{0.0,0.45,0.81}0.29}}bold_0.29 0.640.64\bf{{\color[rgb]{0.0,0.45,0.81}0.64}}bold_0.64
Table 13: Evaluating column-wise learning bias.
Refer to caption
Figure 8: TabDiff with learnable schedules has a more balanced Shape performance.
Refer to caption
Figure 9: TabDiff with learnable schedules has a more balanced Trend performance

Robustness issues of baselines. We identify several robustness issues with the competitive baseline methods. Specifically, TabDDPM struggles to scale to larger datasets, and TabSyn ’s performance is highly dependent on the training quality of VAEs, which can vary significantly across different runs.

TabDDPM: As shown by the results in Tables 1 and 2, TabDDPM achieves poor performance on larger datasets like News and Diabetes. This is because it failed to generate meaningful samples, as we can see in Figure 11 that the numerical columns of TabDDPM’s Diabetes samples all collapsed to the minimal and maximal values of the domains. After examining the training logs, we discovered that this poor generation performance might be due to the explosion of training loss, as shown in Figure 10.

TabSyn: TabSyn is another competitive tabular generation model whose performance, to our best knowledge, is closest to TabDiff. However, this method has a limitation: as mentioned in Zhang et al. (2024), the quality of the VAE has a significant impact on TabSyn ’s performance, as it’s the only component that recovers the original data shape. When reproducing TabSyn’s result, we observed that across different training runs, the sample quality varies significantly. For poorly performing runs, we attempted to retrain the diffusion model and even increased the number of training epochs, but these efforts did not improve the results. This confirms that the issue lies with the VAE.

To further illustrate this, we present the results of Shape and Trend that are averaged across 10 random training runs in Table 14 (Note: in the paper, we follow the convention of previous works and reported results based on 20 different runs of the same checkpoint, and we put the result of the best reproduction run for TabSyn). These additional results demonstrate that TabDiff achieves significantly more consistent performance across different training runs, as shown by the smaller average and standard deviation.

Method Adult Default Beijing
TabSyn (Shape) 1.311.31{1.31}1.31±0.64plus-or-minus0.64\bf{{\color[rgb]{0.8,0.25,0.33}\pm 0.64}}± bold_0.64 1.171.17\bf{{\color[rgb]{0.0,0.45,0.81}1.17}}bold_1.17±0.21plus-or-minus0.21\bf{{\color[rgb]{0.8,0.25,0.33}\pm 0.21}}± bold_0.21 2.692.69{2.69}2.69±1.44plus-or-minus1.44\bf{{\color[rgb]{0.8,0.25,0.33}\pm 1.44}}± bold_1.44
TabSyn (Trend) 2.732.73{2.73}2.73±0.98plus-or-minus0.98\bf{{\color[rgb]{0.8,0.25,0.33}\pm 0.98}}± bold_0.98 5.055.05{5.05}5.05±2.22plus-or-minus2.22\bf{{\color[rgb]{0.8,0.25,0.33}\pm 2.22}}± bold_2.22 5.055.05{5.05}5.05±1.88plus-or-minus1.88\bf{{\color[rgb]{0.8,0.25,0.33}\pm 1.88}}± bold_1.88
TabDiff (Shape) 0.650.65\bf{{\color[rgb]{0.0,0.45,0.81}0.65}}bold_0.65±0.08plus-or-minus0.08{\pm 0.08}± 0.08 1.191.19{1.19}1.19±0.12plus-or-minus0.12{\pm 0.12}± 0.12 1.071.07\bf{{\color[rgb]{0.0,0.45,0.81}1.07}}bold_1.07±0.6plus-or-minus0.6{\pm 0.6}± 0.6
TabDiff (Trend) 1.471.47\bf{{\color[rgb]{0.0,0.45,0.81}1.47}}bold_1.47±0.18plus-or-minus0.18{\pm 0.18}± 0.18 2.462.46\bf{{\color[rgb]{0.0,0.45,0.81}2.46}}bold_2.46±0.62plus-or-minus0.62{\pm 0.62}± 0.62 2.61.2.61\bf{{\color[rgb]{0.0,0.45,0.81}2.61.}}bold_2.61 .±0.20plus-or-minus0.20{\pm 0.20}± 0.20
Table 14: Models’ consistency across different training runs.
Refer to caption
Figure 10: TabDDPM failed to converge on Diabetes
Refer to caption
Figure 11: TabDDPM failed to produce meaningful samples on Diabetes