Excited to share our new preprint: “Bilateral Distribution Compression: Reducing Both Data Size and Dimensionality”
Training AI models often now requires enormous datasets, computation, and energy, with serious financial and environmental costs. A key challenge is how to reduce data without losing the essential information it carries.
With my supervisors Nick Whiteley, Robert Allison, and Tom Lovett, we introduce Bilateral Distribution Compression (BDC), a framework that compresses datasets in both sample size and dimensionality (i.e. bilaterally) while preserving their underlying distribution.
- Traditional distribution compression (e.g. Kernel Herding) shrinks datasets by sample size only.
- Dimensionality reduction (e.g. PCA, autoencoders) shrinks features but leaves sample counts unchanged.
- Our method does both, with linear scaling in dataset size and dimension.
A key idea is to break a hard problem into two easier steps, using three versions of the Maximum Mean Discrepancy (MMD):
- Reconstruction MMD (RMMD): checks if an autoencoder’s reconstructions preserve the distribution of the data (not just pointwise distances).
- Encoded MMD (EMMD): checks if the compressed latent set represents the encoded data well in latent space.
- Decoded MMD (DMMD): the final goal, checks if the decoded compressed set still looks like the original dataset.
The intuition: if the autoencoder preserves the data distribution (low RMMD) and the latent compressed set matches the encoded data (low EMMD), then the decoded compressed set must also match the original distribution (low DMMD).
In practice, BDC achieves very significant reductions, surpassing 99.99% compression on some tasks, while matching or outperforming existing compression approaches across various regression, classification and clustering problems, all at a fraction of the cost.
Preprint: https://lnkd.in/e4EEdKqj
A big thank you to my supervisors for helping bring this idea to life in such a short time frame!