Multi-Source Music Generation with Latent Diffusion

Xu, Zhongweiyang; Dutta, Debottam; Wei, Yu-Lin; Choudhury, Romit Roy

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2409.06190 (eess)

[Submitted on 10 Sep 2024 (v1), last revised 15 Oct 2024 (this version, v3)]

Title:Multi-Source Music Generation with Latent Diffusion

Authors:Zhongweiyang Xu, Debottam Dutta, Yu-Lin Wei, Romit Roy Choudhury

View PDF HTML (experimental)

Abstract:Most music generation models directly generate a single music mixture. To allow for more flexible and controllable generation, the Multi-Source Diffusion Model (MSDM) has been proposed to model music as a mixture of multiple instrumental sources (e.g. piano, drums, bass, and guitar). Its goal is to use one single diffusion model to generate mutually-coherent music sources, that are then mixed to form the music. Despite its capabilities, MSDM is unable to generate music with rich melodies and often generates empty sounds. Its waveform diffusion approach also introduces significant Gaussian noise artifacts that compromise audio quality. In response, we introduce a Multi-Source Latent Diffusion Model (MSLDM) that employs Variational Autoencoders (VAEs) to encode each instrumental source into a distinct latent representation. By training a VAE on all music sources, we efficiently capture each source's unique characteristics in a "source latent." The source latents are concatenated and our diffusion model learns this joint latent space. This approach significantly enhances the total and partial generation of music by leveraging the VAE's latent compression and noise-robustness. The compressed source latent also facilitates more efficient generation. Subjective listening tests and Frechet Audio Distance (FAD) scores confirm that our model outperforms MSDM, showcasing its practical and enhanced applicability in music generation systems. We also emphasize that modeling sources is more effective than direct music mixture modeling. Codes and models are available at this https URL. Demos are available at this https URL.

Comments:	ICASSP 2025 in Submission
Subjects:	Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
Cite as:	arXiv:2409.06190 [eess.AS]
	(or arXiv:2409.06190v3 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2409.06190

Submission history

From: Zhongweiyang Xu [view email]
[v1] Tue, 10 Sep 2024 03:41:10 UTC (1,537 KB)
[v2] Fri, 13 Sep 2024 05:01:02 UTC (1,538 KB)
[v3] Tue, 15 Oct 2024 19:17:33 UTC (1,538 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Multi-Source Music Generation with Latent Diffusion

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Multi-Source Music Generation with Latent Diffusion

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators