Multi-Source Music Generation with Latent Diffusion

Zhongweiyang Xu, Debottam Dutta, Yu-Lin Wei, Romit Roy Choudhury Department of Electrical and Computer Engineering, University of Illinois Urbana-Champaign
Abstract

Most music generation models directly generate a single music mixture. To allow for more flexible and controllable generation, the Multi-Source Diffusion Model (MSDM) has been proposed to model music as a mixture of multiple instrumental sources (e.g. piano, drums, bass, and guitar). Its goal is to use one single diffusion model to generate mutually-coherent music sources, that are then mixed to form the music. Despite its capabilities, MSDM is unable to generate music with rich melodies and often generates empty sounds. Its waveform diffusion approach also introduces significant Gaussian noise artifacts that compromise audio quality. In response, we introduce a Multi-Source Latent Diffusion Model (MSLDM) that employs Variational Autoencoders (VAEs) to encode each instrumental source into a distinct latent representation. By training a VAE on all music sources, we efficiently capture each source’s unique characteristics in a “source latent”. The source latents are concatenated and our diffusion model learns this joint latent space. This approach significantly enhances the total and partial generation of music by leveraging the VAE’s latent compression and noise-robustness. The compressed source latent also facilitates more efficient generation. Subjective listening tests and Fréchet Audio Distance (FAD) scores confirm that our model outperforms MSDM, showcasing its practical and enhanced applicability in music generation systems. We also emphasize that modeling sources is more effective than direct music mixture modeling. Codes and models are available at https://github.com/XZWY/MSLDM. Demos are available at https://xzwy.github.io/MSLDMDemo/.

Index Terms:
Music Generation, Latent Diffusion

I Introduction

Generative models show impressive performance not only in language and image modeling [1, 2, 3], but also show promising results in music generation. Music generation models usually fall into two categories: 1) Auto-regressive models and 2) Diffusion models.

For auto-regressive models, WaveNet [4] directly models scalar-quantized waveform samples, which allows for generating small musical fragments. However, due to the sample-level auto-regression, WaveNet has low sampling efficiency. One way to improve efficiency is to encode waveform samples to a discrete latent representation (tokens) with a much lower time resolution. These tokenizers [5, 6, 7] are usually variations of VQ-VAE [8], but are usually trained with perceptual adversarial loss [6, 7, 9, 10]. Then, the auto-regressive transformer models the sequence of tokens achieving higher efficiency. Among these models, JukeBox [5] allows music generation conditioned on lyrics. More recently, text-to-music generation has shown significant progress based on this framework [11, 12].

On the other hand, diffusion-based music generation models also hold great potential, where the diffusion model is learned on some intermediate representation. Noise2Music [13] uses diffusion models to generate an intermediate representation, either downsampled waveforms, or Mel-Spectrogram features, and then decodes the intermediate representation to music waveform by a cascader or vocoder. Then, because of the success of latent diffusion in image generation [14], music generation also follows this path. DiffSound [15] first trains a spectrogram VQ-VAE tokenizer as the intermediate representation, and then uses a discrete diffusion model to model the token sequence. [16, 17, 18, 19] are using spectrogram-domain (variational) autoencoder’s continuous latent as an intermediate representation for diffusion, while [20, 21] uses waveform-domain VAE’s latent as the diffusion target. Moûsai [22] proposes a spectrogram encoder learned by diffusion magnitude autoencoding (DMAE) and then trains another diffusion model on the encoder’s latent. Further, with a waveform-domain VAE, StableAudio2 [23] achieves full-song generation by modeling the VAE latent with diffusion.

Although rapid progress has been made in music generation, most models directly generate the whole music piece, which is a mixture of individual sources. However, the individual sources cannot be disentangled from the mixture. Ideally, a music generation method should be able to generate the individual music sources that together form a piece of music, similar to a human’s music composition process. This will allow the music to be more interpretable and controllable (e.g., the piano can be made louder than the drums). To solve this problem, one class of methods directly models the musical notes or midi representations in a multi-track manner [24, 25]. However, the generated notes or midi sequence need to be later decoded to a single waveform using synthesizers. The other type of research directly learns to model several music tracks directly. StemGen [26] uses a masked language model on Encodec tokens to generate any single instrument source given a music context. [27] uses the latent diffusion model to generate bass companions conditioned on mixtures, while SingSong [28] generates background companions given the vocal source. Most recently, MSDM [29] has been proposed to simultaneously model four instrument sources (piano, drums, bass, guitar) with a single waveform-domain diffusion model, and GMSDI [30] has generalized MSDM by training on text-conditioned diffusion models allowing adaption to any music dataset using text descriptions. The closest work to ours is multi-track MusicLDM, [31, 32], which is simultaneous to this paper, some implementation differences exist.

Refer to caption
Figure 1: An Overview of the proposed MSLDM framework.

In this paper, we propose to simultaneously model four different instrumental sources (piano, drums, bass, guitar) jointly, with a single multi-source latent diffusion model (MSLDM), as shown in Fig.1. We first train a shared SourceVAE on all the sources to perceptually compress the source audio, and then use this VAE’s encoder to extract the latent feature of each source. We then apply diffusion to model the generation of the latents of the sources. We claim that 1) with the VAE compressing the source audio into a compact latent, the diffusion can better model semantic and sequential information like melodies and the harmony between sources, and 2) modeling individual sources is better than direct modeling mixtures. Our result in both subjective human evaluation and objective FADs validates our claim.

II Models

Fig. 1 shows the training and inference pipeline of our model. Just like any latent diffusion model, our model involves two blocks: (1) SourceVAE, which is trained like a VAE but with an adversarial loss for perceptual compression. (2) A diffusion model simultaneously models all the sources’ latents concatenated together as one latent. Inference includes two sub-tasks as well: (1) Total generation allows unconditional generation of all instrumental sources at the same time and (2) Partial generation allows the generation of companion sources given any combinations of instrumental sources (e.g., generate bass and guitar to accompany given piano and drums).

II-A SourceVAE

The SourceVAE aims to compress waveform-domain instrumental sources into a compact latent space, while still ensuring perceptually indistinguishable reconstruction. This is usually achieved by adversarial training with carefully designed discriminators. We borrow this training framework and model architecture from the DAC [10] neural audio codec. DAC is a state-of-the-art waveform-domain neural audio codec trained with both reconstruction and adversarial losses. It can encode, quantize, and decode audio with superior quality. However, we want our latent to be noise-robust and continuous, so we remove the vector quantization module, constrain the intermediate latent size, and add a small KL-divergence loss term as used in vanilla VAEs [33]. The KL-divergence loss is to ensure a noise-robust latent space. For the encoder and decoder, we use DAC’s default 24kHz model but with a new latent size of C=80𝐶80C=80italic_C = 80. Given an instrumental source sN𝑠superscript𝑁s\in\mathbb{R}^{N}italic_s ∈ blackboard_R start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT with N𝑁Nitalic_N samples, the encoder encodes the waveform to a posterior Ψenc(s)=𝒩(|μz(s),Σz(s))\Psi_{enc}(s)=\mathcal{N}(\cdot|\mu_{z}(s),\Sigma_{z}(s))roman_Ψ start_POSTSUBSCRIPT italic_e italic_n italic_c end_POSTSUBSCRIPT ( italic_s ) = caligraphic_N ( ⋅ | italic_μ start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ( italic_s ) , roman_Σ start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ( italic_s ) ), where μz(s)C×NDsubscript𝜇𝑧𝑠superscript𝐶𝑁𝐷\mu_{z}(s)\in\mathbb{R}^{C\times\frac{N}{D}}italic_μ start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ( italic_s ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × divide start_ARG italic_N end_ARG start_ARG italic_D end_ARG end_POSTSUPERSCRIPT is the posterior mean of the latent and ΣssubscriptΣ𝑠\Sigma_{s}roman_Σ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT is the corresponding posterior covariance. D𝐷Ditalic_D is the time-domain downsampling factor of the encoder, which is 320320320320 in DAC. Then any z𝒩(|μz(s),Σz(s))z\sim\mathcal{N}(\cdot|\mu_{z}(s),\Sigma_{z}(s))italic_z ∼ caligraphic_N ( ⋅ | italic_μ start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ( italic_s ) , roman_Σ start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ( italic_s ) ), processed by the decoder ψdecsubscript𝜓dec\psi_{\text{dec}}italic_ψ start_POSTSUBSCRIPT dec end_POSTSUBSCRIPT, should reconstruct s𝑠sitalic_s with good quality. To extract latent features, we take the posterior mean zs=μz(s)subscript𝑧𝑠subscript𝜇𝑧𝑠z_{s}=\mu_{z}(s)italic_z start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = italic_μ start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ( italic_s ). During training, the loss is as shown below:

LSourceVAE=λ1LMel+λ2Lfeature+λ3Ladversarial+λ4LKLsubscript𝐿SourceVAEsubscript𝜆1subscript𝐿Melsubscript𝜆2subscript𝐿featuresubscript𝜆3subscript𝐿adversarialsubscript𝜆4subscript𝐿KLL_{\text{SourceVAE}}=\lambda_{1}L_{\text{Mel}}+\lambda_{2}L_{\text{feature}}+% \lambda_{3}L_{\text{adversarial}}+\lambda_{4}L_{\text{KL}}italic_L start_POSTSUBSCRIPT SourceVAE end_POSTSUBSCRIPT = italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT Mel end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT feature end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT adversarial end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT (1)

where LMelsubscript𝐿MelL_{\text{Mel}}italic_L start_POSTSUBSCRIPT Mel end_POSTSUBSCRIPT, Lfeaturesubscript𝐿featureL_{\text{feature}}italic_L start_POSTSUBSCRIPT feature end_POSTSUBSCRIPT, Ladversarialsubscript𝐿adversarialL_{\text{adversarial}}italic_L start_POSTSUBSCRIPT adversarial end_POSTSUBSCRIPT are Mel-reconstruction loss, feature matching loss, and adversarial loss, respectively, as in the DAC training framework. Also λ1=15subscript𝜆115\lambda_{1}=15italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 15, λ2=2subscript𝜆22\lambda_{2}=2italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 2, λ3=1subscript𝜆31\lambda_{3}=1italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = 1 according to DAC. LKLsubscript𝐿KLL_{\text{KL}}italic_L start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT is the KL-Divergence between the VAE encoded posterior and 𝒩(0,I)𝒩0𝐼\mathcal{N}(0,I)caligraphic_N ( 0 , italic_I ) and we set λ4=10subscript𝜆410\lambda_{4}=10italic_λ start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT = 10. The implementation details are available in our source code.

II-B Multi-Source Latent Diffusion

In our setup, assume any music piece xN𝑥superscript𝑁x\in\mathbb{R}^{N}italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT is a mixture of K𝐾Kitalic_K instrumental sources: x=k=1Ksk𝑥superscriptsubscript𝑘1𝐾subscript𝑠𝑘x=\sum_{k=1}^{K}s_{k}italic_x = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, where S=(s1,s2,,sK)K×N𝑆subscript𝑠1subscript𝑠2subscript𝑠𝐾superscript𝐾𝑁S=(s_{1},s_{2},...,s_{K})\in\mathbb{R}^{K\times N}italic_S = ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_K × italic_N end_POSTSUPERSCRIPT coherently added together to form the musical mixture x𝑥xitalic_x. Our goal is to sample from the distribution of S𝑆Sitalic_S to get multi-source music. Instead of directly modeling the generation of S𝑆Sitalic_S as in [29], we propose to model the generation of ZS=(zs1,zs2,,zsK)K×C×NDsubscript𝑍𝑆subscript𝑧subscript𝑠1subscript𝑧subscript𝑠2subscript𝑧subscript𝑠𝐾superscript𝐾𝐶𝑁𝐷Z_{S}=(z_{s_{1}},z_{s_{2}},...,z_{s_{K}})\in\mathbb{R}^{K\times C\times\frac{N% }{D}}italic_Z start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT = ( italic_z start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , … , italic_z start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_K × italic_C × divide start_ARG italic_N end_ARG start_ARG italic_D end_ARG end_POSTSUPERSCRIPT, where zsiC×NDsubscript𝑧subscript𝑠𝑖superscript𝐶𝑁𝐷z_{s_{i}}\in\mathbb{R}^{C\times\frac{N}{D}}italic_z start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × divide start_ARG italic_N end_ARG start_ARG italic_D end_ARG end_POSTSUPERSCRIPT is the SourceVAE’s latent of siN.subscript𝑠𝑖superscript𝑁s_{i}\in\mathbb{R}^{N}.italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT . For the notation to be more abbreviated, we will ignore the subscript s𝑠sitalic_s for the latent, so we are modeling the generation of Z=(z1,z2,,zK)K×C×ND𝑍subscript𝑧1subscript𝑧2subscript𝑧𝐾superscript𝐾𝐶𝑁𝐷Z=(z_{1},z_{2},...,z_{K})\in\mathbb{R}^{K\times C\times\frac{N}{D}}italic_Z = ( italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_z start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_K × italic_C × divide start_ARG italic_N end_ARG start_ARG italic_D end_ARG end_POSTSUPERSCRIPT.

We model the generation of Z=(z1,z2,,zK)𝑍subscript𝑧1subscript𝑧2subscript𝑧𝐾Z=(z_{1},z_{2},...,z_{K})italic_Z = ( italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_z start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) with a score-based diffusion model [34]. Following EDM [35], with the diffusion schedule σ(t)=t𝜎𝑡𝑡\sigma(t)=titalic_σ ( italic_t ) = italic_t, the forward diffusion process is defined by:

dZ(t)=σ(t)Z(t)logp(Z(t))dt𝑑𝑍𝑡𝜎𝑡subscript𝑍𝑡𝑝𝑍𝑡𝑑𝑡d{Z}(t)=-\sigma(t)\nabla_{{Z}(t)}\log p({Z}(t))dtitalic_d italic_Z ( italic_t ) = - italic_σ ( italic_t ) ∇ start_POSTSUBSCRIPT italic_Z ( italic_t ) end_POSTSUBSCRIPT roman_log italic_p ( italic_Z ( italic_t ) ) italic_d italic_t (2)

where Z(t)=𝒩(Z(0),σ2(t)I),Z(0)=Zformulae-sequence𝑍𝑡𝒩𝑍0superscript𝜎2𝑡𝐼𝑍0𝑍Z(t)=\mathcal{N}(Z(0),\sigma^{2}(t)I),Z(0)=Zitalic_Z ( italic_t ) = caligraphic_N ( italic_Z ( 0 ) , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_t ) italic_I ) , italic_Z ( 0 ) = italic_Z. Then with an ODE solver, we sample Z𝑍Zitalic_Z by solving the backward process:

dZ(t)=σ(t)Z(t)logp(Z(t))dt𝑑𝑍𝑡𝜎𝑡subscript𝑍𝑡𝑝𝑍𝑡𝑑𝑡d{Z}(t)=\sigma(t)\nabla_{{Z}(t)}\log p({Z}(t))dtitalic_d italic_Z ( italic_t ) = italic_σ ( italic_t ) ∇ start_POSTSUBSCRIPT italic_Z ( italic_t ) end_POSTSUBSCRIPT roman_log italic_p ( italic_Z ( italic_t ) ) italic_d italic_t (3)

We approximate the score Z(t)logp(Z(t))subscript𝑍𝑡𝑝𝑍𝑡\nabla_{{Z}(t)}\log p({Z}(t))∇ start_POSTSUBSCRIPT italic_Z ( italic_t ) end_POSTSUBSCRIPT roman_log italic_p ( italic_Z ( italic_t ) ) with a neural network Sθ(Z(t),σ(t))superscript𝑆𝜃𝑍𝑡𝜎𝑡S^{\theta}(Z(t),\sigma(t))italic_S start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT ( italic_Z ( italic_t ) , italic_σ ( italic_t ) ) and then train the score matching loss following the practice in EDM [35], i.e. σdata=0.4,ptrain(σ)=Uniform(0,3)formulae-sequencesubscript𝜎𝑑𝑎𝑡𝑎0.4subscript𝑝𝑡𝑟𝑎𝑖𝑛𝜎Uniform03\sigma_{data}=0.4,p_{train}(\sigma)=\mathrm{Uniform}(0,3)italic_σ start_POSTSUBSCRIPT italic_d italic_a italic_t italic_a end_POSTSUBSCRIPT = 0.4 , italic_p start_POSTSUBSCRIPT italic_t italic_r italic_a italic_i italic_n end_POSTSUBSCRIPT ( italic_σ ) = roman_Uniform ( 0 , 3 ).

II-C Inference

The inference pipeline is marked by the dashed objects in Fig. 1. During sampling, we use the same sampler for MSDM [29]. The sampler is an Euler method-based ODE solver to integrate Eq. 3 with some stochasticity controlled by the parameter schurnsubscript𝑠churns_{\text{churn}}italic_s start_POSTSUBSCRIPT churn end_POSTSUBSCRIPT, as proposed in EDM [35]. We use σmin=0.01,σmax=3,ρ=7,schurn=20,nsteps=150formulae-sequencesubscript𝜎𝑚𝑖𝑛0.01formulae-sequencesubscript𝜎𝑚𝑎𝑥3formulae-sequence𝜌7formulae-sequencesubscript𝑠𝑐𝑢𝑟𝑛20subscript𝑛𝑠𝑡𝑒𝑝𝑠150\sigma_{min}=0.01,\sigma_{max}=3,\rho=7,s_{churn}=20,n_{steps}=150italic_σ start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT = 0.01 , italic_σ start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT = 3 , italic_ρ = 7 , italic_s start_POSTSUBSCRIPT italic_c italic_h italic_u italic_r italic_n end_POSTSUBSCRIPT = 20 , italic_n start_POSTSUBSCRIPT italic_s italic_t italic_e italic_p italic_s end_POSTSUBSCRIPT = 150, also following the configuration in EDM.

II-C1 Total Generation

The total generation inference is straightforward. Starting from randomly sampled white noise Z(T)𝒩(|0,σmax2I)Z(T)\sim\mathcal{N}(\cdot|0,\sigma_{max}^{2}I)italic_Z ( italic_T ) ∼ caligraphic_N ( ⋅ | 0 , italic_σ start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I ), the diffusion sampling process gradually transforms Z(T)𝑍𝑇Z(T)italic_Z ( italic_T ) to Z(0)𝑍0Z(0)italic_Z ( 0 ). Then instrumental source latent z1,z2,,zKsubscript𝑧1subscript𝑧2subscript𝑧𝐾z_{1},z_{2},...,z_{K}italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_z start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT are extracted from Z(0)𝑍0Z(0)italic_Z ( 0 ), and are further decoded independently by the SourceVAE decoder to get the generated source waveforms {siN|si=ψdec(zi),i[1,2,,K]}conditional-setsubscript𝑠𝑖superscript𝑁formulae-sequencesubscript𝑠𝑖subscript𝜓𝑑𝑒𝑐subscript𝑧𝑖𝑖12𝐾\{s_{i}\in\mathbb{R}^{N}|s_{i}=\psi_{dec}(z_{i}),i\in[1,2,...,K]\}{ italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_ψ start_POSTSUBSCRIPT italic_d italic_e italic_c end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_i ∈ [ 1 , 2 , … , italic_K ] }. These generated sources could then be added to form a mixture of music pieces x=k=1Ksi𝑥superscriptsubscript𝑘1𝐾subscript𝑠𝑖x=\sum_{k=1}^{K}s_{i}italic_x = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

II-C2 Partial Generation

Partial generation is the task of generating complementary sources given some existing ones to condition on. Assume a subset of instruments are given by the indices I{1,2,K}𝐼12𝐾I\subset\{1,2,...K\}italic_I ⊂ { 1 , 2 , … italic_K }, and the corresponding given sources are denoted by SI={si}iIsubscript𝑆𝐼subscriptsubscript𝑠𝑖𝑖𝐼S_{I}=\{s_{i}\}_{i\in I}italic_S start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT = { italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i ∈ italic_I end_POSTSUBSCRIPT. Then the complementary sources to generate are indexed by I¯={1,,K}\I¯𝐼\1𝐾𝐼\bar{I}=\{1,...,K\}\backslash Iover¯ start_ARG italic_I end_ARG = { 1 , … , italic_K } \ italic_I, which means the source to generate are SI¯={si}iI¯subscript𝑆¯𝐼subscriptsubscript𝑠𝑖𝑖¯𝐼S_{\bar{I}}=\{s_{i}\}_{i\in\bar{I}}italic_S start_POSTSUBSCRIPT over¯ start_ARG italic_I end_ARG end_POSTSUBSCRIPT = { italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i ∈ over¯ start_ARG italic_I end_ARG end_POSTSUBSCRIPT. Since our diffusion model works in the latent domain, we first use SourceVAE to encode each source in SIsubscript𝑆𝐼S_{I}italic_S start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT to latent ZI={zi|zi=μz(si),iI}subscript𝑍𝐼conditional-setsubscript𝑧𝑖formulae-sequencesubscript𝑧𝑖subscript𝜇𝑧subscript𝑠𝑖𝑖𝐼Z_{I}=\{z_{i}|z_{i}=\mu_{z}(s_{i}),i\in I\}italic_Z start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT = { italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_μ start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_i ∈ italic_I }. The task is to generate ZI¯subscript𝑍¯𝐼Z_{\bar{I}}italic_Z start_POSTSUBSCRIPT over¯ start_ARG italic_I end_ARG end_POSTSUBSCRIPT conditioned on ZIsubscript𝑍𝐼Z_{I}italic_Z start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT, so we need the conditional score ZI¯(t)logp(ZI¯(t)|ZI(t))subscriptsubscript𝑍¯𝐼𝑡𝑝conditionalsubscript𝑍¯𝐼𝑡subscript𝑍𝐼𝑡\nabla_{{Z}_{\bar{I}}(t)}\log p(Z_{\bar{I}}(t)|Z_{I}(t))∇ start_POSTSUBSCRIPT italic_Z start_POSTSUBSCRIPT over¯ start_ARG italic_I end_ARG end_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT roman_log italic_p ( italic_Z start_POSTSUBSCRIPT over¯ start_ARG italic_I end_ARG end_POSTSUBSCRIPT ( italic_t ) | italic_Z start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ( italic_t ) ) for sampling. Following diffusion-based imputation [34] and MSDM [29], we could estimate the condition score by:

ZI¯(t)logp(ZI¯(t)|ZI(t))subscriptsubscript𝑍¯𝐼𝑡𝑝conditionalsubscript𝑍¯𝐼𝑡subscript𝑍𝐼𝑡\displaystyle\nabla_{{Z}_{\bar{I}}(t)}\log p(Z_{\bar{I}}(t)|Z_{I}(t))∇ start_POSTSUBSCRIPT italic_Z start_POSTSUBSCRIPT over¯ start_ARG italic_I end_ARG end_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT roman_log italic_p ( italic_Z start_POSTSUBSCRIPT over¯ start_ARG italic_I end_ARG end_POSTSUBSCRIPT ( italic_t ) | italic_Z start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ( italic_t ) ) ZI¯(t)logp([ZI¯(t),Z^I(t)])absentsubscriptsubscript𝑍¯𝐼𝑡𝑝subscript𝑍¯𝐼𝑡subscript^𝑍𝐼𝑡\displaystyle\approx\nabla_{{Z}_{\bar{I}}(t)}\log p([Z_{\bar{I}}(t),\hat{Z}_{I% }(t)])≈ ∇ start_POSTSUBSCRIPT italic_Z start_POSTSUBSCRIPT over¯ start_ARG italic_I end_ARG end_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT roman_log italic_p ( [ italic_Z start_POSTSUBSCRIPT over¯ start_ARG italic_I end_ARG end_POSTSUBSCRIPT ( italic_t ) , over^ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ( italic_t ) ] ) (4)
Sθ([ZI¯(t),Z^I(t)],σ(t))absentsuperscript𝑆𝜃subscript𝑍¯𝐼𝑡subscript^𝑍𝐼𝑡𝜎𝑡\displaystyle\approx S^{\theta}([Z_{\bar{I}}(t),\hat{Z}_{I}(t)],\sigma(t))≈ italic_S start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT ( [ italic_Z start_POSTSUBSCRIPT over¯ start_ARG italic_I end_ARG end_POSTSUBSCRIPT ( italic_t ) , over^ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ( italic_t ) ] , italic_σ ( italic_t ) ) (5)

where Z^I(t)subscript^𝑍𝐼𝑡\hat{Z}_{I}(t)over^ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ( italic_t ) is sampled from 𝒩(|ZI(0),σ2(t))\mathcal{N}(\cdot|{Z}_{I}(0),\sigma^{2}(t))caligraphic_N ( ⋅ | italic_Z start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ( 0 ) , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_t ) ). Then using Eq. 5, we can use the sampling method in Sec. II-C to solve the following ODE (similar to Eq.3) initialized from ZI¯(T)𝒩(|0,σmax2I)Z_{\bar{I}}(T)\sim\mathcal{N}(\cdot|0,\sigma_{max}^{2}I)italic_Z start_POSTSUBSCRIPT over¯ start_ARG italic_I end_ARG end_POSTSUBSCRIPT ( italic_T ) ∼ caligraphic_N ( ⋅ | 0 , italic_σ start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I ):

dZI¯(t)=σ(t)Sθ([ZI¯(t),Z^I(t)],σ(t))dt𝑑subscript𝑍¯𝐼𝑡𝜎𝑡superscript𝑆𝜃subscript𝑍¯𝐼𝑡subscript^𝑍𝐼𝑡𝜎𝑡𝑑𝑡d{Z_{\bar{I}}}(t)=\sigma(t)S^{\theta}([Z_{\bar{I}}(t),\hat{Z}_{I}(t)],\sigma(t% ))dtitalic_d italic_Z start_POSTSUBSCRIPT over¯ start_ARG italic_I end_ARG end_POSTSUBSCRIPT ( italic_t ) = italic_σ ( italic_t ) italic_S start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT ( [ italic_Z start_POSTSUBSCRIPT over¯ start_ARG italic_I end_ARG end_POSTSUBSCRIPT ( italic_t ) , over^ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ( italic_t ) ] , italic_σ ( italic_t ) ) italic_d italic_t (6)

With ZI¯(0)subscript𝑍¯𝐼0{Z_{\bar{I}}}(0)italic_Z start_POSTSUBSCRIPT over¯ start_ARG italic_I end_ARG end_POSTSUBSCRIPT ( 0 ) sampled, the partially generated sources SI¯subscript𝑆¯𝐼S_{\bar{I}}italic_S start_POSTSUBSCRIPT over¯ start_ARG italic_I end_ARG end_POSTSUBSCRIPT can be decoded by the SourceVAE decoder: SI¯={siN|si=ψdec(zi),ziZI¯}subscript𝑆¯𝐼conditional-setsubscript𝑠𝑖superscript𝑁formulae-sequencesubscript𝑠𝑖subscript𝜓𝑑𝑒𝑐subscript𝑧𝑖subscript𝑧𝑖subscript𝑍¯𝐼S_{\bar{I}}=\{s_{i}\in\mathbb{R}^{N}|s_{i}=\psi_{dec}(z_{i}),z_{i}\in Z_{\bar{% I}}\}italic_S start_POSTSUBSCRIPT over¯ start_ARG italic_I end_ARG end_POSTSUBSCRIPT = { italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_ψ start_POSTSUBSCRIPT italic_d italic_e italic_c end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_Z start_POSTSUBSCRIPT over¯ start_ARG italic_I end_ARG end_POSTSUBSCRIPT }. Then all the conditional sources and partially generated sources are added to form the final music piece x𝑥xitalic_x.

III Experiments and Dataset

III-A Dataset

We use the same dataset as MSDM [29], namely the slakh2100 music dataset [36]. slakh2100 is a MIDI synthesized music dataset with 145 hours of music, containing both the mixed music and the individual tracks labeled with instrument class. Same as MSDM, we use K=4𝐾4K=4italic_K = 4 main tracks which are piano, drums, bass, and guitar for multi-source modeling. For fair comparison, we use the identical sampling rate of 22,050Hz.

III-B SourceVAE

The SourceVAE mentioned in Sec. II-A is a 1D-CNN-based encoder-decoder architecture coupled with a DAC loss and a KL-divergence loss. The encoder and decoder all follow the final setup of the DAC architecture for 24kHz. The intermediate latent dimension for SourceVAE is set to be C=80𝐶80C=80italic_C = 80 and the encoder has a downsampling rate of D=320𝐷320D=320italic_D = 320 for the temporal dimension. For training, we use a batch size of 28 and train on one-second-long single-instrumental segments for 100k steps. All other SourceVAE training configurations are the same as the DAC paper with code available at https://github.com/descriptinc/descript-audio-codec.

III-C Latent Diffusion and Unet Architecture

As mentioned in Sec. II-B, the score estimation network Sθ(Z(t),σ(t))superscript𝑆𝜃𝑍𝑡𝜎𝑡S^{\theta}(Z(t),\sigma(t))italic_S start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT ( italic_Z ( italic_t ) , italic_σ ( italic_t ) ) learns a function mapping as shown below:

Sθ(Z(t),σ(t)):K×C×ND×K×C×ND:superscript𝑆𝜃𝑍𝑡𝜎𝑡superscript𝐾𝐶𝑁𝐷superscript𝐾𝐶𝑁𝐷S^{\theta}(Z(t),\sigma(t)):\mathbb{R}^{K\times C\times\frac{N}{D}}\times% \mathbb{R}\rightarrow\mathbb{R}^{K\times C\times\frac{N}{D}}italic_S start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT ( italic_Z ( italic_t ) , italic_σ ( italic_t ) ) : blackboard_R start_POSTSUPERSCRIPT italic_K × italic_C × divide start_ARG italic_N end_ARG start_ARG italic_D end_ARG end_POSTSUPERSCRIPT × blackboard_R → blackboard_R start_POSTSUPERSCRIPT italic_K × italic_C × divide start_ARG italic_N end_ARG start_ARG italic_D end_ARG end_POSTSUPERSCRIPT (7)

To accommodate the 1D-Unet architecture used in MSDM, we concatenate the K source latent channel-wise, so the new channel dimension becomes KC𝐾𝐶KCitalic_K italic_C, and the input to the 1D-Unet is Z(t)KC×NDsuperscript𝑍𝑡superscript𝐾𝐶𝑁𝐷Z^{\prime}(t)\in\mathbb{R}^{KC\times\frac{N}{D}}italic_Z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_t ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_K italic_C × divide start_ARG italic_N end_ARG start_ARG italic_D end_ARG end_POSTSUPERSCRIPT, which is a reshaped version of Z(t)𝑍𝑡Z(t)italic_Z ( italic_t ). KC𝐾𝐶KCitalic_K italic_C is treated as the channel dimension and ND𝑁𝐷\frac{N}{D}divide start_ARG italic_N end_ARG start_ARG italic_D end_ARG is treated as the temporal dimension for the Unet. In our setup, K=4,C=80,D=320formulae-sequence𝐾4formulae-sequence𝐶80𝐷320K=4,C=80,D=320italic_K = 4 , italic_C = 80 , italic_D = 320. We train our diffusion model on segments with N=327672𝑁327672N=327672italic_N = 327672 samples (about 15 seconds). For the 1D-Unet architecture, similar to MSDM, we adapt the architecture used in Moûsai [22] but with some modifications. We set the input channel dimension to be KC=320𝐾𝐶320KC=320italic_K italic_C = 320 and then we experiment on two different configurations. We call one model MSLDM and one larger model MSLDM-Large. The MSLDM contains 6 nested U-Net blocks with increasing channels [1024, 2048, 4096, 4096, 4096, 4096]. The downsampling factor for the blocks is [1,1,2,1,1,2]. The self-attention blocks are used for all the blocks except the first one. 12 Attention heads are used, and each head is 64 dimensional. For MSDLM-Large, the Unet contains 8 layers where the corresponding output channel dimensions are [1024, 2048, 4096, 4096, 4096, 4096, 4096, 4096]. The corresponding downsampling factors are [1,1,2,1,1,2,2,2]. All blocks contain self-attention blocks (except the first one) and each attention block contains 12 attention heads that are each 128 dimensional. The Unet and diffusion code setup are adapted from audio-diffusion-pytorch. Similar to MSDM, we train the diffusion model with a batch size 16 and a learning rate of 2e-5 for 400k steps.

All the model training is performed on a single RTX A6000 GPU with 48GB of VRAM. Further details can be found in our code.

IV Evaluation Metrics and Results

We evaluate two tasks: (1) Total generation and (2) Partial generation as mentioned in Sec. II-C. For both tasks, we evaluate the FAD score [37] and human subjective score with a listening test. We compare our performance against 3 baselines.

IV-A Baseline Models

The first baseline is the MSDM [29] model. To show the effectiveness of modeling sources instead of mixtures, we design another baseline called MixLDM, where the latent diffusion model directly models the latent of music mixture, instead of instrumental sources. This is a more common practice in diffusion-based audio/music generation, where the model directly models the music mixture. For training this model, we first train a MixtureVAE which is the same as SourceVAE except that it is trained on Mixture music. The latent size C=320𝐶320C=320italic_C = 320, so that the diffusion model’s input is of the same dimension as MSLDM. To claim that our model generates sources that are mutually coherent (i.e. in harmony with each other), we design one baseline called ISLDM (Independent Source Latent Diffusion Model), where we train four independent diffusion models on four instruments’ latents, respectively, so each model can generate one single instrument. For each single source model, the input channel size becomes C=80𝐶80C=80italic_C = 80, and all the other training parameters are the same as MSLDM. Since all the single source models are independent of each other, they cannot generate mutually coherent sources. This is set as a baseline to validate our model’s abilities to generate mutually coherent sources. All models’ parameters and inference time to generate one 12-second mixture on a single RTX A6000 GPU are listed in Table I.

Model # parameters (M) Inference Time (S)
MSDM 405 7.92
MixLDM 364 5.44
ISLDM 364×\times×4 5.44×\times×4
MSLDM-Ours 364 5.44
MSLDM-Large-Ours 1654 7.44
TABLE I: Model parameters and inference time for generating a 12-second music mixture.
TABLE II: sub-FAD for Partial Generation. The sub-FAD (lower is better) is reported for any source combinations (B: Bass, D: Drums, G: Guitar, P: Piano), where BD means conditioned on piano and guitar, the task is to generate Bass and Drums.
Model B D G P BD BG BP DG DP GP BDG BDP BGP DGP Overall
MSDM 0.23 0.75 0.18 0.49 1.75 0.75 1.40 1.30 1.40 1.77 3.13 2.92 5.54 3.51 1.79
ISLDM 0.30 1.41 0.75 0.42 1.52 1.14 0.76 1.56 1.76 1.33 1.85 2.03 1.78 2.17 1.34
MSLDM 0.24 1.27 0.38 0.32 1.22 0.81 0.64 1.00 0.92 0.98 1.44 1.57 1.48 1.43 0.98
MSLDM-Large 0.14 0.51 0.23 0.41 0.56 0.49 0.61 0.59 0.66 1.05 0.81 1.01 1.40 1.25 0.70
Refer to caption
Figure 2: Total Generation FAD.

IV-B Fréchet Audio Distance

(FAD) We use the Fréchet Audio Distance (FAD) [37] with VGGish feature [38] as the objective metric to evaluate total generation and partial generation. We use 200 chunks of music segment for the FAD calculation, where each segment is about 12 seconds, as in MSDM [29]. We randomly sample 200 12-second long music segments from the slakh2100 test set as the reference set.

Refer to caption
Figure 3: Subjective listening test result.

For the task of total generation, we evaluate the generation of all the instrumental sources (piano, drums, bass, guitar) and the summed mixture (note: MixLDM can only be reported for the mixture). The FAD results are reported in Fig. 2. First, for single-source generation, ISLDM in general has the lowest FAD, showing the best generation performance. This is reasonable because each independent diffusion model only needs to model one single source. Then our proposed MSLDM and MSLDM-Large also show promising results, beating the MSDM baseline by a large margin. The drums generated by MSLDM-Large even achieve lower FAD than ISLDM. For mixture generation, MixLDM exhibits the worst performance. This confirms our belief that modeling the mixture directly is much harder than modeling sources. Observe that ISLDM, MSLDM, MSLDM-Large all outperform MSDM, demonstrating the efficacy of latent diffusion in modeling the harmony in realistic music mixtures. In fact, MSLDM and MSLDM-Large demonstrate lower FAD than ISLDM, showing that our model successfully generates sources in harmony. Note that when generating each single source, MSLDM performs worse than ISLDM for all instruments, but when sources are added together to form mixtures, MSLDM has lower FAD, showing that it is able to model the inter-source harmony.

For partial generation, we use the sub-FAD as the metric, similar to MSDM. sub-FAD calculates the FAD on a reference set and an evaluation set, where the reference set is the real music mixture, and the evaluated set is the mixture formed by mixing partially generated samples and originally given samples. For the dataset used for sub-FAD calculation, we again sample 200 12-second music segments from the slakh2100 test set, but we make sure that all the segments contain four instruments. The sub-FAD result is shown in Table II, where the results are shown for all models and all partial generation setups. The ISLDM model in this case is only generating the target sources using source-independent models. Overall, MSLDM-Large and MSLDM show much better performance than ISLDM and MSDM, with sub-FADs smaller than 1. Interestingly, the ISLDM’s overall sub-FAD is lower than MSDM, implying that even though ISLDM cannot generate mutually coherent sources based on given sources, it is able to model single-source melody much better than MSDM. Across all the detailed setups, we see that MSLDM is consistently better than MSDM and ISLDM, except for the guitar.

IV-C Subjective Listening Test

To complement the objective metrics, we also design subjective listening tests with human evaluation. We follow the exact test design in MSDM [29] for total and partial generation.

Total Generation: Each model generates 10 segments of 12-second music mixture samples. Then for each sample in the test, the participant is asked to rate the ‘quality’ and ‘coherence’, with a score from 1 to 10 (higher the better). ‘Quality’ corresponds to how realistic the music is (considering white noise is the least realistic) and ‘coherence’ corresponds to how mutually coherent the sources inside the mixture are. The evaluation scores are shown in the upper half of Figure 3. MSLDM and MSLDM-Large lead other baselines by a large margin in both quality and coherence. Also, MSLDM-Large exhibits a higher score than MSLDM, showing that large model size results in better performance. ISLDM’s quality is better than MSDM, but its coherence is the worst because all the sources are generated independently. MixLDM has the worst quality and coherence, showing the difficulty in directly modeling the music mixture. In general, users find that our MSLDM model is consistently better than MSLDM, ISDM, MixLDM in both quality and coherence.

Partial Generation: We randomly sample 10 samples in the test set. Then, for each sample, the target source types (e.g., want to generate piano and drums given the two other instruments) are randomly sampled. Finally, the models are used to sample the specified source type for each data sample. We give the participants three music segments (the music to condition on, the partially generated music, and the synthesized mixture), and then ask them to rate 1-10 for ‘coherence’ and ‘density’. The ‘density’ corresponds to how dense (or sparse) the generated instruments are, in the 12-second segment (10 means all target instruments are generated for the whole segment). The results are shown in the lower half of Fig. 3. We observe that MSLDM and MSLDM-Large produce the highest coherence score, achieving the best performance in generating mutually-coherent sources that match the given ones. Again, ISLDM shows the worst coherence because sources are independent. For density, MSLDM, MSLDM-Large, and ISLDM all have high scores, showing the ability to generate rich companions. However, MSDM shows the worst performance in density because it often generates small and naive music segments, or even empty sources at times.

Overall, both the objective metrics and subjective listening test show the MSLDM outperforms MSDM, ISLDM, MixLDM in terms of generating multi-source consistent music pieces. Compared with MSDM, MSLDM is better in terms of generating denser, more melodic, and more harmonious music. Compared with MixLDM, we showed that modeling sources is much easier than directly modeling music mixtures. Compared with ISLDM, we showed that our model is able to model inter-source dependency or harmony between different musical instruments.

V Conclusion and Future Work

In this paper, we propose a multi-source diffusion model to jointly model the generation of multiple instruments together. Both objective and subjective metrics show better results in the tasks of total and partial generation, implying MSLDM is able to efficiently model melody and inter-source relations. Future research needs to advance MSLDM for weakly-supervised music separation and generalize our model to more instruments.

References

  • [1] OpenAI, “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, 2023.
  • [2] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever, “Zero-shot text-to-image generation,” 2021.
  • [3] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei, “Language models are few-shot learners,” 2020.
  • [4] Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu, “Wavenet: A generative model for raw audio,” 2016.
  • [5] Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, and Ilya Sutskever, “Jukebox: A generative model for music,” 2020.
  • [6] Alexandre Défossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi, “High fidelity neural audio compression,” 2022.
  • [7] Neil Zeghidour, Alejandro Luebs, Ahmed Omran, Jan Skoglund, and Marco Tagliasacchi, “Soundstream: An end-to-end neural audio codec,” 2021.
  • [8] Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu, “Neural discrete representation learning,” 2018.
  • [9] Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae, “Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis,” 2020.
  • [10] Rithesh Kumar, Prem Seetharaman, Alejandro Luebs, Ishaan Kumar, and Kundan Kumar, “High-fidelity audio compression with improved rvqgan,” 2023.
  • [11] Andrea Agostinelli, Timo I. Denk, Zalán Borsos, Jesse Engel, Mauro Verzetti, Antoine Caillon, Qingqing Huang, Aren Jansen, Adam Roberts, Marco Tagliasacchi, Matt Sharifi, Neil Zeghidour, and Christian Frank, “Musiclm: Generating music from text,” 2023.
  • [12] Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi, and Alexandre Défossez, “Simple and controllable music generation,” 2024.
  • [13] Qingqing Huang, Daniel S. Park, Tao Wang, Timo I. Denk, Andy Ly, Nanxin Chen, Zhengdong Zhang, Zhishuai Zhang, Jiahui Yu, Christian Frank, Jesse Engel, Quoc V. Le, William Chan, Zhifeng Chen, and Wei Han, “Noise2music: Text-conditioned music generation with diffusion models,” 2023.
  • [14] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer, “High-resolution image synthesis with latent diffusion models,” 2022.
  • [15] Dongchao Yang, Jianwei Yu, Helin Wang, Wen Wang, Chao Weng, Yuexian Zou, and Dong Yu, “Diffsound: Discrete diffusion model for text-to-sound generation,” 2023.
  • [16] Rongjie Huang, Jiawei Huang, Dongchao Yang, Yi Ren, Luping Liu, Mingze Li, Zhenhui Ye, Jinglin Liu, Xiang Yin, and Zhou Zhao, “Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models,” 2023.
  • [17] Haohe Liu, Zehua Chen, Yi Yuan, Xinhao Mei, Xubo Liu, Danilo Mandic, Wenwu Wang, and Mark D. Plumbley, “Audioldm: Text-to-audio generation with latent diffusion models,” 2023.
  • [18] Haohe Liu, Yi Yuan, Xubo Liu, Xinhao Mei, Qiuqiang Kong, Qiao Tian, Yuping Wang, Wenwu Wang, Yuxuan Wang, and Mark D. Plumbley, “Audioldm 2: Learning holistic audio generation with self-supervised pretraining,” 2024.
  • [19] Ke Chen, Yusong Wu, Haohe Liu, Marianna Nezhurina, Taylor Berg-Kirkpatrick, and Shlomo Dubnov, “Musicldm: Enhancing novelty in text-to-music generation using beat-synchronous mixup strategies,” 2023.
  • [20] Zach Evans, Julian D. Parker, CJ Carr, Zack Zukowski, Josiah Taylor, and Jordi Pons, “Stable audio open,” 2024.
  • [21] Peike Li, Boyu Chen, Yao Yao, Yikai Wang, Allen Wang, and Alex Wang, “Jen-1: Text-guided universal music generation with omnidirectional diffusion models,” 2023.
  • [22] Flavio Schneider, Ojasv Kamal, Zhijing Jin, and Bernhard Schölkopf, “Moûsai: Text-to-music generation with long-context latent diffusion,” 2023.
  • [23] Zach Evans, Julian D. Parker, CJ Carr, Zack Zukowski, Josiah Taylor, and Jordi Pons, “Long-form music generation with latent diffusion,” 2024.
  • [24] Gautam Mittal, Jesse Engel, Curtis Hawthorne, and Ian Simon, “Symbolic music generation with diffusion models,” 2021.
  • [25] Hao-Wen Dong, Wen-Yi Hsiao, Li-Chia Yang, and Yi-Hsuan Yang, “Musegan: Multi-track sequential generative adversarial networks for symbolic music generation and accompaniment,” 2017.
  • [26] Julian D. Parker, Janne Spijkervet, Katerina Kosta, Furkan Yesiler, Boris Kuznetsov, Ju-Chiang Wang, Matt Avent, Jitong Chen, and Duc Le, “Stemgen: A music generation model that listens,” 2024.
  • [27] Marco Pasini, Maarten Grachten, and Stefan Lattner, “Bass accompaniment generation via latent diffusion,” 2024.
  • [28] Chris Donahue, Antoine Caillon, Adam Roberts, Ethan Manilow, Philippe Esling, Andrea Agostinelli, Mauro Verzetti, Ian Simon, Olivier Pietquin, Neil Zeghidour, and Jesse Engel, “Singsong: Generating musical accompaniments from singing,” 2023.
  • [29] Giorgio Mariani, Irene Tallini, Emilian Postolache, Michele Mancusi, Luca Cosmo, and Emanuele Rodolà, “Multi-source diffusion models for simultaneous music generation and separation,” 2024.
  • [30] Emilian Postolache, Giorgio Mariani, Luca Cosmo, Emmanouil Benetos, and Emanuele Rodolà, “Generalized multi-source inference for text conditioned music diffusion models,” 2024.
  • [31] Tornike Karchkhadze, Mohammad Rasool Izadi, Ke Chen, Gerard Assayag, and Shlomo Dubnov, “Multi-track musicldm: Towards versatile music generation with latent diffusion model,” 2024.
  • [32] Tornike Karchkhadze, Mohammad Rasool Izadi, and Shlomo Dubnov, “Simultaneous music separation and generation using multi-track latent diffusion models,” 2024.
  • [33] Diederik P Kingma and Max Welling, “Auto-encoding variational bayes,” 2022.
  • [34] Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole, “Score-based generative modeling through stochastic differential equations,” 2021.
  • [35] Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine, “Elucidating the design space of diffusion-based generative models,” 2022.
  • [36] Ethan Manilow, Gordon Wichern, Prem Seetharaman, and Jonathan Le Roux, “Cutting music source separation some slakh: A dataset to study the impact of training data quality and quantity,” 2019.
  • [37] Kevin Kilgour, Mauricio Zuluaga, Dominik Roblek, and Matthew Sharifi, “Fréchet audio distance: A metric for evaluating music enhancement algorithms,” 2019.
  • [38] Shawn Hershey, Sourish Chaudhuri, Daniel P. W. Ellis, Jort F. Gemmeke, Aren Jansen, R. Channing Moore, Manoj Plakal, Devin Platt, Rif A. Saurous, Bryan Seybold, Malcolm Slaney, Ron J. Weiss, and Kevin Wilson, “Cnn architectures for large-scale audio classification,” 2017.