MusRec: Zero-Shot Text-to-Music Editing via Rectified Flow and Diffusion Transformers

Ali Boudaghi, Hadi Zare Corresponding author: Ali Boudaghi (email: [email protected]).
Abstract

Music editing has emerged as an important and practical area of artificial intelligence, with applications ranging from video game and film music production to personalizing existing tracks according to user preferences. However, existing models face significant limitations, such as being restricted to editing synthesized music generated by their own models, requiring highly precise prompts, or necessitating task-specific retraining—thus lacking true zero-shot capability.

Leveraging recent advances in rectified flow and diffusion transformers, we introduce MusRec, the first zero-shot text-to-music editing model capable of performing diverse editing tasks on real-world music efficiently and effectively. Experimental results demonstrate that our approach outperforms existing methods in preserving musical content, structural consistency, and editing fidelity, establishing a strong foundation for controllable music editing in real-world scenarios.

I Introduction

The landscape of audio generation has shifted dramatically in recent years. Text-to-music systems now allow users to compose entire musical pieces from simple textual descriptions, powered by advances in diffusion models and transformer architectures [forsgren2022riffusion, liu2023audioldm, huang2023noise2music, Aufussion, accomontage, musicldm, lin2023content, musecoco, copet2024simple, melechovsky2023mustango, textcond]. While impressive, these systems are still primarily designed for creation from scratch. In contrast, real-world music practice often revolves around editing: refining a performance, altering instrumentation, or adapting an existing recording into a new style. For musicians, producers, and casual creators alike, the ability to reshape existing audio is often more valuable than generating entirely new material.

Music editing, however, is fundamentally more difficult than generation. It requires the model to balance two competing goals: applying the requested modification faithfully, and preserving the rich details of the input recording that should remain unchanged. This trade-off is especially challenging when dealing with expressive, polyphonic, or multi-instrumental recordings. Existing research has attempted to address editing through supervised datasets of paired “before” and “after” examples [m2ugen, instructme, audit], or through zero-shot latent manipulations in diffusion models [musicmagus, zeta, transplayer]. Yet most methods remain restricted by their limitation to specific editing tasks, operate mainly on model-generated music rather than arbitrary recordings, and often require very precise prompts to succeed [musicmagus, transplayer]. These limitations hinder their use in flexible, user-friendly creative workflows. Recent works also show that diffusion models can be effective for audio restoration tasks, such as equalization and bandwidth extension [moliner2024diffusion_equalizer].

At the same time, a parallel line of research has introduced rectified flow models [lee2024improvingtrainingrectifiedflows, liu2022flowstraightfastlearning], which reformulate diffusion as a more direct flow between noise and data distributions. Rectified flows enable efficient and stable generation, and have recently been realized at scale through the Flux family of diffusion transformers [tang2024hart, xie2024sana, yang2024cogvideox, peebles2023scalable]. Flux Plays Music in particular demonstrated the power of this approach for text-to-music generation. In computer vision, the work of Taming Rectified Flow for Inversion and Editing [fluxplaysmusic] showed that RF models also support accurate inversion and robust editing, but these ideas have not yet been applied to music. This raises an intriguing question: can the strengths of rectified flow be used not just for generating music, but for editing real recordings in a practical, zero-shot fashion?

I-A Our Approach

In this paper, we introduce a framework for zero-shot music editing based on rectified flow models. Our approach is motivated by recent progress in improving rectified flow (RF) inversion and editing. RF-Solver [tamingrectifiedflow] addresses the reconstruction problem by formulating a more precise sampler for solving the RF ODE, reducing error accumulation during inversion and thereby yielding more faithful reconstructions. Building on this, RF-Edit [tamingrectifiedflow] extends the idea to practical image and video editing: it stabilizes edits by storing and re-injecting the V (value) feature in the self-attention layers of the source, which preserves structure while allowing targeted modifications. Inspired by these advances, we adapt the principles of RF-Solver and RF-Edit to the audio domain.

Specifically, we leverage a Flux-style diffusion transformer originally trained for text-to-music generation [fluxplaysmusic], and extend its capability to real-audio editing through an inversion procedure that maps raw recordings into the rectified-flow latent space. Within this space, targeted manipulations—such as timbre transfer between instruments—can be performed before decoding the results back into high-fidelity music audio.

Our design deliberately avoids additional training: the entire pipeline works in a zero-shot setting. This choice offers several concrete advantages over prior editing approaches:

  1. 1.

    Zero-shot editing: no fine-tuning, paired data, or supervision is required.

  2. 2.

    Real-audio compatibility: the method accepts arbitrary recordings as inputs, not just outputs generated by the model itself.

  3. 3.

    Instrument-agnostic timbre transfer: edits are not tied to a fixed instrument vocabulary, allowing flexible cross-instrument transformations.

  4. 4.

    Accessible prompting: coarse or natural descriptions suffice, removing the need for carefully engineered text prompts.

  5. 5.

    Efficient inversion and generation: our method performs both inversion and editing in only 25 diffusion steps, whereas other models typically require between 50 and 200 steps to achieve comparable results.

I-B Contributions

In summary, our work introduces a new perspective on music editing:

  • We present the first framework for rectified-flow based editing of real music recordings.

  • We demonstrate a zero-shot pipeline that performs timbre transfer, genre transfer and other edits without any re-training.

  • We highlight the practical advantages of this approach: generality across instruments and genres, compatibility with real audio, user-friendly interaction, and fast inversion and generation.

  • Through experiments on diverse datasets and metrics, we show that our method maintains fidelity to the input recording while applying edits with high transferability.

By extending rectified flow beyond generation into editing, we reveal its potential as a foundation for flexible, high-quality, and accessible tools for music creation and transformation.

II Related Work

II-A Text-to-Music Generation

Text-to-music generation has seen rapid advances with the rise of diffusion and transformer-based models. Early approaches relied on autoregressive language models applied to audio data [agostinelli2023musiclm, copet2024simple, musicgenstem, inspiremusic]. Autoregressive models are advantageous due to their strong temporal coherence and ability to capture long-range dependencies in sequential data. However, they often suffer from error accumulation during sampling and can be computationally expensive for generating long sequences.

More recently, diffusion-based audio models including Riffusion [forsgren2022riffusion], AudioLDM [audioldm2], DiffRhythm [diffrythm], Möusai [mousai], and Tango [tango] have enabled high-quality audio synthesis directly from text prompts. Diffusion models excel in producing realistic, high-fidelity audio and are more robust to error propagation compared to autoregressive methods. On the downside, they typically require lengthy iterative denoising steps, which makes inference slower and more resource-intensive.

Recently, hybrid approaches that combine the strengths of both paradigms have emerged. Models such as Auffusion [Aufussion] and MagNet [magnet] integrate the fidelity and robustness of diffusion with the sequential modeling capacity of autoregressive transformers, offering a promising direction for efficient and controllable text-to-music generation.

Control signals such as melody, chord progression, or rhythm have further improved conditioning and user controllability [ditto, musicontrolnet, cocomulla]. While these methods highlight the creative potential of large-scale generative models, they primarily focus on unconditional or text-conditioned generation, not editing.

More recently, rectified flow (RF) has emerged as an alternative to classical diffusion for music generation [musflow, fluxplaysmusic] and editing [melodyflow]. By reformulating the denoising process into a continuous deterministic flow, RF enables faster and more stable text-to-music synthesis while preserving fine temporal and timbral details. This deterministic nature also makes RF particularly suitable for downstream tasks such as inversion and editing, laying the foundation for the approach we develop in this work.

II-B Music Editing

Editing tasks for diffusion models are critical in practical music production but remain less explored compared to generation. Existing approaches typically follow two main directions. The first involves retraining or fine-tuning certain pretrained components of the model [adapter]. While effective in specific cases, these methods are limited because each type of edit requires a new round of fine-tuning. This process can be both computationally expensive and constrained by the scarcity of suitable training data.

The second direction leverages pretrained generative models in a zero-shot fashion. For example, MusicMagus [musicmagus] demonstrated zero-shot editing by manipulating the latent semantics of diffusion models. However, such methods often remain restricted to editing music generated by the model itself, with performance dropping significantly on real-world audio inputs. Moreover, many existing systems rely on complex and precise prompt engineering, which creates a barrier for non-expert users.

In this work, we introduce MusRec, a zero-shot framework built on pretrained rectified flow models for music editing. MusRec injects the self-attention features of the source music from a diffusion transformer directly into the editing process. Unlike prior approaches, it can operate effectively on real-world audio and generalizes across a wide range of editing tasks. Moreover, MusRec removes the need for prompt engineering during both reversal and editing, making music editing more accessible and practical.

Refer to caption
Figure 1: The source audio is first inverted into noise and then denoised to generate the edited audio. During denoising, the self-attention operations within the single blocks are modified according to their corresponding inversion steps. Note that the architecture comprises multiple single and double blocks, although only one of each is illustrated for clarity.

III Background

III-A Rectified Flow

Let π0\pi_{0} and π1\pi_{1} denote two distributions on d\mathbb{R}^{d} (in generative modeling π0\pi_{0} is a simple prior such as 𝒩(0,I)\mathcal{N}(0,I) and π1\pi_{1} is the data distribution). Rectified flow [liu2022flowstraightfastlearning] constructs intermediate states by coupling samples (z0,z1)(z_{0},z_{1}) drawn from some joint coupling of π0\pi_{0} and π1\pi_{1} and then defining a linear interpolation in time. Concretely, for t[0,1]t\in[0,1] we set

zt=αtz0+βtz1,z_{t}\;=\;\alpha_{t}\,z_{0}\;+\;\beta_{t}\,z_{1}, (1)

where αt,βt\alpha_{t},\beta_{t} are scalar schedules satisfying α0=1,β0=0\alpha_{0}=1,\beta_{0}=0 and α1=0,β1=1\alpha_{1}=0,\beta_{1}=1. The canonical rectified-flow choice is αt=1t,βt=t\alpha_{t}=1-t,\ \beta_{t}=t, yielding the straight path between z0z_{0} and z1z_{1}.

Differentiating (1) with respect to tt gives the target instantaneous velocity along the interpolation:

z˙t=α˙tz0+β˙tz1.\dot{z}_{t}\;=\;\dot{\alpha}_{t}\,z_{0}\;+\;\dot{\beta}_{t}\,z_{1}. (2)

Under the canonical schedule αt=1t,βt=t\alpha_{t}=1-t,\beta_{t}=t, the RHS of (2) is constant in tt, z˙t=z1z0\dot{z}_{t}=z_{1}-z_{0}, which is the defining “straight-line” property of rectified flow.

Rectified flow parameterizes a time-dependent velocity field vθ(z,t)v_{\theta}(z,t) (typically a neural network) and defines the generative dynamics by the ODE

dz(t)dt=vθ(z(t),t),\frac{dz(t)}{dt}\;=\;v_{\theta}\bigl(z(t),t\bigr), (3)

with the goal that trajectories of (3) when initialized from z0π0z_{0}\sim\pi_{0} transport mass to match π1\pi_{1}. Training proceeds by velocity matching: for (z0,z1)(z_{0},z_{1}) drawn from the chosen coupling and t𝒰[0,1]t\sim\mathcal{U}[0,1], minimize the mean squared error between the network velocity and the target derivative,

(θ)=𝔼(z0,z1),t[vθ(zt,t)z˙t2],\mathcal{L}(\theta)\;=\;\mathbb{E}_{(z_{0},z_{1}),\,t}\!\bigl[\;\big\|v_{\theta}(z_{t},t)\;-\;\dot{z}_{t}\big\|^{2}\;\bigr], (4)

where ztz_{t} is given by (1) and z˙t\dot{z}_{t} by (2). For the canonical linear schedule z˙t=z1z0\dot{z}_{t}=z_{1}-z_{0} and (4) reduces to matching the network output to the constant straight-line velocity.

In practice sampling (and inversion) integrate the learned ODE (3) numerically. A standard explicit Euler discretization on a partition 0=t0<t1<<tK=10=t_{0}<t_{1}<\dots<t_{K}=1 yields the familiar update

zk+1=zk+(tk+1tk)vθ(zk,tk),z_{k+1}\;=\;z_{k}\;+\;(t_{k+1}-t_{k})\,v_{\theta}(z_{k},t_{k}), (5)

and higher-order integrators may be used in place of (5) to reduce discretization error. The rectified-flow design aims to make trajectories as straight (low-curvature) as possible so that coarse discretizations (small KK) suffice for high-quality sampling; nevertheless, numerical integration error accumulates across steps and is the principal source of inversion/reconstruction error that RF-Solver later targets.

III-B Inversion

The goal of inversion is to recover the latent representation of observed data—such as images or audio—by reversing the generative dynamics. In diffusion models, one of the earliest and most widely adopted techniques is DDIM inversion [songscore, ddim]. This method reconstructs the latent by progressively injecting noise predicted by the model at each forward step. Although this strategy succeeds in producing approximate reconstructions, it is inherently sensitive to discretization, since numerical integration introduces cumulative error across the trajectory. As a result, the final recovered signal may diverge from the original input. To mitigate this issue, several works [nti, stsl, elarabawy2022direct, negprompt] have explored improved inversion procedures. These approaches differ in implementation, yet all remain constrained by the underlying assumptions of diffusion-based dynamics.

In contrast, research on inversion within rectified flow models is still at an early stage. For instance, RF-Prior [yang2024text] applies score distillation to backtrack data into the latent domain, but the reliance on repeated optimization steps makes it computationally demanding. Another direction, proposed by [rout2024rfinversion], augments the system with an additional vector field conditioned on the input, which provides improved reconstructions. Nevertheless, this approach does not fundamentally resolve the inaccuracies inherent in the rectified flow’s native vector field. Consequently, the effectiveness of current techniques remains limited when applied to downstream tasks that require both high-fidelity reconstruction and stable editing.

RF-Solver [tamingrectifiedflow] addresses this issue from a different angle by directly reducing the numerical errors associated with the rectified flow vector field. Instead of modifying the conditioning strategy or relying on optimization-heavy procedures, RF-Solver reformulates the rectified flow ODE using a variation-of-constants decomposition. This separates the system into linear and nonlinear components, with the nonlinear residual approximated through a high-order Taylor expansion. Such a treatment provides a substantially more accurate approximation of the trajectory during both forward sampling and reverse inversion. Importantly, the method is training-free and thus applicable to any pretrained rectified flow model. In practice, RF-Solver significantly enhances inversion fidelity, yielding reconstructions that more closely preserve input details, while simultaneously improving editability and generation quality compared to existing solvers.

Beyond RF-Solver, a few recent studies have also explored alternative numerical schemes to further improve the inversion process in rectified flow models. FireFlow [fireflow] proposes a second-order integration approach that delivers accurate inversions with noticeably fewer function evaluations, striking a practical balance between computational efficiency and reconstruction quality. Similarly, ABM-Solver [abmsolver] adopts an Adams–Bashforth–Moulton predictor–corrector method with adaptive step sizing, which helps maintain stability and produces more consistent edits across different cases. Although these methods were developed independently, they share the same motivation of making rectified flow inversion more reliable and efficient. They represent promising directions for future research and are worth attention as potential complements to solver-based approaches like RF-Solver.

III-C FLUX that Plays Music

Recent work has extended rectified flow models beyond vision and into the audio domain. The system FLUX that Plays Music [fluxplaysmusic] adapts the FLUX rectified flow transformer to text-to-music generation by operating in a latent mel-spectrogram space, demonstrating the versatility of rectified flow architectures across modalities.

The framework first converts raw waveforms into mel-spectrograms, which are then compressed into a lower-dimensional latent space through a variational autoencoder (VAE). All generative operations occur in this latent domain. Textual conditioning is provided through pretrained encoders such as T5 [chung2024scaling], which produces embeddings that capture semantic content, and CLAP [clap], which produces embeddings aligned with audio, capturing both semantic content and audio-relevant attributes from prompts. Within the transformer backbone, generation alternates between double-stream and single-stream processing. Double-stream blocks handle text embeddings and music latents in parallel, with cross-attention allowing textual instructions to influence musical structure. Single-stream blocks then merge the two modalities, concatenating token-level features so that text and audio information can interact more directly. In addition, coarse-level features—such as global prompt vectors or temporal embeddings—are injected via modulation mechanisms that rescale hidden states.

At inference time, sampling begins from Gaussian noise m(0)m(0), which is transported forward under rectified flow dynamics to produce a latent m(1)m(1). This latent is decoded into a mel-spectrogram by the VAE decoder and finally rendered into an audible waveform by a vocoder. Due to the straightened transport trajectories of rectified flow, FluxMusic requires fewer integration steps than comparable diffusion-based text-to-audio models, thereby achieving faster generation.

Despite these advantages, the system is not without limitations. In practice, the generated music does not always faithfully reflect the input prompt: in particular, genres or styles that are underrepresented in the training distribution often lead to outputs that diverge from the intended semantics. This limitation arises from the generative model itself, which struggles to generalize to musical contexts it has not been adequately trained on. Ensuring robust prompt alignment therefore remains a significant challenge in text-to-music generation with rectified flow models.

IV Methodology

Our approach enables controlled editing of audio through a rectified flow–based generative framework. The process begins by encoding the source audio into a latent representation, which is then inverted into noise. During the subsequent denoising stage, the model reconstructs and edits the audio by modifying the self-attention operations according to the corresponding inversion steps. The editing is guided by a new text prompt, allowing semantic transformation of the source content while preserving its rhythmic and structural characteristics. An overview of the entire process is illustrated in Figure 1.

IV-A Encoder

We adopt the pretrained variational autoencoder (VAE) from AudioLDM2 [liu2023audioldm] as our audio encoder. Given an input waveform, it is first converted into a mel-spectrogram representation through a TacotronSTFT-based frontend, which captures both spectral and temporal structure. The resulting features are then encoded into a compact latent representation by the VAE, effectively compressing high-dimensional audio information into a semantically meaningful latent space. This latent space serves as the foundation for downstream generative and editing tasks, enabling the model to operate on a continuous, information-rich representation of sound.

IV-B RF-Solver

The standard Rectified Flow (RF) sampler exhibits strong generative performance but struggles with inversion and reconstruction tasks due to cumulative errors at each timestep. These errors originate from the approximate solution of the rectified flow ordinary differential equation (ODE), which in prior work is estimated using a first-order Euler discretization [liu2022rectified]. To address this limitation, the RF-Solver method [tamingrectifiedflow] introduces a higher-order numerical scheme that provides a more accurate ODE approximation.

Starting from the continuous form of the rectified flow,

d𝒁tdt=𝒗θ(𝒁t,t),\frac{d\boldsymbol{Z}_{t}}{dt}=\boldsymbol{v}_{\theta}(\boldsymbol{Z}_{t},t), (6)

the method applies a Taylor expansion of 𝒗θ(𝒁τ,τ)\boldsymbol{v}_{\theta}(\boldsymbol{Z}_{\tau},\tau) around timestep tit_{i} and integrates it analytically, leading to the following nn-th order approximation:

𝒁ti1=𝒁ti+k=0n1(ti1ti)k+1(k+1)!𝒗θ(k)(𝒁ti,ti)+𝒪(hin+1),\boldsymbol{Z}_{t_{i-1}}=\boldsymbol{Z}_{t_{i}}+\sum_{k=0}^{n-1}\frac{(t_{i-1}-t_{i})^{k+1}}{(k+1)!}\boldsymbol{v}_{\theta}^{(k)}(\boldsymbol{Z}_{t_{i}},t_{i})+\mathcal{O}(h_{i}^{n+1}), (7)

where 𝒗θ(k)\boldsymbol{v}_{\theta}^{(k)} denotes the kk-th order time derivative of 𝒗θ\boldsymbol{v}_{\theta}, and hi=ti1tih_{i}=t_{i-1}-t_{i}.

In practice, the authors find that a second-order approximation (n=2n=2) effectively mitigates reconstruction errors. The resulting update rule, termed RF-Solver, is:

𝒁ti1=𝒁ti\displaystyle\boldsymbol{Z}_{t_{i-1}}=\boldsymbol{Z}_{t_{i}} +(ti1ti)𝒗θ(𝒁ti,ti)\displaystyle+(t_{i-1}-t_{i})\,\boldsymbol{v}_{\theta}(\boldsymbol{Z}_{t_{i}},t_{i})
+12(ti1ti)2𝒗θ(1)(𝒁ti,ti).\displaystyle+\tfrac{1}{2}(t_{i-1}-t_{i})^{2}\,\boldsymbol{v}_{\theta}^{(1)}(\boldsymbol{Z}_{t_{i}},t_{i}). (8)

Since 𝒗θ(1)\boldsymbol{v}_{\theta}^{(1)} cannot be derived analytically, it is estimated numerically via finite differences:

𝒗θ(1)(𝒁ti,ti)=𝒗θ(𝒁ti+Δt,ti+Δt)𝒗θ(𝒁ti,ti)Δt,\boldsymbol{v}_{\theta}^{(1)}(\boldsymbol{Z}_{t_{i}},t_{i})=\frac{\boldsymbol{v}_{\theta}(\boldsymbol{Z}_{t_{i}+\Delta t},t_{i}+\Delta t)-\boldsymbol{v}_{\theta}(\boldsymbol{Z}_{t_{i}},t_{i})}{\Delta t}, (9)

where Δt\Delta t is a small perturbation (set to 0.010.01 in practice).

This second-order solver substantially reduces the local ODE error from 𝒪(hi2)\mathcal{O}(h_{i}^{2}) to 𝒪(hi3)\mathcal{O}(h_{i}^{3}), enabling more accurate inversion and reconstruction.

IV-C Attention Feature Replacement Strategies

To explore the effect of feature-level guidance during the denoising process, we further experimented with several attention feature replacement strategies within the diffusion transformer architecture. The goal of these experiments is to investigate how reusing intermediate representations from the inversion stage can improve controllability and structure preservation in the generated outputs. Inspired by prior work on attention-level feature reuse in rectified flow models [tamingrectifiedflow], We designed three approaches that modify the self-attention operation of the velocity prediction network 𝒗θ\boldsymbol{v}_{\theta} during the denoising process. In our setup, we focus exclusively on the single transformer blocks, as they integrate information from both the source content and the conditioning input through unified modulation. While the double blocks in the underlying architecture process text and audio features separately, the single blocks concatenate these modalities, making them more suitable for controlled feature sharing. This design choice enables the model to leverage joint representations effectively, thereby enhancing its ability to preserve the structural and semantic characteristics of the source sample during generation.

During inversion, we cache the intermediate key and value tensors, {𝒦~tkm}\{\mathcal{\widetilde{K}}_{t_{k}}^{m}\} and {𝒱~tkm}\{\mathcal{\widetilde{V}}_{t_{k}}^{m}\}, from the self-attention modules in the last MM transformer blocks across the final nn timesteps:

𝑭~tkm=Attention(𝒬~tkm,𝒦~tkm,𝒱~tkm),\boldsymbol{\widetilde{F}}^{m}_{t_{k}}=\text{Attention}(\mathcal{\widetilde{Q}}_{t_{k}}^{m},\mathcal{\widetilde{K}}_{t_{k}}^{m},\mathcal{\widetilde{V}}_{t_{k}}^{m}), (10)

where m{1,,M}m\in\{1,\dots,M\} indexes the transformer blocks and k{Nn,,N}k\in\{N-n,\dots,N\} denotes the inversion timesteps. These features encode localized semantic and structural cues from the source sample.

In the denoising phase, we replace the standard self-attention mechanism

𝑭tkm=Attention(𝒬tkm,𝒦tkm,𝒱tkm)\boldsymbol{F}_{t_{k}}^{m}=\text{Attention}(\mathcal{Q}_{t_{k}}^{m},\mathcal{K}_{t_{k}}^{m},\mathcal{V}_{t_{k}}^{m}) (11)

with modified formulations that inject the cached inversion features according to three strategies:

  • (1) Value Replacement: Replace only the value tensor with its cached counterpart,

    𝑭tkm=Attention(𝒬tkm,𝒦tkm,𝒱~tkm),\boldsymbol{F}_{t_{k}}^{m\prime}=\text{Attention}(\mathcal{Q}_{t_{k}}^{m},\mathcal{K}_{t_{k}}^{m},\mathcal{\widetilde{V}}_{t_{k}}^{m}), (12)

    allowing the denoising process to reuse localized feature representations while maintaining the original attention distribution [crossattentioncontrol, plugandplay].

  • (2) Key Replacement: Replace only the key tensor with the cached key,

    𝑭tkm=Attention(𝒬tkm,𝒦~tkm,𝒱tkm),\boldsymbol{F}_{t_{k}}^{m\prime}=\text{Attention}(\mathcal{Q}_{t_{k}}^{m},\mathcal{\widetilde{K}}_{t_{k}}^{m},\mathcal{V}_{t_{k}}^{m}), (13)

    emphasizing structural correspondence between the inversion and denoising phases [manifoldrepresentationkeyvision].

  • (3) Key–Value Replacement: Replace both the key and value tensors simultaneously,

    𝑭tkm=Attention(𝒬tkm,𝒦~tkm,𝒱~tkm),\boldsymbol{F}_{t_{k}}^{m\prime}=\text{Attention}(\mathcal{Q}_{t_{k}}^{m},\mathcal{\widetilde{K}}_{t_{k}}^{m},\mathcal{\widetilde{V}}_{t_{k}}^{m}), (14)

    effectively aligning both the attention map and the feature content with the inversion trajectory.

Through these experiments, we analyze how different forms of attention-level feature reuse influence reconstruction fidelity, edit consistency, and semantic controllability in generative tasks. This investigation provides insight into the role of cross-attention dynamics in rectified flow–based generation and their applicability to complex modalities such as music and audio.

IV-D Classifier-Free Guidance

In rectified flow models, classifier-free guidance (CFG) is applied by interpolating between conditional and unconditional velocity fields to modulate the strength of conditioning. Formally, given the conditional velocity field vθ(xt,y,t)v_{\theta}(x_{t},y,t) and the unconditional one vθ(xt,,t)v_{\theta}(x_{t},\varnothing,t), the guided velocity v^θ\hat{v}_{\theta} is defined as:

v^θ(xt,y,t)=vθ(xt,,t)+s(vθ(xt,y,t)vθ(xt,,t)),\hat{v}_{\theta}(x_{t},y,t)=v_{\theta}(x_{t},\varnothing,t)+s\big(v_{\theta}(x_{t},y,t)-v_{\theta}(x_{t},\varnothing,t)\big),

where ss denotes the CFG scale controlling the influence of the conditioning signal yy. Higher ss values amplify the semantic conditioning, whereas lower values prioritize fidelity to the source. In our experiments, we employed classifier-free guidance (CFG) to control the conditioning strength during both the inversion and denoising processes. The base model, Flux that Plays Music [fluxplaysmusic], was trained with a fixed negative prompt of ”low quality, gentle”. Consequently, we adopted the same negative prompt across all our experiments to maintain consistency with the model’s training distribution. Attempts to introduce alternative negative prompts resulted in degraded performance. For instance, during timbre transfer, we used ”A recording of target instrument” and ”A recording of source instrument” as positive and negative prompts, respectively, during the denoising following the approach proposed in [adapter]. However, this configuration failed to yield meaningful results, likely due to the model’s reliance on its original negative prompt during training.

For the inversion process, we set the CFG scale to 11, whereas the model’s default value during generation is 77. Using a high CFG value (e.g., 77) in inversion was observed to push latent representations into regions of the latent space that are difficult to guide back to meaningful and attribute-aligned states, thereby hindering effective editing while preserving melody and rhythm. Conversely, during the denoising stage, we increased the CFG scale to 2020 to strongly emphasize the new conditioning prompt, ensuring that the model incorporated the desired semantic changes while maintaining musical coherence.

Refer to caption
Figure 2: Transferability–fidelity trade-off effects of injection steps and IB (injection block) count on the timbre transfer task. The diagram shows the results of injecting the value (VV) components of the attention mechanism into generation, i.e., how VV-injection affects fidelity and transferability of the edited audio. For results of injecting the key (KK) components or both key and value (K+VK+V), and for all related results of genre transfer, see the Appendix.
TABLE I: The objective evaluation results on the timbre transfer.
Model Type CLAP \uparrow Chroma \uparrow CLAP+Chroma Avg. \uparrow CQT-1 PCC \uparrow FAD \downarrow
MusicGen Supervised .220 .757 .489 .274 5.320
AudioLDM2 Zero-shot .235 .820 .527 .563 3.574
Zeta Zero-shot .224 .813 .518 .560 5.693
FluxMusic Zero-shot .220 .756 .488 .464 5.403
MusRec K Injection(ours) Zero-shot .262 .718 .490 .366 7.018
MusRec KV Injection(ours) Zero-shot .237 .851 .543 .600 4.265
MusRec V Injection(ours) Zero-shot .236 .843 .535 .583 4.605
TABLE II: The objective evaluation results on the genre transfer.
Model Type CLAP \uparrow Chroma \uparrow CLAP+Chroma Avg. \uparrow CQT-1 PCC \uparrow FAD \downarrow
MusicGen Supervised .454 .754 .604 .129 9.790
AudioLDM2 Zero-shot .585 .698 .641 .153 8.782
Zeta Zero-shot .531 .762 .646 .315 7.158
FluxMusic Zero-shot .524 .771 .647 .354 6.774
MusRec K Injection(ours) Zero-shot .547 .754 .650 .225 11.398
MusRec KV Injection(ours) Zero-shot .545 .797 .671 .424 5.433
MusRec V Injection(ours) Zero-shot .537 .799 .668 .433 5.662
TABLE III: The subjective evaluation results on the timbre transfer.
Model MOS-T \uparrow MOS-P \uparrow Overall \uparrow
AudioLDM2 3.10 3.33 3.21
MusicGen 3.33 2.62 2.98
Zeta 3.57 3.57 3.57
FluxMusic 3.43 3.71 3.57
MusRec KV Injection (ours) 4.05 4.14 4.10
MusRec V Injection (ours) 3.90 4.24 4.07
MusRec K Injection (ours) 3.43 3.05 3.24
TABLE IV: The subjective evaluation results on the genre transfer.
Model MOS-T \uparrow MOS-P \uparrow Overall \uparrow
AudioLDM2 3.14 1.86 2.50
MusicGen 2.67 2.57 2.62
Zeta 2.71 3.76 3.24
FluxMusic 2.95 3.62 3.29
MusRec KV Injection (ours) 3.14 4.14 3.64
MusRec V Injection (ours) 3.14 4.19 3.67
MusRec K Injection (ours) 2.95 3.19 3.07

V Experiments

V-A Datasets

For our experiments, we curated two small yet high-quality datasets, each comprising 40 music clips collected from publicly available sources on YouTube, one designed for genre transfer and the other for timbre transfer. Each audio sample was manually selected to ensure clear instrument or genre distinction and minimal background noise. The clips were resampled to a sampling rate of 16 kHz and trimmed or segmented to a uniform duration of 10 seconds.

The timbre dataset covers a diverse set of instrumental categories, including electric guitar, flute, piano, violin, and acoustic guitar. These instruments were chosen to provide a balanced range of harmonic, percussive, and timbral characteristics, enabling a comprehensive evaluation of the model’s editing and timbre transfer capabilities. The genre dataset, on the other hand, encompasses a variety of musical styles, including pop, jazz, rock, and hip-hop, to assess the model’s effectiveness in capturing and transferring stylistic attributes across distinct genre domains.

V-B Baselines

To validate the effectiveness of our method, we compare against four strong prior models widely used for text-to-music tasks.

  • AudioLDM2 : AudioLDM2 [audioldm2] is a latent diffusion model for text-to-audio (including music and sound effects), which conditions on text embeddings produced by CLAP and Flan-T5 and uses a U-Net-style architecture with cross-attention conditioning. For editing, we follow an SDEdit-style strategy [meng2021sdedit]: we partially apply the forward diffusion (i.e., adding noise) to the input audio up to a timestep teditt_{\text{edit}}, where tedit<Tt_{\text{edit}}<T represents the noise level from which the reverse denoising process is initiated. The model then performs the reverse diffusion conditioned on the editing prompt to generate the edited audio.

  • MusicGen: As a contrasting baseline, we employ MusicGen [musicgenstem]. MusicGen is a Transformer-based text-to-audio model that generates discrete audio tokens rather than diffusion in latent space. In particular, we use the MusicGen-Melody (1.5B) variant, which allows conditioning on melody via a chromagram proxy. In our setup, we feed the edit prompt as text (yy) and condition on the chromagram of the source audio (xx), letting MusicGen generate the modified audio x~\tilde{x} under this combined conditioning.

  • ZETA (DDPM Inversion): Zero-Shot Unsupervised and Text-Based Audio Editing Using DDPM Inversion [zeta] introduces two complementary modes: ZETA and ZEUS. ZETA performs text-guided editing by inverting the diffusion process for an input audio xx and steering the denoising trajectory using a textual prompt yy. We include ZETA as a baseline to evaluate text-based audio editing performance under the DDPM inversion framework.

  • FluxMusic: FluxMusic [fluxplaysmusic] is a rectified flow transformer model for text-to-music generation. In our setup, we use the RF-solver [tamingrectifiedflow] to invert the input audio xx into its latent representation and then generate the edited music x~\tilde{x} under text conditioning yy. This approach enables semantically guided audio editing while maintaining the structural coherence of the original piece.

In addition, we considered several recent models, including MusicMagus [musicmagus] and TransPlayer [transplayer], but did not include them in our direct comparison. MusicMagus showed limited performance on our editing tasks due to its limitations on real music data and was therefore not included in the quantitative results table, while TransPlayer supports only limited edit tasks, which diverges from the objective of our study. We also examined MelodyFlow [melodyflow] and SteerMusic [steermusic]; however, at the time of writing this paper, neither model had publicly available code or checkpoints, preventing a fair evaluation. Finally, the Audio Prompt Adapter [adapter] was excluded, as its checkpoints were recently removed.

V-C Objective Metrics

We evaluate our method using two complementary objective metrics that assess both transferability and fidelity:

  • CLAP Similarity: CLAP [clap] evaluates the semantic alignment between audio and text by mapping both modalities into a shared embedding space through contrastive learning. The cosine similarity between the CLAP embeddings of x~\tilde{x} and the conditioning text yy measures how well the generated audio reflects the intended semantic meaning.

  • Chroma Similarity: To evaluate the fidelity of the generated audio, we compute the chroma similarity between the original audio xx and the edited audio x~\tilde{x}. This metric captures harmonic and rhythmic correspondence by comparing their chromagrams, extracted using the Constant-Q Transform (CQT) chroma method implemented in librosa [mcfee2015librosa]. Framewise cosine similarity between the chroma features provides a quantitative measure of how well the edited sample preserves the musical structure of the source.

  • CQT-1 PCC: The Constant-Q Transform (CQT) [brown1991cqt] represents audio on a logarithmic frequency scale, reflecting human pitch perception. We compute the Pearson Correlation Coefficient (PCC) between the CQT magnitude spectra of the original audio xx and the edited audio x~\tilde{x}. Higher values indicate stronger spectral correspondence, suggesting that harmonic and timbral structures are well preserved.

  • Fréchet Audio Distance (FAD): To assess the perceptual quality and distributional similarity of the generated audio, we compute the Fréchet Audio Distance (FAD) between the real and generated samples. Analogous to the Fréchet Inception Distance (FID) used in image generation, FAD measures the distance between two multivariate Gaussian distributions fitted to embeddings extracted from a pretrained audio feature extractor (VGGish). These embeddings capture high-level perceptual attributes such as timbre, texture, and overall audio quality. A lower FAD score indicates that the generated audio is closer in distribution to real audio, reflecting higher perceptual realism and better generative performance.

V-D Subjective Metrics

To complement the objective evaluations, we conduct a subjective assessment of the perceptual and semantic quality of the generated music. Following the ITU-T recommendations for subjective evaluation of multimedia content [itut1999video, itut1996audio], we employ two Mean Opinion Score (MOS) metrics: MOS-T and MOS-P.

MOS-T measures the perceived alignment between the generated audio and its corresponding target prompt. Participants rate, on a 5-point Likert scale, how well the musical content reflects the semantics, emotion, and style expressed in the text prompt.

MOS-P evaluates how well the edited audio x~\tilde{x} preserves the perceptual characteristics of the source audio xx, including timbre, rhythm, and overall musical structure. Higher MOS-P values indicate that the edited output maintains greater perceptual similarity to the original recording while integrating the intended edits naturally.

VI Results and Discussion

VI-A Hyperparameter Choice

We discovered that several hyperparameters significantly affect the quality of edited music. To systematically study their impact, we conducted experiments analyzing the effects of different parameters. In our setup, five hyperparameters can be tuned depending on the task and the source audio: the number of diffusion steps, the target classifier-free guidance scale, the source classifier-free guidance scale, the number of injection steps, and the injection block count (IB count).

Since jointly optimizing all five parameters would result in an impractically large search space, we focused our detailed analysis on the injection steps and IB count, while determining suitable values for target CFG, source CFG and number of steps empirically. We observed that target CFG values in the range of 15–25 generally yield the best performance across tasks, while a value of 1 for the source CFG provides stable results. Similarly, we set the number of diffusion steps to 25. Although increasing the number of steps improves performance, we chose 25 to maintain a balance between quality and computational efficiency. While the base model, FluxMusic, generates music using a default of 50 diffusion timesteps, we reduced this number to 25 to accelerate the generation process and lower computational cost.

Injection steps determine at which diffusion steps the model injects the attention mechanism information derived from the corresponding inversion steps (as illustrated in Figure 2). Increasing the number of injection steps enhances fidelity but reduces transferability. For instance, setting the injection step too high causes the model to preserve excessive acoustic details from the input, leading to edited outputs that sound similar to the original audio but may not accurately reflect the editing command.

IB count specifies after which single block within each injection step the attention injection occurs. For example, if there are nn single blocks and the IB count is mm (where m<nm<n), the injection happens after the mm-th block. As shown in Figure 2, increasing the IB count improves transferability but decreases fidelity, indicating a trade-off between these two factors.

VI-B Objective Results

We conduct an objective evaluation to quantitatively assess the performance of the proposed MusRec model on both timbre and genre transfer tasks. The evaluation relies on four key metrics: CLAP similarity, which measures semantic alignment between the generated and target audio; Chroma similarity, which reflects harmonic fidelity to the source; CQT-1 PCC, which captures spectral correlation and timbral preservation between the source and generated audio; and Fréchet Audio Distance (FAD), which estimates perceptual realism by comparing feature distributions of generated and real audio.

Table I presents the results for the timbre transfer task. Among all models, MusRec K Injection attains the highest CLAP similarity score, demonstrating the strongest semantic alignment with the conditioning prompt, followed closely by MusRec KV Injection. In terms of harmonic and spectral fidelity, as measured by Chroma similarity and CQT-1 PCC, MusRec KV Injection achieves the best performance, with MusRec V Injection ranking second in both metrics. When considering the average of Chroma and CLAP similarity, MusRec KV Injection again provides the most balanced outcome, indicating effective integration of semantic and acoustic cues. Regarding perceptual realism, assessed via FAD, AudioLDM2 yields the lowest score, while MusRec KV Injection ranks second, confirming that it maintains high perceptual quality while preserving fidelity to the source.

Table II presents the results for the genre transfer task. The overall trends are consistent with the timbre transfer evaluation. In terms of CLAP similarity, AudioLDM2 achieves the highest score, followed by MusRec K Injection, indicating strong semantic alignment with the target genre. For harmonic and spectral measures—Chroma similarity and CQT-1 PCC—MusRec V Injection performs best, with MusRec KV Injection ranking second, reflecting superior preservation of tonal and timbral characteristics. When averaging CLAP and Chroma scores, MusRec KV Injection attains the best overall balance between semantic consistency and harmonic fidelity, closely followed by MusRec V Injection. Regarding perceptual realism, as measured by FAD, MusRec KV Injection achieves the lowest score, with MusRec V Injection in second place, demonstrating high perceptual quality and effective genre adaptation.

Overall, the objective evaluation demonstrates that incorporating both key and value attention injections provides the most balanced performance across timbre and genre transfer tasks. The MusRec KV Injection variant consistently achieves strong trade-offs between semantic alignment, harmonic fidelity, and perceptual realism, while the V Injection configuration excels in preserving tonal characteristics and producing perceptually coherent outputs. In contrast, the K Injection variant favors semantic transfer, achieving higher CLAP alignment but with a modest reduction in spectral fidelity.

VI-C Subjective Results

To evaluate the perceptual quality of the generated audio, we conducted an online subjective listening test using Google Forms with 21 participants, comprising 11 professional musicians and 10 ordinary listeners without formal musical training. To ensure reliable subjective evaluation, participants were recruited voluntarily online; the slight imbalance between professional and ordinary listeners (11 vs. 10) does not affect the overall analysis, as results were averaged separately across both groups. Each participant was randomly assigned one sample for genre transfer and one for timbre transfer. For each sample, they provided two ratings on a five-point Likert scale: the Mean Opinion Score for Timbre (MOS-T), reflecting the naturalness and timbral realism of the output, and the Mean Opinion Score for Perceptual Quality (MOS-P), indicating the overall perceptual quality of the transferred audio. The summarized results are shown in Tables III and IV, while the detailed breakdown by professional and ordinary listeners is provided in the Appendix. These subjective evaluations closely follow the same trends observed in the objective metrics, further confirming the consistency and reliability of the proposed framework.

Table III presents the results for the timbre transfer task. Among all models, MusRec KV Injection achieves the highest overall perceptual and timbral quality, demonstrating excellent preservation of tonal attributes and structural coherence. MusRec V Injection also performs strongly, producing smooth and natural-sounding edits with consistent fidelity to the input recording. In contrast, MusRec K Injection prioritizes prompt adherence but with slightly reduced perceptual naturalness. All three MusRec variants outperform the baseline models, highlighting the advantage of the proposed MusRec models in generating perceptually convincing timbre transformations.

Table IV reports the subjective evaluation for the genre transfer task. Similar to the timbre transfer results, MusRec V Injection delivers the most perceptually coherent and musically natural outputs, while MusRec KV Injection achieves a strong balance between genre adaptation and fidelity to the original material. The MusRec K Injection variant again emphasizes semantic adherence at a modest cost in perceptual realism. Baseline models show comparatively weaker performance, particularly in perceptual quality, reflecting limited generalization to recordings.

Overall, the subjective results reinforce the findings of the objective evaluation: integrating both key and value conditioning leads to a balanced trade-off between text alignment and perceptual quality, while value-only conditioning excels in producing smooth and natural musical outputs. These findings validate the effectiveness of the proposed approach in achieving high-quality, zero-shot text-driven music editing on real-world audio.

VII Conclusion

In conclusion, this work presents a novel zero-shot music editing framework based on rectified flow modeling. The proposed method effectively edits the source music toward a target text prompt while preserving essential musical attributes such as timbre, melody, rhythm, and overall structural coherence. To the best of our knowledge, this is the first zero-shot music editing approach built upon rectified flow, capable of operating directly on real-world music recordings.

Although FluxMusic, the underlying base model, exhibits limited capability in faithfully following textual prompts and producing high-fidelity outputs, our results demonstrate that the proposed editing mechanism substantially improves controllability and consistency. We believe that applying this framework to future rectified-flow-based music generation models with stronger priors and higher audio quality could further enhance its performance and generalization in real-world editing scenarios.

Appendix A Appendix

A-A Additional Results on Attention Injection Variants

Refer to caption
Figure 3: Results of injecting the key (KK) components of the attention mechanism during timbre transfer task. Injecting KK leads to moderate improvements in transferability but slightly weaker fidelity compared to VV-injection, as less low-level acoustic information is preserved.
Refer to caption
Figure 4: Results of injecting both key and value (K+VK+V) components of the attention mechanism during timbre transfer task. Injecting K+VK+V tends to balance fidelity and transferability, yielding more consistent timbre adaptation while retaining semantic control.
Refer to caption
Figure 5: Results of injecting the value (V) components of the attention mechanism during genre transfer. Injecting V mainly preserves fidelity while limiting the degree of stylistic transfer, showing more stable tonal similarity across genres.
Refer to caption
Figure 6: Results of injecting the key (K) components of the attention mechanism during genre transfer. Injecting only K emphasizes structural transferability but can reduce chroma fidelity, indicating that genre cues dominate over tonal preservation.
Refer to caption
Figure 7: Results of injecting both key and value (K + V) components of the attention mechanism during genre transfer. Injecting K + V achieves a better balance between fidelity and transferability, enabling effective genre transformation while maintaining harmonic consistency.

To complement the main results presented in Figure 2, we further investigate the effects of injecting the key (K) components and the combined key and value (K + V) components of the attention mechanism during the generation process. In addition to the timbre transfer experiments, we also include the results of injecting value (V), key (K), and key + value (K + V) components for the genre transfer task, as shown here. These extended analyses demonstrate that the choice of injected components directly influences the balance between fidelity and transferability. Specifically, V-injection preserves detailed timbral and genre-specific characteristics, K-injection promotes stronger adherence to the conditioning or editing command, and K + V-injection offers a balanced compromise between the two, achieving consistent transformations while maintaining perceptual and structural coherence.

A-B Full Subjective Results

TABLE V: The full subjective evaluation results on the timbre transfer.
MOS-T mean \uparrow MOS-P mean \uparrow
Model Overall Professional Musicians Ordinary Listeners Overall Professional Musicians Ordinary Listeners
MusicGen 3.33 3.55 3.10 2.62 2.55 2.70
AudioLDM2 3.10 3.64 2.50 3.33 3.36 3.30
Zeta 3.57 3.55 3.60 3.57 3.55 3.60
FluxMusic 3.43 3.45 3.40 3.71 3.82 3.60
MusRec KV Injection (ours) 4.05 4.27 3.80 4.14 4.27 4.00
MusRec V Injection (ours) 3.90 3.82 4.00 4.24 4.27 4.20
MusRec K Injection (ours) 3.43 3.36 3.50 3.05 2.91 3.20
TABLE VI: The full subjective evaluation results on the genre transfer.
MOS-T mean \uparrow MOS-P mean \uparrow
Model Overall Professional Musicians Ordinary Listeners Overall Professional Musicians Ordinary Listeners
MusicGen 2.67 2.91 2.40 2.57 2.45 2.70
AudioLDM2 3.14 3.27 3.00 1.86 1.64 2.10
Zeta 2.71 2.45 3.00 3.76 4.18 3.30
FluxMusic 2.95 2.73 3.20 3.62 3.73 3.50
MusRec KV Injection (ours) 3.14 2.91 3.40 4.14 4.73 3.50
MusRec V Injection (ours) 3.14 2.91 3.40 4.19 4.55 3.80
MusRec K Injection (ours) 2.95 2.73 3.20 3.19 3.27 3.10

Tables V and VI present the full subjective evaluation results, including separate scores from professional musicians and ordinary listeners. The results reveal clear but complementary differences in perception between the two groups.

Professional musicians generally assigned higher MOS-T and MOS-P ratings for models that preserved timbral detail and musical structure, showing greater sensitivity to subtle artifacts or tonal imbalances. They consistently preferred MusRec KV Injection, which provided the most faithful timbral transfer in both tasks, and rated MusRec V Injection highly for its perceptual smoothness and realism. Among the baseline systems, FluxMusic and Zeta received comparatively better scores from professionals, reflecting their stronger structural consistency, while AudioLDM2 achieved a slightly higher MOS-T in genre transfer, suggesting that its tonal balance was appreciated despite its lower perceptual realism. In contrast, MusicGen was rated the lowest due to audible artifacts and weaker genre adherence.

Ordinary listeners, on the other hand, tended to favor models that maintained overall musical coherence and recognizable style, even when minor distortions were present. For this group, MusRec V Injection often received the highest perceptual ratings, as its outputs were smoother and easier to follow, while MusRec KV Injection ranked slightly lower but remained among the top performers. Ordinary listeners also rated FluxMusic and Zeta relatively well among the baselines, likely because these models produced sonically appealing and stylistically stable results, whereas AudioLDM2 and MusicGen were perceived as less consistent.

Across both listener groups, MusRec K Injection was perceived as more prompt-aligned but slightly less natural, and the baseline models were consistently rated lower, particularly by professionals. Importantly, the relative ranking of models remained consistent across both groups, confirming that the improvements achieved by the proposed MusRec variants are perceptually robust across varying levels of musical expertise.