FAM Diffusion: Frequency and Attention Modulation for High-Resolution Image Generation with Stable Diffusion

Haosen Yang1,2  Adrian Bulat1  Isma Hadji1  Hai X. Pham1  Xiatian Zhu2
Georgios Tzimiropoulos1,3  Brais Martinez1

1Samsung AI Center, Cambridge, UK  2University of Surrey, UK  3Queen Mary University, UK
This work was conducted while Haosen Yang was an intern at Samsung AI Center, Cambridge, UK.
Abstract

Diffusion models are proficient at generating high-quality images. They are however effective only when operating at the resolution used during training. Inference at a scaled resolution leads to repetitive patterns and structural distortions. Retraining at higher resolutions quickly becomes prohibitive. Thus, methods enabling pre-existing diffusion models to operate at flexible test-time resolutions are highly desirable. Previous works suffer from frequent artifacts and often introduce large latency overheads. We propose two simple modules that combine to solve these issues. We introduce a Frequency Modulation (FM) module that leverages the Fourier domain to improve the global structure consistency, and an Attention Modulation (AM) module which improves the consistency of local texture patterns, a problem largely ignored in prior works. Our method, coined FAM diffusion, can seamlessly integrate into any latent diffusion model and requires no additional training. Extensive qualitative results highlight the effectiveness of our method in addressing structural and local artifacts, while quantitative results show state-of-the-art performance. Also, our method avoids redundant inference tricks for improved consistency such as patch-based or progressive generation, leading to negligible latency overheads.

1 Introduction

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Figure 1: Comparisons of 3× (3072 × 3072) image generation based on SDXL [19].

Diffusion models [22] demonstrate impressive generative power across a range of applications [29, 23, 33, 18, 20, 6, 30]. While powerful, one known shortcoming of diffusion models is their inability to seamlessly scale to higher resolutions beyond the one used during training. It is known that directly generating images at resolutions beyond the training resolution results in severe object repetition and unrealistic local patterns [1, 7, 3]. This is illustrated in Figure 1(a). While retraining diffusion models on higher-resolution images is a straightforward solution, the computational demands quickly become prohibitive. This restricts applications requiring flexible or high-resolution image generation, e.g. 4K. Therefore, adapting pre-trained diffusion models to generate high-resolution images without additional training is a topic of high interest that we tackle in this work.

Prior efforts addressing this important problem can be largely categorized into two tracks. The first set of approaches, e.g. [3, 15], propose mechanisms that improve the global structure consistency by steering the high-resolution generation using the image generated at native (i.e. training) resolution. However, the effectiveness of such mechanisms is mixed, with trailing issues like poor detail quality, inconsistent local textures, and even persisting pattern repetitions as shown in Figure 1(b). Furthermore, these works typically operate on a patch-based basis, generating one patch at a time. Concretely, this means that these methods resort to redundant and overlapping forward passes, leading to large latency overheads. The second group of approaches, e.g. [7, 11, 34], eschews patch-based generation in favor of a one-pass approach by directly altering the model architecture. This leads to faster generation, but unfortunately, it comes at the cost of image quality, as shown in Fig. 1 (c).

To address the aforementioned limitations, we propose a straightforward yet effective approach that takes the best of both worlds. Our method follows the single pass generation strategy for improved latency but, like patch-based approaches, leverages the native resolution generation to steer the high-resolution one. Specifically, our method starts by generating an image at native resolution conditioned on the input text prompt. We then resort to a test-time diffuse-denoise strategy [27, 3, 8], where the high-resolution denoising stage is guided by the native resolution diffusion process. However, instead of blindly steering the high-res image toward the low-res one as done elsewhere [3, 15], we propose a Frequency Modulation (FM) module. In particular, we leverage the Fourier domain to selectively condition low-frequency components during the high-resolution image generation stage, while providing full control over high-frequency components to the denoising process.

While the FM module resolves artifacts related to global consistency, artifacts related to inconsistent local texture might still be present, i.e. finer texture generated on semantically related parts of the image might be inconsistent. To tackle this second issue, largely ignored in the literature, we propose an Attention Modulation (AM) mechanism that leverages attention maps from the denoising process at native resolution to condition the attention maps of the denoising process at high resolution. Since attention maps at native resolution encode which regions of the image are semantically related, they regularize the high-res denoising towards consistent finer texture generation. Our method, coined Frequency and Attention Modulated diffusion (FAM diffusion), combines our FM and AM modules to yield superior quality results, see Fig. 1 (d).

Our method seamlessly integrates with any latent diffusion model without additional training or architectural changes. We empirically show that our method significantly enhances the quality and efficiency of high-resolution image generation, establishing a new state-of-the-art.

2 Related Work

Diffusion models have shown impressive performance in generating creative and accurate representations given text prompts [10, 22]. While early work [22] was limited to generating relatively low-resolution images (i.e. 256×256256256256\times 256256 × 256), follow-up work showed that their performance can scale to higher resolutions, e.g. 512×512512512512\times 512512 × 512 with SD1.5 [22] and 1024×1024102410241024\times 10241024 × 1024 with SDXL [19]. However, a major shortcoming with all these models is that generation remains limited by the resolution used at training time. Naively targeting higher train-time resolutions quickly results in prohibitive training costs and computational requirements, and the limited availability of high-resolution training data also restricts the diversity of image generation. Thus, adapting pre-trained diffusion models to generate high-resolution images without retraining has emerged as a topic of interest.

Early works [1, 14] proposed using overlapping patches at native resolution and blending the outputs to produce an image without seams. However, this leads to frequent repetitions and inconsistent global image structure. Therefore, subsequent works introduced various mechanisms to encourage global structural consistency. For instance, DemoFusion [3] proposed a patch-based generation process with mechanisms such as skip residuals and progressive upsampling, while AccDiffusion [15] used localized prompting to guide high-resolution generation and improve consistency with images generated at native resolutions. However, these methods still suffer from issues like local repetitions, and inconsistent global coherence. They also have significant latency overheads due to the running cost of multiple backward passes. To mitigate the high latencies, other works aim to generate high-resolution images in a single pass by modifying the architecture of the UNet. For example, ScaleCrafter [7] employs dilated convolutions to adjust the receptive field of convolutions in the denoising UNet. HiDiffusion [34] introduces an alternative UNet that dynamically adjusts the feature map size during the denoising process. While these approaches achieve faster generation, they often result in image distortions.

More closely related to ours are methods that have approached structural consistency from a frequency domain perspective. FouriScale [12] splits the image in Fourier domain, then proceeds to incorporate a low-pass filtering operation and impose structural consistency with an image generated at natire resolution. However, this splitting operation results in unrealistic images. HiPrompt [16] decomposes images into spatial frequency components conditioned on local and global prompts, but it often relies on redundant operations that lead to high latencies. ResMaster [25] leverages low-frequency information from the latent representation of the native image to provide desirable global semantics during the denoising process. However, it ignores the noise distribution differences between the current high-resolution denoising step and the native image in latent space. In addition, it still relies on patch-based denoising, making it inefficient. In contrast to these methods, we propose a one-pass method that does not alter the model architecture. Importantly, our method introduces a complementary novel attention modulation mechanism, which targets local structure consistency; an issue overlooked by all existing works.

3 Method

Refer to caption
Figure 2: Overview of the FAM diffusion. (a) We first generate an image at native resolution, followed by a test-time diffuse-denoise process. We incorporate our Frequency Modulation module and Attention Modulation during high-res denoising to control global structure and fine local texture, respectively. (b) Details of the Frequency Modulation, where we use the Fourier domain to selectively condition low-frequency components during high-res denoising while leaving high-frequency components fully controllable. (c) Details of Attention Modulation, where attention maps from the native image denoising are used to correct the high-res denoising.
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Refer to caption
(e)
Figure 3: Ablation on the components of FAM diffusion. Direct Inference (DI) at high resolution from noise, Direct Inference from low-res latent (DI*), Skip Residual (SR) from DemoFusion [3], Frequency Modulation (FM), Attention Modulation (AM).

In this work, we leverage pretrained latent diffusion models (LDMs), which have been extensively trained on large-scale high-quality data. Our goal is to generate images at higher resolutions than during training, without any additional finetuning or model modification. Sec. 3.1 briefly reviews the diffusion notation and the test-time diffuse-denoise strategy. In Sec. 3.2 we present our Frequency Modulated (FM) denoising approach, which is designed to improve global consistency. Finally, we introduce our Attention Modulation (AM) mechanism, which is designed to improve the consistency of the local texture and high-frequency detail, in Sec. 3.3. We provide an overview of our method in Figure 2.

3.1 Preliminaries

Latent Diffusion Models (LDM) [22]: We operate in the realm of LDMs, which first convert image 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to a latent representation 𝐳0subscript𝐳0\mathbf{z}_{0}bold_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT using an encoder such that 𝐳0=(𝐱0)subscript𝐳0subscript𝐱0\mathbf{z}_{0}=\mathcal{E}(\mathbf{x}_{0})bold_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = caligraphic_E ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ), 𝐳0c×h×wsubscript𝐳0superscript𝑐𝑤\mathbf{z}_{0}\in\mathbb{R}^{c\times h\times w}bold_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_c × italic_h × italic_w end_POSTSUPERSCRIPT. During training, a Markovian diffusion process progressively adds noise to the input latent 𝐳0subscript𝐳0\mathbf{z}_{0}bold_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT according to a predefined schedule βt,t[1,T]subscript𝛽𝑡𝑡1𝑇\beta_{t},t\in[1,T]italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ∈ [ 1 , italic_T ] by sampling sequentially from:

q(𝐳t|𝐳t1):=𝒩(𝐳t|1βt𝐳t1,βt𝐈)assign𝑞conditionalsubscript𝐳𝑡subscript𝐳𝑡1𝒩conditionalsubscript𝐳𝑡1subscript𝛽𝑡subscript𝐳𝑡1subscript𝛽𝑡𝐈q(\mathbf{z}_{t}|\mathbf{z}_{t-1}):=\mathcal{N}(\mathbf{z}_{t}|\sqrt{1-\beta_{% t}}\mathbf{z}_{t-1},\beta_{t}\mathbf{I})italic_q ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) := caligraphic_N ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | square-root start_ARG 1 - italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_I ) (1)

Conversely, a trainable denoising process progressively recovers the original latent 𝐳0subscript𝐳0\mathbf{z}_{0}bold_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT using a noise estimator 𝒵θ=(μθ,Σθ)subscript𝒵𝜃subscript𝜇𝜃subscriptΣ𝜃\mathcal{Z}_{\theta}=\left(\mu_{\theta},\Sigma_{\theta}\right)caligraphic_Z start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT = ( italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , roman_Σ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) parametrized by θ𝜃\thetaitalic_θ by sampling from:

pθ(𝐳t1|𝐳t):=𝒩(zt1|μθ(𝐳t,t),Σθ(𝐳t,t))assignsubscript𝑝𝜃conditionalsubscript𝐳𝑡1subscript𝐳𝑡𝒩conditionalsubscriptz𝑡1subscript𝜇𝜃subscript𝐳𝑡𝑡subscriptΣ𝜃subscript𝐳𝑡𝑡p_{\theta}(\mathbf{z}_{t-1}|\mathbf{z}_{t}):=\mathcal{N}\left(\textbf{z}_{t-1}% |\mu_{\theta}(\mathbf{z}_{t},t),\Sigma_{\theta}(\mathbf{z}_{t},t)\right)italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) := caligraphic_N ( z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) , roman_Σ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) ) (2)

During inference, an image is generated by denoising from random noise, 𝐳𝐓𝒩(0,I)c×h×wsimilar-tosubscript𝐳𝐓𝒩0Isuperscript𝑐𝑤\mathbf{z_{T}}\sim\mathcal{N}(0,\textbf{I})\in\mathbb{R}^{c\times h\times w}bold_z start_POSTSUBSCRIPT bold_T end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , I ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_c × italic_h × italic_w end_POSTSUPERSCRIPT, through sequential calls to 𝒵θsubscript𝒵𝜃\mathcal{Z}_{\theta}caligraphic_Z start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT. The quality of the generated image improves with the number of steps to finally yield the latent representation 𝐳0nc×h×wsuperscriptsubscript𝐳0𝑛superscript𝑐𝑤\mathbf{z}_{0}^{n}\in\mathbb{R}^{c\times h\times w}bold_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_c × italic_h × italic_w end_POSTSUPERSCRIPT, where we introduce the superscript n𝑛nitalic_n to indicates generation at native resolution h×w𝑤h\times witalic_h × italic_w (i.e. same as training resolution).

Inference-time diffuse-denoise: Our goal is to use the pretrained parametric denoiser 𝒵θsubscript𝒵𝜃\mathcal{Z}_{\theta}caligraphic_Z start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, without further finetuning, to generate 𝐳0mc×sh×swsubscriptsuperscript𝐳𝑚0superscript𝑐𝑠𝑠𝑤\mathbf{z}^{m}_{0}\in\mathbb{R}^{c\times sh\times sw}bold_z start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_c × italic_s italic_h × italic_s italic_w end_POSTSUPERSCRIPT at a higher resolution m𝑚mitalic_m, m=sh×sw𝑚𝑠𝑠𝑤m=sh\times switalic_m = italic_s italic_h × italic_s italic_w, where s𝑠sitalic_s is the target resolution scaling factor. The naive approach is to directly start from random noise at the target resolution, 𝐳Tm𝒩(0,I)c×sh×swsimilar-tosubscriptsuperscript𝐳𝑚𝑇𝒩0𝐼superscript𝑐𝑠𝑠𝑤\mathbf{z}^{m}_{T}\sim\mathcal{N}(0,I)\in\mathbb{R}^{c\times sh\times sw}bold_z start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , italic_I ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_c × italic_s italic_h × italic_s italic_w end_POSTSUPERSCRIPT. However, this has been repeatedly shown to lead to suboptimal results, with frequent artifacts and object duplication [3, 7, 34]. This is illustrated in Fig. 3(a).

Instead, prior works proposed a test time diffuse-denoise process [27, 3, 8]. The idea is to start from the output of the denoising process at native resolution, 𝐳0nsubscriptsuperscript𝐳𝑛0\mathbf{z}^{n}_{0}bold_z start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT rather than noise, which is then upsampled to the target resolution m𝑚mitalic_m to obtain 𝐳~0m=𝒰(𝐳0n,s)subscriptsuperscript~𝐳𝑚0𝒰subscriptsuperscript𝐳𝑛0𝑠\tilde{\mathbf{z}}^{m}_{0}=\mathcal{U}(\mathbf{z}^{n}_{0},s)over~ start_ARG bold_z end_ARG start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = caligraphic_U ( bold_z start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_s ), where 𝒰𝒰\mathcal{U}caligraphic_U denotes an upsampling function. Next, T𝑇Titalic_T forward diffusion steps progressively add noise to the latents 𝐳~t=1Tmsubscriptsuperscript~𝐳𝑚𝑡1𝑇\tilde{\mathbf{z}}^{m}_{t=1\ldots T}over~ start_ARG bold_z end_ARG start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t = 1 … italic_T end_POSTSUBSCRIPT. Finally, the backward process denoises from 𝐳~Tmsubscriptsuperscript~𝐳𝑚𝑇\tilde{\mathbf{z}}^{m}_{T}over~ start_ARG bold_z end_ARG start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT to yield the final output 𝐳0msubscriptsuperscript𝐳𝑚0\mathbf{z}^{m}_{0}bold_z start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Note that we use 𝐳~~𝐳\tilde{\mathbf{z}}over~ start_ARG bold_z end_ARG and 𝐳𝐳\mathbf{z}bold_z to refer to the latents generated during diffusion and denoising respectively.

While a standard denoising process as in Eq. 2 could be used, it often leads to inconsistent global structures, as shown in Fig. 3(b). Instead, the denoising process from Eq. 2 is now defined as:

pθ(𝐳t1m|ft(𝐳~tm,𝐳tm))subscript𝑝𝜃conditionalsubscriptsuperscript𝐳𝑚𝑡1subscript𝑓𝑡subscriptsuperscript~𝐳𝑚𝑡subscriptsuperscript𝐳𝑚𝑡p_{\theta}\left(\mathbf{z}^{m}_{t-1}|f_{t}(\tilde{\mathbf{z}}^{m}_{t},\mathbf{% z}^{m}_{t})\right)italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( over~ start_ARG bold_z end_ARG start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_z start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) (3)

where ft(.)f_{t}(.)italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( . ) is tasked with steering the denoising process and improving the consistency between the high-res and low-res images. Previous work [3, 15] define ft(.)f_{t}(.)italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( . ) as a simple weighted linear combination of 𝐳~tmsubscriptsuperscript~𝐳𝑚𝑡\tilde{\mathbf{z}}^{m}_{t}over~ start_ARG bold_z end_ARG start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and 𝐳tmsubscriptsuperscript𝐳𝑚𝑡\mathbf{z}^{m}_{t}bold_z start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and coin the mechanism skip residual. We show in Fig. 3(c) that this yields to suboptimal results. In contrast, we propose a Frequency Modulated approach to defining ft(.)f_{t}(.)italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( . ).

3.2 Frequency-Modulated Denoising

The conditioning of the denoising steps through the skip residual has been shown to improve consistency between low and high-resolution images. We however observe that it lacks control over the information transferred. More specifically, the goal of the test-time diffuse-denoise process is to take the upsampled low-resolution image and to produce an output that 1) preserves the global structure, and 2) improves the texture and high-frequency details. The skip residual mechanism however steers the output towards the input indiscriminately, which serves the first objective but can negatively impact the latter. It would be desirable to instead harness the global structure information from the diffused latents of the forward process, while allowing the denoising process to handle the generation of details. To this end, we appeal to the frequency domain, where global structure and finer details are captured by low- and high-frequency, respectively [17, 28, 31], and re-define the function ft(.)f_{t}(.)italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( . ), which controls information transfer from the forward diffusion into the denoising process, in accordance.

Let 𝒦(t)𝒦𝑡\mathcal{K}(t)caligraphic_K ( italic_t ) be a high-pass filter for timestep t𝑡titalic_t, the function ft(.)f_{t}(.)italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( . ) in Eq. 3 is defined as follows:

ft(𝐳~tm,𝐳tm)=subscript𝑓𝑡superscriptsubscript~𝐳𝑡𝑚superscriptsubscript𝐳𝑡𝑚absent\displaystyle f_{t}(\tilde{\mathbf{z}}_{t}^{m},\mathbf{z}_{t}^{m})=italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( over~ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT , bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) = IDFT2D(𝒦(t)DFT2D(𝐳tm)\displaystyle IDFT_{2D}(\mathcal{K}(t)\odot DFT_{2D}\left(\mathbf{z}_{t}^{m}\right)italic_I italic_D italic_F italic_T start_POSTSUBSCRIPT 2 italic_D end_POSTSUBSCRIPT ( caligraphic_K ( italic_t ) ⊙ italic_D italic_F italic_T start_POSTSUBSCRIPT 2 italic_D end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) (4)
+(1𝒦(t))DFT2D(𝐳~tm)),\displaystyle+(1-\mathcal{K}(t))\odot DFT_{2D}\left(\tilde{\mathbf{z}}_{t}^{m}% \right)),+ ( 1 - caligraphic_K ( italic_t ) ) ⊙ italic_D italic_F italic_T start_POSTSUBSCRIPT 2 italic_D end_POSTSUBSCRIPT ( over~ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) ) ,

where direct-product\odot denotes the Hadamard product. Essentially, the high-frequency coefficients of the denoised latent 𝐳tmsuperscriptsubscript𝐳𝑡𝑚\mathbf{z}_{t}^{m}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT are combined with the low-frequency coefficients of the diffused latent 𝐳~tmsuperscriptsubscript~𝐳𝑡𝑚\tilde{\mathbf{z}}_{t}^{m}over~ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, modulated by the filter 𝒦(t)𝒦𝑡\mathcal{K}(t)caligraphic_K ( italic_t ). Eq. 4 can be further reformulated in the time domain as below:

ft(𝐳~tm,𝐳tm)=𝐳tm+κ(t)(𝐳~tm𝐳tm),subscript𝑓𝑡superscriptsubscript~𝐳𝑡𝑚superscriptsubscript𝐳𝑡𝑚subscriptsuperscript𝐳𝑚𝑡𝜅𝑡superscriptsubscript~𝐳𝑡𝑚subscriptsuperscript𝐳𝑚𝑡f_{t}(\tilde{\mathbf{z}}_{t}^{m},\mathbf{z}_{t}^{m})=\mathbf{z}^{m}_{t}+% \mathcal{\kappa}(t)\circledast\bigl{(}{\tilde{\mathbf{z}}}_{t}^{m}-{{\mathbf{z% }}}^{m}_{t}\bigr{)},italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( over~ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT , bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) = bold_z start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_κ ( italic_t ) ⊛ ( over~ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT - bold_z start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , (5)

where κ(t)=IDFT2D(1𝒦(t))sh×sw𝜅𝑡𝐼𝐷𝐹subscript𝑇2𝐷1𝒦𝑡superscript𝑠𝑠𝑤\mathcal{\kappa}(t)=IDFT_{2D}\left(1-\mathcal{K}(t)\right)\in\mathbb{R}^{sh% \times sw}italic_κ ( italic_t ) = italic_I italic_D italic_F italic_T start_POSTSUBSCRIPT 2 italic_D end_POSTSUBSCRIPT ( 1 - caligraphic_K ( italic_t ) ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_s italic_h × italic_s italic_w end_POSTSUPERSCRIPT is a convolutional kernel, and \circledast denotes the circular convolution operator. Eq. 5 shows that the frequency modulation adds a low-frequency update to the denoised latent 𝐳tmsubscriptsuperscript𝐳𝑚𝑡{\mathbf{z}}^{m}_{t}bold_z start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT directed towards the diffused latent 𝐳~tmsuperscriptsubscript~𝐳𝑡𝑚{\tilde{\mathbf{z}}}_{t}^{m}over~ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, subsequently preserving the global structural information from the upsampled latent. Furthermore, the circular convolution κ(t)𝜅𝑡\mathcal{\kappa}(t)italic_κ ( italic_t ) in Eq. 5 can be interpreted as an additional (non-learnable) convolutional layer of the UNet, effectively providing it with a global receptive field and helping generate consistent structure without modifying the UNet architecture [34, 11] or using dilated sampling [3]. The result of our FM approach is shown in Fig. 3(d). In comparison, the skip residual approach of DemoFusion, shown in Fig. 3(c), produces inconsistencies like a missing left nostril and unnaturally small eyes.

3.3 Attention Modulation

While the FM module successfully maintains global structure and solves the issue of object duplication as shown in Fig. 3(d), we note that local structures can be inconsistently generated due to the discrepancy between training-time native resolution and the target inference-time high resolutions. For example, the top image in Fig. 3(d) shows a distorted mouth compared to the one at native resolution. Similarly, in the bottom example, fur texture is incorrectly generated on the shirt collar. That is, the high-frequency detail generated on the shirt collar is semantically related to one generated on the fox’s face and not to the other parts of the shirt. We hypothesize this stems from incorrect attention maps during the high-res denoising stage. This motivates us to propose our Attention Modulation (AM) approach. We take inspiration from attention swapping, a recent method to combine information from two diffusion processes in a more localized manner [13, 5, 4], and extend the idea to transfer local structural information from the denoising process at native resolution to the one at target resolution.

In particular, the attention of an input tensor 𝐳𝐳\mathbf{z}bold_z is computed by first projecting it linearly into a triplet of query, keys, and values, (Q,K,V)𝑄𝐾𝑉(Q,K,V)( italic_Q , italic_K , italic_V ), respectively, and the self-attention is computed as:

Att(z)=softmax(QKTd)V=MV𝐴𝑡𝑡z𝑠𝑜𝑓𝑡𝑚𝑎𝑥𝑄superscript𝐾𝑇𝑑𝑉𝑀𝑉Att(\textbf{z})=softmax\left(\frac{Q\cdot K^{T}}{\sqrt{d}}\right)V=M\cdot Vitalic_A italic_t italic_t ( z ) = italic_s italic_o italic_f italic_t italic_m italic_a italic_x ( divide start_ARG italic_Q ⋅ italic_K start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG ) italic_V = italic_M ⋅ italic_V (6)

where d𝑑ditalic_d indicates the feature dimensionality, and we refer to M𝑀Mitalic_M as the attention matrix.

In our case, we modify the self-attention at specific layers of the UNet of the high-resolution denoising process to incorporate information from the attention maps of the native resolution as:

M¯m=(λ𝒰(Mn,s)+(1λ)Mm)superscript¯𝑀𝑚𝜆𝒰superscript𝑀𝑛𝑠1𝜆superscript𝑀𝑚\bar{M}^{m}=(\lambda\cdot\mathcal{U}(M^{n},s)+(1-\lambda)\cdot M^{m})over¯ start_ARG italic_M end_ARG start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT = ( italic_λ ⋅ caligraphic_U ( italic_M start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , italic_s ) + ( 1 - italic_λ ) ⋅ italic_M start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) (7)

where Mnsuperscript𝑀𝑛M^{n}italic_M start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and Mmsuperscript𝑀𝑚M^{m}italic_M start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT are the attention matrices at native and target resolution respectively, λ𝜆\lambdaitalic_λ is a hyperparameter, and 𝒰𝒰\mathcal{U}caligraphic_U is an s𝑠sitalic_s-times upsampling function. The new attention matrix M¯msuperscript¯𝑀𝑚\bar{M}^{m}over¯ start_ARG italic_M end_ARG start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT is then used instead of Mmsuperscript𝑀𝑚M^{m}italic_M start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT during the high-res denoising process in Eq. 6.

Applying our AM module at all layers of the UNet can lead to suboptimal performance due to over-regularization. We apply it instead only for layers in up-blocks of the UNet, as they are known to preserve layout information better [13]. Furthermore, we experimented with AM at various stages and found the highest benefit to be at up_block_0. Results shown in Fig. 3(e) demonstrate the benefit of the proposed AM module, particularly regarding better preservation of local structures such as the mouth and shirt collar, highlighted in yellow boxes.

4 Experiment

4.1 Experimental setup

To demonstrate the effectiveness of our approach, we pair it with a well-performing diffusion model like SDXL [19]. For completeness, we also pair our approach with the recent HiDiffusion [34], which specifically changes the attention mechanism of SDXL with windowed attention to improve the model latency. SDXL is trained at 1024×1024 resolution, which we refer to as 1×1\times1 ×. We experiment with three unseen higher resolutions such that the model generates 2×2222\times 22 × 2, 3×3333\times 33 × 3, and 4×4444\times 44 × 4 times more pixels than the training setup. In the supplementary, we also include results with various aspect ratios, e.g. 2×4242\times 42 × 4, and also experiment with different variants of Stable Diffusion (SD); namely, SD 1.5 [22], SD 2.1 [22], which generate at 512×512 and 768×768 pixels respectively.

Evaluation set.

Following previous work [3, 7, 15, 11] we evaluate performance on a subset of the Laion-5B dataset [24]. Given the number of compared methods and significant computational demands associated with the task, we randomly sample 10K images from Laion-5b which we use as our real images set, and we sample 1K captions, which we use as text prompts for the models.

Evaluation metrics.

Following prior work, we evaluate the quality and diversity of the generated images using Frechet Inception Distance (FID) [9] and Kernel Inception Distance (KID) [2], computed between the generated and real images. Since FID requires resizing images to 299×299299299299\times 299299 × 299, which negatively impacts the assessment, it is typical to adopt their patch-level variants [3, 34, 15, 11]. Specifically, we extract 10 random crops from each image before calculating FID and KID, referring to these metrics as FIDcc{}_{\text{c}}start_FLOATSUBSCRIPT c end_FLOATSUBSCRIPT and KIDcc{}_{\text{c}}start_FLOATSUBSCRIPT c end_FLOATSUBSCRIPT. To further evaluate the semantic similarity between image features and text prompts, we report the CLIP score [21]. To measure the efficiency of each method, we compute latencies on a single A40 GPU.

Method Scaling Factor FID\downarrow KID\downarrow FIDc{}_{c}\downarrowstart_FLOATSUBSCRIPT italic_c end_FLOATSUBSCRIPT ↓ KIDc{}_{c}\downarrowstart_FLOATSUBSCRIPT italic_c end_FLOATSUBSCRIPT ↓ CLIP \uparrow Latency(mins)
DemoFusion [3] 2×2222\times 22 × 2 63.24 0.0084 36.75 0.0096 32.0 2.5
AccDiffusion [15] 59.42 0.0068 37.23 0.0105 31.69 2.6
FouriScale* [12] 78.54 0.0136 40.80 0.0130 29.8 2.3
HiDiffusion [34] 78.02 0.0136 51.41 0.0139 30.5 0.6
HiDiffusion [34] + FAM diffusion 69.61 0.0140 34.26 0.0084 32.32 0.8
SDXL [19] 59.47 0.0067 50.54 0.0136 30.6 0.8
SDXL [19] + FAM diffusion 58.91 0.0072 33.96 0.0080 32.35 1
DemoFusion [3] 3×3333\times 33 × 3 68.82 0.0159 40.24 0.0122 32.0 8.6
AccDiffusion [15] 73.47 0.0210 43.64 0.014 31.50 10
FouriScale* [12] 73.57 0.0309 65.01 0.0357 28.54 6.2
HiDiffusion [34] 112.51 0.0325 68.84 0.021 28.43 1.5
HiDiffusion [34] + FAM diffusion 76.28 0.0007 36.70 0.010 32.26 1.8
SDXL [19] 78.41 0.0136 69.40 0.0210 28.44 2.2
SDXL [19] + FAM diffusion 69.25 0.0007 36.40 0.010 32.25 2.5
DemoFusion [3] 4×4444\times 44 × 4 65.89 0.0087 48.44 0.0157 30.45 19.6
AccDiffusion [15] 73.97 0.0090 54.80 0.0187 30.15 20.5
FouriScale* [12] 105.24 0.0342 70.45 0.0223 27.86 14.7
HiDiffusion [34] 129.91 0.0483 156.98 0.0877 24.32 2.8
HiDiffusion [34] + FAM diffusion 59.05 0.0074 44.65 0.0134 32.31 3.1
SDXL [19] 160.10 0.0602 74.37 0.0242 26.70 5.4
SDXL [19] + FAM diffusion 58.91 0.0073 43.65 0.0130 32.33 6.1
Table 1: System-level comparisons with SDXL. * indicates inference with FreeU [26]

4.2 Main Results

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 4: Visualization of Attention Maps in the UNet: (a) Low-Resolution Attention map, (b) High-Resolution Attention map, (c) Attention Map when using the AM module
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 5: Qualitative comparison between Direct Upsampling, BSRGAN, and our method. The patches shown were cropped from a 4096×4096409640964096\times 40964096 × 4096 resolution image. Zoom in for best view.
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Refer to caption
(e)
Figure 6: Qualitative comparison with other methods based on SDXL. Best viewed when zoomed in. * indicates inference with FreeU [26]
Refer to caption
(a)
Refer to caption
(b)
Figure 7: Comparison between Constant LF and Time-aware LF.

We select Demofusion [3], AccDiffusion [15], FouriScale [11], and HiDiffusion [34] as representative methods of the current state-of-the-art among high-resolution generation methods. As shown in Table 1, FAM diffusion achieves the best overall performance on FIDcsubscriptFID𝑐\text{FID}_{c}FID start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, KIDcsubscriptKID𝑐\text{KID}_{c}KID start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, and CLIP Score in all cases. In the case of FID and KID, FAM diffusion provides substantial gains for larger scale factors, while producing similar results to DemoFusion on lower scale factors. However, these metrics heavily downsample high-resolution images before computing the metrics and thus do not capture finer details in the evaluation results. This is a widely-known issue for these metrics, as explained in Sec. 4.1. Finally, we note that our method adds only small latency overheads compared to direct inference on the target resolution, e.g. 0.2, 0.3, and 0.7 min at 2×2\times2 ×, 3×3\times3 × and 4×4\times4 × scale factors respectively when combined with SDXL. In comparison, DemoFusion adds 14.2 sec latency vs SDXL direct inference at 4×4\times4 × scale factor. When compared to the frequency-based method FouriScale [11], FAM diffusion also shows notable improvements in both quality and latency. For instance, under 4K resolution image generation, it achieves 43.65 vs. 70.45 on FIDcsubscriptFID𝑐\text{FID}_{c}FID start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and 32.31 vs. 26.67 on CLIP score, while also being faster than FouriScale. Additionally, we observed that FAM diffusion can be seamlessly integrated into single-pass methods, such as HiDiffusion [34], to enhance performance while maintaining fast image generation, achieving an effective latency-quality trade-off. These results quantitatively validate the effectiveness of our method in improving the quality of image generation.

In Figure 6, we present a comparison between DemoFusion, FouriScale, HiDiffusion, and FAM diffusion. We selected three complex textual prompts to highlight the image-generation capabilities of the model. For FouriScale, we used the default setting with FreeU [26]. Firstly, as mentioned above, DemoFusion tends to generate repetitive content and artifacts with unreasonable local structures due to its patch-based generation approach (see for example the two small cat heads generated on the top-right image). FouriScale [11] and HiDiffusion [34] produce visually unappealing structures and extensive areas of irregular textures, which significantly degrade the overall visual quality. Additionally, we compare our method with the super-resolution approach BSRGAN [32], as shown in Figure 5. We observe that FAM diffusion effectively introduces or modifies high-frequency details that were not present in the original image, while preserving structural information, leading to more appealing and detailed images.

To further illustrate the generality of our approach, in the supplementary material we provide results of our approach in combination with SD1.5 and SD2.1.

4.3 Ablation Study

In this section, we conduct ablation studies and use SDXL with the 2×2222\times 22 × 2 scale factor setting.

Effectiveness of the components in the FAM diffusion

We study the effect of the two components of FAM diffusion, Frequency-Modulated Denoising (FM) and Attention Modulation (AM). The results shown in Figure 3 indicate the following: (1) both direct inference from random noise, and direct inference from the diffused latent at native resolution generate outputs with structural distortions and repeated patterns. (2) while the Skip Residuals of DemoFusion helps maintain the global structure of the image, it still produces artifacts and poor local patterns. (3) Compared to Skip Residuals, FM reduces undesirable local patterns by leveraging the low-frequency information of the image at native resolution, which provides better structural guidance. (4) Attention Modulation resolves inconsistencies between local patterns and global structure by utilizing the attention map from the native resolution, offering strong guidance of the semantic relationships among latent tokens. Overall, FM and AM address structural distortions and local pattern inconsistencies in high-resolution images effectively, highlighting the meaningful contributions of FAM diffusion.

Effectiveness of the time-aware formulation on the FM module

We show here the effect of the time-varying formulation of FM, as illustrated in Figure 7(a). Specifically, the FM module incorporates low-frequency information from the corresponding diffused latent at each step t𝑡titalic_t. Instead, we can avoid this time-varying nature and utilize the upsampled latent as a single static reference. However, this approach results in images that appear noticeably blurrier and lose finer details associated with high-frequency information, highlighting the importance of the dynamic nature of the FM module throughout the denoising process.

Analysis of Attention Modulation

To better understand the principles underlying the AM module, we visualize in Figure 4 the self-attention maps of a tokens from the mouth region (marked with a star) as the query and all tokens as the key and value. The resulting attention map computed using the low-resolution latent primarily encodes coarse information of the semantic relations among parts of the image, but lacks fine-grained contextual information across the entire face. Instead, the attention maps at high resolution are more detailed, but fail to capture semantic relatedness, e.g. the mouth areas are not highlighted. After applying AM, the attention map effectively integrates local-global relationships with enhanced fine-grained detail. This analysis provides visual insights into how AM repairs inconsistencies in local patterns, contributing to more coherent global structures.

5 Conclusion

We introduced FAM diffusion, a training-free diffusion model for high-resolution image generation. To address issues of object repetition and structural distortion, we propose a Frequency Modulated strategy. By leveraging the Fourier domain, this method enhances guidance for high-resolution generation while avoiding latency overheads associated with multi-patch approaches. Additionally, we propose an effective Attention Modulation mechanism to address inconsistent local texture patterns, a challenge largely overlooked in previous works. Extensive quantitative and qualitative evaluations highlight the effectiveness of our method. We further show that, contrary to previous works, our method incurs in marginal latency overheads.

References

  • Bar-Tal et al. [2023] Omer Bar-Tal, Lior Yariv, Yaron Lipman, and Tali Dekel. MultiDiffusion: fusing diffusion paths for controlled image generation. In International Conference on Machine Learning, 2023.
  • Bińkowski et al. [2018] Mikołaj Bińkowski, Danica J Sutherland, Michael Arbel, and Arthur Gretton. Demystifying MMD GANs. International Conference on Learning Representations, 2018.
  • Du et al. [2024] Ruoyi Du, Dongliang Chang, Timothy Hospedales, Yi-Zhe Song, and Zhanyu Ma. DemoFusion: Democratising high-resolution image generation with no $$$. In IEEE Conference on Computer Vision and Pattern Recognition, 2024.
  • Gu et al. [2023] Jing Gu, Yilin Wang, Nanxuan Zhao, Tsu-Jui Fu, Wei Xiong, Qing Liu, Zhifei Zhang, He Zhang, Jianming Zhang, HyunJoon Jung, and Xin Eric Wang. Photoswap: Personalized subject swapping in images. Neural Information Processing Systems, 2023.
  • Gu et al. [2024] Jing Gu, Nanxuan Zhao, Wei Xiong, Qing Liu, Zhifei Zhang, He Zhang, Jianming Zhang, HyunJoon Jung, Yilin Wang, and Xin Eric Wang. SwapAnything: Enabling arbitrary object swapping in personalized image editing. European Conference on Computer Vision, 2024.
  • He et al. [2022] Yingqing He, Tianyu Yang, Yong Zhang, Ying Shan, and Qifeng Chen. Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:2211.13221, 2022.
  • He et al. [2024] Yingqing He, Shaoshu Yang, Haoxin Chen, Xiaodong Cun, Menghan Xia, Yong Zhang, Xintao Wang, Ran He, Qifeng Chen, and Ying Shan. Scalecrafter: Tuning-free higher-resolution visual generation with diffusion models. In International Conference on Learning Representations, 2024.
  • Hertz et al. [2022] Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022.
  • Heusel et al. [2017] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. Neural Information Processing Systems, 2017.
  • Ho et al. [2020] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Neural Information Processing Systems, 2020.
  • Huang et al. [2024a] Linjiang Huang, Rongyao Fang, Aiping Zhang, Guanglu Song, Si Liu, Yu Liu, and Hongsheng Li. FouriScale: A frequency perspective on training-free high-resolution image synthesis. In European Conference on Computer Vision, 2024a.
  • Huang et al. [2024b] Linjiang Huang, Rongyao Fang, Aiping Zhang, Guanglu Song, Si Liu, Yu Liu, and Hongsheng Li. FouriScale: A frequency perspective on training-free high-resolution image synthesis. arXiv preprint arXiv:2403.12963, 2024b.
  • Jeong et al. [2024] Jaeseok Jeong, Junho Kim, Yunjey Choi, Gayoung Lee, and Youngjung Uh. Visual style prompting with swapping self-attention. arXiv preprint arXiv:2402.12974, 2024.
  • Lee et al. [2023] Yuseung Lee, Kunho Kim, Hyunjin Kim, and Minhyuk Sung. SyncDiffusion: Coherent montage via synchronized joint diffusions. In Neural Information Processing Systems, 2023.
  • Lin et al. [2024] Zhihang Lin, Mingbao Lin, Zhao Meng, and Rongrong Ji. AccDiffusion: An accurate method for higher-resolution image generation. In European Conference on Computer Vision, 2024.
  • Liu et al. [2024] Xinyu Liu, Yingqing He, Lanqing Guo, Xiang Li, Bu Jin, Peng Li, Yan Li, Chi-Min Chan, Qifeng Chen, Wei Xue, Wenhan Luo, Qifeng Liu, and Yike Guo. HiPrompt: Tuning-free higher-resolution generation with hierarchical MLLM prompts. arXiv preprint arXiv:2409.02919, 2024.
  • Marr and Hildreth [1980] David Marr and Ellen Hildreth. Theory of edge detection. Proceedings of the Royal Society of London. Series B. Biological Sciences, 207(1167):187–217, 1980.
  • Noroozi et al. [2024] Mehdi Noroozi, Isma Hadji, Brais Martinez, Adrian Bulat, and Georgios Tzimiropoulos. You only need one step: Fast super-resolution with stable diffusion via scale distillation. European Conference on Computer Vision, 2024.
  • Podell et al. [2024] Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. SDXL: Improving latent diffusion models for high-resolution image synthesis. In International Conference on Learning Representations, 2024.
  • Poole et al. [2022] Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Mildenhall. Dreamfusion: Text-to-3D using 2D diffusion. arXiv preprint arXiv:2209.14988, 2022.
  • Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  • Rombach et al. [2022] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In IEEE Conference on Computer Vision and Pattern Recognition, 2022.
  • Ruiz et al. [2023] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. DreamBooth: Fine tuning text-to-image diffusion models for subject-driven generation. In IEEE Conference on Computer Vision and Pattern Recognition, 2023.
  • Schuhmann et al. [2022] Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade W Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa R Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. LAION-5B: An open large-scale dataset for training next generation image-text models. In Neural Information Processing Systems - Datasets and Benchmarks Track, 2022.
  • Shi et al. [2024] Shuwei Shi, Wenbo Li, Yuechen Zhang, Jingwen He, Biao Gong, and Yinqiang Zheng. ResMaster: Mastering high-resolution image generation via structural and fine-grained guidance. arXiv preprint arXiv:2406.16476, 2024.
  • Si et al. [2024] Chenyang Si, Ziqi Huang, Yuming Jiang, and Ziwei Liu. FreeU: Free lunch in diffusion U-Net. In IEEE Conference on Computer Vision and Pattern Recognition, 2024.
  • Song et al. [2021] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In International Conference on Learning Representations, 2021.
  • Wandell [1995] BA Wandell. Foundations of vision, 1995.
  • Wang et al. [2024] Wenqing Wang, Haosen Yang, Josef Kittler, and Xiatian Zhu. Single image, any face: Generalisable 3D face generation. arXiv preprint arXiv:2409.16990, 2024.
  • Wu et al. [2023] Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Stan Weixian Lei, Yuchao Gu, Yufei Shi, Wynne Hsu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In IEEE International Conference on Computer Vision, 2023.
  • Xu et al. [2020] Kai Xu, Minghai Qin, Fei Sun, Yuhao Wang, Yen-Kuang Chen, and Fengbo Ren. Learning in the frequency domain. In IEEE Conference on Computer Vision and Pattern Recognition, 2020.
  • Zhang et al. [2021] Kai Zhang, Jingyun Liang, Luc Van Gool, and Radu Timofte. Designing a practical degradation model for deep blind image super-resolution. In IEEE International Conference on Computer Vision, 2021.
  • Zhang et al. [2023] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In IEEE International Conference on Computer Vision, 2023.
  • Zhang et al. [2024] Shen Zhang, Zhaowei Chen, Zhenyu Zhao, Yuhao Chen, Yao Tang, and Jiajun Liang. HiDiffusion: Unlocking higher-resolution creativity and efficiency in pretrained diffusion models. In European Conference on Computer Vision, 2024.

Appendix A Appendix

To complement the main content of the paper, we provide here additional details about the method in Sec. B as well as additional quantitative and qualitative results in Sec C.

Appendix B Additional technical details

B.1 Frequency Modulation details

Time-varying high-pass filter definition.

In our method, we rely on frequency domain and use a high pass filter to steer the denoising process as described in equation (4). In the following, we provide the formal definition of the time-varying high pass filter, 𝒦(t)𝒦𝑡\mathcal{K}(t)caligraphic_K ( italic_t ), that we used.

The high-pass filters 𝒦(t)𝒦𝑡\mathcal{K}(t)caligraphic_K ( italic_t ) have time-varying cut-off frequencies, defined as follows:

ρ(t)𝜌𝑡\displaystyle\rho(t)italic_ρ ( italic_t ) =tTabsent𝑡𝑇\displaystyle=\frac{t}{T}= divide start_ARG italic_t end_ARG start_ARG italic_T end_ARG (8)
τh(t)subscript𝜏𝑡\displaystyle{\tau_{h}}(t)italic_τ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_t ) =hc(1ρ(t))absent𝑐1𝜌𝑡\displaystyle=h\cdot c\cdot(1-\rho(t))= italic_h ⋅ italic_c ⋅ ( 1 - italic_ρ ( italic_t ) ) (9)
τw(t)subscript𝜏𝑤𝑡\displaystyle{\tau_{w}}(t)italic_τ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_t ) =wc(1ρ(t))absent𝑤𝑐1𝜌𝑡\displaystyle=w\cdot c\cdot(1-\rho(t))= italic_w ⋅ italic_c ⋅ ( 1 - italic_ρ ( italic_t ) ) (10)

where τh(t)subscript𝜏𝑡{\tau_{h}}(t)italic_τ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_t ) and τw(t)subscript𝜏𝑤𝑡{\tau_{w}}(t)italic_τ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_t ) are the horizontal and vertical cut-off frequencies at timestep t𝑡titalic_t, respectively. Subsequently, the mask 𝒦(t)𝒦𝑡\mathcal{K}(t)caligraphic_K ( italic_t ), which is applied on the shifted frequency spectrum centered on (xc,yc)subscript𝑥𝑐subscript𝑦𝑐(x_{c},y_{c})( italic_x start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ), is defined as

𝒦(t)={ρ(t),if |xxc|<τw(t)2|yyc|<τh(t)2,1,otherwise𝒦𝑡cases𝜌𝑡if 𝑥subscript𝑥𝑐subscript𝜏𝑤𝑡2otherwise𝑦subscript𝑦𝑐subscript𝜏𝑡21otherwise\displaystyle\mathcal{K}(t)=\begin{cases}\rho(t),&\text{if }\left|x-x_{c}% \right|<\frac{{\tau_{w}}(t)}{2}\\ &\quad\text{\& }\left|y-y_{c}\right|<\frac{{\tau_{h}}(t)}{2},\\ 1,&\text{otherwise}\end{cases}caligraphic_K ( italic_t ) = { start_ROW start_CELL italic_ρ ( italic_t ) , end_CELL start_CELL if | italic_x - italic_x start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT | < divide start_ARG italic_τ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_t ) end_ARG start_ARG 2 end_ARG end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL & | italic_y - italic_y start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT | < divide start_ARG italic_τ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_t ) end_ARG start_ARG 2 end_ARG , end_CELL end_ROW start_ROW start_CELL 1 , end_CELL start_CELL otherwise end_CELL end_ROW (11)

The cut-off frequency grows as the denoising process progresses, while the scaling factor of the low-frequency coefficients decreases. Our frequency modulation is designed such that the guidance from the denoised latent 𝐳~tsubscript~𝐳𝑡{\tilde{\mathbf{z}}}_{t}over~ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT becomes more significant as t0𝑡0t\rightarrow 0italic_t → 0. In our experiments, we set c=0.5𝑐0.5c=0.5italic_c = 0.5.

Derivation of the Frequency Modulation in time-domain.

Refer to caption
(a)
Refer to caption
(b)
Figure 8: Comparison of Attention Swapping and Modulation

In the main paper, we mention that our frequency modulation introduced in Eq. (4) can be reformulated in time domain as Eq. (5) and discuss the corresponding benefits. Here, we provide a formal derivation to support the equivalence between the two formulations. For ease of presentation, we omit the timestep t𝑡titalic_t and resolution m𝑚mitalic_m notations from operands.

Let 𝐳h×w𝐳superscript𝑤\mathbf{z}\in\mathbb{R}^{h\times w}bold_z ∈ blackboard_R start_POSTSUPERSCRIPT italic_h × italic_w end_POSTSUPERSCRIPT be the 2D latent, and 𝐙=DFT2D(𝐳)h×w𝐙𝐷𝐹subscript𝑇2𝐷𝐳superscript𝑤\mathbf{Z}=DFT_{2D}\left(\mathbf{z}\right)\in\mathbb{C}^{h\times w}bold_Z = italic_D italic_F italic_T start_POSTSUBSCRIPT 2 italic_D end_POSTSUBSCRIPT ( bold_z ) ∈ blackboard_C start_POSTSUPERSCRIPT italic_h × italic_w end_POSTSUPERSCRIPT be the Fourier transform of 𝐳𝐳\mathbf{z}bold_z. Written in matrix form,

𝐙=(Wr𝐳Wc),𝐙subscript𝑊𝑟𝐳subscript𝑊𝑐\mathbf{Z}=({W_{r}}\mathbf{z}{W_{c}}),bold_Z = ( italic_W start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT bold_z italic_W start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) , (12)

where Wrh×h,Wcw×wformulae-sequencesubscript𝑊𝑟superscriptsubscript𝑊𝑐superscript𝑤𝑤{W_{r}}\in{\mathbb{C}^{h\times h}},{W_{c}}\in{\mathbb{C}^{w\times w}}italic_W start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ∈ blackboard_C start_POSTSUPERSCRIPT italic_h × italic_h end_POSTSUPERSCRIPT , italic_W start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ∈ blackboard_C start_POSTSUPERSCRIPT italic_w × italic_w end_POSTSUPERSCRIPT are the row- and column-wise Fourier transform matrices, respectively. Let 𝒦h×w𝒦superscript𝑤\mathcal{K}\in\mathbb{R}^{h\times w}caligraphic_K ∈ blackboard_R start_POSTSUPERSCRIPT italic_h × italic_w end_POSTSUPERSCRIPT be the high-pass filter defined in the previous section, our proposed mixing operation in the frequency domain is formulated as below:

𝐙^^𝐙\displaystyle\hat{\mathbf{Z}}over^ start_ARG bold_Z end_ARG =𝒦DFT2D(𝐳)+(1𝒦)DFT2D(𝐳~)absentdirect-product𝒦𝐷𝐹subscript𝑇2𝐷𝐳direct-product1𝒦𝐷𝐹subscript𝑇2𝐷~𝐳\displaystyle=\mathcal{K}\odot DFT_{2D}(\mathbf{z})+(1-\mathcal{K})\odot DFT_{% 2D}(\tilde{\mathbf{z}})= caligraphic_K ⊙ italic_D italic_F italic_T start_POSTSUBSCRIPT 2 italic_D end_POSTSUBSCRIPT ( bold_z ) + ( 1 - caligraphic_K ) ⊙ italic_D italic_F italic_T start_POSTSUBSCRIPT 2 italic_D end_POSTSUBSCRIPT ( over~ start_ARG bold_z end_ARG )
=𝒦(Wr𝐳Wc)+(1𝒦)(Wr𝐳~Wc)absentdirect-product𝒦subscript𝑊𝑟𝐳subscript𝑊𝑐direct-product1𝒦subscript𝑊𝑟~𝐳subscript𝑊𝑐\displaystyle=\mathcal{K}\odot({W_{r}}\mathbf{z}{W_{c}})+(1-\mathcal{K})\odot(% {W_{r}}\tilde{\mathbf{z}}{W_{c}})= caligraphic_K ⊙ ( italic_W start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT bold_z italic_W start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) + ( 1 - caligraphic_K ) ⊙ ( italic_W start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT over~ start_ARG bold_z end_ARG italic_W start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT )
=Wr𝐳Wc+(1𝒦)(Wr(𝐳~𝐳)Wc)absentsubscript𝑊𝑟𝐳subscript𝑊𝑐direct-product1𝒦subscript𝑊𝑟~𝐳𝐳subscript𝑊𝑐\displaystyle={W_{r}}\mathbf{z}{W_{c}}+(1-\mathcal{K})\odot\left({W_{r}}(% \tilde{\mathbf{z}}-\mathbf{z}){W_{c}}\right)= italic_W start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT bold_z italic_W start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT + ( 1 - caligraphic_K ) ⊙ ( italic_W start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( over~ start_ARG bold_z end_ARG - bold_z ) italic_W start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT )

The inverse DFT of 𝐙^^𝐙\hat{\mathbf{Z}}over^ start_ARG bold_Z end_ARG, which is the outcome of Eq. 4, is formulated as:

𝐳^^𝐳\displaystyle\hat{\mathbf{z}}over^ start_ARG bold_z end_ARG =IDFT2D(𝐙^)absent𝐼𝐷𝐹subscript𝑇2𝐷^𝐙\displaystyle=IDFT_{2D}(\hat{\mathbf{Z}})= italic_I italic_D italic_F italic_T start_POSTSUBSCRIPT 2 italic_D end_POSTSUBSCRIPT ( over^ start_ARG bold_Z end_ARG )
=Wr1(Wr𝐳Wc+(1𝒦)(Wr(𝐳~𝐳)Wc))Wc1absentsuperscriptsubscript𝑊𝑟1subscript𝑊𝑟𝐳subscript𝑊𝑐direct-product1𝒦subscript𝑊𝑟~𝐳𝐳subscript𝑊𝑐superscriptsubscript𝑊𝑐1\displaystyle=W_{r}^{-1}\left({W_{r}}\mathbf{z}{W_{c}}+(1-\mathcal{K})\odot% \left({W_{r}}(\tilde{\mathbf{z}}-\mathbf{z}){W_{c}}\right)\right)W_{c}^{-1}= italic_W start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_W start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT bold_z italic_W start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT + ( 1 - caligraphic_K ) ⊙ ( italic_W start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( over~ start_ARG bold_z end_ARG - bold_z ) italic_W start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) ) italic_W start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT
=Wr1Wr𝐳WcWc1absentsuperscriptsubscript𝑊𝑟1subscript𝑊𝑟𝐳subscript𝑊𝑐superscriptsubscript𝑊𝑐1\displaystyle=W_{r}^{-1}{W_{r}}\mathbf{z}{W_{c}}W_{c}^{-1}= italic_W start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT bold_z italic_W start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT
+Wr1((1𝒦)(Wr(𝐳~𝐳)Wc))Wc1superscriptsubscript𝑊𝑟1direct-product1𝒦subscript𝑊𝑟~𝐳𝐳subscript𝑊𝑐superscriptsubscript𝑊𝑐1\displaystyle\quad\quad+W_{r}^{-1}\left((1-\mathcal{K})\odot({W_{r}}(\tilde{% \mathbf{z}}-\mathbf{z}){W_{c}})\right)W_{c}^{-1}+ italic_W start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( ( 1 - caligraphic_K ) ⊙ ( italic_W start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( over~ start_ARG bold_z end_ARG - bold_z ) italic_W start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) ) italic_W start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT
=𝐳+(Wr1(1𝒦)Wc1)(Wr1Wr(𝐳~𝐳)WcWc1)absent𝐳superscriptsubscript𝑊𝑟11𝒦superscriptsubscript𝑊𝑐1superscriptsubscript𝑊𝑟1subscript𝑊𝑟~𝐳𝐳subscript𝑊𝑐superscriptsubscript𝑊𝑐1\displaystyle=\mathbf{z}+\left(W_{r}^{-1}(1-\mathcal{K})W_{c}^{-1}\right)% \circledast\left(W_{r}^{-1}{W_{r}}(\tilde{\mathbf{z}}-\mathbf{z}){W_{c}}W_{c}^% {-1}\right)= bold_z + ( italic_W start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( 1 - caligraphic_K ) italic_W start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) ⊛ ( italic_W start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( over~ start_ARG bold_z end_ARG - bold_z ) italic_W start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT )
=𝐳+k(𝐳~𝐳),absent𝐳𝑘~𝐳𝐳\displaystyle=\mathbf{z}+k\circledast(\tilde{\mathbf{z}}-\mathbf{z}),= bold_z + italic_k ⊛ ( over~ start_ARG bold_z end_ARG - bold_z ) ,

resulting in Eq. 5 in the main paper, where k=Wr1(1K)Wc1=IDFT2D(1𝒦)𝑘superscriptsubscript𝑊𝑟11𝐾superscriptsubscript𝑊𝑐1𝐼𝐷𝐹subscript𝑇2𝐷1𝒦k=W_{r}^{-1}(1-K)W_{c}^{-1}=IDFT_{2D}(1-\mathcal{K})italic_k = italic_W start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( 1 - italic_K ) italic_W start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT = italic_I italic_D italic_F italic_T start_POSTSUBSCRIPT 2 italic_D end_POSTSUBSCRIPT ( 1 - caligraphic_K ) is a convolutional kernel and \circledast denotes a circular convolution operator.

B.2 Attention Modulation analysis

As mentioned in Sec. 3.3, we take inspiration from recent literature using attention swapping to control local texture. However, rather than swapping attention, we mix the two attention paths instead. In Figure 8 we compare attention swapping versus our proposed attention modulation. These results clearly show the benefit of including the attention from the high resolution path rather than directly swapping with the low res pass to avoid loss of information from the high res denoising path. We empirically set λ𝜆\lambdaitalic_λ used in Eq (6) to 0.70.70.70.7.

Appendix C Additional experimental results

C.1 FAM diffusion with different SD backbones

In Table 1 we show that our method outperforms several baselines when combined with SDXL. In addition to those main results, we further combine our FAM diffusion method with various SD backbones. The quantitative results in Table 2 demonstrate that our approach can seamless combine with different variants of SD and provides similarly large improvements in quality and image-text alignment across all experimental settings.

C.2 FAM diffusion with different aspect ratios

Thus far, we have used our method to generate high-resolution images by equally upscaling both the height and width. Here, we study the effect of using Fam diffusion targeting different aspect ratios. In particular, starting from the SDXL model, we use our approach targeting higher resolutions with different aspect ratios. The quantitative results in Table 3 and qualitative results shown in Figures 9 through 11, clearly highlight the versatility of our method that can seamlessly adapt to various settings without compromising quality.

C.3 FAM diffusion with different conditioning terms

Fam Diffusion enables seamless integration with various LDM-based applications, such as ControlNet [33]. As shown in Figure 12, Fam Diffusion combined with ControlNet [33] achieves controllable high-resolution generation, with examples showcasing the use of images and canny edges as conditions.

Method Resolution Scale Factor FIDrsubscriptFID𝑟\text{FID}_{r}FID start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT \downarrow KIDrsubscriptKID𝑟\text{KID}_{r}KID start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT \downarrow FIDcsubscriptFID𝑐\text{FID}_{c}FID start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT \downarrow KIDcsubscriptKID𝑐\text{KID}_{c}KID start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT \downarrow CLIP Score \uparrow
SD 1.5 2×2222\times 22 × 2 75.36 0.0122 43.99 0.0103 30.35
SD 1.5 + FAM diffusion 65.07 0.0087 34.06 0.0082 30.92
SD 2.1 86.62 0.0163 53.67 0.0137 29.66
SD 2.1 + FAM diffusion 64.77 0.0084 38.18 0.0091 31.13
SDXL 59.47 0.0067 50.54 0.0136 30.6
SDXL+ FAM diffusion 58.91 0.0072 33.96 0.0080 32.35
SD 1.5 3×3333\times 33 × 3 106.50 0.0251 48.92 0.0133 28.89
SD 1.5 + FAM diffusion 38.19 0.0011 43.99 0.0082 30.44
SD 2.1 137.05 0.0384 63.91 0.01719 27.81
SD 2.1 + FAM diffusion 64.8 0.0089 40.49 0.0114 31.13
SDXL 78.41 0.0136 69.40 0.0210 28.44
SDXL + FAM diffusion 69.25 0.0007 36.40 0.0100 32.25
SD 1.5 4×4444\times 44 × 4 150.84 0.0474 55.97 0.0155 27.40
SD 1.5 + FAM diffusion 67.77 0.0086 40.21 0.0012 30.36
SD 2.1 177.06 0.0645 69.43 0.019 26.36
SD 2.1+ FAM diffusion 66.32 0.0085 41.37 0.0018 31.10
SDXL 160.10 0.0602 74.37 0.0242 26.70
SDXL + FAM diffusion 58.91 0.0073 43.65 0.0130 32.33
Table 2: Comparison of vanilla Stable Diffusion and our  FAM diffusion.
Method Scaling Factor FID\downarrow KID\downarrow FIDc{}_{c}\downarrowstart_FLOATSUBSCRIPT italic_c end_FLOATSUBSCRIPT ↓ KIDc{}_{c}\downarrowstart_FLOATSUBSCRIPT italic_c end_FLOATSUBSCRIPT ↓ CLIP \uparrow
DemoFusion [3] 2×4242\times 42 × 4 81.69 0.0112 54.48 0.0165 29.3
AccDiffusion [15] 70.42 0.0119 55.73 0.0205 29.0
FouriScale* [12] 71.86 0.0302 63.28 0.0322 25.8
HiDiffusion [34] 118.56 0.038 65.46 0.021 26.3
SDXL [19] 80.62 0.0236 67.46 0.0302 25.5
SDXL [19] + FAM diffusion 63.48 0.0090 41.44 0.0115 30.6
Table 3: System-level comparisons with SDXL. * indicates inference with FreeU [26]
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 9: Qualitative comparison with other methods based on SDXL. Best viewed when zoomed in. * indicates inference with FreeU [26]. (Continued in Fig. 10).
Refer to caption
(a)
Refer to caption
(b)
Figure 10: Qualitative comparison with other methods based on SDXL (continued from Fig. 9). Best viewed when zoomed in.
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 11: Qualitative comparison with other methods based on SDXL with arbitrary resolutions. DemoFusion is unable to handle arbitrary resolutions, therefore not included. Best viewed when zoomed in.
Refer to caption
(a)
Refer to caption
(b)
Figure 12: Results of FAM Diffusion combining with ControlNet [33]. All images are generated at 2× (2048 × 2048).Best viewed when zoomed in.