FAM Diffusion: Frequency and Attention Modulation for High-Resolution Image Generation with Stable Diffusion

Haosen Yang^1,2 Adrian Bulat¹ Isma Hadji¹ Hai X. Pham¹ Xiatian Zhu²
Georgios Tzimiropoulos^1,3 Brais Martinez¹

¹Samsung AI Center, Cambridge, UK ²University of Surrey, UK ³Queen Mary University, UK
This work was conducted while Haosen Yang was an intern at Samsung AI Center, Cambridge, UK.

Abstract

Diffusion models are proficient at generating high-quality images. They are however effective only when operating at the resolution used during training. Inference at a scaled resolution leads to repetitive patterns and structural distortions. Retraining at higher resolutions quickly becomes prohibitive. Thus, methods enabling pre-existing diffusion models to operate at flexible test-time resolutions are highly desirable. Previous works suffer from frequent artifacts and often introduce large latency overheads. We propose two simple modules that combine to solve these issues. We introduce a Frequency Modulation (FM) module that leverages the Fourier domain to improve the global structure consistency, and an Attention Modulation (AM) module which improves the consistency of local texture patterns, a problem largely ignored in prior works. Our method, coined FAM diffusion, can seamlessly integrate into any latent diffusion model and requires no additional training. Extensive qualitative results highlight the effectiveness of our method in addressing structural and local artifacts, while quantitative results show state-of-the-art performance. Also, our method avoids redundant inference tricks for improved consistency such as patch-based or progressive generation, leading to negligible latency overheads.

1 Introduction

Diffusion models [22] demonstrate impressive generative power across a range of applications [29, 23, 33, 18, 20, 6, 30]. While powerful, one known shortcoming of diffusion models is their inability to seamlessly scale to higher resolutions beyond the one used during training. It is known that directly generating images at resolutions beyond the training resolution results in severe object repetition and unrealistic local patterns [1, 7, 3]. This is illustrated in Figure 1(a). While retraining diffusion models on higher-resolution images is a straightforward solution, the computational demands quickly become prohibitive. This restricts applications requiring flexible or high-resolution image generation, e.g. 4K. Therefore, adapting pre-trained diffusion models to generate high-resolution images without additional training is a topic of high interest that we tackle in this work.

Prior efforts addressing this important problem can be largely categorized into two tracks. The first set of approaches, e.g. [3, 15], propose mechanisms that improve the global structure consistency by steering the high-resolution generation using the image generated at native (i.e. training) resolution. However, the effectiveness of such mechanisms is mixed, with trailing issues like poor detail quality, inconsistent local textures, and even persisting pattern repetitions as shown in Figure 1(b). Furthermore, these works typically operate on a patch-based basis, generating one patch at a time. Concretely, this means that these methods resort to redundant and overlapping forward passes, leading to large latency overheads. The second group of approaches, e.g. [7, 11, 34], eschews patch-based generation in favor of a one-pass approach by directly altering the model architecture. This leads to faster generation, but unfortunately, it comes at the cost of image quality, as shown in Fig. 1 (c).

To address the aforementioned limitations, we propose a straightforward yet effective approach that takes the best of both worlds. Our method follows the single pass generation strategy for improved latency but, like patch-based approaches, leverages the native resolution generation to steer the high-resolution one. Specifically, our method starts by generating an image at native resolution conditioned on the input text prompt. We then resort to a test-time diffuse-denoise strategy [27, 3, 8], where the high-resolution denoising stage is guided by the native resolution diffusion process. However, instead of blindly steering the high-res image toward the low-res one as done elsewhere [3, 15], we propose a Frequency Modulation (FM) module. In particular, we leverage the Fourier domain to selectively condition low-frequency components during the high-resolution image generation stage, while providing full control over high-frequency components to the denoising process.

While the FM module resolves artifacts related to global consistency, artifacts related to inconsistent local texture might still be present, i.e. finer texture generated on semantically related parts of the image might be inconsistent. To tackle this second issue, largely ignored in the literature, we propose an Attention Modulation (AM) mechanism that leverages attention maps from the denoising process at native resolution to condition the attention maps of the denoising process at high resolution. Since attention maps at native resolution encode which regions of the image are semantically related, they regularize the high-res denoising towards consistent finer texture generation. Our method, coined Frequency and Attention Modulated diffusion (FAM diffusion), combines our FM and AM modules to yield superior quality results, see Fig. 1 (d).

Our method seamlessly integrates with any latent diffusion model without additional training or architectural changes. We empirically show that our method significantly enhances the quality and efficiency of high-resolution image generation, establishing a new state-of-the-art.

2 Related Work

Diffusion models have shown impressive performance in generating creative and accurate representations given text prompts [10, 22]. While early work [22] was limited to generating relatively low-resolution images (i.e. $256\times 256$ ), follow-up work showed that their performance can scale to higher resolutions, e.g. $512\times 512$ with SD1.5 [22] and $1024\times 1024$ with SDXL [19]. However, a major shortcoming with all these models is that generation remains limited by the resolution used at training time. Naively targeting higher train-time resolutions quickly results in prohibitive training costs and computational requirements, and the limited availability of high-resolution training data also restricts the diversity of image generation. Thus, adapting pre-trained diffusion models to generate high-resolution images without retraining has emerged as a topic of interest.

Early works [1, 14] proposed using overlapping patches at native resolution and blending the outputs to produce an image without seams. However, this leads to frequent repetitions and inconsistent global image structure. Therefore, subsequent works introduced various mechanisms to encourage global structural consistency. For instance, DemoFusion [3] proposed a patch-based generation process with mechanisms such as skip residuals and progressive upsampling, while AccDiffusion [15] used localized prompting to guide high-resolution generation and improve consistency with images generated at native resolutions. However, these methods still suffer from issues like local repetitions, and inconsistent global coherence. They also have significant latency overheads due to the running cost of multiple backward passes. To mitigate the high latencies, other works aim to generate high-resolution images in a single pass by modifying the architecture of the UNet. For example, ScaleCrafter [7] employs dilated convolutions to adjust the receptive field of convolutions in the denoising UNet. HiDiffusion [34] introduces an alternative UNet that dynamically adjusts the feature map size during the denoising process. While these approaches achieve faster generation, they often result in image distortions.

More closely related to ours are methods that have approached structural consistency from a frequency domain perspective. FouriScale [12] splits the image in Fourier domain, then proceeds to incorporate a low-pass filtering operation and impose structural consistency with an image generated at natire resolution. However, this splitting operation results in unrealistic images. HiPrompt [16] decomposes images into spatial frequency components conditioned on local and global prompts, but it often relies on redundant operations that lead to high latencies. ResMaster [25] leverages low-frequency information from the latent representation of the native image to provide desirable global semantics during the denoising process. However, it ignores the noise distribution differences between the current high-resolution denoising step and the native image in latent space. In addition, it still relies on patch-based denoising, making it inefficient. In contrast to these methods, we propose a one-pass method that does not alter the model architecture. Importantly, our method introduces a complementary novel attention modulation mechanism, which targets local structure consistency; an issue overlooked by all existing works.

3 Method

In this work, we leverage pretrained latent diffusion models (LDMs), which have been extensively trained on large-scale high-quality data. Our goal is to generate images at higher resolutions than during training, without any additional finetuning or model modification. Sec. 3.1 briefly reviews the diffusion notation and the test-time diffuse-denoise strategy. In Sec. 3.2 we present our Frequency Modulated (FM) denoising approach, which is designed to improve global consistency. Finally, we introduce our Attention Modulation (AM) mechanism, which is designed to improve the consistency of the local texture and high-frequency detail, in Sec. 3.3. We provide an overview of our method in Figure 2.

3.1 Preliminaries

Latent Diffusion Models (LDM) [22]: We operate in the realm of LDMs, which first convert image $\mathbf{x}_{0}$ to a latent representation $\mathbf{z}_{0}$ using an encoder such that $\mathbf{z}_{0}=\mathcal{E}(\mathbf{x}_{0})$ , $\mathbf{z}_{0}\in\mathbb{R}^{c\times h\times w}$ . During training, a Markovian diffusion process progressively adds noise to the input latent $\mathbf{z}_{0}$ according to a predefined schedule $\beta_{t},t\in[1,T]$ by sampling sequentially from:

q(\mathbf{z}_{t}|\mathbf{z}_{t-1}):=\mathcal{N}(\mathbf{z}_{t}|\sqrt{1-\beta_{% t}}\mathbf{z}_{t-1},\beta_{t}\mathbf{I})

(1)

Conversely, a trainable denoising process progressively recovers the original latent $\mathbf{z}_{0}$ using a noise estimator $\mathcal{Z}_{\theta}=\left(\mu_{\theta},\Sigma_{\theta}\right)$ parametrized by $\theta$ by sampling from:

p_{\theta}(\mathbf{z}_{t-1}|\mathbf{z}_{t}):=\mathcal{N}\left(\textbf{z}_{t-1}% |\mu_{\theta}(\mathbf{z}_{t},t),\Sigma_{\theta}(\mathbf{z}_{t},t)\right)

(2)

During inference, an image is generated by denoising from random noise, $\mathbf{z_{T}}\sim\mathcal{N}(0,\textbf{I})\in\mathbb{R}^{c\times h\times w}$ , through sequential calls to $\mathcal{Z}_{\theta}$ . The quality of the generated image improves with the number of steps to finally yield the latent representation $\mathbf{z}_{0}^{n}\in\mathbb{R}^{c\times h\times w}$ , where we introduce the superscript $n$ to indicates generation at native resolution $h\times w$ (i.e. same as training resolution).

Inference-time diffuse-denoise: Our goal is to use the pretrained parametric denoiser $\mathcal{Z}_{\theta}$ , without further finetuning, to generate $\mathbf{z}^{m}_{0}\in\mathbb{R}^{c\times sh\times sw}$ at a higher resolution $m$ , $m=sh\times sw$ , where $s$ is the target resolution scaling factor. The naive approach is to directly start from random noise at the target resolution, $\mathbf{z}^{m}_{T}\sim\mathcal{N}(0,I)\in\mathbb{R}^{c\times sh\times sw}$ . However, this has been repeatedly shown to lead to suboptimal results, with frequent artifacts and object duplication [3, 7, 34]. This is illustrated in Fig. 3(a).

Instead, prior works proposed a test time diffuse-denoise process [27, 3, 8]. The idea is to start from the output of the denoising process at native resolution, $\mathbf{z}^{n}_{0}$ rather than noise, which is then upsampled to the target resolution $m$ to obtain $\tilde{\mathbf{z}}^{m}_{0}=\mathcal{U}(\mathbf{z}^{n}_{0},s)$ , where $\mathcal{U}$ denotes an upsampling function. Next, $T$ forward diffusion steps progressively add noise to the latents $\tilde{\mathbf{z}}^{m}_{t=1\ldots T}$ . Finally, the backward process denoises from $\tilde{\mathbf{z}}^{m}_{T}$ to yield the final output $\mathbf{z}^{m}_{0}$ . Note that we use $\tilde{\mathbf{z}}$ and $\mathbf{z}$ to refer to the latents generated during diffusion and denoising respectively.

While a standard denoising process as in Eq. 2 could be used, it often leads to inconsistent global structures, as shown in Fig. 3(b). Instead, the denoising process from Eq. 2 is now defined as:

p_{\theta}\left(\mathbf{z}^{m}_{t-1}|f_{t}(\tilde{\mathbf{z}}^{m}_{t},\mathbf{% z}^{m}_{t})\right)

(3)

where $f_{t}(.)$ is tasked with steering the denoising process and improving the consistency between the high-res and low-res images. Previous work [3, 15] define $f_{t}(.)$ as a simple weighted linear combination of $\tilde{\mathbf{z}}^{m}_{t}$ and $\mathbf{z}^{m}_{t}$ and coin the mechanism skip residual. We show in Fig. 3(c) that this yields to suboptimal results. In contrast, we propose a Frequency Modulated approach to defining $f_{t}(.)$ .

3.2 Frequency-Modulated Denoising

The conditioning of the denoising steps through the skip residual has been shown to improve consistency between low and high-resolution images. We however observe that it lacks control over the information transferred. More specifically, the goal of the test-time diffuse-denoise process is to take the upsampled low-resolution image and to produce an output that 1) preserves the global structure, and 2) improves the texture and high-frequency details. The skip residual mechanism however steers the output towards the input indiscriminately, which serves the first objective but can negatively impact the latter. It would be desirable to instead harness the global structure information from the diffused latents of the forward process, while allowing the denoising process to handle the generation of details. To this end, we appeal to the frequency domain, where global structure and finer details are captured by low- and high-frequency, respectively [17, 28, 31], and re-define the function $f_{t}(.)$ , which controls information transfer from the forward diffusion into the denoising process, in accordance.

Let $\mathcal{K}(t)$ be a high-pass filter for timestep $t$ , the function $f_{t}(.)$ in Eq. 3 is defined as follows:

	$\displaystyle f_{t}(\tilde{\mathbf{z}}_{t}^{m},\mathbf{z}_{t}^{m})=$	$\displaystyle IDFT_{2D}(\mathcal{K}(t)\odot DFT_{2D}\left(\mathbf{z}_{t}^{m}\right)$		(4)
		$\displaystyle+(1-\mathcal{K}(t))\odot DFT_{2D}\left(\tilde{\mathbf{z}}_{t}^{m}% \right)),$		(4)

where $\odot$ denotes the Hadamard product. Essentially, the high-frequency coefficients of the denoised latent $\mathbf{z}_{t}^{m}$ are combined with the low-frequency coefficients of the diffused latent $\tilde{\mathbf{z}}_{t}^{m}$ , modulated by the filter $\mathcal{K}(t)$ . Eq. 4 can be further reformulated in the time domain as below:

f_{t}(\tilde{\mathbf{z}}_{t}^{m},\mathbf{z}_{t}^{m})=\mathbf{z}^{m}_{t}+% \mathcal{\kappa}(t)\circledast\bigl{(}{\tilde{\mathbf{z}}}_{t}^{m}-{{\mathbf{z% }}}^{m}_{t}\bigr{)},

(5)

where $\mathcal{\kappa}(t)=IDFT_{2D}\left(1-\mathcal{K}(t)\right)\in\mathbb{R}^{sh% \times sw}$ is a convolutional kernel, and $\circledast$ denotes the circular convolution operator. Eq. 5 shows that the frequency modulation adds a low-frequency update to the denoised latent ${\mathbf{z}}^{m}_{t}$ directed towards the diffused latent ${\tilde{\mathbf{z}}}_{t}^{m}$ , subsequently preserving the global structural information from the upsampled latent. Furthermore, the circular convolution $\mathcal{\kappa}(t)$ in Eq. 5 can be interpreted as an additional (non-learnable) convolutional layer of the UNet, effectively providing it with a global receptive field and helping generate consistent structure without modifying the UNet architecture [34, 11] or using dilated sampling [3]. The result of our FM approach is shown in Fig. 3(d). In comparison, the skip residual approach of DemoFusion, shown in Fig. 3(c), produces inconsistencies like a missing left nostril and unnaturally small eyes.

3.3 Attention Modulation

While the FM module successfully maintains global structure and solves the issue of object duplication as shown in Fig. 3(d), we note that local structures can be inconsistently generated due to the discrepancy between training-time native resolution and the target inference-time high resolutions. For example, the top image in Fig. 3(d) shows a distorted mouth compared to the one at native resolution. Similarly, in the bottom example, fur texture is incorrectly generated on the shirt collar. That is, the high-frequency detail generated on the shirt collar is semantically related to one generated on the fox’s face and not to the other parts of the shirt. We hypothesize this stems from incorrect attention maps during the high-res denoising stage. This motivates us to propose our Attention Modulation (AM) approach. We take inspiration from attention swapping, a recent method to combine information from two diffusion processes in a more localized manner [13, 5, 4], and extend the idea to transfer local structural information from the denoising process at native resolution to the one at target resolution.

In particular, the attention of an input tensor $\mathbf{z}$ is computed by first projecting it linearly into a triplet of query, keys, and values, $(Q,K,V)$ , respectively, and the self-attention is computed as:

Att(\textbf{z})=softmax\left(\frac{Q\cdot K^{T}}{\sqrt{d}}\right)V=M\cdot V

(6)

where $d$ indicates the feature dimensionality, and we refer to $M$ as the attention matrix.

In our case, we modify the self-attention at specific layers of the UNet of the high-resolution denoising process to incorporate information from the attention maps of the native resolution as:

\bar{M}^{m}=(\lambda\cdot\mathcal{U}(M^{n},s)+(1-\lambda)\cdot M^{m})

(7)

where $M^{n}$ and $M^{m}$ are the attention matrices at native and target resolution respectively, $\lambda$ is a hyperparameter, and $\mathcal{U}$ is an $s$ -times upsampling function. The new attention matrix $\bar{M}^{m}$ is then used instead of $M^{m}$ during the high-res denoising process in Eq. 6.

Applying our AM module at all layers of the UNet can lead to suboptimal performance due to over-regularization. We apply it instead only for layers in up-blocks of the UNet, as they are known to preserve layout information better [13]. Furthermore, we experimented with AM at various stages and found the highest benefit to be at up_block_0. Results shown in Fig. 3(e) demonstrate the benefit of the proposed AM module, particularly regarding better preservation of local structures such as the mouth and shirt collar, highlighted in yellow boxes.

4 Experiment

4.1 Experimental setup

To demonstrate the effectiveness of our approach, we pair it with a well-performing diffusion model like SDXL [19]. For completeness, we also pair our approach with the recent HiDiffusion [34], which specifically changes the attention mechanism of SDXL with windowed attention to improve the model latency. SDXL is trained at 1024×1024 resolution, which we refer to as $1\times$ . We experiment with three unseen higher resolutions such that the model generates $2\times 2$ , $3\times 3$ , and $4\times 4$ times more pixels than the training setup. In the supplementary, we also include results with various aspect ratios, e.g. $2\times 4$ , and also experiment with different variants of Stable Diffusion (SD); namely, SD 1.5 [22], SD 2.1 [22], which generate at 512×512 and 768×768 pixels respectively.

Evaluation set.

Following previous work [3, 7, 15, 11] we evaluate performance on a subset of the Laion-5B dataset [24]. Given the number of compared methods and significant computational demands associated with the task, we randomly sample 10K images from Laion-5b which we use as our real images set, and we sample 1K captions, which we use as text prompts for the models.

Evaluation metrics.

Following prior work, we evaluate the quality and diversity of the generated images using Frechet Inception Distance (FID) [9] and Kernel Inception Distance (KID) [2], computed between the generated and real images. Since FID requires resizing images to $299\times 299$ , which negatively impacts the assessment, it is typical to adopt their patch-level variants [3, 34, 15, 11]. Specifically, we extract 10 random crops from each image before calculating FID and KID, referring to these metrics as FID ${}_{\text{c}}$ and KID ${}_{\text{c}}$ . To further evaluate the semantic similarity between image features and text prompts, we report the CLIP score [21]. To measure the efficiency of each method, we compute latencies on a single A40 GPU.

Method	Scaling Factor	FID $\downarrow$	KID $\downarrow$	FID ${}_{c}\downarrow$	KID ${}_{c}\downarrow$	CLIP $\uparrow$	Latency(mins)
DemoFusion [3]	$2\times 2$	63.24	0.0084	36.75	0.0096	32.0	2.5
AccDiffusion [15]		59.42	0.0068	37.23	0.0105	31.69	2.6
FouriScale* [12]		78.54	0.0136	40.80	0.0130	29.8	2.3
HiDiffusion [34]		78.02	0.0136	51.41	0.0139	30.5	0.6
HiDiffusion [34] + FAM diffusion		69.61	0.0140	34.26	0.0084	32.32	0.8
SDXL [19]		59.47	0.0067	50.54	0.0136	30.6	0.8
SDXL [19] + FAM diffusion		58.91	0.0072	33.96	0.0080	32.35	1
DemoFusion [3]	$3\times 3$	68.82	0.0159	40.24	0.0122	32.0	8.6
AccDiffusion [15]		73.47	0.0210	43.64	0.014	31.50	10
FouriScale* [12]		73.57	0.0309	65.01	0.0357	28.54	6.2
HiDiffusion [34]		112.51	0.0325	68.84	0.021	28.43	1.5
HiDiffusion [34] + FAM diffusion		76.28	0.0007	36.70	0.010	32.26	1.8
SDXL [19]		78.41	0.0136	69.40	0.0210	28.44	2.2
SDXL [19] + FAM diffusion		69.25	0.0007	36.40	0.010	32.25	2.5
DemoFusion [3]	$4\times 4$	65.89	0.0087	48.44	0.0157	30.45	19.6
AccDiffusion [15]		73.97	0.0090	54.80	0.0187	30.15	20.5
FouriScale* [12]		105.24	0.0342	70.45	0.0223	27.86	14.7
HiDiffusion [34]		129.91	0.0483	156.98	0.0877	24.32	2.8
HiDiffusion [34] + FAM diffusion		59.05	0.0074	44.65	0.0134	32.31	3.1
SDXL [19]		160.10	0.0602	74.37	0.0242	26.70	5.4
SDXL [19] + FAM diffusion		58.91	0.0073	43.65	0.0130	32.33	6.1

Table 1: System-level comparisons with SDXL. * indicates inference with FreeU [26]

4.2 Main Results

We select Demofusion [3], AccDiffusion [15], FouriScale [11], and HiDiffusion [34] as representative methods of the current state-of-the-art among high-resolution generation methods. As shown in Table 1, FAM diffusion achieves the best overall performance on $\text{FID}_{c}$ , $\text{KID}_{c}$ , and CLIP Score in all cases. In the case of FID and KID, FAM diffusion provides substantial gains for larger scale factors, while producing similar results to DemoFusion on lower scale factors. However, these metrics heavily downsample high-resolution images before computing the metrics and thus do not capture finer details in the evaluation results. This is a widely-known issue for these metrics, as explained in Sec. 4.1. Finally, we note that our method adds only small latency overheads compared to direct inference on the target resolution, e.g. 0.2, 0.3, and 0.7 min at $2\times$ , $3\times$ and $4\times$ scale factors respectively when combined with SDXL. In comparison, DemoFusion adds 14.2 sec latency vs SDXL direct inference at $4\times$ scale factor. When compared to the frequency-based method FouriScale [11], FAM diffusion also shows notable improvements in both quality and latency. For instance, under 4K resolution image generation, it achieves 43.65 vs. 70.45 on $\text{FID}_{c}$ and 32.31 vs. 26.67 on CLIP score, while also being faster than FouriScale. Additionally, we observed that FAM diffusion can be seamlessly integrated into single-pass methods, such as HiDiffusion [34], to enhance performance while maintaining fast image generation, achieving an effective latency-quality trade-off. These results quantitatively validate the effectiveness of our method in improving the quality of image generation.

In Figure 6, we present a comparison between DemoFusion, FouriScale, HiDiffusion, and FAM diffusion. We selected three complex textual prompts to highlight the image-generation capabilities of the model. For FouriScale, we used the default setting with FreeU [26]. Firstly, as mentioned above, DemoFusion tends to generate repetitive content and artifacts with unreasonable local structures due to its patch-based generation approach (see for example the two small cat heads generated on the top-right image). FouriScale [11] and HiDiffusion [34] produce visually unappealing structures and extensive areas of irregular textures, which significantly degrade the overall visual quality. Additionally, we compare our method with the super-resolution approach BSRGAN [32], as shown in Figure 5. We observe that FAM diffusion effectively introduces or modifies high-frequency details that were not present in the original image, while preserving structural information, leading to more appealing and detailed images.

To further illustrate the generality of our approach, in the supplementary material we provide results of our approach in combination with SD1.5 and SD2.1.

4.3 Ablation Study

In this section, we conduct ablation studies and use SDXL with the $2\times 2$ scale factor setting.

Effectiveness of the components in the FAM diffusion

We study the effect of the two components of FAM diffusion, Frequency-Modulated Denoising (FM) and Attention Modulation (AM). The results shown in Figure 3 indicate the following: (1) both direct inference from random noise, and direct inference from the diffused latent at native resolution generate outputs with structural distortions and repeated patterns. (2) while the Skip Residuals of DemoFusion helps maintain the global structure of the image, it still produces artifacts and poor local patterns. (3) Compared to Skip Residuals, FM reduces undesirable local patterns by leveraging the low-frequency information of the image at native resolution, which provides better structural guidance. (4) Attention Modulation resolves inconsistencies between local patterns and global structure by utilizing the attention map from the native resolution, offering strong guidance of the semantic relationships among latent tokens. Overall, FM and AM address structural distortions and local pattern inconsistencies in high-resolution images effectively, highlighting the meaningful contributions of FAM diffusion.

Effectiveness of the time-aware formulation on the FM module

We show here the effect of the time-varying formulation of FM, as illustrated in Figure 7(a). Specifically, the FM module incorporates low-frequency information from the corresponding diffused latent at each step $t$ . Instead, we can avoid this time-varying nature and utilize the upsampled latent as a single static reference. However, this approach results in images that appear noticeably blurrier and lose finer details associated with high-frequency information, highlighting the importance of the dynamic nature of the FM module throughout the denoising process.

Analysis of Attention Modulation

To better understand the principles underlying the AM module, we visualize in Figure 4 the self-attention maps of a tokens from the mouth region (marked with a star) as the query and all tokens as the key and value. The resulting attention map computed using the low-resolution latent primarily encodes coarse information of the semantic relations among parts of the image, but lacks fine-grained contextual information across the entire face. Instead, the attention maps at high resolution are more detailed, but fail to capture semantic relatedness, e.g. the mouth areas are not highlighted. After applying AM, the attention map effectively integrates local-global relationships with enhanced fine-grained detail. This analysis provides visual insights into how AM repairs inconsistencies in local patterns, contributing to more coherent global structures.

5 Conclusion

We introduced FAM diffusion, a training-free diffusion model for high-resolution image generation. To address issues of object repetition and structural distortion, we propose a Frequency Modulated strategy. By leveraging the Fourier domain, this method enhances guidance for high-resolution generation while avoiding latency overheads associated with multi-patch approaches. Additionally, we propose an effective Attention Modulation mechanism to address inconsistent local texture patterns, a challenge largely overlooked in previous works. Extensive quantitative and qualitative evaluations highlight the effectiveness of our method. We further show that, contrary to previous works, our method incurs in marginal latency overheads.

References

Bar-Tal et al. [2023] Omer Bar-Tal, Lior Yariv, Yaron Lipman, and Tali Dekel. MultiDiffusion: fusing diffusion paths for controlled image generation. In International Conference on Machine Learning, 2023.
Bińkowski et al. [2018] Mikołaj Bińkowski, Danica J Sutherland, Michael Arbel, and Arthur Gretton. Demystifying MMD GANs. International Conference on Learning Representations, 2018.
Du et al. [2024] Ruoyi Du, Dongliang Chang, Timothy Hospedales, Yi-Zhe Song, and Zhanyu Ma. DemoFusion: Democratising high-resolution image generation with no $$$. In IEEE Conference on Computer Vision and Pattern Recognition, 2024.
Gu et al. [2023] Jing Gu, Yilin Wang, Nanxuan Zhao, Tsu-Jui Fu, Wei Xiong, Qing Liu, Zhifei Zhang, He Zhang, Jianming Zhang, HyunJoon Jung, and Xin Eric Wang. Photoswap: Personalized subject swapping in images. Neural Information Processing Systems, 2023.
Gu et al. [2024] Jing Gu, Nanxuan Zhao, Wei Xiong, Qing Liu, Zhifei Zhang, He Zhang, Jianming Zhang, HyunJoon Jung, Yilin Wang, and Xin Eric Wang. SwapAnything: Enabling arbitrary object swapping in personalized image editing. European Conference on Computer Vision, 2024.
He et al. [2022] Yingqing He, Tianyu Yang, Yong Zhang, Ying Shan, and Qifeng Chen. Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:2211.13221, 2022.
He et al. [2024] Yingqing He, Shaoshu Yang, Haoxin Chen, Xiaodong Cun, Menghan Xia, Yong Zhang, Xintao Wang, Ran He, Qifeng Chen, and Ying Shan. Scalecrafter: Tuning-free higher-resolution visual generation with diffusion models. In International Conference on Learning Representations, 2024.
Hertz et al. [2022] Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022.
Heusel et al. [2017] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. Neural Information Processing Systems, 2017.
Ho et al. [2020] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Neural Information Processing Systems, 2020.
Huang et al. [2024a] Linjiang Huang, Rongyao Fang, Aiping Zhang, Guanglu Song, Si Liu, Yu Liu, and Hongsheng Li. FouriScale: A frequency perspective on training-free high-resolution image synthesis. In European Conference on Computer Vision, 2024a.
Huang et al. [2024b] Linjiang Huang, Rongyao Fang, Aiping Zhang, Guanglu Song, Si Liu, Yu Liu, and Hongsheng Li. FouriScale: A frequency perspective on training-free high-resolution image synthesis. arXiv preprint arXiv:2403.12963, 2024b.
Jeong et al. [2024] Jaeseok Jeong, Junho Kim, Yunjey Choi, Gayoung Lee, and Youngjung Uh. Visual style prompting with swapping self-attention. arXiv preprint arXiv:2402.12974, 2024.
Lee et al. [2023] Yuseung Lee, Kunho Kim, Hyunjin Kim, and Minhyuk Sung. SyncDiffusion: Coherent montage via synchronized joint diffusions. In Neural Information Processing Systems, 2023.
Lin et al. [2024] Zhihang Lin, Mingbao Lin, Zhao Meng, and Rongrong Ji. AccDiffusion: An accurate method for higher-resolution image generation. In European Conference on Computer Vision, 2024.
Liu et al. [2024] Xinyu Liu, Yingqing He, Lanqing Guo, Xiang Li, Bu Jin, Peng Li, Yan Li, Chi-Min Chan, Qifeng Chen, Wei Xue, Wenhan Luo, Qifeng Liu, and Yike Guo. HiPrompt: Tuning-free higher-resolution generation with hierarchical MLLM prompts. arXiv preprint arXiv:2409.02919, 2024.
Marr and Hildreth [1980] David Marr and Ellen Hildreth. Theory of edge detection. Proceedings of the Royal Society of London. Series B. Biological Sciences, 207(1167):187–217, 1980.
Noroozi et al. [2024] Mehdi Noroozi, Isma Hadji, Brais Martinez, Adrian Bulat, and Georgios Tzimiropoulos. You only need one step: Fast super-resolution with stable diffusion via scale distillation. European Conference on Computer Vision, 2024.
Podell et al. [2024] Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. SDXL: Improving latent diffusion models for high-resolution image synthesis. In International Conference on Learning Representations, 2024.
Poole et al. [2022] Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Mildenhall. Dreamfusion: Text-to-3D using 2D diffusion. arXiv preprint arXiv:2209.14988, 2022.
Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
Rombach et al. [2022] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In IEEE Conference on Computer Vision and Pattern Recognition, 2022.
Ruiz et al. [2023] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. DreamBooth: Fine tuning text-to-image diffusion models for subject-driven generation. In IEEE Conference on Computer Vision and Pattern Recognition, 2023.
Schuhmann et al. [2022] Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade W Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa R Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. LAION-5B: An open large-scale dataset for training next generation image-text models. In Neural Information Processing Systems - Datasets and Benchmarks Track, 2022.
Shi et al. [2024] Shuwei Shi, Wenbo Li, Yuechen Zhang, Jingwen He, Biao Gong, and Yinqiang Zheng. ResMaster: Mastering high-resolution image generation via structural and fine-grained guidance. arXiv preprint arXiv:2406.16476, 2024.
Si et al. [2024] Chenyang Si, Ziqi Huang, Yuming Jiang, and Ziwei Liu. FreeU: Free lunch in diffusion U-Net. In IEEE Conference on Computer Vision and Pattern Recognition, 2024.
Song et al. [2021] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In International Conference on Learning Representations, 2021.
Wandell [1995] BA Wandell. Foundations of vision, 1995.
Wang et al. [2024] Wenqing Wang, Haosen Yang, Josef Kittler, and Xiatian Zhu. Single image, any face: Generalisable 3D face generation. arXiv preprint arXiv:2409.16990, 2024.
Wu et al. [2023] Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Stan Weixian Lei, Yuchao Gu, Yufei Shi, Wynne Hsu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In IEEE International Conference on Computer Vision, 2023.
Xu et al. [2020] Kai Xu, Minghai Qin, Fei Sun, Yuhao Wang, Yen-Kuang Chen, and Fengbo Ren. Learning in the frequency domain. In IEEE Conference on Computer Vision and Pattern Recognition, 2020.
Zhang et al. [2021] Kai Zhang, Jingyun Liang, Luc Van Gool, and Radu Timofte. Designing a practical degradation model for deep blind image super-resolution. In IEEE International Conference on Computer Vision, 2021.
Zhang et al. [2023] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In IEEE International Conference on Computer Vision, 2023.
Zhang et al. [2024] Shen Zhang, Zhaowei Chen, Zhenyu Zhao, Yuhao Chen, Yao Tang, and Jiajun Liang. HiDiffusion: Unlocking higher-resolution creativity and efficiency in pretrained diffusion models. In European Conference on Computer Vision, 2024.

Appendix A Appendix

To complement the main content of the paper, we provide here additional details about the method in Sec. B as well as additional quantitative and qualitative results in Sec C.

Appendix B Additional technical details

B.1 Frequency Modulation details

Time-varying high-pass filter definition.

In our method, we rely on frequency domain and use a high pass filter to steer the denoising process as described in equation (4). In the following, we provide the formal definition of the time-varying high pass filter, $\mathcal{K}(t)$ , that we used.

The high-pass filters $\mathcal{K}(t)$ have time-varying cut-off frequencies, defined as follows:

$\displaystyle\rho(t)$	$\displaystyle=\frac{t}{T}$	(8)
$\displaystyle{\tau_{h}}(t)$	$\displaystyle=h\cdot c\cdot(1-\rho(t))$	(9)
$\displaystyle{\tau_{w}}(t)$	$\displaystyle=w\cdot c\cdot(1-\rho(t))$	(10)

where ${\tau_{h}}(t)$ and ${\tau_{w}}(t)$ are the horizontal and vertical cut-off frequencies at timestep $t$ , respectively. Subsequently, the mask $\mathcal{K}(t)$ , which is applied on the shifted frequency spectrum centered on $(x_{c},y_{c})$ , is defined as

\displaystyle\mathcal{K}(t)=\begin{cases}\rho(t),&\text{if }\left|x-x_{c}% \right|<\frac{{\tau_{w}}(t)}{2}\\ &\quad\text{\& }\left|y-y_{c}\right|<\frac{{\tau_{h}}(t)}{2},\\ 1,&\text{otherwise}\end{cases}

(11)

The cut-off frequency grows as the denoising process progresses, while the scaling factor of the low-frequency coefficients decreases. Our frequency modulation is designed such that the guidance from the denoised latent ${\tilde{\mathbf{z}}}_{t}$ becomes more significant as $t\rightarrow 0$ . In our experiments, we set $c=0.5$ .

Derivation of the Frequency Modulation in time-domain.

In the main paper, we mention that our frequency modulation introduced in Eq. (4) can be reformulated in time domain as Eq. (5) and discuss the corresponding benefits. Here, we provide a formal derivation to support the equivalence between the two formulations. For ease of presentation, we omit the timestep $t$ and resolution $m$ notations from operands.

Let $\mathbf{z}\in\mathbb{R}^{h\times w}$ be the 2D latent, and $\mathbf{Z}=DFT_{2D}\left(\mathbf{z}\right)\in\mathbb{C}^{h\times w}$ be the Fourier transform of $\mathbf{z}$ . Written in matrix form,

\mathbf{Z}=({W_{r}}\mathbf{z}{W_{c}}),

(12)

where ${W_{r}}\in{\mathbb{C}^{h\times h}},{W_{c}}\in{\mathbb{C}^{w\times w}}$ are the row- and column-wise Fourier transform matrices, respectively. Let $\mathcal{K}\in\mathbb{R}^{h\times w}$ be the high-pass filter defined in the previous section, our proposed mixing operation in the frequency domain is formulated as below:

	$\displaystyle\hat{\mathbf{Z}}$	$\displaystyle=\mathcal{K}\odot DFT_{2D}(\mathbf{z})+(1-\mathcal{K})\odot DFT_{% 2D}(\tilde{\mathbf{z}})$
		$\displaystyle=\mathcal{K}\odot({W_{r}}\mathbf{z}{W_{c}})+(1-\mathcal{K})\odot(% {W_{r}}\tilde{\mathbf{z}}{W_{c}})$
		$\displaystyle={W_{r}}\mathbf{z}{W_{c}}+(1-\mathcal{K})\odot\left({W_{r}}(% \tilde{\mathbf{z}}-\mathbf{z}){W_{c}}\right)$

The inverse DFT of $\hat{\mathbf{Z}}$ , which is the outcome of Eq. 4, is formulated as:

	$\displaystyle\hat{\mathbf{z}}$	$\displaystyle=IDFT_{2D}(\hat{\mathbf{Z}})$
		$\displaystyle=W_{r}^{-1}\left({W_{r}}\mathbf{z}{W_{c}}+(1-\mathcal{K})\odot% \left({W_{r}}(\tilde{\mathbf{z}}-\mathbf{z}){W_{c}}\right)\right)W_{c}^{-1}$
		$\displaystyle=W_{r}^{-1}{W_{r}}\mathbf{z}{W_{c}}W_{c}^{-1}$
		$\displaystyle\quad\quad+W_{r}^{-1}\left((1-\mathcal{K})\odot({W_{r}}(\tilde{% \mathbf{z}}-\mathbf{z}){W_{c}})\right)W_{c}^{-1}$
		$\displaystyle=\mathbf{z}+\left(W_{r}^{-1}(1-\mathcal{K})W_{c}^{-1}\right)% \circledast\left(W_{r}^{-1}{W_{r}}(\tilde{\mathbf{z}}-\mathbf{z}){W_{c}}W_{c}^% {-1}\right)$
		$\displaystyle=\mathbf{z}+k\circledast(\tilde{\mathbf{z}}-\mathbf{z}),$

resulting in Eq. 5 in the main paper, where $k=W_{r}^{-1}(1-K)W_{c}^{-1}=IDFT_{2D}(1-\mathcal{K})$ is a convolutional kernel and $\circledast$ denotes a circular convolution operator.

B.2 Attention Modulation analysis

As mentioned in Sec. 3.3, we take inspiration from recent literature using attention swapping to control local texture. However, rather than swapping attention, we mix the two attention paths instead. In Figure 8 we compare attention swapping versus our proposed attention modulation. These results clearly show the benefit of including the attention from the high resolution path rather than directly swapping with the low res pass to avoid loss of information from the high res denoising path. We empirically set $\lambda$ used in Eq (6) to $0.7$ .

Appendix C Additional experimental results

C.1 FAM diffusion with different SD backbones

In Table 1 we show that our method outperforms several baselines when combined with SDXL. In addition to those main results, we further combine our FAM diffusion method with various SD backbones. The quantitative results in Table 2 demonstrate that our approach can seamless combine with different variants of SD and provides similarly large improvements in quality and image-text alignment across all experimental settings.

C.2 FAM diffusion with different aspect ratios

Thus far, we have used our method to generate high-resolution images by equally upscaling both the height and width. Here, we study the effect of using Fam diffusion targeting different aspect ratios. In particular, starting from the SDXL model, we use our approach targeting higher resolutions with different aspect ratios. The quantitative results in Table 3 and qualitative results shown in Figures 9 through 11, clearly highlight the versatility of our method that can seamlessly adapt to various settings without compromising quality.

C.3 FAM diffusion with different conditioning terms

Fam Diffusion enables seamless integration with various LDM-based applications, such as ControlNet [33]. As shown in Figure 12, Fam Diffusion combined with ControlNet [33] achieves controllable high-resolution generation, with examples showcasing the use of images and canny edges as conditions.

Method	Resolution Scale Factor	$\text{FID}_{r}$ $\downarrow$	$\text{KID}_{r}$ $\downarrow$	$\text{FID}_{c}$ $\downarrow$	$\text{KID}_{c}$ $\downarrow$	CLIP Score $\uparrow$
SD 1.5	$2\times 2$	75.36	0.0122	43.99	0.0103	30.35
SD 1.5 + FAM diffusion		65.07	0.0087	34.06	0.0082	30.92
SD 2.1		86.62	0.0163	53.67	0.0137	29.66
SD 2.1 + FAM diffusion		64.77	0.0084	38.18	0.0091	31.13
SDXL		59.47	0.0067	50.54	0.0136	30.6
SDXL+ FAM diffusion		58.91	0.0072	33.96	0.0080	32.35
SD 1.5	$3\times 3$	106.50	0.0251	48.92	0.0133	28.89
SD 1.5 + FAM diffusion		38.19	0.0011	43.99	0.0082	30.44
SD 2.1		137.05	0.0384	63.91	0.01719	27.81
SD 2.1 + FAM diffusion		64.8	0.0089	40.49	0.0114	31.13
SDXL		78.41	0.0136	69.40	0.0210	28.44
SDXL + FAM diffusion		69.25	0.0007	36.40	0.0100	32.25
SD 1.5	$4\times 4$	150.84	0.0474	55.97	0.0155	27.40
SD 1.5 + FAM diffusion		67.77	0.0086	40.21	0.0012	30.36
SD 2.1		177.06	0.0645	69.43	0.019	26.36
SD 2.1+ FAM diffusion		66.32	0.0085	41.37	0.0018	31.10
SDXL		160.10	0.0602	74.37	0.0242	26.70
SDXL + FAM diffusion		58.91	0.0073	43.65	0.0130	32.33

Table 2: Comparison of vanilla Stable Diffusion and our FAM diffusion.

Method	Scaling Factor	FID $\downarrow$	KID $\downarrow$	FID ${}_{c}\downarrow$	KID ${}_{c}\downarrow$	CLIP $\uparrow$
DemoFusion [3]	$2\times 4$	81.69	0.0112	54.48	0.0165	29.3
AccDiffusion [15]		70.42	0.0119	55.73	0.0205	29.0
FouriScale* [12]		71.86	0.0302	63.28	0.0322	25.8
HiDiffusion [34]		118.56	0.038	65.46	0.021	26.3
SDXL [19]		80.62	0.0236	67.46	0.0302	25.5
SDXL [19] + FAM diffusion		63.48	0.0090	41.44	0.0115	30.6

Table 3: System-level comparisons with SDXL. * indicates inference with FreeU [26]