Cvpr23 - Scaling Up GANs For Text-To-Image Synthesis
Cvpr23 - Scaling Up GANs For Text-To-Image Synthesis
1
A golden luxury motorcycle parked at the
King's palace. 35mm f/4.5.
A portrait of a human growing colorful flowers from her hair. Hyperrealistic oil painting. a cute magical flying maltipoo at light
Intricate details. speed, fantasy concept art, bokeh, wide sky
A living room with a fireplace at a blue Porsche 356 parked in Eiffel Tower, landscape A painting of a majestic royal
a wood cabin. Interior design. front of a yellow brick wall. photography tall ship in Age of Discovery.
Isometric underwater Atlantis city A hot air balloon in shape of a low poly bunny with cute eyes A cube made of denim on a wooden
with a Greek temple in a bubble. heart. Grand Canyon table
Figure 1. Our model, GigaGAN, shows GAN frameworks can also be scaled up for general text-to-image synthesis tasks, generating a
512px output at an interactive speed of 0.13s, and 4096px at 3.7s. Selected examples at 2K or 4K resolutions are shown. Please zoom
in for more details. See Appendix C and our website for more uncurated comparisons.
2
Input artwork from AdobeStock (128px) Real-ESRGAN (1024px, 0.06s) SD Upscaler (1024px, 7.75s) GigaGAN Upsampler (1024px, 0.13s)
Input
Real-ESRGAN (1K)
SD Upscaler (1K)
GigaGAN Up (1K)
GigaGAN Up (4K)
Figure 2. Our GAN-based upsampler can serve in the upsampling pipeline of many text-to-image models that often generate initial
outputs at low resolutions like 64px or 128px. We simulate such usage by applying our text-conditioned 8× superresolution model on a
low-res 128px artwork to obtain the 1K output, using “Portrait of a colored iguana dressed in a hoodie”. Then our model can be re-applied to
go beyond 4K. We compare our model with the text-conditioned upscaler of Stable Diffusion [78] and unconditional Real-ESRGAN [33].
Zooming in is recommended for comparison between 1K and 4K.
3
Input photo (128px) Real-ESRGAN (1024px, 0.06s) SD Upscaler (1024px, 7.75s) GigaGAN Upsampler (1024px, 0.13s)
Input
Real-ESRGAN (1K)
SD Upscaler (1K)
GigaGAN Up (1K)
GigaGAN Up (4K)
Figure 3. Our GAN-based upsampler, similar to Figure 2, can also be used as an off-the-shelf superresolution model for real images
with a large scaling factor by providing an appropriate description of the image. We apply our text-conditioned 8× superresolution model
on a low-res 128px photo to obtain the 1K output, using “A dog sitting in front of a mini tipi tent”. Then our model can be re-applied to
go beyond 4K. We compare our model with the text-conditioned upscaler of Stable Diffusion [78] and unconditional Real-ESRGAN [33].
Zooming in is recommended for comparison between 1K and 4K.
4
GAN2 [42] and 6× larger than StyleGAN-XL [86] and GAN-based image synthesis. GANs [21] have been one
XMC-GAN [103]. While our 1B parameter count is still of the primary families of generative models for natural im-
lower than the largest recent synthesis models, such as Ima- age synthesis. As the sampling quality and diversity of
gen (3.0B), DALL·E 2 (5.5B), and Parti (20B), we have not GANs improve [39–42, 44, 72, 84], GANs have been de-
yet observed a quality saturation regarding the model size. ployed to various computer vision and graphics applica-
GigaGAN achieves a zero-shot FID of 9.09 on COCO2014 tions, such as text-to-image synthesis [76], image-to-image
dataset, lower than the FID of DALL·E 2, Parti-750M, and translation [29, 34, 49, 65, 66, 110], and image editing [1,
Stable Diffusion. 7, 69, 109]. Notably, StyleGAN-family models [40, 42]
Furthermore, GigaGAN has three major practical ad- have shown impressive ability in image synthesis tasks for
vantages compared to diffusion and autoregressive models. single-category domains [1, 31, 69, 98, 112]. Other works
First, it is orders of magnitude faster, generating a 512px have explored class-conditional GANs [6, 36, 86, 102, 107]
image in 0.13 seconds (Figure 1). Second, it can synthe- on datasets with a fixed set of object categories.
size ultra high-res images at 4k resolution in 3.66 seconds. In this paper, we change the data regimes from single- or
Third, it is endowed with a controllable, latent vector space multi-categories datasets to extremely data-rich situations.
that lends itself to well-studied controllable image synthesis We make the first expedition toward training a large-scale
applications, such as style mixing (Figure 6), prompt inter- GAN for text-to-image generation on a vast amount of web-
polation (Figure 7), and prompt mixing (Figure 8). crawled text and image pairs, such as LAION2B-en [88]
and COYO-700M [8]. Existing GAN-based text-to-image
In summary, our model is the first GAN-based method
synthesis models [52, 76, 93, 99, 103, 104, 111] are trained
that successfully trains a billion-scale model on billions
on relatively small datasets, such as CUB-200 (12k train-
of real-world complex Internet images. This suggests that
ing pairs), MSCOCO (82k) and LN-OpenImages (507k).
GANs are still a viable option for text-to-image synthe-
Also, those models are evaluated on associated validation
sis and should be considered for future aggressive scaling.
datasets, which have not been validated to perform large-
Please visit our website for additional results.
scale text-image synthesis like diffusion or AR models.
Concurrent with our method, StyleGAN-T [85] and
2. Related Works GALIP [92] share similar goals to ours. However, Giga-
GAN and the aforementioned techniques were developed
Text-to-image synthesis. Generating a realistic image independently with distinct technical contributions. We
given a text description, as first explored by Mansimov hope these methods can complement each other and col-
et al. [58], is a challenging task. Earlier works adopted lectively address the limitations of GANs.
text-conditional GANs [76, 77, 93, 99, 104, 111] on specific
domains [96] and datasets with a closed-world assump- Super-resolution for large-scale text-to-image models.
tion [54]. With the development of diffusion models [15, Large-scale models require prohibitive computational costs
26], autoregressive (AR) transformers [12], and large-scale for both training and inference. To reduce the memory and
language encoders [71, 73], text-to-image synthesis has running time, cutting-edge text-to-image models [63,74,80,
shown remarkable improvement on an open-world of ar- 101] have adopted cascaded generation processes where im-
bitrary text descriptions. GLIDE [63], DALL·E 2 [74], ages are first generated at 64 × 64 resolution and upsampled
and Imagen [80] are representative diffusion models that to 256 × 256 and 1024 × 1024 sequentially. However, the
show photorealistic outputs with the aid of a pretrained lan- super-resolution networks are primarily based on diffusion
guage encoder [71, 73]. AR models such as DALL·E [75], models, which require many iterations. In contrast, our low-
Make-A-Scene [20], CogView [16, 17], and Parti [101] res image generators and upsamplers are based on GANs,
also achieve amazing results. While these models exhibit reducing the computational costs for both stages. Unlike
unprecedented image synthesis ability, they require time- traditional super-resolution techniques [2, 18, 47, 95] that
consuming iterative processes to achieve high-quality im- aim to faithfully reproduce low-resolution inputs or handle
age sampling. image degradation like compression artifacts, our upsam-
plers for large-scale models serve a different purpose. They
To accelerate the sampling, several methods propose
need to perform larger upsampling factors while potentially
to reduce the sampling steps [57, 59, 83, 89] or reuse
leveraging the input text prompt.
pre-computed features [51]. Latent Diffusion Model
(LDM) [79] performs the reverse processes in low-
dimensional latent space instead of pixel space. How- 3. Method
ever, consecutive reverse processes are still computationally We train a generator G(z, c) to predict an image x ∈
expensive, limiting the usage of large-scale text-to-image RH×W ×3 given a latent code z ∼ N (0, 1) ∈ R128 and
models for interactive applications. text-conditioning signal c. We use a discriminator D(x, c)
5
Pretrained Learned
text encoder text encoder
Convolutional
𝑡!"#$! Self-attention
”an oil Cross-attention
painting of a CLIP 𝑇
corgi” 𝑡%!"&$! w Affine
Affine
Softmax
Text 𝑐
Constant
𝐺"
Weighted
-Avg
Modulated
Filter Bank Selected Filter
weights
𝑀
𝑧~𝑁(0,1) Filter Selection Modulation
𝑤
Latent code
Sample-adaptive kernel selection
Our high-capacity text-to-image generator
Figure 4. Our GigaGAN high-capacity text-to-image generator. First, we extract text embeddings using a pretrained CLIP model and a
learned encoder T . The local text descriptors are fed to the generator using cross-attention. The global text descriptor, along with a latent
code z, is fed to a style mapping network M to produce style code w. The style code modulates the main generator using our style-adaptive
kernel selection, shown on the right. The generator outputs an image pyramid by converting the intermediate features into RGB images. To
achieve higher capacity, we use multiple attention and convolution layers at each scale (Appendix A2). We also use a separate upsampler
model, which is not shown in this diagram.
to judge the realism of the generated image, as compared source of information to model conditioning.
to a sample from the training database D, which contains
image-text pairs. Sample-adaptive kernel selection. To handle the highly
Although GANs [6, 39, 41] can successfully generate re- diverse distribution of internet images, we aim to increase
alistic images on single- and multi-category datasets [13, the capacity of convolution kernels. However, increasing
41, 100], open-ended text-conditioned synthesis on Internet the width of the convolution layers becomes too demanding,
images remains challenging. We hypothesize that the cur- as the same operation is repeated across all locations.
rent limitation stems from its reliance on convolutional lay- We propose an efficient way to enhance the expressivity
ers. That is, the same convolution filters are challenged to of convolutional kernels by creating them on-the-fly based
model the general image synthesis function for all text con- on the text conditioning, as illustrated in Figure 4 (right).
ditioning across all locations of the image. In this light, we In this scheme, we instantiate a bank of N filters {Ki ∈
seek to inject more expressivity into our parameterization RCin ×Cout ×K×K }Ni=1 , instead of one, that takes a feature f ∈
by dynamically selecting convolution filters based on the RCin at each layer. The style vector w ∈ Rd then goes
input conditioning and by capturing long-range dependence through an affine layer [Wfilter , bfilter ] ∈ R(d+1)×N to predict
via the attention mechanism. a set of weights to average across the filters, to produce an
Below, we discuss our key contributions to making Con- aggregated filter K ∈ RCin ×Cout ×K×K .
vNets more expressive (Section 3.1), followed by our de- N
signs for the generator (Section 3.2) and discriminator (Sec- K=
X
>
Ki · softmax Wfilter w + bfilter
(1)
tion 3.3). Lastly, we introduce a new, fast GAN-based up- i
i=1
sampler model that can improve the inference quality and
speed of our method and diffusion models such as Ima- The filter is then used in the regular convolution pipeline
gen [80] and DALL·E 2 [74]. of StyleGAN2, with the second affine layer [Wmod , bmod ] ∈
R(d+1)×Cin for weight (de-)modulation [42].
3.1. Modeling complex contextual interaction
>
gadaconv (f , w) = (Wmod w + bmod ⊗ K) ∗ f , (2)
Baseline StyleGAN generator. We base our architecture
off the conditional version of StyleGAN2 [42], comprised where ⊗ and ∗ represent (de-)modulation and convolution.
of two networks G = G e ◦ M . The mapping network At a high level, the softmax-based weighting can be
w = M (z, c) maps the inputs into a “style” vector w, viewed as a differentiable filter selection process based on
which modulates a series of upsampling convolutional lay- input conditioning. Furthermore, since the filter selection
ers in the synthesis network G(w)
e to map a learned constant process is performed only once at each layer, the selection
tensor to an output image x. Convolution is the main engine process is much faster than the actual convolution, decou-
to generate all output pixels, with the w vector as the only pling compute complexity from the resolution. Our method
6
shares a spirit with dynamic convolutions [23, 35, 91, 97] in in text embedding t = T (Etxt (c)) ∈ RC×1024 . Each com-
that the convolution filters dynamically change per sample, ponent ti of t captures the embedding of the ith word in
but differs in that we explicitly instantiate a larger filter bank the sentence. We refer to them as tlocal = t{1:C}\EOT ∈
and select weights based on a separate pathway conditional R(C−1)×1024 . The EOT (“end of text”) component of t ag-
on the w-space of StyleGAN. gregates global information, and is called tglobal ∈ R1024 .
We process this global text descriptor, along with the latent
Interleaving attention with convolution. Since the con- code z ∼ N (0, 1), via an MLP mapping network to extract
volutional filter operates within its receptive field, it can- the style w = M (z, tglobal ).
not contextualize itself in relationship to distant parts of
the images. One way to incorporate such long-range re- (tlocal , tglobal ) = T (Etxt (c)),
lationships is using attention layers gattention . While recent (3)
w = M (z, tglobal ).
diffusion-based models [15,27,79] have commonly adopted
attention mechanisms, StyleGAN architectures are predom- Different from the original StyleGAN, we use both the text-
inantly convolutional with the notable exceptions such as based style code w to modulate the synthesis network G e and
BigGAN [6], GANformer [30], and ViTGAN [50]. the word embeddings tlocal as features for cross-attention.
We aim to improve the performance of StyleGAN by in-
tegrating attention layers with the convolutional backbone. x = G(w,
e tlocal ). (4)
However, simply adding attention layers to StyleGAN of-
ten results in training collapse, possibly because the dot- Similar to earlier works [58,74,80], the text-image align-
product self-attention is not Lipschitz, as pointed out by ment visually improves with cross-attention.
Kim et al. [43]. As the Lipschitz continuity of discrimi-
nators has played a critical role in stable training [3, 22, 60],
we use the L2-distance instead of the dot product as the at- Synthesis network. Our synthesis network consists of a
tention logits to promote Lipschitz continuity [43], similar series of upsampling convolutional layers, with each layer
to ViTGAN [50]. enhanced with the adaptive kernel selection (Equation 1)
To further improve performance, we find it crucial to and followed by our attention layers.
match the architectural details of StyleGAN, such as equal- ` ` `
ized learning rate [39] and weight initialization from a unit f`+1 = gxa (gattn (gadaconv (f` , w), w), tlocal ), (5)
normal distribution. We scale down the L2 distance logits
` ` `
to roughly match the unit normal distribution at initializa- where gxa , gattn , and gadaconv denote the l-th layer of cross-
tion and reduce the residual gain from the attention layers. attention, self-attention, and weight (de-)modulation layers.
We further improve stability by tying the key and query ma- We find it beneficial to increase the depth of the network by
trix [50], and applying weight decay. adding more blocks at each layer. In addition, our genera-
In the synthesis network G,
e the attention layers are inter- tor outputs a multi-scale image pyramid with L = 5 levels,
leaved with each convolutional block, leveraging the style instead of a single image at the highest resolution, simi-
vector w as an additional token. At each attention block, lar to MSG-GAN [38] and AnycostGAN [53]. We refer
we add a separate cross-attention mechanism gcross-attention to the pyramid as {xi }L−1 i=0 = {x0 , x1 , ..., x4 }, with spa-
L−1
to attend to individual word embeddings [4]. We use each tial resolutions {Si }i=0 = {64, 32, 16, 8, 4}, respectively.
input feature tensor as the query, and the text embeddings The base level x0 is the output image x. Each image of
as the key and value of the attention mechanism. the pyramid is independently used to compute the GAN
loss, as discussed in Section 3.3. We follow the findings of
3.2. Generator design StyleGAN-XL [86] and turn off the style mixing and path
length regularization [42]. We include more training details
Text and latent-code conditioning. First, we extract the in Appendix A.1.
text embedding from the prompt. Previous works [75, 80]
have shown that leveraging a strong language model is es- 3.3. Discriminator design
sential for producing strong results. To do so, we tokenize
the input prompt (after padding it to C = 77 words, follow- As shown in Figure 5, our discriminator consists of sep-
ing best practices [75, 80]) to produce conditioning vector arate branches for processing text with the function tD and
c ∈ RC×1024 , and take the features from the penultimate images with function φ. The prediction of real vs. fake
layer [80] of a frozen CLIP feature extractor [71]. To allow is made by comparing the features from the two branches
for additional flexibility, we apply additional attention lay- using function ψ. We introduce a new way of making pre-
ers T on top to process the word embeddings before pass- dictions on multiple scales. Finally, we use additional CLIP
ing them to the MLP-based mapping network. This results and Vision-Aided GAN losses [44] to improve stability.
7
D D
Multi-scale output extractor φi→j : RXi ×Xi ×3 → RXj ×Xj ×Cj . Practically,
Text conditioning 𝑡! each sub-network φi→j is a subset of full φ , φ0→L , with
R/F R/F R/F R/F R/F
i > 0 indicating late entry and j < L indicating early exit.
𝜓# Each layer in φ is composed of self-attention, followed by
convolution with stride 2. The final layer flattens the spatial
extent into a 1 × 1 tensor. This produces output resolu-
𝜙 tions at {XjD } = {32, 16, 8, 4, 1}. This allows us to inject
lower-resolution images on the pyramid into intermediate
Convolutional
layers [39]. As we use a shared feature extractor across dif-
Self-attention ferent levels and most of the added predictions are made
at low resolutions, the increased computation overhead is
manageable.
𝑥"
Sweep through multi-scale input Multi-scale input, multi-scale output adversarial loss.
In total, our training objective consists of discriminator
Figure 5. Our discriminator consists of two branches for pro- losses, along with our proposed matching loss, to encour-
cessing the image and the text conditioning tD . The text branch age the discriminator to take into account the conditioning:
processes the text similar to the generator (Figure 4). The image
L−1 L
branch receives an image pyramid and makes independent predic- XX
tions for each image scale. Moreover, the predictions are made VMS-I/O (G, D) = VGAN (Gi , Dij ) + Vmatch (Gi , Dij ),
at all subsequent scales of the downsampling layers, making it a i=0 j=1
Multiscale image processing. We observe that the early, where ψj is implemented as a 4-layer 1 × 1 modulated con-
low-resolution layers of the generator become inactive, volution, and Conv1×1 is added as a skip connection to ex-
using small dynamic ranges irrespective of the provided plicitly maintain an unconditional prediction branch [62].
prompts. StyleGAN2 [42] also observes this phenomenon,
concluding that the network relies on the high-resolution Matching-aware loss. The previous GAN terms measure
layers, as the model size increases. As recovering perfor- how closely the image x matches the conditioning c, as well
mance in low frequencies, which contains complex struc- as how realistic x looks, irrespective of conditioning. How-
ture information, is crucial, we redesign the model architec- ever, during early training, when artifacts are obvious, the
ture to provide training signals across multiple scales. discriminator heavily relies on making a decision indepen-
Recall the generator produces a pyramid {xi }L−1 i=0 , with dent of conditioning and hesitates to account for the condi-
the full image x0 at the pyramid base. MSG-GAN [38] im- tioning later.
proves performance by making a prediction on the entire To enforce the discriminator to incorporate conditioning,
pyramid at once, enforcing consistency across scales. How- we match x with a random, independently sampled condi-
ever, in our large-scale setting, this harms stability, as this tion ĉ, and present them as a fake pair:
limits the generator from making adjustments to its initial
low-res output.
Instead, we process each level of the pyramid indepen- Vmatch = Ex,c,ĉ log(1 + exp(D(x, ĉ)))
dently. As shown in Figure 5, each level xi makes real/fake (8)
+ log(1 + exp(D(G(c), ĉ)) ,
a prediction at multiple scales i < j ≤ L. For example, the
full x0 makes predictions at L = 5 scales, the next level x1 where (x, c) and ĉ are separately sampled from pdata . This
makes predictions at 4 scales, and so on. In total, our dis- loss has previously been explored in text-to-image GAN
criminator produces L(L−1)
2 predictions, supervising multi- works [76,104], except we find that enforcing the Matching-
scale generations at multiple scales. aware loss on generated images from G, as well real images
To extract features at different scales, we define feature x, leads to clear gains in performance (Table 1).
8
CLIP contrastive loss. We further leverage off-the-shelf also provides a stable learning signal. We believe that our
pretrained models as a loss function [44, 84, 90]. In par- GAN upsampler can serve as a drop-in replacement for the
ticular, we enforce the generator to produce outputs that superresolution stage of other generative models.
are identifiable by the pre-trained CLIP image and text en-
coders [71], Eimg and Etxt , in the contrastive cross-entropy 4. Experiments
loss that was used to train them originally.
Systematic, controlled evaluation of large-scale text-to-
h >
exp(Eimg (G(c0 )) Etxt (c0 )) i image synthesis tasks is difficult, as most existing mod-
LCLIP = E{cn } − log P >
) , els are not publicly available. Training a new model from
n exp(Eimg (G(c0 )) Etxt (cn )
scratch would be prohibitively costly, even if the train-
(9) ing code were available. Still, we compare our model
to recent text-to-image models, such as Imagen [80], La-
where {cn } = {c0 , . . . } are sampled captions from the
tent Diffusion Models (LDM) [79], Stable Diffusion [78],
training data.
and Parti [101], based on the available information, while
Vision-aided adversarial loss. Lastly, we build an addi- acknowledging considerable differences in the training
tional discriminator that uses the CLIP model as a back- dataset, number of iterations, batch size, and model size.
bone, known as Vision-Aided GAN [44]. We freeze the In addition to text-to-image results, we evaluate our model
CLIP image encoder, extract features from the intermedi- on ImageNet class-conditional generation in Appendix B,
ate layers, and process them through a simple network with for an apples-to-apples comparison with other methods at a
3 × 3 conv layers to make real/fake predictions. We also more controlled setting.
incorporate conditioning through modulation, as in Equa- For quantitative evaluation, we mainly use the Fréchet
tion 7. To stabilize training, we also add a fixed random Inception Distance (FID) [25] for measuring the realism of
projection layer, as proposed by Projected GAN [84]. We the output distribution and the CLIP score for evaluating the
refer to this as LVision (G) (omitting the learnable discrimi- image-text alignment.
nator parameters for clarity). We conduct five different experiments. First, we show
Our final objective is V(G, D) = VMS-I/O (G, D) + the effectiveness of our method by gradually incorporating
LCLIP (G) + LVision (G), with weighting between the terms each technical component one by one (Section 4.2). Sec-
specified in Table A2. ond, our text-to-image synthesis results demonstrate that
GigaGAN exhibits comparable FID with Stable Diffusion
3.4. GAN-based upsampler (SD-v1.5) [79] while generating results hundreds of times
Furthermore, GigaGAN framework can be easily ex- faster than diffusion or autoregressive models (Section 4.3).
tended to train a text-conditioned superresolution model, Third, we compare GigaGAN with a distillation-based dif-
capable of upsampling the outputs of the base GigaGAN fusion model [59] and show that GigaGAN can synthe-
generator to obtain high-resolution images at 512px or 2k size higher-quality images faster than the distillation-based
resolution. By training our pipeline in two separate stages, diffusion model. Fourth, we verify the advantage of Gi-
we can afford a higher capacity 64px base model within the gaGAN’s upsampler over other upsamplers in both condi-
same computational resources. tional and unconditional super-resolution tasks. Lastly, we
In the upsampler, the synthesis network is rearranged show our large-scale GANs still enjoy the continuous and
to an asymmetric U-Net architecture, which processes the disentangled latent space manipulation of GANs, enabling
64px input through 3 downsampling residual blocks, fol- new image editing modes (Section 4.6).
lowed by 6 upsampling residual blocks with attention layers
to produce the 512px image. There exist skip connections 4.1. Training and evaluation details
at the same resolution, similar to CoModGAN [106]. The We implement GigaGAN based on the StudioGAN Py-
model is trained with the same losses as the base model, as Torch library [37], following the standard FID evaluation
well as the LPIPS Perceptual Loss [105] with respect to the protocol with the anti-aliasing bicubic resize function [67],
ground truth high-resolution image. Vision-aided GAN is unless otherwise noted. For text-to-image synthesis, we
not used for the upsampler. During training and inference train our models on the union of LAION2B-en [88] and
time, we apply moderate Gaussian noise augmentation to COYO-700M [8] datasets, with the exception of the 128-to-
reduce the gap between real and GAN-generated images. 1024 upsampler model trained on Adobe’s internal Stock
Please refer to Appendix A.3 for more details. images. The image-text pairs are preprocessed based on
Our GigaGAN framework becomes particularly effective CLIP score [24], image resolution, and aesthetic score [87],
for the superresolution task compared to the diffusion-based similar to prior work [78]. We use CLIP ViT-L/14 [71] for
models, which cannot afford as many sampling steps as the the pre-trained text encoder and OpenCLIP ViT-G/14 [32]
base model at high resolution. The LPIPS regression loss for CLIP score calculation [24] except for Table 1. All our
9
Fine styles
“A Toy sport
sedan, CG
art.”
Coarse styles
Figure 6. Style mixing. Our GAN-based architecture retains a disentangled latent space, enabling us to blend the coarse style of one
sample with the fine style of another. All outputs are generated with the prompt “A Toy sport sedan, CG art.” The corresponding latent
codes are spliced together to produce a style-swapping grid.
“.. in a
sunny day”
“.. in sunset”
Figure 7. Prompt interpolation. GigaGAN enables smooth interpolation between prompts, as shown in the interpolation grid. The four
corners are generated from the same latent z but with different text prompts. The corresponding text embeddings t and style vectors w are
interpolated to create a smooth transition. The same z results in similar layouts. See Figure 8 for more precise control.
10
no mixing “crochet” “fur” “denim” “brick”
“a cube
on tabletop”
“a ball
on tabletop”
“a teddy bear
on tabletop”
“a teddy bear
on tabletop”
Figure 8. Prompt mixing. GigaGAN retains a disentangled latent space, enabling us to combine the coarse style of one sample with
the fine style of another. Moreover, GigaGAN can directly control the style with text prompts. Here we generate four outputs using the
prompts “a X on tabletop”, shown in the “no mixing” column. Then we re-compute the text embeddings t and the style codes w using
the new prompts “a X with the texture of Y on tabletop”, such as “a cube with the texture of crochet on tabletop”, and apply them to the
second half layers of the generator, achieving layout-preserving fine style control. Cross-attention mechanism automatically localizes the
style to the object of interest.
models are trained and evaluated on A100 GPUs. We in- 4.3. Text-to-Image synthesis
clude more training and evaluation details in Appendix A.
We proceed to train a larger model by increasing the
capacity of the base generator and upsampler to 652.5M
and 359.1M, respectively. This results in an unprece-
4.2. Effectiveness of proposed components dented size of GAN model, with a total parameter count
of 1.0B. Table 2 compares the performance of our end-
First, we show the effectiveness of our formulation via to-end pipeline to various text-to-image generative mod-
ablation study in Table 1. We set up a baseline by adding els [5, 10, 63, 74, 75, 78–80, 101, 108]. Note that there exist
text-conditioning to StyleGAN2 and tuning the configura- differences in the training dataset, the pretrained text en-
tion based on the findings of StyleGAN-XL. We first di- coders, and even image resolutions. For example, Giga-
rectly increase the model size of this baseline, but we find GAN initially synthesizes 512px images, which are resized
that this does not improve the FID and CLIP scores. Then, to 256px before evaluation.
we add our components one by one and observe that they Table 2 shows that GigaGAN exhibits a lower FID
consistently improve performance. In particular, our model than DALL·E 2 [74], Stable Diffusion [78], and Parti-
is more scalable, as the higher-capacity version of the final 750M [101]. While our model can be optimized to better
formulation achieves better performance. match the feature distribution of real images than existing
11
Table 1. Ablation study on 64px text-to-image synthesis. To Table 3. Comparison to distilled diffusion models shows that
evaluate the effectiveness of our components, we start with a mod- GigaGAN achieves better FID and CLIP scores compared to the
ified version of StyleGAN for text conditioning. While increasing progressively distilled diffusion models [59] for fast inference. As
the network width does not show satisfactory improvement, each GigaGAN generates outputs in a single feedforward pass, the in-
addition of our contributions keeps improving metrics. Finally, we ference speed is still faster. The evaluation setup is different from
increase the network width and scale up training to reach our fi- Table 2 to match SD-distilled’s protocol [59].
nal model. All ablated models are trained for 100k iterations at
a batch size of 256 except for the Scale-up row (1350k iterations Model Steps FID-5k ↓ CLIP ↑ Inf. time
with a larger batch size). CLIP Score is computed using CLIP SD-distilled-2 [59] 2 37.3 0.27 0.23s
ViT-B/32 [71]. SD-distilled-4 [59] 4 26.0 0.30 0.33s
SD-distilled-8 [59] 8 26.9 0.30 0.52s
Model FID-10k ↓ CLIP Score ↑ # Param.
SD-distilled-16 [59] 16 28.8 0.30 0.88s
StyleGAN2 29.91 0.222 27.8M
GigaGAN 1 21.1 0.32 0.13s
+ Larger (5.7×) 34.07 0.223 158.9M
+ Tuned 28.11 0.228 26.2M Table 4. Text-conditioned 128→1024 super-resolution on ran-
+ Attention 23.87 0.235 59.0M
dom 10K LAION samples, compared against unconditional Real-
+ Matching-aware D 27.29 0.250 59.0M
+ Matching-aware G and D 21.66 0.254 59.0M ESRGAN [33] and Stable Diffusion Upscaler [78]. GigaGAN en-
+ Adaptive convolution 19.97 0.261 80.2M joys the fast speed of a GAN-based model while achieving better
+ Deeper 19.18 0.263 161.9M FID, patch-FID [9], CLIP score, and LPIPS [105].
+ CLIP loss 14.88 0.280 161.9M Model # Param. Inf. time FID-10k ↓ pFID ↓ CLIP ↑ LPIPS↓
+ Multi-scale training 14.92 0.300 164.0M Real-ESRGAN [33] 17M 0.06s 8.60 22.8 0.314 0.363
+ Vision-aided GAN 13.67 0.287 164.0M SD Upscaler [78] 846M 7.75s 9.39 41.3 0.316 0.523
+ Scale-up (GigaGAN) 9.18 0.307 652.5M GigaGAN 693M 0.13s 1.54 8.90 0.322 0.274
12
5. Discussion and Limitations
Our experiments provide a conclusive answer about the
scalability of GANs: our new architecture can scale up to
model sizes that enable text-to-image synthesis. However,
the visual quality of our results is not yet comparable to
production-grade models like DALL·E 2. Figure 9 shows
several instances where our method fails to produce high-
quality results when compared to DALL·E 2, in terms of
photorealism and text-to-image alignment for the same in-
put prompts used in their paper.
Nevertheless, we have tested capacities well beyond
what is possible with a naı̈ve approach and achieved com-
Figure 9. Failure cases. Our outputs with the same prompts as
petitive visual quality with autoregressive and diffusion
DALL·E 2. Each column conditions on “a teddy bear on a skate- models trained with similar resources while being orders of
board in Times Square”, “a Vibrant portrait painting of Salvador magnitude faster and enabling latent interpolation and styl-
Dali with a robotic half face”, and “A close up of a handpalm with ization. Our GigaGAN architecture opens up a whole new
leaves growing from it”. Compared to production-grade models design space for large-scale generative models and brings
such as DALL·E 2, our model exhibits limitations in realism and back key editing capabilities that became challenging with
compositionality. See Appendix C for uncurated comparisons. the transition to autoregressive and diffusion models. We
expect our performance to improve with larger models, as
seen in Table 1.
scaler). We also use the unconditional Real-ESRGAN [33]
as another baseline. Table 4 measures the performance of Acknowledgments. We thank Simon Niklaus, Alexandru
the upsampler on random 10K images from the LAION Chiculita, and Markus Woodson for building the distributed
dataset and shows that our GigaGAN upsampler signifi- training pipeline. We thank Nupur Kumari, Gaurav Parmar,
cantly outperforms the other upsamplers in realism scores Bill Peebles, Phillip Isola, Alyosha Efros, and Joonghyuk
(FID and patch-FID [9]), text alignment (CLIP score) and Shin for their helpful comments. We also want to thank
closeness to the ground truth (LPIPS [105]). In addition, Chenlin Meng, Chitwan Saharia, and Jiahui Yu for answer-
for more controlled comparison, we train our model on ing many questions about their fantastic work. We thank
the ImageNet unconditional superresolution task and com- Kevin Duarte for discussions regarding upsampling beyond
pare performance with the diffusion-based models, includ- 4K. Part of this work was done while Minguk Kang was an
ing SR3 [81] and LDM [79]. As shown in Table 5, Gi- intern at Adobe Research. Minguk Kang and Jaesik Park
gaGAN achieves the best IS and FID scores with a single were supported by IITP grant funded by the government of
feedforward pass. South Korea (MSIT) (POSTECH GSAI: 2019-0-01906 and
Image restoration: 2021-0-00537).
13
References [15] Prafulla Dhariwal and Alexander Nichol. Diffusion models
beat gans on image synthesis. In Conference on Neural
[1] Rameen Abdal, Yipeng Qin, and Peter Wonka. Im- Information Processing Systems (NeurIPS), 2021. 5, 7, 19,
age2stylegan: How to embed images into the stylegan la- 21, 22, 23
tent space? In IEEE International Conference on Computer
[16] Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng,
Vision (ICCV), 2019. 5
Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao,
[2] Saeed Anwar and Nick Barnes. Densely residual laplacian
Hongxia Yang, et al. Cogview: Mastering text-to-image
super-resolution. IEEE Transactions on Pattern Analysis
generation via transformers. In Conference on Neural In-
and Machine Intelligence (TPAMI), 2020. 5
formation Processing Systems (NeurIPS), 2021. 5
[3] Martin Arjovsky, Soumith Chintala, and Léon Bottou.
[17] Ming Ding, Wendi Zheng, Wenyi Hong, and Jie
Wasserstein Generative Adversarial Networks. In Interna-
Tang. Cogview2: Faster and better text-to-image gen-
tional Conference on Machine Learning (ICML), 2017. 7
eration via hierarchical transformers. arXiv preprint
[4] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio.
arXiv:2204.14217, 2022. 5
Neural machine translation by jointly learning to align and
[18] Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou
translate. In International Conference on Learning Repre-
Tang. Image super-resolution using deep convolutional net-
sentations (ICLR), 2015. 7
works. IEEE Transactions on Pattern Analysis and Ma-
[5] Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat,
chine Intelligence (TPAMI), 2015. 5
Jiaming Song, Karsten Kreis, Miika Aittala, Timo Aila,
Samuli Laine, Bryan Catanzaro, et al. ediffi: Text-to- [19] Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming
image diffusion models with an ensemble of expert denois- transformers for high-resolution image synthesis. In IEEE
ers. arXiv preprint arXiv:2211.01324, 2022. 11, 12 Conference on Computer Vision and Pattern Recognition
[6] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large (CVPR), pages 12873–12883, 2021. 19
Scale GAN Training for High Fidelity Natural Image Syn- [20] Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin,
thesis. In International Conference on Learning Represen- Devi Parikh, and Yaniv Taigman. Make-A-Scene: Scene-
tations (ICLR), 2019. 1, 5, 6, 7, 19, 23 Based Text-to-Image Generation with Human Priors. In Eu-
[7] Andrew Brock, Theodore Lim, James M. Ritchie, and Nick ropean Conference on Computer Vision (ECCV), 2022. 5
Weston. Neural Photo Editing with Introspective Adver- [21] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing
sarial Networks. In International Conference on Learning Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville,
Representations (ICLR), 2017. 5 and Yoshua Bengio. Generative Adversarial Nets. In
[8] Minwoo Byeon, Beomhee Park, Haecheon Kim, Sungjun Conference on Neural Information Processing Systems
Lee, Woonhyuk Baek, and Saehoon Kim. COYO-700M: (NeurIPS), pages 2672–2680, 2014. 1, 5, 8
Image-Text Pair Dataset. https://github.com/ [22] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent
kakaobrain/coyo-dataset, 2022. 5, 9, 19 Dumoulin, and Aaron C Courville. Improved training of
[9] Lucy Chai, Michael Gharbi, Eli Shechtman, Phillip Isola, wasserstein gans. In Conference on Neural Information
and Richard Zhang. Any-resolution training for high- Processing Systems (NeurIPS), 2017. 7
resolution image synthesis. In European Conference on [23] David Ha, Andrew Dai, and Quoc V Le. Hypernetworks.
Computer Vision (ECCV), 2022. 12, 13 In International Conference on Learning Representations
[10] Huiwen Chang, Han Zhang, Jarred Barber, AJ Maschinot, (ICLR), 2017. 7
Jose Lezama, Lu Jiang, Ming-Hsuan Yang, Kevin Murphy, [24] Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le
William T Freeman, Michael Rubinstein, et al. Muse: Text- Bras, and Yejin Choi. Clipscore: A reference-free
to-image generation via masked generative transformers. evaluation metric for image captioning. arXiv preprint
arXiv preprint arXiv:2301.00704, 2023. 11, 12 arXiv:2104.08718, 2021. 9, 19
[11] Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and [25] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner,
William T Freeman. Maskgit: Masked generative image Bernhard Nessler, and Sepp Hochreiter. GANs Trained by
transformer. In IEEE Conference on Computer Vision and a Two Time-Scale Update Rule Converge to a Local Nash
Pattern Recognition (CVPR), pages 11315–11325, 2022. Equilibrium. In Conference on Neural Information Pro-
19 cessing Systems (NeurIPS), pages 6626–6637, 2017. 9, 19
[12] Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Hee- [26] Jonathan Ho, Ajay Jain, and P. Abbeel. Denoising Diffusion
woo Jun, David Luan, and Ilya Sutskever. Generative pre- Probabilistic Models. In Conference on Neural Information
training from pixels. In International Conference on Ma- Processing Systems (NeurIPS), 2020. 5
chine Learning (ICML). PMLR, 2020. 5 [27] Jonathan Ho, Chitwan Saharia, William Chan, David J.
[13] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, Fleet, Mohammad Norouzi, and Tim Salimans. Cas-
and Li Fei-Fei. Imagenet: A large-scale hierarchical im- caded Diffusion Models for High Fidelity Image Genera-
age database. In IEEE Conference on Computer Vision and tion. Journal of Machine Learning Research, pages 47:1–
Pattern Recognition (CVPR), 2009. 6, 19 47:33, 2022. 7, 19
[14] Emily L Denton, Soumith Chintala, Rob Fergus, et al. Deep [28] Jonathan Ho and Tim Salimans. Classifier-free diffusion
generative image models using a laplacian pyramid of ad- guidance. In Conference on Neural Information Processing
versarial networks. Conference on Neural Information Pro- Systems (NeurIPS) Workshop, 2022. 23
cessing Systems (NeurIPS), 28, 2015. 1
14
[29] Xun Huang, Ming-Yu Liu, Serge Belongie, and Jan Kautz. [43] Hyunjik Kim, George Papamakarios, and Andriy Mnih.
Multimodal unsupervised image-to-image translation. In The lipschitz constant of self-attention. In International
European Conference on Computer Vision (ECCV), 2018. Conference on Machine Learning (ICML), 2021. 7
5 [44] Nupur Kumari, Richard Zhang, Eli Shechtman, and Jun-
[30] Drew A Hudson and Larry Zitnick. Generative adversar- Yan Zhu. Ensembling off-the-shelf models for gan train-
ial transformers. In International Conference on Machine ing. In IEEE Conference on Computer Vision and Pattern
Learning (ICML), 2021. 7 Recognition (CVPR), 2022. 5, 7, 9
[31] Erik Härkönen, Aaron Hertzmann, Jaakko Lehtinen, and [45] Tuomas Kynkäänniemi, Tero Karras, Miika Aittala, Timo
Sylvain Paris. GANSpace: Discovering Interpretable GAN Aila, and Jaakko Lehtinen. The Role of ImageNet
Controls. In Conference on Neural Information Processing Classes in Fr\’echet Inception Distance. arXiv preprint
Systems (NeurIPS), 2020. 5 arXiv:2203.06026, 2022. 19
[32] Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade [46] Tuomas Kynkäänniemi, Tero Karras, Samuli Laine, Jaakko
Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Lehtinen, and Timo Aila. Improved Precision and Recall
Vaishaal Shankar, Hongseok Namkoong, John Miller, Han- Metric for Assessing Generative Models. In Conference on
naneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. Open- Neural Information Processing Systems (NeurIPS), 2019.
clip. https://doi.org/10.5281/zenodo.5143773, 2021. 9 19
[33] intao Wang and Liangbin Xie and Chao Dong and Ying [47] Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Ca-
Shan. Real-esrgan: Training real-world blind super- ballero, Andrew Cunningham, Alejandro Acosta, Andrew
resolution with pure synthetic data. In IEEE International Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, et al.
Conference on Computer Vision (ICCV) Workshop, 2021. Photo-realistic single image super-resolution using a gen-
3, 4, 12, 13, 33, 34, 35 erative adversarial network. In IEEE Conference on Com-
[34] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A puter Vision and Pattern Recognition (CVPR), 2017. 5
Efros. Image-to-image translation with conditional adver- [48] Doyup Lee, Chiheon Kim, Saehoon Kim, Minsu Cho, and
sarial networks. In IEEE Conference on Computer Vision Wook-Shin Han. Autoregressive Image Generation using
and Pattern Recognition (CVPR), 2017. 5 Residual Quantization. In IEEE Conference on Computer
[35] Xu Jia, Bert De Brabandere, Tinne Tuytelaars, and Luc V Vision and Pattern Recognition (CVPR), pages 11523–
Gool. Dynamic filter networks. Conference on Neural In- 11532, 2022. 19
formation Processing Systems (NeurIPS), 29, 2016. 7 [49] Hsin-Ying Lee, Hung-Yu Tseng, Jia-Bin Huang, Maneesh
[36] Minguk Kang, Woohyeon Shim, Minsu Cho, and Jaesik Singh, and Ming-Hsuan Yang. Diverse image-to-image
Park. Rebooting ACGAN: Auxiliary Classifier GANs with translation via disentangled representations. In European
Stable Training. In Conference on Neural Information Pro- Conference on Computer Vision (ECCV), 2018. 5
cessing Systems (NeurIPS), 2021. 5 [50] Kwonjoon Lee, Huiwen Chang, Lu Jiang, Han Zhang,
[37] Minguk Kang, Joonghyuk Shin, and Jaesik Park. Studio- Zhuowen Tu, and Ce Liu. ViTGAN: Training GANs with
GAN: A Taxonomy and Benchmark of GANs for Image vision transformers. In International Conference on Learn-
Synthesis. arXiv preprint arXiv:2206.09479, 2022. 9, 19 ing Representations (ICLR), 2022. 7
[38] Animesh Karnewar and Oliver Wang. Msg-gan: Multi- [51] Muyang Li, Ji Lin, Chenlin Meng, Stefano Ermon, Song
scale gradients for generative adversarial networks. In IEEE Han, and Jun-Yan Zhu. Efficient spatially sparse infer-
Conference on Computer Vision and Pattern Recognition ence for conditional gans and diffusion models. In Confer-
(CVPR), pages 7799–7808, 2020. 7, 8 ence on Neural Information Processing Systems (NeurIPS),
[39] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehti- 2022. 5
nen. Progressive growing of gans for improved quality, sta- [52] Jiadong Liang, Wenjie Pei, and Feng Lu. Cpgan: Content-
bility, and variation. In International Conference on Learn- parsing generative adversarial networks for text-to-image
ing Representations (ICLR), 2018. 5, 6, 7, 8 synthesis. In European Conference on Computer Vision
[40] Tero Karras, Miika Aittala, Samuli Laine, Erik Härkönen, (ECCV), 2020. 5
Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Alias-free [53] Ji Lin, Richard Zhang, Frieder Ganz, Song Han, and Jun-
generative adversarial networks. In Conference on Neural Yan Zhu. Anycost gans for interactive image synthesis and
Information Processing Systems (NeurIPS), 2021. 5 editing. In IEEE Conference on Computer Vision and Pat-
[41] Tero Karras, Samuli Laine, and Timo Aila. A style- tern Recognition (CVPR), pages 14986–14996, 2021. 7
based generator architecture for generative adversarial net- [54] Tsung-Yi Lin, Michael Maire, Serge Belongie, James
works. In IEEE Conference on Computer Vision and Pat- Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and
tern Recognition (CVPR), pages 4401–4410, 2019. 1, 5, 6, C Lawrence Zitnick. Microsoft coco: Common objects
19, 23 in context. In European Conference on Computer Vision
[42] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, (ECCV), 2014. 5, 12, 19
Jaakko Lehtinen, and Timo Aila. Analyzing and improv- [55] Luping Liu, Yi Ren, Zhijie Lin, and Zhou Zhao. Pseudo
ing the image quality of stylegan. In IEEE Conference on Numerical Methods for Diffusion Models on Manifolds.
Computer Vision and Pattern Recognition (CVPR), pages In International Conference on Learning Representations
8110–8119, 2020. 1, 5, 6, 7, 8 (ICLR), 2022. 27, 28, 29, 30, 31, 32
15
[56] Ilya Loshchilov and Frank Hutter. Decoupled Weight De- tion of stylegan imagery. In IEEE International Conference
cay Regularization. In International Conference on Learn- on Computer Vision (ICCV), 2021. 5
ing Representations (ICLR), 2019. 20 [70] William Peebles and Saining Xie. Scalable Diffusion Mod-
[57] Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan els with Transformers. arXiv preprint arXiv:2212.09748,
Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffu- 2022. 19
sion probabilistic model sampling in around 10 steps. arXiv [71] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
preprint arXiv:2206.00927, 2022. 5 Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
[58] Elman Mansimov, Emilio Parisotto, Jimmy Lei Ba, and Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn-
Ruslan Salakhutdinov. Generating Images from Captions ing transferable visual models from natural language super-
with Attention. In International Conference on Learning vision. In International Conference on Machine Learning
Representations (ICLR), 2016. 5, 7 (ICML), 2021. 5, 7, 8, 9, 12, 20
[59] Chenlin Meng, Ruiqi Gao, Diederik P Kingma, Stefano Er- [72] Alec Radford, Luke Metz, and Soumith Chintala. Un-
mon, Jonathan Ho, and Tim Salimans. On distillation of supervised representation learning with deep convolu-
guided diffusion models. In Conference on Neural Infor- tional generative adversarial networks. arXiv preprint
mation Processing Systems (NeurIPS) Workshop, 2022. 5, arXiv:1511.06434, 2015. 1, 5
9, 12 [73] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
[60] Lars Mescheder, Andreas Geiger, and Sebastian Nowozin. Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li,
Which training methods for gans do actually converge? In and Peter J. Liu. Exploring the Limits of Transfer Learning
International Conference on Machine Learning (ICML), with a Unified Text-to-Text Transformer. Journal of Ma-
2018. 7 chine Learning Research, 2020. 5
[61] Lars Mescheder, Sebastian Nowozin, and Andreas Geiger. [74] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey
Which Training Methods for GANs do actually Converge? Chu, and Mark Chen. Hierarchical text-conditional
In International Conference on Machine Learning (ICML), image generation with clip latents. arXiv preprint
2018. 20 arXiv:2204.06125, 2022. 1, 5, 6, 7, 11, 12, 19, 23, 27,
[62] Takeru Miyato and Masanori Koyama. cGANs with Projec- 28, 29, 30, 31, 32
tion Discriminator. In International Conference on Learn- [75] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott
ing Representations (ICLR), 2018. 8, 20 Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya
[63] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Sutskever. Zero-shot text-to-image generation. In Interna-
Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and tional Conference on Machine Learning (ICML), 2021. 5,
Mark Chen. GLIDE: Towards Photorealistic Image Gen- 7, 11, 12
eration and Editing with Text-Guided Diffusion Models. [76] Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Lo-
In International Conference on Machine Learning (ICML), geswaran, Bernt Schiele, and Honglak Lee. Generative ad-
2022. 5, 11, 12 versarial text to image synthesis. In International Confer-
[64] OpenAI. DALL·E API. https://openai.com/ ence on Machine Learning (ICML), 2016. 5, 8
product/dall-e-2, 2022. 27, 28, 29, 30, 31, 32
[77] Scott E Reed, Zeynep Akata, Santosh Mohan, Samuel
[65] Taesung Park, Alexei A. Efros, Richard Zhang, and Jun-
Tenka, Bernt Schiele, and Honglak Lee. Learning what and
Yan Zhu. Contrastive Learning for Unpaired Image-to-
where to draw. In Conference on Neural Information Pro-
Image Translation. In European Conference on Computer
cessing Systems (NeurIPS), 2016. 5
Vision (ECCV), 2020. 5
[78] Robin Rombach, Andreas Blattmann, Dominik Lorenz,
[66] Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-
Patrick Esser, and Björn Ommer. Stable Diffu-
Yan Zhu. Semantic image synthesis with spatially-adaptive
sion. https://github.com/CompVis/stable-
normalization. In IEEE Conference on Computer Vision
diffusion. Accessed: 2022-11-06. 3, 4, 9, 11, 12, 19,
and Pattern Recognition (CVPR), 2019. 5
23, 27, 28, 29, 30, 31, 32, 33, 34, 35
[67] Gaurav Parmar, Richard Zhang, and Jun-Yan Zhu. On
[79] Robin Rombach, Andreas Blattmann, Dominik Lorenz,
Aliased Resizing and Surprising Subtleties in GAN Evalu-
Patrick Esser, and Björn Ommer. High-resolution image
ation. In IEEE Conference on Computer Vision and Pattern
synthesis with latent diffusion models. In IEEE Confer-
Recognition (CVPR), 2022. 9, 19
ence on Computer Vision and Pattern Recognition (CVPR),
[68] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer,
2022. 1, 5, 7, 9, 11, 12, 13, 19, 21, 22, 23, 27, 28, 29, 30,
James Bradbury, Gregory Chanan, Trevor Killeen, Zeming
31, 32
Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison,
[80] Chitwan Saharia, William Chan, Saurabh Saxena, Lala
Andreas Kopf, Edward Yang, Zachary DeVito, Martin Rai-
Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed
son, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner,
Ghasemipour, Burcu Karagol Ayan, S Sara Mahdavi,
Lu Fang, Junjie Bai, and Soumith Chintala. PyTorch:
Rapha Gontijo Lopes, et al. Photorealistic Text-to-Image
An Imperative Style, High-Performance Deep Learning Li-
Diffusion Models with Deep Language Understanding.
brary. In Conference on Neural Information Processing
arXiv preprint arXiv:2205.11487, 2022. 1, 5, 6, 7, 9, 11,
Systems (NeurIPS), pages 8024–8035, 2019. 19
[69] Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen- 12, 19
Or, and Dani Lischinski. Styleclip: Text-driven manipula-
16
[81] Chitwan Saharia, Jonathan Ho, William Chan, Tim Sal- hanced super-resolution generative adversarial networks. In
imans, David J Fleet, and Mohammad Norouzi. Image European Conference on Computer Vision (ECCV) Work-
super-resolution via iterative refinement. IEEE Trans- shop, 2018. 5
actions on Pattern Analysis and Machine Intelligence [96] P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Be-
(TPAMI), 2022. 12, 13, 19 longie, and P. Perona. Caltech-UCSD Birds 200. Technical
[82] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki report, California Institute of Technology, 2010. 5
Cheung, Alec Radford, Xi Chen, and Xi Chen. Improved [97] Felix Wu, Angela Fan, Alexei Baevski, Yann Dauphin, and
Techniques for Training GANs. In Conference on Neural Michael Auli. Pay Less Attention with Lightweight and Dy-
Information Processing Systems (NeurIPS), pages 2234– namic Convolutions. In International Conference on Learn-
2242, 2016. 19 ing Representations (ICLR), 2018. 7
[83] Tim Salimans and Jonathan Ho. Progressive distillation for [98] Jonas Wulff and Antonio Torralba. Improving inversion and
fast sampling of diffusion models. In International Confer- generation diversity in stylegan using a gaussianized latent
ence on Learning Representations (ICLR), 2022. 5 space. arXiv preprint arXiv:2009.06529, 2020. 5
[84] Axel Sauer, Kashyap Chitta, Jens Müller, and Andreas [99] Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang,
Geiger. Projected GANs Converge Faster. In Conference on Zhe Gan, Xiaolei Huang, and Xiaodong He. Attngan: Fine-
Neural Information Processing Systems (NeurIPS), 2021. grained text to image generation with attentional generative
5, 9 adversarial networks. In IEEE Conference on Computer
[85] Axel Sauer, Tero Karras, Samuli Laine, Andreas Geiger, Vision and Pattern Recognition (CVPR), 2018. 5
and Timo Aila. StyleGAN-T: Unlocking the Power of [100] Fisher Yu, Ari Seff, Yinda Zhang, Shuran Song, Thomas
GANs for Fast Large-Scale Text-to-Image Synthesis. arXiv Funkhouser, and Jianxiong Xiao. Lsun: Construction of a
preprint arXiv:2301.09515, 2023. 5 large-scale image dataset using deep learning with humans
[86] Axel Sauer, Katja Schwarz, and Andreas Geiger. Stylegan- in the loop. arXiv preprint arXiv:1506.03365, 2015. 6
xl: Scaling stylegan to large diverse datasets. In ACM SIG- [101] Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong,
GRAPH 2022 Conference Proceedings, pages 1–10, 2022. Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku,
5, 7, 19, 21, 22 Yinfei Yang, Burcu Karagol Ayan, et al. Scaling autore-
[87] Christoph Schuhmann. CLIP+MLP Aesthetic gressive models for content-rich text-to-image generation.
Score Predictor. https : / / github . com / arXiv preprint arXiv:2206.10789, 2022. 1, 5, 9, 11, 12
christophschuhmann / improved - aesthetic - [102] Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Au-
predictor. 9, 19 gustus Odena. Self-Attention Generative Adversarial Net-
[88] Christoph Schuhmann, Romain Beaumont, Richard Vencu, works. In International Conference on Machine Learning
Cade Gordon, Ross Wightman, Mehdi Cherti, Theo (ICML), pages 7354–7363, 2019. 5
Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts-
[103] Han Zhang, Jing Yu Koh, Jason Baldridge, Honglak Lee,
man, et al. LAION-5B: An open large-scale dataset for
and Yinfei Yang. Cross-Modal Contrastive Learning for
training next generation image-text models. arXiv preprint
Text-to-Image Generation. In IEEE Conference on Com-
arXiv:2210.08402, 2022. 1, 5, 9, 19
puter Vision and Pattern Recognition (CVPR), 2021. 5
[89] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois-
[104] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xi-
ing Diffusion Implicit Models. In International Conference
aogang Wang, Xiaolei Huang, and Dimitris N Metaxas.
on Learning Representations (ICLR), 2021. 5
Stackgan: Text to photo-realistic image synthesis with
[90] Diana Sungatullina, Egor Zakharov, Dmitry Ulyanov, and
stacked generative adversarial networks. In IEEE Interna-
Victor Lempitsky. Image manipulation with perceptual dis-
tional Conference on Computer Vision (ICCV), 2017. 1, 5,
criminators. In European Conference on Computer Vision
8
(ECCV), 2018. 9
[105] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht-
[91] Md Mehrab Tanjim. DynamicRec: a dynamic convolu-
man, and Oliver Wang. The unreasonable effectiveness of
tional network for next item recommendation. In Proceed-
deep features as a perceptual metric. In IEEE Conference on
ings of the 29th ACM International Conference on Informa-
Computer Vision and Pattern Recognition (CVPR), 2018. 9,
tion and Knowledge Management (CIKM), 2020. 7
12, 13, 20
[92] Ming Tao, Bing-Kun Bao, Hao Tang, and Changsheng Xu.
[106] Shengyu Zhao, Jonathan Cui, Yilun Sheng, Yue Dong, Xiao
GALIP: Generative Adversarial CLIPs for Text-to-Image
Liang, Eric I Chang, and Yan Xu. Large Scale Image
Synthesis. arXiv preprint arXiv:2301.12959, 2023. 5
Completion via Co-Modulated Generative Adversarial Net-
[93] Ming Tao, Hao Tang, Fei Wu, Xiao-Yuan Jing, Bing-Kun
works. In International Conference on Learning Represen-
Bao, and Changsheng Xu. DF-GAN: A Simple and Effec-
tations (ICLR), 2021. 9
tive Baseline for Text-to-Image Synthesis. In IEEE Confer-
[107] Shengyu Zhao, Zhijian Liu, Ji Lin, Jun-Yan Zhu, and Song
ence on Computer Vision and Pattern Recognition (CVPR),
Han. Differentiable augmentation for data-efficient gan
2022. 5
training. arXiv preprint arXiv 2006.10738, 2020. 5
[94] Ken Turkowski. Filters for common resampling tasks.
Graphics gems, pages 147–165, 1990. 19 [108] Yufan Zhou, Ruiyi Zhang, Changyou Chen, Chunyuan Li,
[95] Xintao Wang, Ke Yu, Shixiang Wu, Jinjin Gu, Yihao Liu, Chris Tensmeyer, Tong Yu, Jiuxiang Gu, Jinhui Xu, and
Chao Dong, Yu Qiao, and Chen Change Loy. Esrgan: En-
17
Tong Sun. Lafite: Towards language-free training for text- Conference on Computer Vision (ICCV), pages 2223–2232,
to-image generation. In IEEE Conference on Computer Vi- 2017. 5
sion and Pattern Recognition (CVPR), 2022. 11, 12 [111] Minfeng Zhu, Pingbo Pan, Wei Chen, and Yi Yang. Dm-
[109] Jun-Yan Zhu, Philipp Krähenbühl, Eli Shechtman, and gan: Dynamic memory generative adversarial networks for
Alexei A Efros. Generative visual manipulation on the nat- text-to-image synthesis. In IEEE Conference on Computer
ural image manifold. In European Conference on Computer Vision and Pattern Recognition (CVPR), 2019. 5
Vision (ECCV), 2016. 5 [112] Peihao Zhu, Rameen Abdal, Yipeng Qin, John Femiani, and
[110] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Peter Wonka. Improved stylegan embedding: Where are the
Efros. Unpaired image-to-image translation using cycle- good latents? arXiv preprint arXiv:2012.09036, 2020. 5
consistent adversarial networks. In IEEE International
18
Appendices column of Table A2. To compare our model with SR3 [81]
and LDM fairly, we follow the evaluation procedure de-
We first provide training and evaluation details in Ap- scribed in SR3 and LDM papers.
pendix A. Then, we share results on ImageNet, with visual
comparison to existing methods in Appendix B. Lastly in
B. ImageNet experiments
Appendix C, we show more visuals on our text-to-image
synthesis results and compare them with LDM [79], Stable
Diffusion [78], and DALL·E 2 [74]. B.1. Qualitative results
We train a class-conditional GAN on the ImageNet
A. Training and evaluation details dataset [13], for which apples-to-apples comparison is pos-
sible using the same dataset and evaluation pipeline. Our
A.1. Text-to-image synthesis GAN achieves comparable generation quality to the cutting-
We train GigaGAN on a combined dataset of LAION2B- edge generative models without a pretrained ImageNet clas-
en [88] and COYO-700M [8] in PyTorch framework [68]. sifier, which acts favorably toward automated metrics [45].
For training, we apply center cropping, which results in We apply L2 self-attention, style-adaptive convolution ker-
a square image whose length is the same as the shorter nel, and matching-aware loss to our model and use a wider
side of the original image. Then, we resize the image synthesis network to train the base 64px model with a
to the resolution 64 × 64 using PIL.LANCZOS [94] re- batch size of 1024. Additionally, we train a separate 256px
sizer, which supports anti-aliasing [67]. We filter the train- class-conditional upsampler model and combine them with
ing image–text pairs based on image resolution (≥ 512), an end-to-end finetuning stage. Table A1 shows that our
CLIP score (> 0.3) [24], aesthetics score (> 5.0) [87], method generates high-fidelity images.
and remove watermarked images. We train our GigaGAN
Table A1. Class-conditional synthesis on ImageNet 256px. Our
based on the configurations denoted in the fourth and fifth
method performs competitively against large diffusion and trans-
columns of Table A2. former models. Shaded methods leverage a pretrained ImageNet
For evaluation, we use 40,504 and 30,000 real and gen- classifier at training or inference time, which could act favor-
erated images from COCO2014 [54] validation dataset as ably toward the automated metrics [45]. † indicates IS [82] and
described in Imagen [80]. We apply the center cropping FID [25] are borrowed from the original DiT paper [70].
and resize the real and generated images to 299 × 299 reso-
lution using PIL.BICUBIC, suggested by clean-fid [67]. We Model IS [82] FID [25] Precision/Recall [46] Size
use the clean-fid library [67] for FID calculation.
GAN
We follow the training and evaluation protocol proposed CDM [27] 158.71 4.88 -/- -
by Kang et al. [37] to make a fair comparison against other LDM-8-G [79] 209.52 7.76 -/- 506M
LDM-4-G [79] 247.67 3.60 -/- 400M
cutting-edge generative models. We use the same crop- DiT-XL/2† [70] 278.24 2.27 -/- 675M
ping strategy to process images for training and evalua- Mask-GIT [11] 216.38 5.40 0.87/0.60 227M
xformer
tion as in our text-to-image experiments. Then, we re- VQ-GAN [19] 314.61 5.20 0.81/0.57 1.4B
size the image to the target resolution (64 × 64 for the RQ-Transformer [48] 339.41 3.83 0.85/0.60 3.8B
base generator or 256 × 256 for the super-resolution stack) GigaGAN 225.52 3.45 0.84/0.61 569M
using PIL.LANCZOS [94] resizer, which supports anti-
aliasing [67]. Using the pre-processed training images, we
B.2. Quantitative results
train GigaGAN based on the configurations denoted in the
second and third columns of Table A2. We provide visual results from ADM-G-U, LDM,
For evaluation, we upsample the real and generated im- StyleGAN-XL [86], and GigaGAN in Figures A1 and A2.
ages to 299 × 299 resolution using the PIL.BILINEAR re- Although StyleGAN-XL has the lowest FID, its visual qual-
sizer. To compute FID, we generate 50k images without ity appears worse than ADM and GigaGAN. StyleGAN-XL
truncation tricks [6, 41] and compare those images with the struggles to synthesize the overall image structure, leading
entire training dataset. We use the pre-calculated features of to less realistic images. In contrast, GigaGAN appears to
real images provided by StudioGAN [37] and 50k generated synthesize the overall structure better than StyleGAN-XL
images for Precision & Recall [46] calculation. and faithfully captures fine-grained details, such as the wing
patterns of a monarch and the white fur of an arctic fox.
A.3. Super-resolution results Compared to GigaGAN, ADM-G-U synthesizes the image
For model training, we preprocess ImageNet in the same structure more rationally but lacks in reflecting the afore-
way as in Section A.2 and use the configuration in the last mentioned fine-grained details.
19
Table A2. Hyperparameters for GigaGAN training. We denote Projection Discriminator [62] as PD, R1 regularization [61] as R1,
Learned Perceptual Image Patch Similarity [105] as LPIPS, Adam with decoupled weight decay [56] as AdamW, and the pretrained VIT-
B/32 visual encoder [71] as CLIP-ViT-B/32-V.
20
Figure A1. Uncurated images (above: Tench and below: Monarch) from ADM-G-U [15], LDM-4-G [79], GigaGAN (ours), and StyleGAN-
XL [86]. FID values of each generative model are 4.01, 3.60, 3.45, and 2.32, respectively.
21
Figure A2. Uncurated images (above: Lorikeet and below: Arctic fox) from ADM-G-U [15], LDM-4-G [79], GigaGAN (ours), and
StyleGAN-XL [86]. FID values of each generative model are 4.01, 3.60, 3.45, and 2.32, respectively.
22
C. Text-to-image synthesis results Quantitatively, the effect of truncation is similar to the
guidance technique of diffusion models. As shown in Fig-
C.1. Truncation trick at inference ure A3, the CLIP score increases with more truncation,
where the FID increases due to reduced diversity.
Similar to the classifier guidance [15] and classifier-free
guidance [28] used in diffusion models such as LDM, our C.2. Comparison to diffusion models
GAN model can leverage the truncation trick [6, 41] at in- Finally, we show randomly sampled results of our model
ference time.
and compare them with publicly available diffusion models,
wtrunc = lerp(wmean , w, ψ), (9) LDM [79], Stable Diffusion [78], and DALL·E 2 [74].
10
wtrunc = lerp(wmean,c , lerp(wmean , w, ψ), ψ), (10) 0.22 0.24 0.26 0.28 0.30 0.32 0.34
23
no truncation strong truncation
𝜓 = 1.0 0.9 0.7 0.5 0.3 0.1
Figure A4. The visual effect of our truncation trick. We demonstrate the effect of truncation by decreasing the truncation value ψ from
1.0. We show six example outputs with the text prompt “digital painting of a confident and severe looking northern war goddess, extremely
long blond braided hair, beautiful blue eyes and red lips.” and “Magritte painting of a clock on a beach.”. At 1.0 (no truncation), the
diversity is high, but the alignment is not satisfactory. As the truncation increases, text-image alignment improves, at the cost of diversity.
We find that a truncation value between 0.8 and 0.7 produces the best result.
24
Fine styles
“A modern
style house,
DSLR.”
Coarse styles
Fine styles
“A male
Headshot
Picture.”
Coarse styles
Figure A5. Style mixing. GigaGAN maintains a disentangled latent space, allowing us to blend the coarse style of one sample with the
fine style of another. The corresponding latent codes are spliced together to produce a style-swapping grid. The outputs are generated from
the same prompt but with different latent codes.
25
“A modern mansion ..” “A victorian mansion ..”
“.. in a
sunny day”
“.. in sunset”
“Roses.” “Sunflowers.”
“oil painting”
“photograph”
Figure A6. Prompt interpolation. GigaGAN enables smooth interpolation between prompts, as shown in the interpolation grid. The
four corners are generated from the same latent but with different text prompts. The corresponding text embeddings and style vectors
are interpolated to create a smooth transition. The same results in similar layouts.
26
“A loft bed with a dresser underneath it.”
DALL·E 2 (1024px)
Figure A7. Random outputs of our model, Latent Diffusion Model [79], Stable Diffusion [78], and DALL·E 2 [74], using prompt “A loft
bed with a dresser underneath it”. We show two versions of our model, one without truncation and the other with truncation. Our model
enjoys faster speed than the diffusion models. Still, we observe our model falls behind in structural coherency, such as the number of legs
of the bed frames. For LDM and Stable Diffusion, we use 250 and 50 sampling steps with DDIM / PLMS [55], respectively. For DALL·E
2, we generate images using the official DALL·E service [64].
27
“A green vase filed with red roses sitting on top of table.”
DALL·E 2 (1024px)
Figure A8. Random outputs of our model, Latent Diffusion Model [79], Stable Diffusion [78], and DALL·E 2 [74], using prompt “A green
vase filed with red roses sitting on top of table”. We show two versions of our model, one without truncation and the other with truncation.
Our model enjoys faster speed than the diffusion models in both cases. Still, we observe our model falls behind in structural coherency like
the symmetry of the vases. For LDM and Stable Diffusion, we use 250 and 50 sampling steps with DDIM / PLMS [55], respectively. For
DALL·E 2, we generate images using the official DALL·E service [64].
28
“A zebra in the grass who is cleaning himself.”
DALL·E 2 (1024px)
Figure A9. Random outputs of our model, Latent Diffusion Model [79], Stable Diffusion [78], and DALL·E 2 [74], using prompt “A zebra
in the grass who is cleaning himself”. We show two versions of our model, one without truncation and the other with truncation. Our
model enjoys faster speed than the diffusion models in both cases. Still, we observe our model falls behind in details, such as the precise
stripe pattern of the positioning of eyes. For LDM and Stable Diffusion, we use 250 and 50 sampling steps with DDIM / PLMS [55],
respectively. For DALL·E 2, we generate images using the official DALL·E service [64].
29
“A teddy bear on a skateboard in times square.”
DALL·E 2 (1024px)
Figure A10. Random outputs of our model, Latent Diffusion Model [79], Stable Diffusion [78], and DALL·E 2 [74], using prompt “A
teddy bear on a skateboard in times square”. We show two versions of our model, one without truncation and the other with truncation. Our
model enjoys faster speed than the diffusion models in both cases. Still, we observe our model falls behind in details, like the exact shape
of skateboards. For LDM and Stable Diffusion, we use 250 and 50 sampling steps with DDIM / PLMS [55], respectively. For DALL·E 2,
we generate images using the official DALL·E service [64].
30
“Vibrant portrait painting of Salvador Dalí with a robotic half face.”
DALL·E 2 (1024px)
Figure A11. Random outputs of our model, Latent Diffusion Model [79], Stable Diffusion [78], and DALL·E 2 [74], using prompt “Vibrant
portrait painting of Salvador Dalı́ with a robotic half face”. We show two versions of our model, one without truncation and the other with
truncation. Our model enjoys faster speed than the diffusion models in both cases. Still, we observe our model falls behind in structural
details like in the detailed shape of eyes. For LDM and Stable Diffusion, we use 250 and 50 sampling steps with DDIM / PLMS [55],
respectively. For DALL·E 2, we generate images using the official DALL·E service [64].
31
“Three men in military suits are sitting on a bench.”
DALL·E 2 (1024px)
Figure A12. Random outputs of our model, Latent Diffusion Model [79], Stable Diffusion [78], and DALL·E 2 [74], using prompt “Three
men in military suits are sitting on a bench”. We show two versions of our model, one without truncation and the other with truncation. Our
model enjoys faster speed than the diffusion models in both cases. Still, we observe our model falls behind in details in facial expression
and attire. For LDM and Stable Diffusion, we use 250 and 50 sampling steps with DDIM / PLMS [55], respectively. For DALL·E 2, we
generate images using the official DALL·E service [64].
32
Input artwork from AdobeStock (128px) Real-ESRGAN (1024px, 0.06s) SD Upscaler (1024px, 7.75s) GigaGAN Upsampler (1024px, 0.13s)
Input
Real-ESRGAN (1K)
SD Upscaler (1K)
GigaGAN Up (1K)
GigaGAN Up (4K)
Figure A13. Our GAN-based upsampler can serve as the upsampler for many text-to-image models that generate initial outputs at low
resolutions like 64px or 128px. We simulate such usage by applying our 8× superresolution model on a low-res 128px artwork to obtain
the 1K output, using “Portrait of a kitten dressed in a bow tie. Red Rose. Valentine’s day.”. Then our model can be re-applied to go beyond
4K. We compare our model with the text-conditioned upscaler of Stable Diffusion [78] and unconditional Real-ESRGAN [33]. Zooming
in is recommended for comparison between 1K and 4K outputs.
33
Input artwork from AdobeStock (128px) Real-ESRGAN (1024px, 0.06s) SD Upscaler (1024px, 7.75s) GigaGAN Upsampler (1024px, 0.13s)
Input
Real-ESRGAN (1K)
SD Upscaler (1K)
GigaGAN Up (1K)
GigaGAN Up (4K)
Figure A14. Our GAN-based upsampler can serve as the upsampler for many text-to-image models that generate initial outputs at low
resolutions like 64px or 128px. We simulate such usage by applying our 8× superresolution model on a low-res 128px artwork to obtain
the 1K output, using “Heart shaped pancakes with honey and strawberry for Valentine’s Day”. Then our model can be re-applied to go
beyond 4K. We compare our model with the text-conditioned upscaler of Stable Diffusion [78] and unconditional Real-ESRGAN [33].
Zooming in is recommended for comparison between 1K and 4K outputs.
34
Input photo (128px) Real-ESRGAN (1024px, 0.06s) SD Upscaler (1024px, 7.75s) GigaGAN Upsampler (1024px, 0.13s)
Input
Real-ESRGAN (1K)
SD Upscaler (1K)
GigaGAN Up (1K)
GigaGAN Up (4K)
Figure A15. Our GAN-based upsampler can also be used as an off-the-shelf superresolution model for real images with a large scaling
factor by providing an appropriate description of the image. We apply our text-conditioned 8× superresolution model on a low-res 128px
photo to obtain the 1K output, using “An elephant spraying water with its trunk”. Then our model can be re-applied to go beyond 4K.
We compare our model with the text-conditioned upscaler of Stable Diffusion [78] and unconditional Real-ESRGAN [33]. Zooming in is
recommended for comparison between 1K and 4K outputs.
35