0% found this document useful (0 votes)

23 views35 pages

Cvpr23 - Scaling Up GANs For Text-To-Image Synthesis

1) The document introduces GigaGAN, a new GAN architecture that can scale up to generate high-resolution images from text descriptions. 2) GigaGAN is able to generate 512px images in 0.13 seconds and 16-megapixel images in 3.66 seconds. 3) Key techniques to stabilize training for the large GigaGAN model include retaining a bank of filters and taking sample-specific linear combinations, as well as incorporating self-attention and multi-scale training.

Uploaded by

itomixyzki

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views35 pages

Cvpr23 - Scaling Up GANs For Text-To-Image Synthesis

Uploaded by

itomixyzki

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 35

Scaling up GANs for Text-to-Image Synthesis

Minguk Kang1,3 Jun-Yan Zhu2 Richard Zhang3

Jaesik Park1 Eli Shechtman3 Sylvain Paris3 Taesung Park3
1 2 3
POSTECH Carnegie Mellon University Adobe Research
arXiv:2303.05511v1 [cs.CV] 9 Mar 2023

Abstract architectures and training considerations due to instabilities

in the training procedure. As such, GANs have excelled at
The recent success of text-to-image synthesis has taken modeling single or multiple object classes, but scaling to
the world by storm and captured the general public’s imag- complex datasets, much less an open world, has remained
ination. From a technical standpoint, it also marked a dras- challenging. As a result, ultra-large models, data, and com-
tic change in the favored architecture to design generative pute resources are now dedicated to diffusion and autore-
image models. GANs used to be the de facto choice, with gressive models. In this work, we ask – can GANs continue
techniques like StyleGAN. With DALL·E 2, autoregressive to be scaled up and potentially benefit from such resources,
and diffusion models became the new standard for large- or have they plateaued? What prevents them from further
scale generative models overnight. This rapid shift raises scaling, and can we overcome these barriers?
a fundamental question: can we scale up GANs to benefit We first experiment with StyleGAN2 [42] and observe
from large datasets like LAION? We find that naı̈vely in- that simply scaling the backbone causes unstable training.
creasing the capacity of the StyleGAN architecture quickly We identify several key issues and propose techniques to
becomes unstable. We introduce GigaGAN, a new GAN ar- stabilize the training while increasing the model capacity.
chitecture that far exceeds this limit, demonstrating GANs First, we effectively scale the generator’s capacity by re-
as a viable option for text-to-image synthesis. GigaGAN taining a bank of filters and taking a sample-specific linear
offers three major advantages. First, it is orders of mag- combination. We also adapt several techniques commonly
nitude faster at inference time, taking only 0.13 seconds used in the diffusion context and confirm that they bring
to synthesize a 512px image. Second, it can synthesize similar benefits to GANs. For instance, interleaving both
high-resolution images, for example, 16-megapixel images self-attention (image-only) and cross-attention (image-text)
in 3.66 seconds. Finally, GigaGAN supports various latent with the convolutional layers improves performance.
space editing applications such as latent interpolation, style Furthermore, we reintroduce multi-scale training, find-
mixing, and vector arithmetic operations. ing a new scheme that improves image-text alignment and
low-frequency details of generated outputs. Multi-scale
training allows the GAN-based generator to use parameters
1. Introduction in low-resolution blocks more effectively, leading to better
Recently released models, such as DALL·E 2 [74], Im- image-text alignment and image quality. After careful tun-
agen [80], Parti [101], and Stable Diffusion [79], have ing, we achieve stable and scalable training of a one-billion-
ushered in a new era of image generation, achieving un- parameter GAN (GigaGAN) on large-scale datasets, such as
precedented levels of image quality and model flexibility. LAION2B-en [88]. Our results are shown in Figure 1.
The now-dominant paradigms, diffusion models and autore- In addition, our method uses a multi-stage approach [14,
gressive models, both rely on iterative inference. This is 104]. We first generate at 64 × 64 and then upsample to
a double-edged sword, as iterative methods enable stable 512 × 512. These two networks are modular and robust
training with simple objectives but incur a high computa- enough to be used in a plug-and-play fashion. We show that
tional cost during inference. our text-conditioned GAN-based upsampling network can
Contrast this with Generative Adversarial Networks be used as an efficient, higher-quality upsampler for a base
(GANs) [6,21,41,72], which generate images through a sin- diffusion model such as DALL·E 2, despite never having
gle forward pass and thus inherently efficient. While such seen diffusion images at training time (Figures 2).
models dominated the previous “era” of generative mod- Together, these advances enable our GigaGAN to go
eling, scaling them requires careful tuning of the network far beyond previous GANs: 36× larger than Style-

1
A golden luxury motorcycle parked at the
King's palace. 35mm f/4.5.

A portrait of a human growing colorful flowers from her hair. Hyperrealistic oil painting. a cute magical flying maltipoo at light
Intricate details. speed, fantasy concept art, bokeh, wide sky

A living room with a fireplace at a blue Porsche 356 parked in Eiffel Tower, landscape A painting of a majestic royal
a wood cabin. Interior design. front of a yellow brick wall. photography tall ship in Age of Discovery.

Isometric underwater Atlantis city A hot air balloon in shape of a low poly bunny with cute eyes A cube made of denim on a wooden
with a Greek temple in a bubble. heart. Grand Canyon table

Figure 1. Our model, GigaGAN, shows GAN frameworks can also be scaled up for general text-to-image synthesis tasks, generating a
512px output at an interactive speed of 0.13s, and 4096px at 3.7s. Selected examples at 2K or 4K resolutions are shown. Please zoom
in for more details. See Appendix C and our website for more uncurated comparisons.

2
Input artwork from AdobeStock (128px) Real-ESRGAN (1024px, 0.06s) SD Upscaler (1024px, 7.75s) GigaGAN Upsampler (1024px, 0.13s)

Input

Real-ESRGAN (1K)

SD Upscaler (1K)

GigaGAN Up (1K)

GigaGAN Up (4K)

GigaGAN Upsampler (4096px, 16Mpix, 3.66s)

Figure 2. Our GAN-based upsampler can serve in the upsampling pipeline of many text-to-image models that often generate initial
outputs at low resolutions like 64px or 128px. We simulate such usage by applying our text-conditioned 8× superresolution model on a
low-res 128px artwork to obtain the 1K output, using “Portrait of a colored iguana dressed in a hoodie”. Then our model can be re-applied to
go beyond 4K. We compare our model with the text-conditioned upscaler of Stable Diffusion [78] and unconditional Real-ESRGAN [33].
Zooming in is recommended for comparison between 1K and 4K.

3
Input photo (128px) Real-ESRGAN (1024px, 0.06s) SD Upscaler (1024px, 7.75s) GigaGAN Upsampler (1024px, 0.13s)

Input

Real-ESRGAN (1K)

SD Upscaler (1K)

GigaGAN Up (1K)

GigaGAN Up (4K)

GigaGAN Upsampler (4096px, 16Mpix, 3.66s)

Figure 3. Our GAN-based upsampler, similar to Figure 2, can also be used as an off-the-shelf superresolution model for real images
with a large scaling factor by providing an appropriate description of the image. We apply our text-conditioned 8× superresolution model
on a low-res 128px photo to obtain the 1K output, using “A dog sitting in front of a mini tipi tent”. Then our model can be re-applied to
go beyond 4K. We compare our model with the text-conditioned upscaler of Stable Diffusion [78] and unconditional Real-ESRGAN [33].
Zooming in is recommended for comparison between 1K and 4K.

4
GAN2 [42] and 6× larger than StyleGAN-XL [86] and GAN-based image synthesis. GANs [21] have been one
XMC-GAN [103]. While our 1B parameter count is still of the primary families of generative models for natural im-
lower than the largest recent synthesis models, such as Ima- age synthesis. As the sampling quality and diversity of
gen (3.0B), DALL·E 2 (5.5B), and Parti (20B), we have not GANs improve [39–42, 44, 72, 84], GANs have been de-
yet observed a quality saturation regarding the model size. ployed to various computer vision and graphics applica-
GigaGAN achieves a zero-shot FID of 9.09 on COCO2014 tions, such as text-to-image synthesis [76], image-to-image
dataset, lower than the FID of DALL·E 2, Parti-750M, and translation [29, 34, 49, 65, 66, 110], and image editing [1,
Stable Diffusion. 7, 69, 109]. Notably, StyleGAN-family models [40, 42]
Furthermore, GigaGAN has three major practical ad- have shown impressive ability in image synthesis tasks for
vantages compared to diffusion and autoregressive models. single-category domains [1, 31, 69, 98, 112]. Other works
First, it is orders of magnitude faster, generating a 512px have explored class-conditional GANs [6, 36, 86, 102, 107]
image in 0.13 seconds (Figure 1). Second, it can synthe- on datasets with a fixed set of object categories.
size ultra high-res images at 4k resolution in 3.66 seconds. In this paper, we change the data regimes from single- or
Third, it is endowed with a controllable, latent vector space multi-categories datasets to extremely data-rich situations.
that lends itself to well-studied controllable image synthesis We make the first expedition toward training a large-scale
applications, such as style mixing (Figure 6), prompt inter- GAN for text-to-image generation on a vast amount of web-
polation (Figure 7), and prompt mixing (Figure 8). crawled text and image pairs, such as LAION2B-en [88]
and COYO-700M [8]. Existing GAN-based text-to-image
In summary, our model is the first GAN-based method
synthesis models [52, 76, 93, 99, 103, 104, 111] are trained
that successfully trains a billion-scale model on billions
on relatively small datasets, such as CUB-200 (12k train-
of real-world complex Internet images. This suggests that
ing pairs), MSCOCO (82k) and LN-OpenImages (507k).
GANs are still a viable option for text-to-image synthe-
Also, those models are evaluated on associated validation
sis and should be considered for future aggressive scaling.
datasets, which have not been validated to perform large-
Please visit our website for additional results.
scale text-image synthesis like diffusion or AR models.
Concurrent with our method, StyleGAN-T [85] and
2. Related Works GALIP [92] share similar goals to ours. However, Giga-
GAN and the aforementioned techniques were developed
Text-to-image synthesis. Generating a realistic image independently with distinct technical contributions. We
given a text description, as first explored by Mansimov hope these methods can complement each other and col-
et al. [58], is a challenging task. Earlier works adopted lectively address the limitations of GANs.
text-conditional GANs [76, 77, 93, 99, 104, 111] on specific
domains [96] and datasets with a closed-world assump- Super-resolution for large-scale text-to-image models.
tion [54]. With the development of diffusion models [15, Large-scale models require prohibitive computational costs
26], autoregressive (AR) transformers [12], and large-scale for both training and inference. To reduce the memory and
language encoders [71, 73], text-to-image synthesis has running time, cutting-edge text-to-image models [63,74,80,
shown remarkable improvement on an open-world of ar- 101] have adopted cascaded generation processes where im-
bitrary text descriptions. GLIDE [63], DALL·E 2 [74], ages are first generated at 64 × 64 resolution and upsampled
and Imagen [80] are representative diffusion models that to 256 × 256 and 1024 × 1024 sequentially. However, the
show photorealistic outputs with the aid of a pretrained lan- super-resolution networks are primarily based on diffusion
guage encoder [71, 73]. AR models such as DALL·E [75], models, which require many iterations. In contrast, our low-
Make-A-Scene [20], CogView [16, 17], and Parti [101] res image generators and upsamplers are based on GANs,
also achieve amazing results. While these models exhibit reducing the computational costs for both stages. Unlike
unprecedented image synthesis ability, they require time- traditional super-resolution techniques [2, 18, 47, 95] that
consuming iterative processes to achieve high-quality im- aim to faithfully reproduce low-resolution inputs or handle
age sampling. image degradation like compression artifacts, our upsam-
plers for large-scale models serve a different purpose. They
To accelerate the sampling, several methods propose
need to perform larger upsampling factors while potentially
to reduce the sampling steps [57, 59, 83, 89] or reuse
leveraging the input text prompt.
pre-computed features [51]. Latent Diffusion Model
(LDM) [79] performs the reverse processes in low-
dimensional latent space instead of pixel space. How- 3. Method
ever, consecutive reverse processes are still computationally We train a generator G(z, c) to predict an image x ∈
expensive, limiting the usage of large-scale text-to-image RH×W ×3 given a latent code z ∼ N (0, 1) ∈ R128 and
models for interactive applications. text-conditioning signal c. We use a discriminator D(x, c)

5
Pretrained Learned
text encoder text encoder
Convolutional
𝑡!"#$! Self-attention
”an oil Cross-attention
painting of a CLIP 𝑇
corgi” 𝑡%!"&$! w Affine
Affine
Softmax
Text 𝑐

Constant
𝐺"
Weighted
-Avg

Modulated
Filter Bank Selected Filter
weights
𝑀
𝑧~𝑁(0,1) Filter Selection Modulation
𝑤
Latent code
Sample-adaptive kernel selection
Our high-capacity text-to-image generator
Figure 4. Our GigaGAN high-capacity text-to-image generator. First, we extract text embeddings using a pretrained CLIP model and a
learned encoder T . The local text descriptors are fed to the generator using cross-attention. The global text descriptor, along with a latent
code z, is fed to a style mapping network M to produce style code w. The style code modulates the main generator using our style-adaptive
kernel selection, shown on the right. The generator outputs an image pyramid by converting the intermediate features into RGB images. To
achieve higher capacity, we use multiple attention and convolution layers at each scale (Appendix A2). We also use a separate upsampler
model, which is not shown in this diagram.

to judge the realism of the generated image, as compared source of information to model conditioning.
to a sample from the training database D, which contains
image-text pairs. Sample-adaptive kernel selection. To handle the highly
Although GANs [6, 39, 41] can successfully generate re- diverse distribution of internet images, we aim to increase
alistic images on single- and multi-category datasets [13, the capacity of convolution kernels. However, increasing
41, 100], open-ended text-conditioned synthesis on Internet the width of the convolution layers becomes too demanding,
images remains challenging. We hypothesize that the cur- as the same operation is repeated across all locations.
rent limitation stems from its reliance on convolutional lay- We propose an efficient way to enhance the expressivity
ers. That is, the same convolution filters are challenged to of convolutional kernels by creating them on-the-fly based
model the general image synthesis function for all text con- on the text conditioning, as illustrated in Figure 4 (right).
ditioning across all locations of the image. In this light, we In this scheme, we instantiate a bank of N filters {Ki ∈
seek to inject more expressivity into our parameterization RCin ×Cout ×K×K }Ni=1 , instead of one, that takes a feature f ∈
by dynamically selecting convolution filters based on the RCin at each layer. The style vector w ∈ Rd then goes
input conditioning and by capturing long-range dependence through an affine layer [Wfilter , bfilter ] ∈ R(d+1)×N to predict
via the attention mechanism. a set of weights to average across the filters, to produce an
Below, we discuss our key contributions to making Con- aggregated filter K ∈ RCin ×Cout ×K×K .
vNets more expressive (Section 3.1), followed by our de- N
signs for the generator (Section 3.2) and discriminator (Sec- K=
X
>
Ki · softmax Wfilter w + bfilter

(1)
tion 3.3). Lastly, we introduce a new, fast GAN-based up- i
i=1
sampler model that can improve the inference quality and
speed of our method and diffusion models such as Ima- The filter is then used in the regular convolution pipeline
gen [80] and DALL·E 2 [74]. of StyleGAN2, with the second affine layer [Wmod , bmod ] ∈
R(d+1)×Cin for weight (de-)modulation [42].
3.1. Modeling complex contextual interaction
>

gadaconv (f , w) = (Wmod w + bmod ⊗ K) ∗ f , (2)
Baseline StyleGAN generator. We base our architecture
off the conditional version of StyleGAN2 [42], comprised where ⊗ and ∗ represent (de-)modulation and convolution.
of two networks G = G e ◦ M . The mapping network At a high level, the softmax-based weighting can be
w = M (z, c) maps the inputs into a “style” vector w, viewed as a differentiable filter selection process based on
which modulates a series of upsampling convolutional lay- input conditioning. Furthermore, since the filter selection
ers in the synthesis network G(w)
e to map a learned constant process is performed only once at each layer, the selection
tensor to an output image x. Convolution is the main engine process is much faster than the actual convolution, decou-
to generate all output pixels, with the w vector as the only pling compute complexity from the resolution. Our method

6
shares a spirit with dynamic convolutions [23, 35, 91, 97] in in text embedding t = T (Etxt (c)) ∈ RC×1024 . Each com-
that the convolution filters dynamically change per sample, ponent ti of t captures the embedding of the ith word in
but differs in that we explicitly instantiate a larger filter bank the sentence. We refer to them as tlocal = t{1:C}\EOT ∈
and select weights based on a separate pathway conditional R(C−1)×1024 . The EOT (“end of text”) component of t ag-
on the w-space of StyleGAN. gregates global information, and is called tglobal ∈ R1024 .
We process this global text descriptor, along with the latent
Interleaving attention with convolution. Since the con- code z ∼ N (0, 1), via an MLP mapping network to extract
volutional filter operates within its receptive field, it can- the style w = M (z, tglobal ).
not contextualize itself in relationship to distant parts of
the images. One way to incorporate such long-range re- (tlocal , tglobal ) = T (Etxt (c)),
lationships is using attention layers gattention . While recent (3)
w = M (z, tglobal ).
diffusion-based models [15,27,79] have commonly adopted
attention mechanisms, StyleGAN architectures are predom- Different from the original StyleGAN, we use both the text-
inantly convolutional with the notable exceptions such as based style code w to modulate the synthesis network G e and
BigGAN [6], GANformer [30], and ViTGAN [50]. the word embeddings tlocal as features for cross-attention.
We aim to improve the performance of StyleGAN by in-
tegrating attention layers with the convolutional backbone. x = G(w,
e tlocal ). (4)
However, simply adding attention layers to StyleGAN of-
ten results in training collapse, possibly because the dot- Similar to earlier works [58,74,80], the text-image align-
product self-attention is not Lipschitz, as pointed out by ment visually improves with cross-attention.
Kim et al. [43]. As the Lipschitz continuity of discrimi-
nators has played a critical role in stable training [3, 22, 60],
we use the L2-distance instead of the dot product as the at- Synthesis network. Our synthesis network consists of a
tention logits to promote Lipschitz continuity [43], similar series of upsampling convolutional layers, with each layer
to ViTGAN [50]. enhanced with the adaptive kernel selection (Equation 1)
To further improve performance, we find it crucial to and followed by our attention layers.
match the architectural details of StyleGAN, such as equal- ` ` `
ized learning rate [39] and weight initialization from a unit f`+1 = gxa (gattn (gadaconv (f` , w), w), tlocal ), (5)
normal distribution. We scale down the L2 distance logits
` ` `
to roughly match the unit normal distribution at initializa- where gxa , gattn , and gadaconv denote the l-th layer of cross-
tion and reduce the residual gain from the attention layers. attention, self-attention, and weight (de-)modulation layers.
We further improve stability by tying the key and query ma- We find it beneficial to increase the depth of the network by
trix [50], and applying weight decay. adding more blocks at each layer. In addition, our genera-
In the synthesis network G,
e the attention layers are inter- tor outputs a multi-scale image pyramid with L = 5 levels,
leaved with each convolutional block, leveraging the style instead of a single image at the highest resolution, simi-
vector w as an additional token. At each attention block, lar to MSG-GAN [38] and AnycostGAN [53]. We refer
we add a separate cross-attention mechanism gcross-attention to the pyramid as {xi }L−1 i=0 = {x0 , x1 , ..., x4 }, with spa-
L−1
to attend to individual word embeddings [4]. We use each tial resolutions {Si }i=0 = {64, 32, 16, 8, 4}, respectively.
input feature tensor as the query, and the text embeddings The base level x0 is the output image x. Each image of
as the key and value of the attention mechanism. the pyramid is independently used to compute the GAN
loss, as discussed in Section 3.3. We follow the findings of
3.2. Generator design StyleGAN-XL [86] and turn off the style mixing and path
length regularization [42]. We include more training details
Text and latent-code conditioning. First, we extract the in Appendix A.1.
text embedding from the prompt. Previous works [75, 80]
have shown that leveraging a strong language model is es- 3.3. Discriminator design
sential for producing strong results. To do so, we tokenize
the input prompt (after padding it to C = 77 words, follow- As shown in Figure 5, our discriminator consists of sep-
ing best practices [75, 80]) to produce conditioning vector arate branches for processing text with the function tD and
c ∈ RC×1024 , and take the features from the penultimate images with function φ. The prediction of real vs. fake
layer [80] of a frozen CLIP feature extractor [71]. To allow is made by comparing the features from the two branches
for additional flexibility, we apply additional attention lay- using function ψ. We introduce a new way of making pre-
ers T on top to process the word embeddings before pass- dictions on multiple scales. Finally, we use additional CLIP
ing them to the MLP-based mapping network. This results and Vision-Aided GAN losses [44] to improve stability.

7
D D
Multi-scale output extractor φi→j : RXi ×Xi ×3 → RXj ×Xj ×Cj . Practically,
Text conditioning 𝑡! each sub-network φi→j is a subset of full φ , φ0→L , with
R/F R/F R/F R/F R/F
i > 0 indicating late entry and j < L indicating early exit.
𝜓# Each layer in φ is composed of self-attention, followed by
convolution with stride 2. The final layer flattens the spatial
extent into a 1 × 1 tensor. This produces output resolu-
𝜙 tions at {XjD } = {32, 16, 8, 4, 1}. This allows us to inject
lower-resolution images on the pyramid into intermediate
Convolutional
layers [39]. As we use a shared feature extractor across dif-
Self-attention ferent levels and most of the added predictions are made
at low resolutions, the increased computation overhead is
manageable.
𝑥"
Sweep through multi-scale input Multi-scale input, multi-scale output adversarial loss.
In total, our training objective consists of discriminator
Figure 5. Our discriminator consists of two branches for pro- losses, along with our proposed matching loss, to encour-
cessing the image and the text conditioning tD . The text branch age the discriminator to take into account the conditioning:
processes the text similar to the generator (Figure 4). The image
L−1 L
branch receives an image pyramid and makes independent predic- XX
tions for each image scale. Moreover, the predictions are made VMS-I/O (G, D) = VGAN (Gi , Dij ) + Vmatch (Gi , Dij ),
at all subsequent scales of the downsampling layers, making it a i=0 j=1

multi-scale input, multi-scale output (MS-I/O) discriminator. (6)

where VGAN is the standard, non-saturating GAN loss [21].
To compute the discriminator output, we train predictor ψ,
Text conditioning. First, to incorporate conditioning into
which uses text feature tD to modulate image features φ(x):
discriminators, we extract text descriptor tD from text c.
Similar to the generator, we apply a pretrained text encoder,
such as CLIP [71], followed by a few learnable attention Dij (x, c) = ψj (φi→j (xi ), tD ) + Conv1×1 (φi→j (xi )),
layers. In this case, we only use the global descriptor. (7)

Multiscale image processing. We observe that the early, where ψj is implemented as a 4-layer 1 × 1 modulated con-
low-resolution layers of the generator become inactive, volution, and Conv1×1 is added as a skip connection to ex-
using small dynamic ranges irrespective of the provided plicitly maintain an unconditional prediction branch [62].
prompts. StyleGAN2 [42] also observes this phenomenon,
concluding that the network relies on the high-resolution Matching-aware loss. The previous GAN terms measure
layers, as the model size increases. As recovering perfor- how closely the image x matches the conditioning c, as well
mance in low frequencies, which contains complex struc- as how realistic x looks, irrespective of conditioning. How-
ture information, is crucial, we redesign the model architec- ever, during early training, when artifacts are obvious, the
ture to provide training signals across multiple scales. discriminator heavily relies on making a decision indepen-
Recall the generator produces a pyramid {xi }L−1 i=0 , with dent of conditioning and hesitates to account for the condi-
the full image x0 at the pyramid base. MSG-GAN [38] im- tioning later.
proves performance by making a prediction on the entire To enforce the discriminator to incorporate conditioning,
pyramid at once, enforcing consistency across scales. How- we match x with a random, independently sampled condi-
ever, in our large-scale setting, this harms stability, as this tion ĉ, and present them as a fake pair:
limits the generator from making adjustments to its initial
low-res output.

Instead, we process each level of the pyramid indepen- Vmatch = Ex,c,ĉ log(1 + exp(D(x, ĉ)))
dently. As shown in Figure 5, each level xi makes real/fake (8)
+ log(1 + exp(D(G(c), ĉ)) ,
a prediction at multiple scales i < j ≤ L. For example, the
full x0 makes predictions at L = 5 scales, the next level x1 where (x, c) and ĉ are separately sampled from pdata . This
makes predictions at 4 scales, and so on. In total, our dis- loss has previously been explored in text-to-image GAN
criminator produces L(L−1)
2 predictions, supervising multi- works [76,104], except we find that enforcing the Matching-
scale generations at multiple scales. aware loss on generated images from G, as well real images
To extract features at different scales, we define feature x, leads to clear gains in performance (Table 1).

8
CLIP contrastive loss. We further leverage off-the-shelf also provides a stable learning signal. We believe that our
pretrained models as a loss function [44, 84, 90]. In par- GAN upsampler can serve as a drop-in replacement for the
ticular, we enforce the generator to produce outputs that superresolution stage of other generative models.
are identifiable by the pre-trained CLIP image and text en-
coders [71], Eimg and Etxt , in the contrastive cross-entropy 4. Experiments
loss that was used to train them originally.
Systematic, controlled evaluation of large-scale text-to-
h >
exp(Eimg (G(c0 )) Etxt (c0 )) i image synthesis tasks is difficult, as most existing mod-
LCLIP = E{cn } − log P >
) , els are not publicly available. Training a new model from
n exp(Eimg (G(c0 )) Etxt (cn )
scratch would be prohibitively costly, even if the train-
(9) ing code were available. Still, we compare our model
to recent text-to-image models, such as Imagen [80], La-
where {cn } = {c0 , . . . } are sampled captions from the
tent Diffusion Models (LDM) [79], Stable Diffusion [78],
training data.
and Parti [101], based on the available information, while
Vision-aided adversarial loss. Lastly, we build an addi- acknowledging considerable differences in the training
tional discriminator that uses the CLIP model as a back- dataset, number of iterations, batch size, and model size.
bone, known as Vision-Aided GAN [44]. We freeze the In addition to text-to-image results, we evaluate our model
CLIP image encoder, extract features from the intermedi- on ImageNet class-conditional generation in Appendix B,
ate layers, and process them through a simple network with for an apples-to-apples comparison with other methods at a
3 × 3 conv layers to make real/fake predictions. We also more controlled setting.
incorporate conditioning through modulation, as in Equa- For quantitative evaluation, we mainly use the Fréchet
tion 7. To stabilize training, we also add a fixed random Inception Distance (FID) [25] for measuring the realism of
projection layer, as proposed by Projected GAN [84]. We the output distribution and the CLIP score for evaluating the
refer to this as LVision (G) (omitting the learnable discrimi- image-text alignment.
nator parameters for clarity). We conduct five different experiments. First, we show
Our final objective is V(G, D) = VMS-I/O (G, D) + the effectiveness of our method by gradually incorporating
LCLIP (G) + LVision (G), with weighting between the terms each technical component one by one (Section 4.2). Sec-
specified in Table A2. ond, our text-to-image synthesis results demonstrate that
GigaGAN exhibits comparable FID with Stable Diffusion
3.4. GAN-based upsampler (SD-v1.5) [79] while generating results hundreds of times
Furthermore, GigaGAN framework can be easily ex- faster than diffusion or autoregressive models (Section 4.3).
tended to train a text-conditioned superresolution model, Third, we compare GigaGAN with a distillation-based dif-
capable of upsampling the outputs of the base GigaGAN fusion model [59] and show that GigaGAN can synthe-
generator to obtain high-resolution images at 512px or 2k size higher-quality images faster than the distillation-based
resolution. By training our pipeline in two separate stages, diffusion model. Fourth, we verify the advantage of Gi-
we can afford a higher capacity 64px base model within the gaGAN’s upsampler over other upsamplers in both condi-
same computational resources. tional and unconditional super-resolution tasks. Lastly, we
In the upsampler, the synthesis network is rearranged show our large-scale GANs still enjoy the continuous and
to an asymmetric U-Net architecture, which processes the disentangled latent space manipulation of GANs, enabling
64px input through 3 downsampling residual blocks, fol- new image editing modes (Section 4.6).
lowed by 6 upsampling residual blocks with attention layers
to produce the 512px image. There exist skip connections 4.1. Training and evaluation details
at the same resolution, similar to CoModGAN [106]. The We implement GigaGAN based on the StudioGAN Py-
model is trained with the same losses as the base model, as Torch library [37], following the standard FID evaluation
well as the LPIPS Perceptual Loss [105] with respect to the protocol with the anti-aliasing bicubic resize function [67],
ground truth high-resolution image. Vision-aided GAN is unless otherwise noted. For text-to-image synthesis, we
not used for the upsampler. During training and inference train our models on the union of LAION2B-en [88] and
time, we apply moderate Gaussian noise augmentation to COYO-700M [8] datasets, with the exception of the 128-to-
reduce the gap between real and GAN-generated images. 1024 upsampler model trained on Adobe’s internal Stock
Please refer to Appendix A.3 for more details. images. The image-text pairs are preprocessed based on
Our GigaGAN framework becomes particularly effective CLIP score [24], image resolution, and aesthetic score [87],
for the superresolution task compared to the diffusion-based similar to prior work [78]. We use CLIP ViT-L/14 [71] for
models, which cannot afford as many sampling steps as the the pre-trained text encoder and OpenCLIP ViT-G/14 [32]
base model at high resolution. The LPIPS regression loss for CLIP score calculation [24] except for Table 1. All our

9
Fine styles
“A Toy sport
sedan, CG
art.”

Coarse styles

Figure 6. Style mixing. Our GAN-based architecture retains a disentangled latent space, enabling us to blend the coarse style of one
sample with the fine style of another. All outputs are generated with the prompt “A Toy sport sedan, CG art.” The corresponding latent
codes are spliced together to produce a style-swapping grid.

“A modern mansion ..” “A victorian mansion ..”

“.. in a
sunny day”

“.. in sunset”

Figure 7. Prompt interpolation. GigaGAN enables smooth interpolation between prompts, as shown in the interpolation grid. The four
corners are generated from the same latent z but with different text prompts. The corresponding text embeddings t and style vectors w are
interpolated to create a smooth transition. The same z results in similar layouts. See Figure 8 for more precise control.

10
no mixing “crochet” “fur” “denim” “brick”

“a cube
on tabletop”

“a ball
on tabletop”

“a teddy bear
on tabletop”

Figure 8. Prompt mixing. GigaGAN retains a disentangled latent space, enabling us to combine the coarse style of one sample with
the fine style of another. Moreover, GigaGAN can directly control the style with text prompts. Here we generate four outputs using the
prompts “a X on tabletop”, shown in the “no mixing” column. Then we re-compute the text embeddings t and the style codes w using
the new prompts “a X with the texture of Y on tabletop”, such as “a cube with the texture of crochet on tabletop”, and apply them to the
second half layers of the generator, achieving layout-preserving fine style control. Cross-attention mechanism automatically localizes the
style to the object of interest.

models are trained and evaluated on A100 GPUs. We in- 4.3. Text-to-Image synthesis
clude more training and evaluation details in Appendix A.
We proceed to train a larger model by increasing the
capacity of the base generator and upsampler to 652.5M
and 359.1M, respectively. This results in an unprece-
4.2. Effectiveness of proposed components dented size of GAN model, with a total parameter count
of 1.0B. Table 2 compares the performance of our end-
First, we show the effectiveness of our formulation via to-end pipeline to various text-to-image generative mod-
ablation study in Table 1. We set up a baseline by adding els [5, 10, 63, 74, 75, 78–80, 101, 108]. Note that there exist
text-conditioning to StyleGAN2 and tuning the configura- differences in the training dataset, the pretrained text en-
tion based on the findings of StyleGAN-XL. We first di- coders, and even image resolutions. For example, Giga-
rectly increase the model size of this baseline, but we find GAN initially synthesizes 512px images, which are resized
that this does not improve the FID and CLIP scores. Then, to 256px before evaluation.
we add our components one by one and observe that they Table 2 shows that GigaGAN exhibits a lower FID
consistently improve performance. In particular, our model than DALL·E 2 [74], Stable Diffusion [78], and Parti-
is more scalable, as the higher-capacity version of the final 750M [101]. While our model can be optimized to better
formulation achieves better performance. match the feature distribution of real images than existing

11
Table 1. Ablation study on 64px text-to-image synthesis. To Table 3. Comparison to distilled diffusion models shows that
evaluate the effectiveness of our components, we start with a mod- GigaGAN achieves better FID and CLIP scores compared to the
ified version of StyleGAN for text conditioning. While increasing progressively distilled diffusion models [59] for fast inference. As
the network width does not show satisfactory improvement, each GigaGAN generates outputs in a single feedforward pass, the in-
addition of our contributions keeps improving metrics. Finally, we ference speed is still faster. The evaluation setup is different from
increase the network width and scale up training to reach our fi- Table 2 to match SD-distilled’s protocol [59].
nal model. All ablated models are trained for 100k iterations at
a batch size of 256 except for the Scale-up row (1350k iterations Model Steps FID-5k ↓ CLIP ↑ Inf. time
with a larger batch size). CLIP Score is computed using CLIP SD-distilled-2 [59] 2 37.3 0.27 0.23s
ViT-B/32 [71]. SD-distilled-4 [59] 4 26.0 0.30 0.33s
SD-distilled-8 [59] 8 26.9 0.30 0.52s
Model FID-10k ↓ CLIP Score ↑ # Param.
SD-distilled-16 [59] 16 28.8 0.30 0.88s
StyleGAN2 29.91 0.222 27.8M
GigaGAN 1 21.1 0.32 0.13s
+ Larger (5.7×) 34.07 0.223 158.9M
+ Tuned 28.11 0.228 26.2M Table 4. Text-conditioned 128→1024 super-resolution on ran-
+ Attention 23.87 0.235 59.0M
dom 10K LAION samples, compared against unconditional Real-
+ Matching-aware D 27.29 0.250 59.0M
+ Matching-aware G and D 21.66 0.254 59.0M ESRGAN [33] and Stable Diffusion Upscaler [78]. GigaGAN en-
+ Adaptive convolution 19.97 0.261 80.2M joys the fast speed of a GAN-based model while achieving better
+ Deeper 19.18 0.263 161.9M FID, patch-FID [9], CLIP score, and LPIPS [105].
+ CLIP loss 14.88 0.280 161.9M Model # Param. Inf. time FID-10k ↓ pFID ↓ CLIP ↑ LPIPS↓
+ Multi-scale training 14.92 0.300 164.0M Real-ESRGAN [33] 17M 0.06s 8.60 22.8 0.314 0.363
+ Vision-aided GAN 13.67 0.287 164.0M SD Upscaler [78] 846M 7.75s 9.39 41.3 0.316 0.523
+ Scale-up (GigaGAN) 9.18 0.307 652.5M GigaGAN 693M 0.13s 1.54 8.90 0.322 0.274

Table 5. Unconditional 64→256 super-resolution on ImageNet.

We compare to a simple U-Net trained with a pixel regression loss
models, the quality of the generated images is not necessar- (U-Net regression), and diffusion-based methods (SR3 [81] and
ily better (see Appendix C for more samples). We acknowl- LDM [79]. Our method achieves higher realism scores represented
edge that this may represent a corner case of zero-shot FID by the Inception Score (IS) and FID.
on COCO2014 dataset and suggest that further research on Model # Param. Steps IS ↑ FID-50k ↓ PSNR ↑ SSIM ↑
a better evaluation metric is necessary to improve text-to- U-Net regression [81] 625M 1 121.1 15.2 27.9 0.80
image models. Nonetheless, we emphasize that GigaGAN SR3 [81] 625M 100 180.1 5.2 26.4 0.76
LDM-4 [79] 169M 100 166.3 2.8 24.4 0.69
is the first GAN model capable of synthesizing promising emphLDM-4 [79] 552M 100 174.9 2.4 24.7 0.71
images from arbitrary text prompts and exhibits competi- LDM-4-G [79] 183M 50 153.7 4.4 25.8 0.74
tive zero-shot FID with other text-to-image models. GigaGAN 359M 1 191.5 1.2 24.3 0.71

4.4. Comparison with distilled diffusion models

Table 2. Comparison to recent text-to-image models. Model
size, GPU days, total images seen during training, COCO FID- While GigaGAN is at least 20 times faster than the
30k, and inference speed of text-image models. ∗ denotes that the above diffusion models, there have been efforts to improve
model has been evaluated by us. GigaGAN achieves a lower FID the inference speed of diffusion models. We compare Gi-
than DALL·E 2 [74], Stable Diffusion [78], and Parti-750M [101], gaGAN with progressively distilled Stable Diffusion (SD-
while being much faster compared to recent competitive methods. distilled) [59]. Table 3 demonstrates that GigaGAN remains
Model Type # Param. # Images FID-30k ↓ Inf. time faster than the distilled Stable Diffusion while showing bet-
DALL·E [75] Diff 12.0B 1.54B 27.50 - ter FID and CLIP scores of 21.1 and 0.32, respectively.
GLIDE [63] Diff 5.0B 5.94B 12.24 15.0s
LDM [79] Diff 1.5B 0.27B 12.63 9.4s
We follow the evaluation protocol of SD-distilled [59] and
DALL·E 2 [74] Diff 5.5B 5.63B 10.39 - report FID and CLIP scores on COCO2017 dataset [54],
Imagen [80] Diff 3.0B 15.36B 7.27 9.1s where images are resized to 512px.
256

eDiff-I [5] Diff 9.1B 11.47B 6.95 32.0s

Parti-750M [101] AR 750M 3.69B 10.71 -
Parti-3B [101] AR 3.0B 3.69B 8.10 6.4s 4.5. Super-resolution for large-scale image synthesis
Parti-20B [101] AR 20.0B 3.69B 7.23 - We separately evaluate the performance of the GigaGAN
LAFITE [108] GAN 75M - 26.94 0.02s
upsampler. Our evaluation consists of two parts. First,
SD-v1.5∗ [78] Diff 0.9B 3.16B 9.62 2.9s
we compare GigaGAN with several commonly-used up-
512

Muse-3B [10] AR 3.0B 0.51B 7.88 1.3s

GigaGAN GAN 1.0B 0.98B 9.09 0.13s samplers. For the text-conditioned upsampling task, we
combine the Stable Diffusion [78] 4x Upscaler and 2x La-
tent Upscaler to establish an 8x upscaling model (SD Up-

12
5. Discussion and Limitations
Our experiments provide a conclusive answer about the
scalability of GANs: our new architecture can scale up to
model sizes that enable text-to-image synthesis. However,
the visual quality of our results is not yet comparable to
production-grade models like DALL·E 2. Figure 9 shows
several instances where our method fails to produce high-
quality results when compared to DALL·E 2, in terms of
photorealism and text-to-image alignment for the same in-
put prompts used in their paper.
Nevertheless, we have tested capacities well beyond
what is possible with a naı̈ve approach and achieved com-
Figure 9. Failure cases. Our outputs with the same prompts as
petitive visual quality with autoregressive and diffusion
DALL·E 2. Each column conditions on “a teddy bear on a skate- models trained with similar resources while being orders of
board in Times Square”, “a Vibrant portrait painting of Salvador magnitude faster and enabling latent interpolation and styl-
Dali with a robotic half face”, and “A close up of a handpalm with ization. Our GigaGAN architecture opens up a whole new
leaves growing from it”. Compared to production-grade models design space for large-scale generative models and brings
such as DALL·E 2, our model exhibits limitations in realism and back key editing capabilities that became challenging with
compositionality. See Appendix C for uncurated comparisons. the transition to autoregressive and diffusion models. We
expect our performance to improve with larger models, as
seen in Table 1.
scaler). We also use the unconditional Real-ESRGAN [33]
as another baseline. Table 4 measures the performance of Acknowledgments. We thank Simon Niklaus, Alexandru
the upsampler on random 10K images from the LAION Chiculita, and Markus Woodson for building the distributed
dataset and shows that our GigaGAN upsampler signifi- training pipeline. We thank Nupur Kumari, Gaurav Parmar,
cantly outperforms the other upsamplers in realism scores Bill Peebles, Phillip Isola, Alyosha Efros, and Joonghyuk
(FID and patch-FID [9]), text alignment (CLIP score) and Shin for their helpful comments. We also want to thank
closeness to the ground truth (LPIPS [105]). In addition, Chenlin Meng, Chitwan Saharia, and Jiahui Yu for answer-
for more controlled comparison, we train our model on ing many questions about their fantastic work. We thank
the ImageNet unconditional superresolution task and com- Kevin Duarte for discussions regarding upsampling beyond
pare performance with the diffusion-based models, includ- 4K. Part of this work was done while Minguk Kang was an
ing SR3 [81] and LDM [79]. As shown in Table 5, Gi- intern at Adobe Research. Minguk Kang and Jaesik Park
gaGAN achieves the best IS and FID scores with a single were supported by IITP grant funded by the government of
feedforward pass. South Korea (MSIT) (POSTECH GSAI: 2019-0-01906 and
Image restoration: 2021-0-00537).

4.6. Controllable image synthesis

StyleGANs are known to possess a linear latent space
useful for image manipulation, called the W-space. Like-
wise, we perform coarse and fine-grained style swapping
using style vectors w. Similar to the W-space of Style-
GAN, Figure 6 illustrates that GigaGAN maintains a disen-
tangled W-space, suggesting existing latent manipulation
techniques of StyleGAN can transfer to GigaGAN. Further-
more, our model possesses another latent space of text em-
bedding t = [tlocal , tglobal ] prior to W, and we explore its
potential for image synthesis. In Figure 8, we show that
the disentangled style manipulation can be controlled via
text inputs. In detail, we can compute the text embedding
t and style code w using different prompts and apply them
to different layers of the generator. This way, we gain not
only the coarse and fine style disentanglement but also an
intuitive prompt-based maneuver in the style space.

13
References [15] Prafulla Dhariwal and Alexander Nichol. Diffusion models
beat gans on image synthesis. In Conference on Neural
[1] Rameen Abdal, Yipeng Qin, and Peter Wonka. Im- Information Processing Systems (NeurIPS), 2021. 5, 7, 19,
age2stylegan: How to embed images into the stylegan la- 21, 22, 23
tent space? In IEEE International Conference on Computer
[16] Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng,
Vision (ICCV), 2019. 5
Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao,
[2] Saeed Anwar and Nick Barnes. Densely residual laplacian
Hongxia Yang, et al. Cogview: Mastering text-to-image
super-resolution. IEEE Transactions on Pattern Analysis
generation via transformers. In Conference on Neural In-
and Machine Intelligence (TPAMI), 2020. 5
formation Processing Systems (NeurIPS), 2021. 5
[3] Martin Arjovsky, Soumith Chintala, and Léon Bottou.
[17] Ming Ding, Wendi Zheng, Wenyi Hong, and Jie
Wasserstein Generative Adversarial Networks. In Interna-
Tang. Cogview2: Faster and better text-to-image gen-
tional Conference on Machine Learning (ICML), 2017. 7
eration via hierarchical transformers. arXiv preprint
[4] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio.
arXiv:2204.14217, 2022. 5
Neural machine translation by jointly learning to align and
[18] Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou
translate. In International Conference on Learning Repre-
Tang. Image super-resolution using deep convolutional net-
sentations (ICLR), 2015. 7
works. IEEE Transactions on Pattern Analysis and Ma-
[5] Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat,
chine Intelligence (TPAMI), 2015. 5
Jiaming Song, Karsten Kreis, Miika Aittala, Timo Aila,
Samuli Laine, Bryan Catanzaro, et al. ediffi: Text-to- [19] Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming
image diffusion models with an ensemble of expert denois- transformers for high-resolution image synthesis. In IEEE
ers. arXiv preprint arXiv:2211.01324, 2022. 11, 12 Conference on Computer Vision and Pattern Recognition
[6] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large (CVPR), pages 12873–12883, 2021. 19
Scale GAN Training for High Fidelity Natural Image Syn- [20] Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin,
thesis. In International Conference on Learning Represen- Devi Parikh, and Yaniv Taigman. Make-A-Scene: Scene-
tations (ICLR), 2019. 1, 5, 6, 7, 19, 23 Based Text-to-Image Generation with Human Priors. In Eu-
[7] Andrew Brock, Theodore Lim, James M. Ritchie, and Nick ropean Conference on Computer Vision (ECCV), 2022. 5
Weston. Neural Photo Editing with Introspective Adver- [21] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing
sarial Networks. In International Conference on Learning Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville,
Representations (ICLR), 2017. 5 and Yoshua Bengio. Generative Adversarial Nets. In
[8] Minwoo Byeon, Beomhee Park, Haecheon Kim, Sungjun Conference on Neural Information Processing Systems
Lee, Woonhyuk Baek, and Saehoon Kim. COYO-700M: (NeurIPS), pages 2672–2680, 2014. 1, 5, 8
Image-Text Pair Dataset. https://github.com/ [22] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent
kakaobrain/coyo-dataset, 2022. 5, 9, 19 Dumoulin, and Aaron C Courville. Improved training of
[9] Lucy Chai, Michael Gharbi, Eli Shechtman, Phillip Isola, wasserstein gans. In Conference on Neural Information
and Richard Zhang. Any-resolution training for high- Processing Systems (NeurIPS), 2017. 7
resolution image synthesis. In European Conference on [23] David Ha, Andrew Dai, and Quoc V Le. Hypernetworks.
Computer Vision (ECCV), 2022. 12, 13 In International Conference on Learning Representations
[10] Huiwen Chang, Han Zhang, Jarred Barber, AJ Maschinot, (ICLR), 2017. 7
Jose Lezama, Lu Jiang, Ming-Hsuan Yang, Kevin Murphy, [24] Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le
William T Freeman, Michael Rubinstein, et al. Muse: Text- Bras, and Yejin Choi. Clipscore: A reference-free
to-image generation via masked generative transformers. evaluation metric for image captioning. arXiv preprint
arXiv preprint arXiv:2301.00704, 2023. 11, 12 arXiv:2104.08718, 2021. 9, 19
[11] Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and [25] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner,
William T Freeman. Maskgit: Masked generative image Bernhard Nessler, and Sepp Hochreiter. GANs Trained by
transformer. In IEEE Conference on Computer Vision and a Two Time-Scale Update Rule Converge to a Local Nash
Pattern Recognition (CVPR), pages 11315–11325, 2022. Equilibrium. In Conference on Neural Information Pro-
19 cessing Systems (NeurIPS), pages 6626–6637, 2017. 9, 19
[12] Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Hee- [26] Jonathan Ho, Ajay Jain, and P. Abbeel. Denoising Diffusion
woo Jun, David Luan, and Ilya Sutskever. Generative pre- Probabilistic Models. In Conference on Neural Information
training from pixels. In International Conference on Ma- Processing Systems (NeurIPS), 2020. 5
chine Learning (ICML). PMLR, 2020. 5 [27] Jonathan Ho, Chitwan Saharia, William Chan, David J.
[13] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, Fleet, Mohammad Norouzi, and Tim Salimans. Cas-
and Li Fei-Fei. Imagenet: A large-scale hierarchical im- caded Diffusion Models for High Fidelity Image Genera-
age database. In IEEE Conference on Computer Vision and tion. Journal of Machine Learning Research, pages 47:1–
Pattern Recognition (CVPR), 2009. 6, 19 47:33, 2022. 7, 19
[14] Emily L Denton, Soumith Chintala, Rob Fergus, et al. Deep [28] Jonathan Ho and Tim Salimans. Classifier-free diffusion
generative image models using a laplacian pyramid of ad- guidance. In Conference on Neural Information Processing
versarial networks. Conference on Neural Information Pro- Systems (NeurIPS) Workshop, 2022. 23
cessing Systems (NeurIPS), 28, 2015. 1

14
[29] Xun Huang, Ming-Yu Liu, Serge Belongie, and Jan Kautz. [43] Hyunjik Kim, George Papamakarios, and Andriy Mnih.
Multimodal unsupervised image-to-image translation. In The lipschitz constant of self-attention. In International
European Conference on Computer Vision (ECCV), 2018. Conference on Machine Learning (ICML), 2021. 7
5 [44] Nupur Kumari, Richard Zhang, Eli Shechtman, and Jun-
[30] Drew A Hudson and Larry Zitnick. Generative adversar- Yan Zhu. Ensembling off-the-shelf models for gan train-
ial transformers. In International Conference on Machine ing. In IEEE Conference on Computer Vision and Pattern
Learning (ICML), 2021. 7 Recognition (CVPR), 2022. 5, 7, 9
[31] Erik Härkönen, Aaron Hertzmann, Jaakko Lehtinen, and [45] Tuomas Kynkäänniemi, Tero Karras, Miika Aittala, Timo
Sylvain Paris. GANSpace: Discovering Interpretable GAN Aila, and Jaakko Lehtinen. The Role of ImageNet
Controls. In Conference on Neural Information Processing Classes in Fr\’echet Inception Distance. arXiv preprint
Systems (NeurIPS), 2020. 5 arXiv:2203.06026, 2022. 19
[32] Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade [46] Tuomas Kynkäänniemi, Tero Karras, Samuli Laine, Jaakko
Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Lehtinen, and Timo Aila. Improved Precision and Recall
Vaishaal Shankar, Hongseok Namkoong, John Miller, Han- Metric for Assessing Generative Models. In Conference on
naneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. Open- Neural Information Processing Systems (NeurIPS), 2019.
clip. https://doi.org/10.5281/zenodo.5143773, 2021. 9 19
[33] intao Wang and Liangbin Xie and Chao Dong and Ying [47] Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Ca-
Shan. Real-esrgan: Training real-world blind super- ballero, Andrew Cunningham, Alejandro Acosta, Andrew
resolution with pure synthetic data. In IEEE International Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, et al.
Conference on Computer Vision (ICCV) Workshop, 2021. Photo-realistic single image super-resolution using a gen-
3, 4, 12, 13, 33, 34, 35 erative adversarial network. In IEEE Conference on Com-
[34] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A puter Vision and Pattern Recognition (CVPR), 2017. 5
Efros. Image-to-image translation with conditional adver- [48] Doyup Lee, Chiheon Kim, Saehoon Kim, Minsu Cho, and
sarial networks. In IEEE Conference on Computer Vision Wook-Shin Han. Autoregressive Image Generation using
and Pattern Recognition (CVPR), 2017. 5 Residual Quantization. In IEEE Conference on Computer
[35] Xu Jia, Bert De Brabandere, Tinne Tuytelaars, and Luc V Vision and Pattern Recognition (CVPR), pages 11523–
Gool. Dynamic filter networks. Conference on Neural In- 11532, 2022. 19
formation Processing Systems (NeurIPS), 29, 2016. 7 [49] Hsin-Ying Lee, Hung-Yu Tseng, Jia-Bin Huang, Maneesh
[36] Minguk Kang, Woohyeon Shim, Minsu Cho, and Jaesik Singh, and Ming-Hsuan Yang. Diverse image-to-image
Park. Rebooting ACGAN: Auxiliary Classifier GANs with translation via disentangled representations. In European
Stable Training. In Conference on Neural Information Pro- Conference on Computer Vision (ECCV), 2018. 5
cessing Systems (NeurIPS), 2021. 5 [50] Kwonjoon Lee, Huiwen Chang, Lu Jiang, Han Zhang,
[37] Minguk Kang, Joonghyuk Shin, and Jaesik Park. Studio- Zhuowen Tu, and Ce Liu. ViTGAN: Training GANs with
GAN: A Taxonomy and Benchmark of GANs for Image vision transformers. In International Conference on Learn-
Synthesis. arXiv preprint arXiv:2206.09479, 2022. 9, 19 ing Representations (ICLR), 2022. 7
[38] Animesh Karnewar and Oliver Wang. Msg-gan: Multi- [51] Muyang Li, Ji Lin, Chenlin Meng, Stefano Ermon, Song
scale gradients for generative adversarial networks. In IEEE Han, and Jun-Yan Zhu. Efficient spatially sparse infer-
Conference on Computer Vision and Pattern Recognition ence for conditional gans and diffusion models. In Confer-
(CVPR), pages 7799–7808, 2020. 7, 8 ence on Neural Information Processing Systems (NeurIPS),
[39] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehti- 2022. 5
nen. Progressive growing of gans for improved quality, sta- [52] Jiadong Liang, Wenjie Pei, and Feng Lu. Cpgan: Content-
bility, and variation. In International Conference on Learn- parsing generative adversarial networks for text-to-image
ing Representations (ICLR), 2018. 5, 6, 7, 8 synthesis. In European Conference on Computer Vision
[40] Tero Karras, Miika Aittala, Samuli Laine, Erik Härkönen, (ECCV), 2020. 5
Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Alias-free [53] Ji Lin, Richard Zhang, Frieder Ganz, Song Han, and Jun-
generative adversarial networks. In Conference on Neural Yan Zhu. Anycost gans for interactive image synthesis and
Information Processing Systems (NeurIPS), 2021. 5 editing. In IEEE Conference on Computer Vision and Pat-
[41] Tero Karras, Samuli Laine, and Timo Aila. A style- tern Recognition (CVPR), pages 14986–14996, 2021. 7
based generator architecture for generative adversarial net- [54] Tsung-Yi Lin, Michael Maire, Serge Belongie, James
works. In IEEE Conference on Computer Vision and Pat- Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and
tern Recognition (CVPR), pages 4401–4410, 2019. 1, 5, 6, C Lawrence Zitnick. Microsoft coco: Common objects
19, 23 in context. In European Conference on Computer Vision
[42] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, (ECCV), 2014. 5, 12, 19
Jaakko Lehtinen, and Timo Aila. Analyzing and improv- [55] Luping Liu, Yi Ren, Zhijie Lin, and Zhou Zhao. Pseudo
ing the image quality of stylegan. In IEEE Conference on Numerical Methods for Diffusion Models on Manifolds.
Computer Vision and Pattern Recognition (CVPR), pages In International Conference on Learning Representations
8110–8119, 2020. 1, 5, 6, 7, 8 (ICLR), 2022. 27, 28, 29, 30, 31, 32

15
[56] Ilya Loshchilov and Frank Hutter. Decoupled Weight De- tion of stylegan imagery. In IEEE International Conference
cay Regularization. In International Conference on Learn- on Computer Vision (ICCV), 2021. 5
ing Representations (ICLR), 2019. 20 [70] William Peebles and Saining Xie. Scalable Diffusion Mod-
[57] Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan els with Transformers. arXiv preprint arXiv:2212.09748,
Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffu- 2022. 19
sion probabilistic model sampling in around 10 steps. arXiv [71] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
preprint arXiv:2206.00927, 2022. 5 Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
[58] Elman Mansimov, Emilio Parisotto, Jimmy Lei Ba, and Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn-
Ruslan Salakhutdinov. Generating Images from Captions ing transferable visual models from natural language super-
with Attention. In International Conference on Learning vision. In International Conference on Machine Learning
Representations (ICLR), 2016. 5, 7 (ICML), 2021. 5, 7, 8, 9, 12, 20
[59] Chenlin Meng, Ruiqi Gao, Diederik P Kingma, Stefano Er- [72] Alec Radford, Luke Metz, and Soumith Chintala. Un-
mon, Jonathan Ho, and Tim Salimans. On distillation of supervised representation learning with deep convolu-
guided diffusion models. In Conference on Neural Infor- tional generative adversarial networks. arXiv preprint
mation Processing Systems (NeurIPS) Workshop, 2022. 5, arXiv:1511.06434, 2015. 1, 5
9, 12 [73] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
[60] Lars Mescheder, Andreas Geiger, and Sebastian Nowozin. Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li,
Which training methods for gans do actually converge? In and Peter J. Liu. Exploring the Limits of Transfer Learning
International Conference on Machine Learning (ICML), with a Unified Text-to-Text Transformer. Journal of Ma-
2018. 7 chine Learning Research, 2020. 5
[61] Lars Mescheder, Sebastian Nowozin, and Andreas Geiger. [74] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey
Which Training Methods for GANs do actually Converge? Chu, and Mark Chen. Hierarchical text-conditional
In International Conference on Machine Learning (ICML), image generation with clip latents. arXiv preprint
2018. 20 arXiv:2204.06125, 2022. 1, 5, 6, 7, 11, 12, 19, 23, 27,
[62] Takeru Miyato and Masanori Koyama. cGANs with Projec- 28, 29, 30, 31, 32
tion Discriminator. In International Conference on Learn- [75] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott
ing Representations (ICLR), 2018. 8, 20 Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya
[63] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Sutskever. Zero-shot text-to-image generation. In Interna-
Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and tional Conference on Machine Learning (ICML), 2021. 5,
Mark Chen. GLIDE: Towards Photorealistic Image Gen- 7, 11, 12
eration and Editing with Text-Guided Diffusion Models. [76] Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Lo-
In International Conference on Machine Learning (ICML), geswaran, Bernt Schiele, and Honglak Lee. Generative ad-
2022. 5, 11, 12 versarial text to image synthesis. In International Confer-
[64] OpenAI. DALL·E API. https://openai.com/ ence on Machine Learning (ICML), 2016. 5, 8
product/dall-e-2, 2022. 27, 28, 29, 30, 31, 32
[77] Scott E Reed, Zeynep Akata, Santosh Mohan, Samuel
[65] Taesung Park, Alexei A. Efros, Richard Zhang, and Jun-
Tenka, Bernt Schiele, and Honglak Lee. Learning what and
Yan Zhu. Contrastive Learning for Unpaired Image-to-
where to draw. In Conference on Neural Information Pro-
Image Translation. In European Conference on Computer
cessing Systems (NeurIPS), 2016. 5
Vision (ECCV), 2020. 5
[78] Robin Rombach, Andreas Blattmann, Dominik Lorenz,
[66] Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-
Patrick Esser, and Björn Ommer. Stable Diffu-
Yan Zhu. Semantic image synthesis with spatially-adaptive
sion. https://github.com/CompVis/stable-
normalization. In IEEE Conference on Computer Vision
diffusion. Accessed: 2022-11-06. 3, 4, 9, 11, 12, 19,
and Pattern Recognition (CVPR), 2019. 5
23, 27, 28, 29, 30, 31, 32, 33, 34, 35
[67] Gaurav Parmar, Richard Zhang, and Jun-Yan Zhu. On
[79] Robin Rombach, Andreas Blattmann, Dominik Lorenz,
Aliased Resizing and Surprising Subtleties in GAN Evalu-
Patrick Esser, and Björn Ommer. High-resolution image
ation. In IEEE Conference on Computer Vision and Pattern
synthesis with latent diffusion models. In IEEE Confer-
Recognition (CVPR), 2022. 9, 19
ence on Computer Vision and Pattern Recognition (CVPR),
[68] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer,
2022. 1, 5, 7, 9, 11, 12, 13, 19, 21, 22, 23, 27, 28, 29, 30,
James Bradbury, Gregory Chanan, Trevor Killeen, Zeming
31, 32
Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison,
[80] Chitwan Saharia, William Chan, Saurabh Saxena, Lala
Andreas Kopf, Edward Yang, Zachary DeVito, Martin Rai-
Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed
son, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner,
Ghasemipour, Burcu Karagol Ayan, S Sara Mahdavi,
Lu Fang, Junjie Bai, and Soumith Chintala. PyTorch:
Rapha Gontijo Lopes, et al. Photorealistic Text-to-Image
An Imperative Style, High-Performance Deep Learning Li-
Diffusion Models with Deep Language Understanding.
brary. In Conference on Neural Information Processing
arXiv preprint arXiv:2205.11487, 2022. 1, 5, 6, 7, 9, 11,
Systems (NeurIPS), pages 8024–8035, 2019. 19
[69] Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen- 12, 19
Or, and Dani Lischinski. Styleclip: Text-driven manipula-

16
[81] Chitwan Saharia, Jonathan Ho, William Chan, Tim Sal- hanced super-resolution generative adversarial networks. In
imans, David J Fleet, and Mohammad Norouzi. Image European Conference on Computer Vision (ECCV) Work-
super-resolution via iterative refinement. IEEE Trans- shop, 2018. 5
actions on Pattern Analysis and Machine Intelligence [96] P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Be-
(TPAMI), 2022. 12, 13, 19 longie, and P. Perona. Caltech-UCSD Birds 200. Technical
[82] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki report, California Institute of Technology, 2010. 5
Cheung, Alec Radford, Xi Chen, and Xi Chen. Improved [97] Felix Wu, Angela Fan, Alexei Baevski, Yann Dauphin, and
Techniques for Training GANs. In Conference on Neural Michael Auli. Pay Less Attention with Lightweight and Dy-
Information Processing Systems (NeurIPS), pages 2234– namic Convolutions. In International Conference on Learn-
2242, 2016. 19 ing Representations (ICLR), 2018. 7
[83] Tim Salimans and Jonathan Ho. Progressive distillation for [98] Jonas Wulff and Antonio Torralba. Improving inversion and
fast sampling of diffusion models. In International Confer- generation diversity in stylegan using a gaussianized latent
ence on Learning Representations (ICLR), 2022. 5 space. arXiv preprint arXiv:2009.06529, 2020. 5
[84] Axel Sauer, Kashyap Chitta, Jens Müller, and Andreas [99] Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang,
Geiger. Projected GANs Converge Faster. In Conference on Zhe Gan, Xiaolei Huang, and Xiaodong He. Attngan: Fine-
Neural Information Processing Systems (NeurIPS), 2021. grained text to image generation with attentional generative
5, 9 adversarial networks. In IEEE Conference on Computer
[85] Axel Sauer, Tero Karras, Samuli Laine, Andreas Geiger, Vision and Pattern Recognition (CVPR), 2018. 5
and Timo Aila. StyleGAN-T: Unlocking the Power of [100] Fisher Yu, Ari Seff, Yinda Zhang, Shuran Song, Thomas
GANs for Fast Large-Scale Text-to-Image Synthesis. arXiv Funkhouser, and Jianxiong Xiao. Lsun: Construction of a
preprint arXiv:2301.09515, 2023. 5 large-scale image dataset using deep learning with humans
[86] Axel Sauer, Katja Schwarz, and Andreas Geiger. Stylegan- in the loop. arXiv preprint arXiv:1506.03365, 2015. 6
xl: Scaling stylegan to large diverse datasets. In ACM SIG- [101] Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong,
GRAPH 2022 Conference Proceedings, pages 1–10, 2022. Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku,
5, 7, 19, 21, 22 Yinfei Yang, Burcu Karagol Ayan, et al. Scaling autore-
[87] Christoph Schuhmann. CLIP+MLP Aesthetic gressive models for content-rich text-to-image generation.
Score Predictor. https : / / github . com / arXiv preprint arXiv:2206.10789, 2022. 1, 5, 9, 11, 12
christophschuhmann / improved - aesthetic - [102] Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Au-
predictor. 9, 19 gustus Odena. Self-Attention Generative Adversarial Net-
[88] Christoph Schuhmann, Romain Beaumont, Richard Vencu, works. In International Conference on Machine Learning
Cade Gordon, Ross Wightman, Mehdi Cherti, Theo (ICML), pages 7354–7363, 2019. 5
Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts-
[103] Han Zhang, Jing Yu Koh, Jason Baldridge, Honglak Lee,
man, et al. LAION-5B: An open large-scale dataset for
and Yinfei Yang. Cross-Modal Contrastive Learning for
training next generation image-text models. arXiv preprint
Text-to-Image Generation. In IEEE Conference on Com-
arXiv:2210.08402, 2022. 1, 5, 9, 19
puter Vision and Pattern Recognition (CVPR), 2021. 5
[89] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois-
[104] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xi-
ing Diffusion Implicit Models. In International Conference
aogang Wang, Xiaolei Huang, and Dimitris N Metaxas.
on Learning Representations (ICLR), 2021. 5
Stackgan: Text to photo-realistic image synthesis with
[90] Diana Sungatullina, Egor Zakharov, Dmitry Ulyanov, and
stacked generative adversarial networks. In IEEE Interna-
Victor Lempitsky. Image manipulation with perceptual dis-
tional Conference on Computer Vision (ICCV), 2017. 1, 5,
criminators. In European Conference on Computer Vision
8
(ECCV), 2018. 9
[105] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht-
[91] Md Mehrab Tanjim. DynamicRec: a dynamic convolu-
man, and Oliver Wang. The unreasonable effectiveness of
tional network for next item recommendation. In Proceed-
deep features as a perceptual metric. In IEEE Conference on
ings of the 29th ACM International Conference on Informa-
Computer Vision and Pattern Recognition (CVPR), 2018. 9,
tion and Knowledge Management (CIKM), 2020. 7
12, 13, 20
[92] Ming Tao, Bing-Kun Bao, Hao Tang, and Changsheng Xu.
[106] Shengyu Zhao, Jonathan Cui, Yilun Sheng, Yue Dong, Xiao
GALIP: Generative Adversarial CLIPs for Text-to-Image
Liang, Eric I Chang, and Yan Xu. Large Scale Image
Synthesis. arXiv preprint arXiv:2301.12959, 2023. 5
Completion via Co-Modulated Generative Adversarial Net-
[93] Ming Tao, Hao Tang, Fei Wu, Xiao-Yuan Jing, Bing-Kun
works. In International Conference on Learning Represen-
Bao, and Changsheng Xu. DF-GAN: A Simple and Effec-
tations (ICLR), 2021. 9
tive Baseline for Text-to-Image Synthesis. In IEEE Confer-
[107] Shengyu Zhao, Zhijian Liu, Ji Lin, Jun-Yan Zhu, and Song
ence on Computer Vision and Pattern Recognition (CVPR),
Han. Differentiable augmentation for data-efficient gan
2022. 5
training. arXiv preprint arXiv 2006.10738, 2020. 5
[94] Ken Turkowski. Filters for common resampling tasks.
Graphics gems, pages 147–165, 1990. 19 [108] Yufan Zhou, Ruiyi Zhang, Changyou Chen, Chunyuan Li,
[95] Xintao Wang, Ke Yu, Shixiang Wu, Jinjin Gu, Yihao Liu, Chris Tensmeyer, Tong Yu, Jiuxiang Gu, Jinhui Xu, and
Chao Dong, Yu Qiao, and Chen Change Loy. Esrgan: En-

17
Tong Sun. Lafite: Towards language-free training for text- Conference on Computer Vision (ICCV), pages 2223–2232,
to-image generation. In IEEE Conference on Computer Vi- 2017. 5
sion and Pattern Recognition (CVPR), 2022. 11, 12 [111] Minfeng Zhu, Pingbo Pan, Wei Chen, and Yi Yang. Dm-
[109] Jun-Yan Zhu, Philipp Krähenbühl, Eli Shechtman, and gan: Dynamic memory generative adversarial networks for
Alexei A Efros. Generative visual manipulation on the nat- text-to-image synthesis. In IEEE Conference on Computer
ural image manifold. In European Conference on Computer Vision and Pattern Recognition (CVPR), 2019. 5
Vision (ECCV), 2016. 5 [112] Peihao Zhu, Rameen Abdal, Yipeng Qin, John Femiani, and
[110] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Peter Wonka. Improved stylegan embedding: Where are the
Efros. Unpaired image-to-image translation using cycle- good latents? arXiv preprint arXiv:2012.09036, 2020. 5
consistent adversarial networks. In IEEE International

18
Appendices column of Table A2. To compare our model with SR3 [81]
and LDM fairly, we follow the evaluation procedure de-
We first provide training and evaluation details in Ap- scribed in SR3 and LDM papers.
pendix A. Then, we share results on ImageNet, with visual
comparison to existing methods in Appendix B. Lastly in
B. ImageNet experiments
Appendix C, we show more visuals on our text-to-image
synthesis results and compare them with LDM [79], Stable
Diffusion [78], and DALL·E 2 [74]. B.1. Qualitative results
We train a class-conditional GAN on the ImageNet
A. Training and evaluation details dataset [13], for which apples-to-apples comparison is pos-
sible using the same dataset and evaluation pipeline. Our
A.1. Text-to-image synthesis GAN achieves comparable generation quality to the cutting-
We train GigaGAN on a combined dataset of LAION2B- edge generative models without a pretrained ImageNet clas-
en [88] and COYO-700M [8] in PyTorch framework [68]. sifier, which acts favorably toward automated metrics [45].
For training, we apply center cropping, which results in We apply L2 self-attention, style-adaptive convolution ker-
a square image whose length is the same as the shorter nel, and matching-aware loss to our model and use a wider
side of the original image. Then, we resize the image synthesis network to train the base 64px model with a
to the resolution 64 × 64 using PIL.LANCZOS [94] re- batch size of 1024. Additionally, we train a separate 256px
sizer, which supports anti-aliasing [67]. We filter the train- class-conditional upsampler model and combine them with
ing image–text pairs based on image resolution (≥ 512), an end-to-end finetuning stage. Table A1 shows that our
CLIP score (> 0.3) [24], aesthetics score (> 5.0) [87], method generates high-fidelity images.
and remove watermarked images. We train our GigaGAN
Table A1. Class-conditional synthesis on ImageNet 256px. Our
based on the configurations denoted in the fourth and fifth
method performs competitively against large diffusion and trans-
columns of Table A2. former models. Shaded methods leverage a pretrained ImageNet
For evaluation, we use 40,504 and 30,000 real and gen- classifier at training or inference time, which could act favor-
erated images from COCO2014 [54] validation dataset as ably toward the automated metrics [45]. † indicates IS [82] and
described in Imagen [80]. We apply the center cropping FID [25] are borrowed from the original DiT paper [70].
and resize the real and generated images to 299 × 299 reso-
lution using PIL.BICUBIC, suggested by clean-fid [67]. We Model IS [82] FID [25] Precision/Recall [46] Size
use the clean-fid library [67] for FID calculation.
GAN

BigGAN-Deep [6] 224.46 6.95 0.89/0.38 112M

StyleGAN-XL [86] 297.62 2.32 0.82/0.61 166M
A.2. Conditional image synthesis on ImageNet ADM-G [15] 207.86 4.48 0.84/0.62 608M
ADM-G-U [15] 240.24 4.01 0.85/0.62 726M
Diffusion

We follow the training and evaluation protocol proposed CDM [27] 158.71 4.88 -/- -
by Kang et al. [37] to make a fair comparison against other LDM-8-G [79] 209.52 7.76 -/- 506M
LDM-4-G [79] 247.67 3.60 -/- 400M
cutting-edge generative models. We use the same crop- DiT-XL/2† [70] 278.24 2.27 -/- 675M
ping strategy to process images for training and evalua- Mask-GIT [11] 216.38 5.40 0.87/0.60 227M
xformer

tion as in our text-to-image experiments. Then, we re- VQ-GAN [19] 314.61 5.20 0.81/0.57 1.4B
size the image to the target resolution (64 × 64 for the RQ-Transformer [48] 339.41 3.83 0.85/0.60 3.8B

base generator or 256 × 256 for the super-resolution stack) GigaGAN 225.52 3.45 0.84/0.61 569M
using PIL.LANCZOS [94] resizer, which supports anti-
aliasing [67]. Using the pre-processed training images, we
B.2. Quantitative results
train GigaGAN based on the configurations denoted in the
second and third columns of Table A2. We provide visual results from ADM-G-U, LDM,
For evaluation, we upsample the real and generated im- StyleGAN-XL [86], and GigaGAN in Figures A1 and A2.
ages to 299 × 299 resolution using the PIL.BILINEAR re- Although StyleGAN-XL has the lowest FID, its visual qual-
sizer. To compute FID, we generate 50k images without ity appears worse than ADM and GigaGAN. StyleGAN-XL
truncation tricks [6, 41] and compare those images with the struggles to synthesize the overall image structure, leading
entire training dataset. We use the pre-calculated features of to less realistic images. In contrast, GigaGAN appears to
real images provided by StudioGAN [37] and 50k generated synthesize the overall structure better than StyleGAN-XL
images for Precision & Recall [46] calculation. and faithfully captures fine-grained details, such as the wing
patterns of a monarch and the white fur of an arctic fox.
A.3. Super-resolution results Compared to GigaGAN, ADM-G-U synthesizes the image
For model training, we preprocess ImageNet in the same structure more rationally but lacks in reflecting the afore-
way as in Section A.2 and use the configuration in the last mentioned fine-grained details.

19
Table A2. Hyperparameters for GigaGAN training. We denote Projection Discriminator [62] as PD, R1 regularization [61] as R1,
Learned Perceptual Image Patch Similarity [105] as LPIPS, Adam with decoupled weight decay [56] as AdamW, and the pretrained VIT-
B/32 visual encoder [71] as CLIP-ViT-B/32-V.

Task Class-Label-to-Image Text-to-Image Super-Resolution

Dataset & Resolution ImageNet 64 ImageNet 64→256 LAION&COYO 64 LAION&COYO 64→512 ImageNet 64→256
z dimension 64 128 128 128 128
w dimension 512 512 1024 512 512
Adversarial loss type Logistic Logistic Logistic Logistic Logistic
Conditioning loss type PD PD MS-I/O MS-I/O -
R1 strength 0.2048 0.2048 0.2048 ∼ 2.048 0.2048 0.2048
R1 interval 16 16 16 16 16
G Matching loss strength - - 1.0 1.0 -
D Matching loss strength - - 1.0 1.0 -
LPIPS strength - 100.0 - 10.0 100.0
CLIP loss strength - - 0.2 ∼ 1.0 1.0 -
Optimizer AdamW AdamW AdamW AdamW AdamW
Batch size 1024 256 512∼1024 192∼320 256
G learning rate 0.0025 0.0025 0.0025 0.0025 0.0025
D learning rate 0.0025 0.0025 0.0025 0.0025 0.0025
β1 for AdamW 0.0 0.0 0.0 0.0 0.0
β2 for AdamW 0.99 0.99 0.99 0.99 0.99
Weight decay strength 0.00001 0.00001 0.00001 0.00001 0.00001
Weight decay strength on attention - - 0.01 0.01 -
# D updates per G update 1 1 1 1 1
G ema beta 0.9651 0.9912 0.9999 0.9890 0.9912
Precision TF32 TF32 TF32 TF32 TF32
Mapping Network M layer depth 2 4 4 4 4
Text Transformer T layer depth - - 4 2 -
G channel base 32768 32768 16384 32768 32768
D channel base 32768 32768 16384 32768 32768
G channel max 512 512 1600 512 512
D channel max 768 512 1536 512 512
G # of filters N for adaptive kernel selection 8 4 [1, 1, 2, 4, 8] [1, 1, 1, 1, 1, 2, 4, 8, 16, 16, 16, 16] 4
Attention type self self self + cross self + cross self
G attention resolutions [8, 16, 32] [16, 32] [8, 16, 32] [8, 16, 32, 64] [16, 32]
D attention resolutions [8, 16, 32] - [8, 16, 32] [8, 16] -
G attention depth [4, 4, 4] [4, 2] [2, 2, 1] [2, 2, 2, 1] [4, 2]
D attention depth [1, 1, 1] - [2, 2, 1] [2, 2] -
Attention dimension multiplier 1.0 1.4 1.0 1.0 1.4
MLP dimension multiplier of attention 4.0 4.0 4.0 4.0 4.0
# synthesis block per resolution 1 5 [3, 3, 3, 2, 2] [4, 4, 4, 4, 4, 4, 3] 5
# discriminator block per resolution 1 1 [1, 2, 2, 2, 2] 1 -
Residual gain 1.0 0.4 0.4 0.4 0.4
Residual gain on attention 1.0 0.3 0.3 0.5 0.3
MinibatchStdLayer True True False True True
D epilogue mbstd group size 8 4 - 2 4
Multi-scale training False False True True False
Multi-scale loss ratio (high to low res) - - [0.33, 0.17, 0.17, 0.17, 0.17] -
D intermediate layer adv loss weight - - 0.01 [0.2, 0.2, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1] -
D intermediate layer matching loss weight - - 0.05 - -
Vision-aided discriminator backbone - - CLIP-ViT-B/32-V - -
G Model size 209.5M 359.8M 652.5M 359.1M 359.0M
D Model size 76.7M 30.7M 381.4M 130.1M 28.9M
Iterations 300k 620k 1350k 915k 160k
# A100 GPUs for training 64 64 96∼128 64 32

20
Figure A1. Uncurated images (above: Tench and below: Monarch) from ADM-G-U [15], LDM-4-G [79], GigaGAN (ours), and StyleGAN-
XL [86]. FID values of each generative model are 4.01, 3.60, 3.45, and 2.32, respectively.

21
Figure A2. Uncurated images (above: Lorikeet and below: Arctic fox) from ADM-G-U [15], LDM-4-G [79], GigaGAN (ours), and
StyleGAN-XL [86]. FID values of each generative model are 4.01, 3.60, 3.45, and 2.32, respectively.

22
C. Text-to-image synthesis results Quantitatively, the effect of truncation is similar to the
guidance technique of diffusion models. As shown in Fig-
C.1. Truncation trick at inference ure A3, the CLIP score increases with more truncation,
where the FID increases due to reduced diversity.
Similar to the classifier guidance [15] and classifier-free
guidance [28] used in diffusion models such as LDM, our C.2. Comparison to diffusion models
GAN model can leverage the truncation trick [6, 41] at in- Finally, we show randomly sampled results of our model
ference time.
and compare them with publicly available diffusion models,
wtrunc = lerp(wmean , w, ψ), (9) LDM [79], Stable Diffusion [78], and DALL·E 2 [74].

where wmean is the mean of w of the entire dataset, which

can be precomputed. In essence, the truncation trick lets us 45

trade diversity for fidelity by interpolating the latent vector

to the mean of the distribution and thereby making the out- 40

puts more typical. When ψ = 1.0, wmean is not used, and

COCO zero-shot FID 10k

there is no truncation. When ψ = 0.0, w collapses to the 35

mean, losing diversity. LDM

30
While it is straightforward to apply the truncation trick
for the unconditional case, it is less clear how to achieve this
25
for text-conditional image generation. We find that interpo-
lating the latent vector toward both the mean of the entire Stable Diffusion v1.5
20
distribution as well as the mean of w conditioned on the
text prompt produces desirable results.
Ours
15

10
wtrunc = lerp(wmean,c , lerp(wmean , w, ψ), ψ), (10) 0.22 0.24 0.26 0.28 0.30 0.32 0.34

CLIP score (ViT-G/14)

where wmean,c can be computed at inference time by sam-
pling w = M (z, c) 16 times with the same c, and tak- Figure A3. We investigate how our FID and CLIP score changes
ing the average. This operation’s overhead is negligible, over different truncation values [1.0, 0.9, 0.8, 0.7, 0.6, 0.5], by vi-
as the mapping network M is computationally light com- sualizing them along with the FID-CLIP score curve of two pub-
licly available large scale diffusion models: LDM and Stable Dif-
pared to the synthesis network. At ψ = 1.0, wtrunc becomes
fusion. It is seen that the CLIP score increases with more trunca-
wtrunc = w, meaning no truncation. Figure A4 demon- tion, at the cost of reduced diversity indicated by higher FID.
strates the effect of our text-conditioned truncation trick.

23
no truncation strong truncation
𝜓 = 1.0 0.9 0.7 0.5 0.3 0.1

Figure A4. The visual effect of our truncation trick. We demonstrate the effect of truncation by decreasing the truncation value ψ from
1.0. We show six example outputs with the text prompt “digital painting of a confident and severe looking northern war goddess, extremely
long blond braided hair, beautiful blue eyes and red lips.” and “Magritte painting of a clock on a beach.”. At 1.0 (no truncation), the
diversity is high, but the alignment is not satisfactory. As the truncation increases, text-image alignment improves, at the cost of diversity.
We find that a truncation value between 0.8 and 0.7 produces the best result.

24
Fine styles
“A modern
style house,
DSLR.”

Coarse styles
Fine styles

“A male
Headshot
Picture.”

Coarse styles

Figure A5. Style mixing. GigaGAN maintains a disentangled latent space, allowing us to blend the coarse style of one sample with the
fine style of another. The corresponding latent codes are spliced together to produce a style-swapping grid. The outputs are generated from
the same prompt but with different latent codes.

25
“A modern mansion ..” “A victorian mansion ..”

“.. in a
sunny day”

“.. in sunset”

“Roses.” “Sunflowers.”

“oil painting”

“photograph”

Figure A6. Prompt interpolation. GigaGAN enables smooth interpolation between prompts, as shown in the interpolation grid. The
four corners are generated from the same latent but with different text prompts. The corresponding text embeddings and style vectors
are interpolated to create a smooth transition. The same results in similar layouts.

26
“A loft bed with a dresser underneath it.”

Ours (512px, 0.13s / img)

Ours (512px, 0.14s / img, truncation 𝜓 = 0.8)

LDM (256px, 9.4s / img, 250 steps, guidance=6.0)

Stable Diffusion v1.5 (512px, 2.9s / img, 50 steps, guidance=7.5)

DALL·E 2 (1024px)
Figure A7. Random outputs of our model, Latent Diffusion Model [79], Stable Diffusion [78], and DALL·E 2 [74], using prompt “A loft
bed with a dresser underneath it”. We show two versions of our model, one without truncation and the other with truncation. Our model
enjoys faster speed than the diffusion models. Still, we observe our model falls behind in structural coherency, such as the number of legs
of the bed frames. For LDM and Stable Diffusion, we use 250 and 50 sampling steps with DDIM / PLMS [55], respectively. For DALL·E
2, we generate images using the official DALL·E service [64].

27
“A green vase filed with red roses sitting on top of table.”

Ours (512px, 0.13s / img)

Ours (512px, 0.14s / img, truncation 𝜓 = 0.8)

LDM (256px, 9.4s / img, 250 steps, guidance=6.0)

Stable Diffusion v1.5 (512px, 2.9s / img, 50 steps, guidance=7.5)

DALL·E 2 (1024px)

Figure A8. Random outputs of our model, Latent Diffusion Model [79], Stable Diffusion [78], and DALL·E 2 [74], using prompt “A green
vase filed with red roses sitting on top of table”. We show two versions of our model, one without truncation and the other with truncation.
Our model enjoys faster speed than the diffusion models in both cases. Still, we observe our model falls behind in structural coherency like
the symmetry of the vases. For LDM and Stable Diffusion, we use 250 and 50 sampling steps with DDIM / PLMS [55], respectively. For
DALL·E 2, we generate images using the official DALL·E service [64].

28
“A zebra in the grass who is cleaning himself.”

Ours (512px, 0.13s / img)

Ours (512px, 0.14s / img, truncation 𝜓 = 0.8)

LDM (256px, 9.4s / img, 250 steps, guidance=6.0)

Stable Diffusion v1.5 (512px, 2.9s / img, 50 steps, guidance=7.5)

DALL·E 2 (1024px)

Figure A9. Random outputs of our model, Latent Diffusion Model [79], Stable Diffusion [78], and DALL·E 2 [74], using prompt “A zebra
in the grass who is cleaning himself”. We show two versions of our model, one without truncation and the other with truncation. Our
model enjoys faster speed than the diffusion models in both cases. Still, we observe our model falls behind in details, such as the precise
stripe pattern of the positioning of eyes. For LDM and Stable Diffusion, we use 250 and 50 sampling steps with DDIM / PLMS [55],
respectively. For DALL·E 2, we generate images using the official DALL·E service [64].

29
“A teddy bear on a skateboard in times square.”

Ours (512px, 0.13s / img)

Ours (512px, 0.14s / img, truncation 𝜓 = 0.8)

LDM (256px, 9.4s / img, 250 steps, guidance=6.0)

Stable Diffusion v1.5 (512px, 2.9s / img, 50 steps, guidance=7.5)

DALL·E 2 (1024px)

Figure A10. Random outputs of our model, Latent Diffusion Model [79], Stable Diffusion [78], and DALL·E 2 [74], using prompt “A
teddy bear on a skateboard in times square”. We show two versions of our model, one without truncation and the other with truncation. Our
model enjoys faster speed than the diffusion models in both cases. Still, we observe our model falls behind in details, like the exact shape
of skateboards. For LDM and Stable Diffusion, we use 250 and 50 sampling steps with DDIM / PLMS [55], respectively. For DALL·E 2,
we generate images using the official DALL·E service [64].

30
“Vibrant portrait painting of Salvador Dalí with a robotic half face.”

Ours (512px, 0.13s / img)

Ours (512px, 0.14s / img, truncation 𝜓 = 0.8)

LDM (256px, 9.4s / img, 250 steps, guidance=6.0)

Stable Diffusion v1.5 (512px, 2.9s / img, 50 steps, guidance=7.5)

DALL·E 2 (1024px)

Figure A11. Random outputs of our model, Latent Diffusion Model [79], Stable Diffusion [78], and DALL·E 2 [74], using prompt “Vibrant
portrait painting of Salvador Dalı́ with a robotic half face”. We show two versions of our model, one without truncation and the other with
truncation. Our model enjoys faster speed than the diffusion models in both cases. Still, we observe our model falls behind in structural
details like in the detailed shape of eyes. For LDM and Stable Diffusion, we use 250 and 50 sampling steps with DDIM / PLMS [55],
respectively. For DALL·E 2, we generate images using the official DALL·E service [64].

31
“Three men in military suits are sitting on a bench.”

Ours (512px, 0.13s / img)

Ours (512px, 0.14s / img, truncation 𝜓 = 0.8)

LDM (256px, 9.4s / img, 250 steps, guidance=6.0)

Stable Diffusion v1.5 (512px, 2.9s / img, 50 steps, guidance=7.5)

DALL·E 2 (1024px)

Figure A12. Random outputs of our model, Latent Diffusion Model [79], Stable Diffusion [78], and DALL·E 2 [74], using prompt “Three
men in military suits are sitting on a bench”. We show two versions of our model, one without truncation and the other with truncation. Our
model enjoys faster speed than the diffusion models in both cases. Still, we observe our model falls behind in details in facial expression
and attire. For LDM and Stable Diffusion, we use 250 and 50 sampling steps with DDIM / PLMS [55], respectively. For DALL·E 2, we
generate images using the official DALL·E service [64].

32
Input artwork from AdobeStock (128px) Real-ESRGAN (1024px, 0.06s) SD Upscaler (1024px, 7.75s) GigaGAN Upsampler (1024px, 0.13s)

Input

Real-ESRGAN (1K)

SD Upscaler (1K)

GigaGAN Up (1K)

GigaGAN Up (4K)

GigaGAN Upsampler (4096px, 16Mpix, 3.66s)

Figure A13. Our GAN-based upsampler can serve as the upsampler for many text-to-image models that generate initial outputs at low
resolutions like 64px or 128px. We simulate such usage by applying our 8× superresolution model on a low-res 128px artwork to obtain
the 1K output, using “Portrait of a kitten dressed in a bow tie. Red Rose. Valentine’s day.”. Then our model can be re-applied to go beyond
4K. We compare our model with the text-conditioned upscaler of Stable Diffusion [78] and unconditional Real-ESRGAN [33]. Zooming
in is recommended for comparison between 1K and 4K outputs.

33
Input artwork from AdobeStock (128px) Real-ESRGAN (1024px, 0.06s) SD Upscaler (1024px, 7.75s) GigaGAN Upsampler (1024px, 0.13s)

Input

Real-ESRGAN (1K)

SD Upscaler (1K)

GigaGAN Up (1K)

GigaGAN Up (4K)

GigaGAN Upsampler (4096px, 16Mpix, 3.66s)

Figure A14. Our GAN-based upsampler can serve as the upsampler for many text-to-image models that generate initial outputs at low
resolutions like 64px or 128px. We simulate such usage by applying our 8× superresolution model on a low-res 128px artwork to obtain
the 1K output, using “Heart shaped pancakes with honey and strawberry for Valentine’s Day”. Then our model can be re-applied to go
beyond 4K. We compare our model with the text-conditioned upscaler of Stable Diffusion [78] and unconditional Real-ESRGAN [33].
Zooming in is recommended for comparison between 1K and 4K outputs.

34
Input photo (128px) Real-ESRGAN (1024px, 0.06s) SD Upscaler (1024px, 7.75s) GigaGAN Upsampler (1024px, 0.13s)

Input

Real-ESRGAN (1K)

SD Upscaler (1K)

GigaGAN Up (1K)

GigaGAN Up (4K)

GigaGAN Upsampler (4096px, 16Mpix, 3.66s)

Figure A15. Our GAN-based upsampler can also be used as an off-the-shelf superresolution model for real images with a large scaling
factor by providing an appropriate description of the image. We apply our text-conditioned 8× superresolution model on a low-res 128px
photo to obtain the 1K output, using “An elephant spraying water with its trunk”. Then our model can be re-applied to go beyond 4K.
We compare our model with the text-conditioned upscaler of Stable Diffusion [78] and unconditional Real-ESRGAN [33]. Zooming in is
recommended for comparison between 1K and 4K outputs.

Generative AI Fundamentals GANs QB 14 Aug v1.0
No ratings yet
Generative AI Fundamentals GANs QB 14 Aug v1.0
24 pages
Slope Maintenance Manual
100% (1)
Slope Maintenance Manual
111 pages
Adversarial Digit Creation
No ratings yet
Adversarial Digit Creation
10 pages
Adversarial Digit Creation
No ratings yet
Adversarial Digit Creation
10 pages
CVPR2020-Image Processing Using Multi-Code GAN Prior
No ratings yet
CVPR2020-Image Processing Using Multi-Code GAN Prior
10 pages
StackGAN Realistic Image Synthesis With Stacked Generative Adversarial Networks
No ratings yet
StackGAN Realistic Image Synthesis With Stacked Generative Adversarial Networks
16 pages
Sitecore Basic Training
No ratings yet
Sitecore Basic Training
33 pages
Shortest Path
No ratings yet
Shortest Path
57 pages
Multimodal
No ratings yet
Multimodal
25 pages
Image Generation Using: Generative Ai
No ratings yet
Image Generation Using: Generative Ai
20 pages
DL M6 Tech
No ratings yet
DL M6 Tech
29 pages
StyleSwin Transformer-Based GAN For High-Resolution Image Generation
No ratings yet
StyleSwin Transformer-Based GAN For High-Resolution Image Generation
11 pages
Presentation 1
No ratings yet
Presentation 1
64 pages
Sinddm: A Single Image Denoising Diffusion Model
No ratings yet
Sinddm: A Single Image Denoising Diffusion Model
39 pages
Ai Image Generator
No ratings yet
Ai Image Generator
37 pages
Masterclass GANs
No ratings yet
Masterclass GANs
20 pages
3rd Unit Notes
No ratings yet
3rd Unit Notes
16 pages
Batch 16
No ratings yet
Batch 16
24 pages
BTP - 6 Sem - Part1
No ratings yet
BTP - 6 Sem - Part1
40 pages
Atharv Report Final
No ratings yet
Atharv Report Final
23 pages
Image Restoration Using Residual Generative Adversarial Networks-FINAL
No ratings yet
Image Restoration Using Residual Generative Adversarial Networks-FINAL
21 pages
Gans
No ratings yet
Gans
17 pages
Image-to-Image Difussion Models
No ratings yet
Image-to-Image Difussion Models
29 pages
Fouri Scale
No ratings yet
Fouri Scale
26 pages
Micro-Budget CREW CONTRACT
No ratings yet
Micro-Budget CREW CONTRACT
5 pages
GAN
No ratings yet
GAN
16 pages
L S Gan T H F N I S: Arge Cale Raining For IGH Idelity Atural Mage Ynthesis
No ratings yet
L S Gan T H F N I S: Arge Cale Raining For IGH Idelity Atural Mage Ynthesis
35 pages
Stylegan-T: Unlocking The Power of Gans For Fast Large-Scale Text-To-Image Synthesis
No ratings yet
Stylegan-T: Unlocking The Power of Gans For Fast Large-Scale Text-To-Image Synthesis
13 pages
A Tiered GAN Approach For Monet-Style Image Generation
No ratings yet
A Tiered GAN Approach For Monet-Style Image Generation
6 pages
AI Resubmtion
No ratings yet
AI Resubmtion
18 pages
Adding Conditional Control To Text-to-Image Diffusion Models
No ratings yet
Adding Conditional Control To Text-to-Image Diffusion Models
33 pages
Goldfish: by Dick Mills
No ratings yet
Goldfish: by Dick Mills
7 pages
Project Work: Final-ISA (Review 4)
No ratings yet
Project Work: Final-ISA (Review 4)
29 pages
Granot Drop The GAN in Defense of Patches Nearest Neighbors As CVPR 2022 Paper
No ratings yet
Granot Drop The GAN in Defense of Patches Nearest Neighbors As CVPR 2022 Paper
10 pages
Karras Analyzing and Improving The Image Quality of StyleGAN CVPR 2020 Paper
No ratings yet
Karras Analyzing and Improving The Image Quality of StyleGAN CVPR 2020 Paper
10 pages
GAN Applications
No ratings yet
GAN Applications
2 pages
1 RV
No ratings yet
1 RV
11 pages
Development of The Eye
No ratings yet
Development of The Eye
25 pages
Background: Image Transformer
No ratings yet
Background: Image Transformer
6 pages
Exploring The Various Machine Learning Models For Image Generation - A Comprehensive Survey Unlocking The Future of Digital Creativity
No ratings yet
Exploring The Various Machine Learning Models For Image Generation - A Comprehensive Survey Unlocking The Future of Digital Creativity
15 pages
Kang Scaling Up GANs For Text-to-Image Synthesis CVPR 2023 Paper
No ratings yet
Kang Scaling Up GANs For Text-to-Image Synthesis CVPR 2023 Paper
11 pages
Script
No ratings yet
Script
10 pages
ImageGenerationwithGans basedTechniquesASurvey
No ratings yet
ImageGenerationwithGans basedTechniquesASurvey
19 pages
Eco Mba
No ratings yet
Eco Mba
15 pages
Base Paper Batch 9 Final Updated 3
No ratings yet
Base Paper Batch 9 Final Updated 3
10 pages
2021 Arxiv - TransGAN Two Transformers Can Make One Strong GAN
No ratings yet
2021 Arxiv - TransGAN Two Transformers Can Make One Strong GAN
13 pages
Generative Adversarial Networks
No ratings yet
Generative Adversarial Networks
11 pages
Generative - Adversarial - Networks - For - Extreme - Learned - Image - Compression
No ratings yet
Generative - Adversarial - Networks - For - Extreme - Learned - Image - Compression
11 pages
Photographic Text-to-Image Synthesis With A Hierarchically-Nested Adversarial Network
No ratings yet
Photographic Text-to-Image Synthesis With A Hierarchically-Nested Adversarial Network
10 pages
Host Transmission Specifications - Autolyser Dialab
No ratings yet
Host Transmission Specifications - Autolyser Dialab
13 pages
Shaham SinGAN Learning A Generative Model From A Single Natural Image ICCV 2019 Paper PDF
No ratings yet
Shaham SinGAN Learning A Generative Model From A Single Natural Image ICCV 2019 Paper PDF
11 pages
Colorization Using Convnet and Gan
No ratings yet
Colorization Using Convnet and Gan
8 pages
Market Research
No ratings yet
Market Research
9 pages
Fallschirmjagergewehr 42 (FG42) Light Machine Gun (NAZI)
No ratings yet
Fallschirmjagergewehr 42 (FG42) Light Machine Gun (NAZI)
7 pages
Meta
No ratings yet
Meta
17 pages
Perception Management
No ratings yet
Perception Management
126 pages
MODULE6
No ratings yet
MODULE6
11 pages
The Information Age (Gutenberg To Social Media) : Prepared By: Marvin Ramos (BSN Ii-D)
No ratings yet
The Information Age (Gutenberg To Social Media) : Prepared By: Marvin Ramos (BSN Ii-D)
7 pages
6 FM 12
No ratings yet
6 FM 12
2 pages
Engproc 20 00016 With Cover
No ratings yet
Engproc 20 00016 With Cover
7 pages
Resume of Daniel Jayson Mabanglo
0% (1)
Resume of Daniel Jayson Mabanglo
2 pages
Anysize GAN: A Solution To The Image-Warping Problem
No ratings yet
Anysize GAN: A Solution To The Image-Warping Problem
14 pages
Chapter 7 - Segmentation, Target Marketing, and Positioning
0% (1)
Chapter 7 - Segmentation, Target Marketing, and Positioning
14 pages
LargeGANS PDF
No ratings yet
LargeGANS PDF
29 pages
Corn Dog
No ratings yet
Corn Dog
1 page
SAFC Biosciences - Technical Bulletin - BIOEAZE Bags - Tubing and Connectors
100% (1)
SAFC Biosciences - Technical Bulletin - BIOEAZE Bags - Tubing and Connectors
2 pages
Liu Hu Report
No ratings yet
Liu Hu Report
6 pages
Sr. No Title Published Problem Statement Methodology Dataset Dataset Avail-Ability
No ratings yet
Sr. No Title Published Problem Statement Methodology Dataset Dataset Avail-Ability
2 pages
Image Transformer: Van Den Oord & Schrauwen 2014 Bellemare Et Al. 2016
No ratings yet
Image Transformer: Van Den Oord & Schrauwen 2014 Bellemare Et Al. 2016
10 pages
50 of The World
0% (1)
50 of The World
50 pages
Tao DF-GAN A Simple and Effective Baseline For Text-to-Image Synthesis CVPR 2022 Paper
No ratings yet
Tao DF-GAN A Simple and Effective Baseline For Text-to-Image Synthesis CVPR 2022 Paper
11 pages
BTP Report On Text To Image Synthesis
No ratings yet
BTP Report On Text To Image Synthesis
62 pages
Assessing Writing Performance
No ratings yet
Assessing Writing Performance
44 pages
Building A System That Can Generate High
No ratings yet
Building A System That Can Generate High
2 pages
Image Super Resolution
No ratings yet
Image Super Resolution
8 pages
Gan June 2019
No ratings yet
Gan June 2019
28 pages
DLL Lessonlog - DRR Week 3
No ratings yet
DLL Lessonlog - DRR Week 3
6 pages
TOS Math 10
No ratings yet
TOS Math 10
1 page
Lokpal Bill A Path of Corruption Free Society in India
100% (1)
Lokpal Bill A Path of Corruption Free Society in India
4 pages
Corporate Gifts PO and Terms Condition Millennium Erectors Corporation
No ratings yet
Corporate Gifts PO and Terms Condition Millennium Erectors Corporation
3 pages
Chapter-12: File Management in C
No ratings yet
Chapter-12: File Management in C
6 pages
A Style-Based Generator Architecture For Generative Adversarial Networks
No ratings yet
A Style-Based Generator Architecture For Generative Adversarial Networks
12 pages
Clinical Examination of Children With Cerebral Palsy
No ratings yet
Clinical Examination of Children With Cerebral Palsy
10 pages
Optimise Your Grammar - B1+: Tense Review: Present Simple and Continuous, Stative Verbs
No ratings yet
Optimise Your Grammar - B1+: Tense Review: Present Simple and Continuous, Stative Verbs
1 page
I. Objectives: (Leads To Formative Assessment)
100% (1)
I. Objectives: (Leads To Formative Assessment)
3 pages
Developments in Prepress Technology (PDFDrive)
No ratings yet
Developments in Prepress Technology (PDFDrive)
62 pages
MFMC 6000W-20000W Multi Module CW Fiber Laser
No ratings yet
MFMC 6000W-20000W Multi Module CW Fiber Laser
2 pages
4th Periodic Test in Science 6 With Key Tos
94% (16)
4th Periodic Test in Science 6 With Key Tos
6 pages
Google JAX Essentials: A quick practical learning of blazing-fast library for machine learning and deep learning projects
From Everand
Google JAX Essentials: A quick practical learning of blazing-fast library for machine learning and deep learning projects
Mei Wong
No ratings yet
AI for Everyone: An Intermediate Guide to Artificial Intelligence
From Everand
AI for Everyone: An Intermediate Guide to Artificial Intelligence
Nova Clarke
No ratings yet