Esser Taming Transformers For High-Resolution Image Synthesis CVPR 2021 Paper
Esser Taming Transformers For High-Resolution Image Synthesis CVPR 2021 Paper
Figure 1. Our approach enables transformers to synthesize high-resolution images like this one, which contains 1280x460 pixels.
12873
spatial invariance through the use of shared weights across the attention mechanism relies on the computation of inner
all positions. This makes them ineffective if a more holistic products between all pairs of elements in the sequence, its
understanding of the input is required. computational complexity increases quadratically with the
Our key insight to obtain an effective and expressive sequence length. While the ability to consider interactions
model is that, taken together, convolutional and transformer between all elements is the reason transformers efficiently
architectures can model the compositional nature of our vi- learn long-range interactions, it is also the reason transform-
sual world [44]: We use a convolutional approach to effi- ers quickly become infeasible, especially on images, where
ciently learn a codebook of context-rich visual parts and, the sequence length itself scales quadratically with the res-
subsequently, learn a model of their global compositions. olution. Different approaches have been proposed to reduce
The long-range interactions within these compositions re- the computational requirements to make transformers feasi-
quire an expressive transformer architecture to model distri- ble for longer sequences. [48] and [66] restrict the receptive
butions over their consituent visual parts. Furthermore, we fields of the attention modules, which reduces the expres-
utilize an adversarial approach to ensure that the dictionary sivity and, especially for high-resolution images, introduces
of local parts captures perceptually important local struc- unjustified assumptions on the independence of pixels. [12]
ture to alleviate the need for modeling low-level statistics and [24] retain the full receptive field but can reduce
√ costs
with the transformer architecture. Allowing transformers for a sequence of length n only from n2 to n n, which
to concentrate on their unique strength — modeling long- makes resolutions beyond 64 pixels still prohibitively ex-
range relations — enables them to generate high-resolution pensive.
images as in Fig. 1, a feat which previously has been out of
reach. Our formulation directly gives control over the gen- Convolutional Approaches The two-dimensional struc-
erated images by means of conditioning information regard- ture of images suggests that local interactions are particu-
ing desired object classes or spatial layouts. Finally, experi- larly important. CNNs exploit this structure by restricting
ments demonstrate that our approach retains the advantages interactions between input variables to a local neighborhood
of transformers by outperforming previous codebook-based defined by the kernel size of the convolutional kernel. Ap-
state-of-the-art approaches based on convolutional architec- plying a kernel thus results in costs that scale linearly with
tures. the overall sequence length (the number of pixels in the case
of images) and quadratically in the kernel size, which, in
2. Related Work modern CNN architectures, is often fixed to a small constant
The Transformer Family The defining characteristic of such as 3 × 3. This inductive bias towards local interactions
the transformer architecture [64] is that it models interac- thus leads to efficient computations, but the wide range of
tions between its inputs solely through attention [2, 32, 45] specialized layers which are introduced into CNNs to han-
which enables them to faithfully handle interactions be- dle different synthesis tasks [46, 70, 59, 74, 73] suggest that
tween inputs regardless of their relative position to one an- this bias is often too restrictive.
other. Originally applied to language tasks, inputs to the Convolutional architectures have been used for autore-
transformer were given by tokens, but other signals, such as gressive modeling of images [61, 62, 10] but, for low-
those obtained from audio [37] or images [8], can be used. resolution images, previous works [48, 12, 24] demon-
Each layer of the transformer then consists of an attention strated that transformers consistently outperform their con-
mechanism, which allows for interaction between inputs at volutional counterparts. Our approach allows us to ef-
different positions, followed by a position-wise fully con- ficiently model high-resolution images with transformers
nected network, which is applied to all positions indepen- while retaining their advantages over state-of-the-art con-
dently. More specifically, the (self-)attention mechanism volutional approaches.
can be described by mapping an intermediate representa-
tion with three position-wise linear layers into three repre- Two-Stage Approaches Closest to ours are two-stage ap-
sentations, query Q ∈ RN ×dk , key K ∈ RN ×dk and value proaches which first learn an encoding of data and after-
V ∈ RN ×dv , to compute the output as wards learn, in a second stage, a probabilistic model of this
QK t encoding. [13] demonstrated both theoretical and empirical
Attn(Q, K, V ) = softmax √ V ∈ RN ×dv . (1) evidence on the advantages of first learning a data repre-
dk sentation with a Variational Autoencoder (VAE) [34, 54],
When performing autoregressive maximum-likelihood and then again learning its distribution with a VAE. [17, 68]
learning, non-causal entries of QK t , i.e. all entries be- demonstrate similar gains when using an unconditional nor-
low its diagonal, are set to −∞ and the final output of the malizing flow for the second stage, and [55, 56] when using
transformer is given after a linear, point-wise transforma- a conditional normalizing flow. To improve training effi-
tion to predict logits of the next sequence element. Since ciency of Generative Adversarial Networks (GANs), [39]
12874
Figure 2. Our approach uses a convolutional VQGAN to learn a codebook of context-rich visual parts, whose composition is subsequently
modeled with an autoregressive transformer architecture. A discrete codebook provides the interface between these architectures and a
patch-based discriminator enables strong compression while retaining high perceptual quality. This method introduces the efficiency of
convolutional approaches to transformer based high resolution image synthesis.
12875
sent images with codes from a learned, discrete codebook where Lrec is the perceptual reconstruction loss [71], ∇GL [·]
Z = {zk }K k=1 ⊂ R
nz
(see Fig. 2 for an overview). More denotes the gradient of its input w.r.t. the last layer L of
precisely, we approximate a given image x by x̂ = G(zq ). the decoder, and δ = 10−6 is used for numerical stability.
We obtain zq using the encoding ẑ = E(x) ∈ Rh×w×nz To aggregate context from everywhere, we apply a single
and a subsequent element-wise quantization q(·) of each attention layer on the lowest resolution. This training pro-
spatial code ẑij ∈ Rnz onto its closest codebook entry zk : cedure significantly reduces the sequence length when un-
rolling the latent code and thereby enables the application
zq = q(ẑ) := arg minkẑij − zk k ∈ Rh×w×nz . (2) of powerful transformer models.
zk ∈Z
3.2. Learning the Composition of Images with
The reconstruction x̂ ≈ x is then given by Transformers
x̂ = G(zq ) = G (q(E(x))) . (3) Latent Transformers With E and G available, we can
now represent images in terms of the codebook-indices of
Backpropagation through the non-differentiable quantiza- their encodings. More precisely, the quantized encoding of
tion operation in Eq. (3) is achieved by a straight-through an image x is given by zq = q(E(x)) ∈ Rh×w×nz and
gradient estimator, which simply copies the gradients from is equivalent to a sequence s ∈ {0, . . . , |Z|−1}h×w of in-
the decoder to the encoder [3], such that the model and dices from the codebook, which is obtained by replacing
codebook can be trained end-to-end via the loss function each code by its index in the codebook Z:
LVQ (E, G, Z) = kx − x̂k2 + ksg[E(x)] − zq k22 sij = k such that (zq )ij = zk . (8)
+ βksg[zq ] − E(x)k22 . (4)
By mapping indices of a sequence s back to their corre-
Here, Lrec = kx − x̂k2 is a reconstruction loss, sg[·] denotes sponding codebook entries, zq = zsij is readily recov-
the stop-gradient operation, and ksg[zq ] − E(x)k22 is the so- ered and decoded to an image x̂ = G(zq ).
called “commitment loss” with weighting factor β [63]. Thus, after choosing some ordering of the indices in
s, image-generation can be formulated as autoregressive
Learning a Perceptually Rich Codebook Using trans- next-index prediction: Given indices s<i , the transformer
formers to represent images as a distribution over latent learns to predict the distribution of possible next indices,
image constituents requires us to push the limits of com- i.e. p(si |s<i ) to compute the likelihood of the full repre-
Q
pression and learn a rich codebook. To do so, we propose sentation as p(s) = i p(si |s<i ). This allows us to directly
VQGAN, a variant of the original VQVAE, and use a dis- maximize the log-likelihood of the data representations:
criminator and perceptual loss [36, 26, 35, 16] to keep good
perceptual quality at increased compression rate. Note that LTransformer = Ex∼p(x) [− log p(s)] . (9)
this is in contrast to previous works which applied pixel-
based [62, 53] and transformer-based autoregressive mod-
Conditioned Synthesis In many image synthesis tasks a
els [8] on top of only a shallow quantization model. More
user demands control over the generation process by provid-
specifically, we replace the L2 loss used in [63] for Lrec by
ing additional information from which an example shall be
a perceptual loss and introduce an adversarial training pro-
synthesized. This information, which we will call c, could
cedure with a patch-based discriminator D [25] that aims to
be a single label describing the overall image class or even
differentiate between real and reconstructed images:
another image itself. The task is then to learn the likelihood
LGAN ({E, G, Z}, D) = [log D(x) + log(1 − D(x̂))] (5) of the sequence given this information c:
Y
The complete objective for finding the optimal compression p(s|c) = p(si |s<i , c). (10)
model Q∗ = {E ∗ , G∗ , Z ∗ } then reads i
h
Q∗ = arg min max Ex∼p(x) LVQ (E, G, Z) If the conditioning information c has spatial extent, we first
E,G,Z D learn another VQGAN to obtain again an index-based rep-
resentation r ∈ {0, . . . , |Zc |−1}hc ×wc with the newly ob-
i
+λLGAN ({E, G, Z}, D) , (6)
tained codebook Zc Due to the autoregressive structure of
where we compute the adaptive weight λ according to the transformer, we can then simply prepend r to s and
restrict the computation of the negative log-likelihood to
∇GL [Lrec ] entries p(si |s<i , r). This “decoder-only” strategy has also
λ= (7) been successfully used for text-summarization tasks [40].
∇GL [LGAN ] + δ
12876
Negative Log-Likelihood (NLL)
Data / Transformer Transformer PixelSNAIL
# params P-SNAIL steps P-SNAIL time fixed time
12877
conditioning samples (iv): Stochastic superresolution, where low-resolution im-
ages serve as the conditioning information and are thereby
upsampled. We train our model for an upsampling factor of
8 on ImageNet and show results in Fig. 6.
(v): Class-conditional image synthesis: Here, the condi-
tioning information c is a single index describing the class
label of interest. Results on conditional sampling for the
RIN dataset are demonstrated in Fig. 4.
All of these examples make use of the same methodology.
Instead of requiring task specific architectures or modules,
the flexibility of the transformer allows us to learn appropri-
ate interactions for each task, while the VQGAN — which
can be reused across different tasks — leads to short se-
quence lengths. In combination, the presented approach can
be understood as an efficient, general purpose mechanism
for conditional image synthesis. Note that additional results
for each experiment can be found in the appendix, Sec. C.
12878
Figure 5. Samples generated from semantic layouts on S-FLCKR. Figure 6. Applying the sliding attention window approach (Fig. 3)
Sizes from top-to-bottom: 1280 × 832, 1024 × 416 and 1280 × to various conditional image synthesis tasks. Top: Depth-to-image
240 pixels. Best viewed zoomed in. A larger visualization can be on RIN, 2nd row: Stochastic superresolution on IN, 3rd and 4th
found in the appendix, see Fig 17. row: Semantic synthesis on S-FLCKR, bottom: Edge-guided syn-
thesis on IN. The resulting images vary between 368 × 496 and
1024 × 576, hence they are best viewed zoomed in.
Results Fig. 7 shows results for unconditional synthesis of
faces on FacesHQ, the combination of CelebA-HQ [27] and Dataset ours SPADE [46] Pix2PixHD (+aug) [65] CRN [9]
FFHQ [29]. It clearly demonstrates the benefits of power- COCO-Stuff 22.4 22.6/23.9(*) 111.5 (54.2) 70.4
ADE20K 35.5 33.9/35.7(*) 81.8 (41.5) 73.3
ful VQGANs by increasing the effective receptive field of
the transformer. For small receptive fields, or equivalently Table 2. FID score comparison for semantic image synthesis
small f , the model cannot capture coherent structures. For (256 × 256 pixels). (*): Recalculated with our evaluation protocol
an intermediate value of f = 8, the overall structure of based on [43] on the validation splits of each dataset.
images can be approximated, but inconsistencies of facial
4.4. Quantitative Comparison to Existing Models
features such as a half-bearded face and of viewpoints in
different parts of the image arise. Only our full setting of In this section we investigate how our approach quantita-
f = 16 can synthesize high-fidelity samples. For analogous tively compares to existing models for generative image
results in the conditional setting on S-FLCKR, we refer to synthesis. In particular, we assess the performance of our
the appendix (Fig. 10 and Sec. B). model in terms of FID and compare to a variety of es-
To assess the effectiveness of our approach quantitatively, tablished models (GANs, VAEs, Flows, AR, Hybrid) on
we compare results between training a transformer directly (i) semantic synthesis in Tab. 2 (where we compare to
on pixels, and training it on top of a VQGAN’s latent code [46, 65, 31, 9]) and (ii) unconditional face synthesis in
with f = 2, given a fixed computational budget. Again, we Tab. 3. Furthermore, to address a direct comparison to the
follow [8] and learn a dictionary of 512 RGB values on CI- original VQVAE-2 model [53], we train a class conditional
FAR10 to operate directly on pixel space and train the same ImageNet transformer on 256 × 256 images, using a VQ-
transformer architecture on top of our VQGAN with a latent GAN with dim Z = 16384 and f = 16, and additionally
code of size 16 × 16 = 256. We observe improvements of compare to BigGAN [4] and MSP [18] in Tab. 4. Note
18.63% for FIDs and 14.08× faster sampling of images. that our model uses ≃ 10× less parameters than VQVAE-2,
12879
f1 f2 f8 f16 downsampling factor
CelebA-HQ 256 × 256 FFHQ 256 × 256 Model Codebook Size dim Z FID ↓
Method FID ↓ Method FID ↓ VQVAE-2 64 × 64 & 32 × 32 512 ∼ 10
VQGAN 16 × 16 1024 8.0
GLOW [33] 69.0 VDVAE (t = 0.7) [11] 38.8 VQGAN 16 × 16 16384 4.9
NVAE [60] 40.3 VDVAE (t = 1.0) 33.5
VQGAN 64 × 64 & 32 × 32 512 1.7
PIONEER (B.) [21] 39.2 (25.3) VDVAE (t = 0.8) 29.8
NCPVAE [1] 24.8 VDVAE (t = 0.9) 28.5 Table 5. Reconstruction FID on ImageNet (validation split).
VAEBM [67] 20.4 VQGAN+P.SNAIL 21.9 VQVAE-2 reported their reconstruction FID as “∼ 10”.
Style ALAE [49] 19.2 BigGAN 12.4
DC-VAE [47] 15.8 ours 11.4
ours 10.7 U-Net GAN (+aug) [58] 10.9 (7.6) VQGAN (either in terms of larger codebook sizes or in-
PGGAN [27] 8.0 StyleGAN2 (+aug) [30] 3.8 (3.6) creased code lengths) further improve performance. Us-
Table 3. FID score comparison for face image synthesis. CelebA-
ing the same hierarchical codebook setting as in VQVAE-2
HQ results reproduced from [1, 47, 67, 22], FFHQ from [58, 28]. with our model provides the best reconstruction FID, al-
beit at the cost of a very long and thus impractical se-
Dataset ours (+R) VQVAE-2 (+R) BigGAN (-deep) MSP quence. Furthermore, Fig. 9 qualitatively shows that a stan-
IN 256, 50K 19.8 (11.2) 38.1 (∼ 10) 7.1 (7.3) n.a. dard VQVAE cannot achieve such compressions; the cor-
IN 256, 18K 23.5 n.a. 9.6 (9.7) 50.4 responding reconstruction-FIDs read: VQVAE 254.4; VQ-
Table 4. FID score comparison for class-conditional synthesis. GAN 5.7. Sampling from this VQVAE cannot achieve FIDs
“+R”: classifier-based rejection sampling as proposed in VQVAE- below 254.4, whereas our VQGAN achieves 21.93 with Pix-
2. FID*-values (calculated on reconstructed data, analogous to elSNAIL and 11.44 with a transformer (see Tab. 3).
[53]): ours: 13.5 (8.1), VQVAE-2: 19 (5). BigGAN (-deep) eval-
uated via https://tfhub.dev/deepmind truncated at 1.0. 5. Conclusion
This paper adressed the fundamental challenges that previ-
which has an estimated parameter count of 13.5B (estima- ously confined transformers to low-resolution images. We
tion based on https://github.com/rosinality/ proposed an approach which represents images as a compo-
vq- vae- 2- pytorch). While some task-specialized sition of perceptually rich image constituents and thereby
GAN models report better FID scores, our approach pro- overcomes the infeasible quadratic complexity when mod-
vides a unified model that works well across a wide range eling images directly in pixel space. Modeling constituents
of tasks while retaining the ability to encode and recon- with a CNN architecture and their compositions with a
struct images. It thereby bridges the gap between purely ad- transformer architecture taps into the full potential of their
versarial and likelihood-based approaches. Fig. 11, 12, 13 complementary strengths and thereby allowed us to rep-
and Fig. 14 contain qualitative samples corresponding to the resent the first results on high-resolution image synthesis
quantitative analysis in Tab. 4. with a transformer-based architecture. In experiments, our
approach demonstrates the efficiency of convolutional in-
How good is the VQGAN? Reconstruction FIDs obtained ductive biases and the expressivity of transformers by syn-
via the codebook provide a lower bound on the achiev- thesizing images in the megapixel range and outperforming
able FID of the generative model trained on it. To quan- state-of-the-art convolutional approaches. Equipped with a
tify the performance gains of our VQGAN over VQVAE- general mechanism for conditional synthesis, it offers many
2, we evaluate this metric on ImageNet and report results opportunities for novel neural rendering approaches.
in Tab. 5. Our VQGAN outperforms VQVAE-2 while pro- This work has been supported by the German Research Foundation
viding significantly more compression (seq. length of 256 (DFG) projects 371923335, 421703927 and a hardware donation from
vs. 5120 = 322 + 642 ). As expected, larger versions of NVIDIA corporation.
12880
References [14] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li,
and Li Fei-Fei. Imagenet: A large-scale hierarchical image
[1] Jyoti Aneja, Alexander G. Schwing, Jan Kautz, and Arash database. In 2009 IEEE Computer Society Conference on
Vahdat. NCP-VAE: variational autoencoders with noise con- Computer Vision and Pattern Recognition CVPR, 2009. 5
trastive priors. CoRR, abs/2010.02917, 2020. 8
[15] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov,
[2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio.
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner,
Neural machine translation by jointly learning to align and
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl-
translate, 2016. 2
vain Gelly, et al. An image is worth 16x16 words: Trans-
[3] Yoshua Bengio, Nicholas Léonard, and Aaron C. Courville.
formers for image recognition at scale. 2020. 1
Estimating or propagating gradients through stochastic neu-
rons for conditional computation. CoRR, abs/1308.3432, [16] Alexey Dosovitskiy and Thomas Brox. Generating Images
2013. 4 with Perceptual Similarity Metrics based on Deep Networks.
In Advances in Neural Information Processing Systems 29:
[4] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large
Annual Conference on Neural Information Processing Sys-
Scale GAN Training for High Fidelity Natural Image Syn-
tems, NeurIPS, 2016. 4
thesis. In 7th International Conference on Learning Repre-
sentations, ICLR, 2019. 7, 16, 17, 18, 19 [17] Patrick Esser, Robin Rombach, and Björn Ommer. A Dis-
[5] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Sub- entangling Invertible Interpretation Network for Explaining
biah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakan- Latent Representations. In 2020 IEEE/CVF Conference on
tan, Pranav Shyam, Girish Sastry, Amanda Askell, Sand- Computer Vision and Pattern Recognition, CVPR, 2020. 2
hini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom [18] Jeffrey De Fauw, Sander Dieleman, and Karen Simonyan.
Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Hierarchical autoregressive image models with auxiliary de-
Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, coders. CoRR, abs/1903.04933, 2019. 7, 16, 17, 18, 19
Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, [19] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing
Jack Clark, Christopher Berner, Sam McCandlish, Alec Rad- Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville,
ford, Ilya Sutskever, and Dario Amodei. Language Models and Yoshua Bengio. Generative Adversarial Nets. In Ad-
are Few-Shot Learners. arXiv preprint arXiv:2005.14165, vances in Neural Information Processing Systems 27: An-
2020. 1 nual Conference on Neural Information Processing Systems,
[6] Holger Caesar, Jasper Uijlings, and Vittorio Ferrari. COCO- NeurIPS, 2014. 3
Stuff: Thing and stuff classes in context. In Computer vision [20] Seungwook Han, Akash Srivastava, Cole L. Hurwitz,
and pattern recognition (CVPR), 2018 IEEE conference on. Prasanna Sattigeri, and David D. Cox. not-so-biggan: Gener-
IEEE, 2018. 6 ating high-fidelity images on a small compute budget. CoRR,
[7] Liang-Chieh Chen, G. Papandreou, I. Kokkinos, Kevin Mur- abs/2009.04433, 2020. 3
phy, and A. Yuille. DeepLab: Semantic Image Segmenta- [21] Ari Heljakka, Arno Solin, and Juho Kannala. Pioneer
tion with Deep Convolutional Nets, Atrous Convolution, and networks: Progressively growing generative autoencoder.
Fully Connected CRFs. IEEE Transactions on Pattern Anal- In C. V. Jawahar, Hongdong Li, Greg Mori, and Konrad
ysis and Machine Intelligence, 2018. 5 Schindler, editors, Computer Vision - ACCV 2018 - 14th
[8] Mark Chen, Alec Radford, Rewon Child, Jeff Wu, Hee- Asian Conference on Computer Vision, Perth, Australia, De-
woo Jun, Prafulla Dhariwal, David Luan, and Ilya Sutskever. cember 2-6, 2018, Revised Selected Papers, Part I, 2018. 8
Generative pretraining from pixels. 2020. 1, 2, 3, 4, 5, 6, 7,
[22] Ari Heljakka, Arno Solin, and Juho Kannala. Towards pho-
13, 14, 20, 21
tographic image manipulation with balanced growing of gen-
[9] Qifeng Chen and Vladlen Koltun. Photographic image syn-
erative autoencoders. In IEEE Winter Conference on Appli-
thesis with cascaded refinement networks. In IEEE Interna-
cations of Computer Vision, WACV 2020, Snowmass Village,
tional Conference on Computer Vision, ICCV 2017, Venice,
CO, USA, March 1-5, 2020, pages 3109–3118. IEEE, 2020.
Italy, October 22-29, 2017, pages 1520–1529. IEEE Com-
8
puter Society, 2017. 7
[23] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu-
[10] Xi Chen, Nikhil Mishra, Mostafa Rohaninejad, and Pieter
sion probabilistic models, 2020. 12
Abbeel. Pixelsnail: An improved autoregressive generative
model. In ICML, volume 80 of Proceedings of Machine [24] Jonathan Ho, Nal Kalchbrenner, Dirk Weissenborn, and Tim
Learning Research, pages 863–871. PMLR, 2018. 2, 5 Salimans. Axial attention in multidimensional transformers.
[11] Rewon Child. Very deep vaes generalize autoregressive CoRR, abs/1912.12180, 2019. 2, 5
models and can outperform them on images. CoRR, [25] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A.
abs/2011.10650, 2020. 8 Efros. Image-to-Image Translation with Conditional Adver-
[12] Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. sarial Networks. In 2017 IEEE Conference on Computer Vi-
Generating long sequences with sparse transformers, 2019. sion and Pattern Recognition, CVPR, 2017. 4, 12
1, 2, 5 [26] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual
[13] Bin Dai and David P. Wipf. Diagnosing and enhancing VAE losses for real-time style transfer and super-resolution. In
models. In 7th International Conference on Learning Repre- ECCV (2), volume 9906 of Lecture Notes in Computer Sci-
sentations, ICLR, 2019. 2 ence, pages 694–711. Springer, 2016. 4
12881
[27] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. [41] Ziwei Liu, Ping Luo, Shi Qiu, Xiaogang Wang, and Xi-
Progressive growing of gans for improved quality, stability, aoou Tang. Deepfashion: Powering robust clothes recog-
and variation. CoRR, abs/1710.10196, 2017. 7, 8 nition and retrieval with rich annotations. In Proceedings of
[28] Tero Karras, Miika Aittala, Janne Hellsten, Samuli Laine, IEEE Conference on Computer Vision and Pattern Recogni-
Jaakko Lehtinen, and Timo Aila. Training generative ad- tion (CVPR), June 2016. 6
versarial networks with limited data. In Hugo Larochelle, [42] Jacob Menick and Nal Kalchbrenner. Generating high fi-
Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, delity images with subscale pixel networks and multidimen-
and Hsuan-Tien Lin, editors, Advances in Neural Informa- sional upscaling. In 7th International Conference on Learn-
tion Processing Systems 33: Annual Conference on Neural ing Representations, ICLR 2019, New Orleans, LA, USA,
Information Processing Systems 2020, NeurIPS 2020, De- May 6-9, 2019. OpenReview.net, 2019. 14
cember 6-12, 2020, virtual, 2020. 8 [43] Anton Obukhov, Maximilian Seitzer, Po-Wei Wu, Se-
[29] Tero Karras, Samuli Laine, and Timo Aila. A style-based men Zhydenko, Jonathan Kyl, and Elvis Yu-Jing Lin.
generator architecture for generative adversarial networks. In toshas/torch-fidelity: Version 0.2.0, May 2020. 7
IEEE Conference on Computer Vision and Pattern Recog- [44] B. Ommer and J. M. Buhmann. Learning the compositional
nition, (CVPR) 2019, Long Beach, CA, USA, June 16- nature of visual objects. In 2007 IEEE Conference on Com-
20, 2019, pages 4401–4410. Computer Vision Foundation / puter Vision and Pattern Recognition, pages 1–8, 2007. 2
IEEE, 2019. 7 [45] Ankur P. Parikh, Oscar Täckström, Dipanjan Das, and Jakob
[30] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Uszkoreit. A decomposable attention model for natural lan-
Jaakko Lehtinen, and Timo Aila. Analyzing and improving guage inference, 2016. 2
the image quality of stylegan. In 2020 IEEE/CVF Confer- [46] Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan
ence on Computer Vision and Pattern Recognition, CVPR Zhu. Semantic Image Synthesis with Spatially-Adaptive
2020, Seattle, WA, USA, June 13-19, 2020, pages 8107– Normalization. In Proceedings of the IEEE Conference on
8116. IEEE, 2020. 8 Computer Vision and Pattern Recognition, CVPR, 2019. 2,
[31] Prateek Katiyar and Anna Khoreva. Improving augmentation 7, 33, 34
and evaluation schemes for semantic image synthesis, 2021. [47] Gaurav Parmar, Dacheng Li, Kwonjoon Lee, and Zhuowen
7 Tu. Dual contradistinctive generative autoencoder, 2020. 8
[32] Yoon Kim, Carl Denton, Luong Hoang, and Alexander M. [48] Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz
Rush. Structured attention networks, 2017. 2 Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. Im-
[33] Diederik P. Kingma and Prafulla Dhariwal. Glow: Gener- age transformer. In ICML, volume 80 of Proceedings of Ma-
ative Flow with Invertible 1x1 Convolutions. In Advances chine Learning Research, pages 4052–4061. PMLR, 2018.
in Neural Information Processing Systems 31: Annual Con- 2, 3, 5
ference on Neural Information Processing Systems 2018, [49] Stanislav Pidhorskyi, Donald A. Adjeroh, and Gianfranco
NeurIPS, 2018. 8 Doretto. Adversarial latent autoencoders. In 2020 IEEE/CVF
Conference on Computer Vision and Pattern Recognition,
[34] Diederik P. Kingma and Max Welling. Auto-Encoding Vari-
CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pages
ational Bayes. In 2nd International Conference on Learning
14092–14101. IEEE, 2020. 8
Representations, ICLR, 2014. 2
[50] A. Radford. Improving language understanding by genera-
[35] Alex Lamb, Vincent Dumoulin, and Aaron C. Courville.
tive pre-training. 2018. 1
Discriminative regularization for generative models. CoRR,
[51] A. Radford, Jeffrey Wu, R. Child, David Luan, Dario
abs/1602.03220, 2016. 4
Amodei, and Ilya Sutskever. Language models are unsuper-
[36] Anders Boesen Lindbo Larsen, Søren Kaae Sønderby, Hugo vised multitask learners. 2019. 1, 5, 12
Larochelle, and Ole Winther. Autoencoding beyond pixels
[52] René Ranftl, Katrin Lasinger, David Hafner, Konrad
using a learned similarity metric, 2015. 4
Schindler, and Vladlen Koltun. Towards robust monocular
[37] Naihan Li, Shujie Liu, Yanqing Liu, Sheng Zhao, and Ming depth estimation: Mixing datasets for zero-shot cross-dataset
Liu. Neural speech synthesis with transformer network. In transfer. IEEE Transactions on Pattern Analysis and Ma-
AAAI, pages 6706–6713. AAAI Press, 2019. 2 chine Intelligence (TPAMI), 2020. 5
[38] Chieh Hubert Lin, Chia-Che Chang, Yu-Sheng Chen, Da- [53] Ali Razavi, Aaron van den Oord, and Oriol Vinyals. Gener-
Cheng Juan, Wei Wei, and Hwann-Tzong Chen. COCO- ating diverse high-fidelity images with vq-vae-2, 2019. 3, 4,
GAN: generation by parts via conditional coordinating. In 5, 7, 8, 16, 17, 18, 19
ICCV, pages 4511–4520. IEEE, 2019. 5 [54] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wier-
[39] Jinlin Liu, Yuan Yao, and Jianqiang Ren. An accelera- stra. Stochastic backpropagation and approximate inference
tion framework for high resolution image synthesis. CoRR, in deep generative models. In Proceedings of the 31st In-
abs/1909.03611, 2019. 2 ternational Conference on International Conference on Ma-
[40] Peter J. Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, chine Learning, ICML, 2014. 2
Ryan Sepassi, Lukasz Kaiser, and Noam Shazeer. Gener- [55] Robin Rombach, Patrick Esser, and Björn Ommer. Making
ating wikipedia by summarizing long sequences. In ICLR sense of cnns: Interpreting deep representations and their
(Poster). OpenReview.net, 2018. 4 invariances with inns. In Andrea Vedaldi, Horst Bischof,
12882
Thomas Brox, and Jan-Michael Frahm, editors, Computer [67] Zhisheng Xiao, Karsten Kreis, Jan Kautz, and Arash Vahdat.
Vision - ECCV 2020 - 16th European Conference, Glasgow, Vaebm: A symbiosis between variational autoencoders and
UK, August 23-28, 2020, Proceedings, Part XVII, volume energy-based models, 2021. 8
12362 of Lecture Notes in Computer Science, pages 647– [68] Zhisheng Xiao, Qing Yan, Yi-an Chen, and Yali Amit. Gen-
664. Springer, 2020. 2 erative latent flow: A framework for non-adversarial image
[56] Robin Rombach, Patrick Esser, and Bjorn Ommer. Network- generation. CoRR, abs/1905.10485, 2019. 2
to-network translation with conditional invertible neural net- [69] Fisher Yu, Yinda Zhang, Shuran Song, Ari Seff, and Jianx-
works. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Bal- iong Xiao. Lsun: Construction of a large-scale image dataset
can, and H. Lin, editors, Advances in Neural Information using deep learning with humans in the loop. arXiv preprint
Processing Systems, volume 33, pages 2784–2797. Curran arXiv:1506.03365, 2015. 5
Associates, Inc., 2020. 2 [70] Pan Zhang, Bo Zhang, Dong Chen, Lu Yuan, and Fang Wen.
[57] Shibani Santurkar, Dimitris Tsipras, Brandon Tran, Andrew Cross-Domain Correspondence Learning for Exemplar-
Ilyas, Logan Engstrom, and Aleksander Madry. Computer Based Image Translation. In 2020 IEEE/CVF Conference
vision with a single (robust) classifier. In ArXiv preprint on Computer Vision and Pattern Recognition, CVPR, 2020.
arXiv:1906.09453, 2019. 5 2
[58] Edgar Schönfeld, Bernt Schiele, and Anna Khoreva. A u- [71] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman,
net based discriminator for generative adversarial networks. and Oliver Wang. The Unreasonable Effectiveness of Deep
In 2020 IEEE/CVF Conference on Computer Vision and Pat- Features as a Perceptual Metric. In CVPR, 2018. 4, 12
tern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, [72] Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela
2020, pages 8204–8213. IEEE, 2020. 8 Barriuso, and Antonio Torralba. Semantic understand-
ing of scenes through the ade20k dataset. arXiv preprint
[59] Aliaksandr Siarohin, Stéphane Lathuilière, Sergey Tulyakov,
arXiv:1608.05442, 2016. 6
Elisa Ricci, and Nicu Sebe. First order motion model for
image animation. In Conference on Neural Information Pro- [73] Tinghui Zhou, Shubham Tulsiani, Weilun Sun, Jitendra Ma-
cessing Systems (NeurIPS), December 2019. 2 lik, and Alexei A. Efros. View synthesis by appearance flow,
2017. 2
[60] Arash Vahdat and Jan Kautz. NVAE: A deep hierarchical
[74] Peihao Zhu, Rameen Abdal, Yipeng Qin, and Peter Wonka.
variational autoencoder. In Hugo Larochelle, Marc’Aurelio
Sean: Image synthesis with semantic region-adaptive nor-
Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-
malization, 2019. 2
Tien Lin, editors, Advances in Neural Information Process-
ing Systems 33: Annual Conference on Neural Information
Processing Systems 2020, NeurIPS 2020, December 6-12,
2020, virtual, 2020. 8
[61] Aäron van den Oord, Nal Kalchbrenner, and Koray
Kavukcuoglu. Pixel recurrent neural networks. In ICML,
volume 48 of JMLR Workshop and Conference Proceedings,
pages 1747–1756. JMLR.org, 2016. 2
[62] Aaron van den Oord, Nal Kalchbrenner, Oriol Vinyals, Lasse
Espeholt, Alex Graves, and Koray Kavukcuoglu. Condi-
tional image generation with pixelcnn decoders, 2016. 2,
4, 14
[63] Aaron van den Oord, Oriol Vinyals, and Koray
Kavukcuoglu. Neural discrete representation learning,
2018. 3, 4, 13, 15
[64] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
reit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia
Polosukhin. Attention is All you Need. In Advances in Neu-
ral Information Processing Systems 30: Annual Conference
on Neural Information Processing Systems, NeurIPS, 2017.
1, 2, 3
[65] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao,
Jan Kautz, and Bryan Catanzaro. High-resolution image syn-
thesis and semantic manipulation with conditional gans. In
Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, 2018. 7
[66] Dirk Weissenborn, Oscar Täckström, and Jakob Uszkoreit.
Scaling autoregressive video models. In ICLR. OpenRe-
view.net, 2020. 2
12883