0% found this document useful (0 votes)
11 views11 pages

Esser Taming Transformers For High-Resolution Image Synthesis CVPR 2021 Paper

The document presents a novel approach that combines convolutional neural networks (CNNs) with transformers to enable high-resolution image synthesis, addressing the computational challenges associated with transformers when applied to long sequences like images. By learning a context-rich codebook of visual parts through CNNs and modeling their composition with transformers, the method achieves efficient generation of megapixel images while retaining the advantages of both architectures. The proposed technique allows for conditional synthesis, enabling control over the generated images based on various input parameters.

Uploaded by

singqinghui
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views11 pages

Esser Taming Transformers For High-Resolution Image Synthesis CVPR 2021 Paper

The document presents a novel approach that combines convolutional neural networks (CNNs) with transformers to enable high-resolution image synthesis, addressing the computational challenges associated with transformers when applied to long sequences like images. By learning a context-rich codebook of visual parts through CNNs and modeling their composition with transformers, the method achieves efficient generation of megapixel images while retaining the advantages of both architectures. The proposed technique allows for conditional synthesis, enabling control over the generated images based on various input parameters.

Uploaded by

singqinghui
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Taming Transformers for High-Resolution Image Synthesis

Patrick Esser* Robin Rombach* Björn Ommer


Heidelberg Collaboratory for Image Processing, IWR, Heidelberg University, Germany
*Both authors contributed equally to this work

Figure 1. Our approach enables transformers to synthesize high-resolution images like this one, which contains 1280x460 pixels.

Abstract and are increasingly adapted in other areas such as audio


[12] and vision [8, 15]. In contrast to the predominant vi-
Designed to learn long-range interactions on sequential sion architecture, convolutional neural networks (CNNs),
data, transformers continue to show state-of-the-art results the transformer architecture contains no built-in inductive
on a wide variety of tasks. In contrast to CNNs, they contain prior on the locality of interactions and is therefore free
no inductive bias that prioritizes local interactions. This to learn complex relationships among its inputs. However,
makes them expressive, but also computationally infeasi- this generality also implies that it has to learn all relation-
ble for long sequences, such as high-resolution images. We ships, whereas CNNs have been designed to exploit prior
demonstrate how combining the effectiveness of the induc- knowledge about strong local correlations within images.
tive bias of CNNs with the expressivity of transformers en- Thus, the increased expressivity of transformers comes with
ables them to model and thereby synthesize high-resolution quadratically increasing computational costs, because all
images. We show how to (i) use CNNs to learn a context- pairwise interactions are taken into account. The result-
rich vocabulary of image constituents, and in turn (ii) utilize ing energy and time requirements of state-of-the-art trans-
transformers to efficiently model their composition within former models thus pose fundamental problems for scaling
high-resolution images. Our approach is readily applied them to high-resolution images with millions of pixels.
to conditional synthesis tasks, where both non-spatial in- Observations that transformers tend to learn convolu-
formation, such as object classes, and spatial information, tional structures [15] thus beg the question: Do we have
such as segmentations, can control the generated image. to re-learn everything we know about the local structure
In particular, we present the first results on semantically- and regularity of images from scratch each time we train
guided synthesis of megapixel images with transformers. a vision model, or can we efficiently encode inductive im-
Project page at https://git.io/JLlvY. age biases while still retaining the flexibility of transform-
ers? We hypothesize that low-level image structure is well
1. Introduction described by a local connectivity, i.e. a convolutional ar-
chitecture, whereas this structural assumption ceases to be
Transformers are on the rise—they are now the de-facto effective on higher semantic levels. Moreover, CNNs not
standard architecture for language tasks [64, 50, 51, 5] only exhibit a strong locality bias, but also a bias towards

12873
spatial invariance through the use of shared weights across the attention mechanism relies on the computation of inner
all positions. This makes them ineffective if a more holistic products between all pairs of elements in the sequence, its
understanding of the input is required. computational complexity increases quadratically with the
Our key insight to obtain an effective and expressive sequence length. While the ability to consider interactions
model is that, taken together, convolutional and transformer between all elements is the reason transformers efficiently
architectures can model the compositional nature of our vi- learn long-range interactions, it is also the reason transform-
sual world [44]: We use a convolutional approach to effi- ers quickly become infeasible, especially on images, where
ciently learn a codebook of context-rich visual parts and, the sequence length itself scales quadratically with the res-
subsequently, learn a model of their global compositions. olution. Different approaches have been proposed to reduce
The long-range interactions within these compositions re- the computational requirements to make transformers feasi-
quire an expressive transformer architecture to model distri- ble for longer sequences. [48] and [66] restrict the receptive
butions over their consituent visual parts. Furthermore, we fields of the attention modules, which reduces the expres-
utilize an adversarial approach to ensure that the dictionary sivity and, especially for high-resolution images, introduces
of local parts captures perceptually important local struc- unjustified assumptions on the independence of pixels. [12]
ture to alleviate the need for modeling low-level statistics and [24] retain the full receptive field but can reduce
√ costs
with the transformer architecture. Allowing transformers for a sequence of length n only from n2 to n n, which
to concentrate on their unique strength — modeling long- makes resolutions beyond 64 pixels still prohibitively ex-
range relations — enables them to generate high-resolution pensive.
images as in Fig. 1, a feat which previously has been out of
reach. Our formulation directly gives control over the gen- Convolutional Approaches The two-dimensional struc-
erated images by means of conditioning information regard- ture of images suggests that local interactions are particu-
ing desired object classes or spatial layouts. Finally, experi- larly important. CNNs exploit this structure by restricting
ments demonstrate that our approach retains the advantages interactions between input variables to a local neighborhood
of transformers by outperforming previous codebook-based defined by the kernel size of the convolutional kernel. Ap-
state-of-the-art approaches based on convolutional architec- plying a kernel thus results in costs that scale linearly with
tures. the overall sequence length (the number of pixels in the case
of images) and quadratically in the kernel size, which, in
2. Related Work modern CNN architectures, is often fixed to a small constant
The Transformer Family The defining characteristic of such as 3 × 3. This inductive bias towards local interactions
the transformer architecture [64] is that it models interac- thus leads to efficient computations, but the wide range of
tions between its inputs solely through attention [2, 32, 45] specialized layers which are introduced into CNNs to han-
which enables them to faithfully handle interactions be- dle different synthesis tasks [46, 70, 59, 74, 73] suggest that
tween inputs regardless of their relative position to one an- this bias is often too restrictive.
other. Originally applied to language tasks, inputs to the Convolutional architectures have been used for autore-
transformer were given by tokens, but other signals, such as gressive modeling of images [61, 62, 10] but, for low-
those obtained from audio [37] or images [8], can be used. resolution images, previous works [48, 12, 24] demon-
Each layer of the transformer then consists of an attention strated that transformers consistently outperform their con-
mechanism, which allows for interaction between inputs at volutional counterparts. Our approach allows us to ef-
different positions, followed by a position-wise fully con- ficiently model high-resolution images with transformers
nected network, which is applied to all positions indepen- while retaining their advantages over state-of-the-art con-
dently. More specifically, the (self-)attention mechanism volutional approaches.
can be described by mapping an intermediate representa-
tion with three position-wise linear layers into three repre- Two-Stage Approaches Closest to ours are two-stage ap-
sentations, query Q ∈ RN ×dk , key K ∈ RN ×dk and value proaches which first learn an encoding of data and after-
V ∈ RN ×dv , to compute the output as wards learn, in a second stage, a probabilistic model of this
 QK t  encoding. [13] demonstrated both theoretical and empirical
Attn(Q, K, V ) = softmax √ V ∈ RN ×dv . (1) evidence on the advantages of first learning a data repre-
dk sentation with a Variational Autoencoder (VAE) [34, 54],
When performing autoregressive maximum-likelihood and then again learning its distribution with a VAE. [17, 68]
learning, non-causal entries of QK t , i.e. all entries be- demonstrate similar gains when using an unconditional nor-
low its diagonal, are set to −∞ and the final output of the malizing flow for the second stage, and [55, 56] when using
transformer is given after a linear, point-wise transforma- a conditional normalizing flow. To improve training effi-
tion to predict logits of the next sequence element. Since ciency of Generative Adversarial Networks (GANs), [39]

12874
Figure 2. Our approach uses a convolutional VQGAN to learn a codebook of context-rich visual parts, whose composition is subsequently
modeled with an autoregressive transformer architecture. A discrete codebook provides the interface between these architectures and a
patch-based discriminator enables strong compression while retaining high perceptual quality. This method introduces the efficiency of
convolutional approaches to transformer based high resolution image synthesis.

learns a GAN [19] on representations of an autoencoder and to higher resolutions.


[20] on low-resolution wavelet coefficients which are then High-resolution image synthesis requires a model that
decoded to images with a learned generator. understands the global composition of images, enabling it to
[63] presents the Vector Quantised Variational Autoen- generate locally realistic as well as globally consistent pat-
coder (VQVAE), an approach to learn discrete represen- terns. Therefore, instead of representing an image with pix-
tations of images, and models their distribution autore- els, we represent it as a composition of perceptually rich im-
gressively with a convolutional architecture. [53] extends age constituents from a codebook. By learning an effective
this approach to use a hierarchy of learned representations. code, as described in Sec. 3.1, we can significantly reduce
However, these methods still rely on convolutional density the description length of compositions, which allows us to
estimation, which makes it difficult to capture long-range efficiently model their global interrelations within images
interactions in high-resolution images. [8] models images with a transformer architecture as described in Sec. 3.2.
autoregressively with transformers in order to evaluate the This approach, summarized in Fig. 2, is able to generate
suitability of generative pretraining to learn image repre- realistic and consistent high resolution images both in an
sentations for downstream tasks. Since input resolutions of unconditional and a conditional setting.
32 × 32 pixels are still quite computationally expensive [8],
a VQVAE is used to encode images up to a resolution of 3.1. Learning an Effective Codebook of Image Con-
192 × 192. In an effort to keep the learned discrete repre- stituents for Use in Transformers
sentation as spatially invariant as possible with respect to
To utilize the highly expressive transformer architecture for
the pixels, a shallow VQVAE with small receptive field is
image synthesis, we need to express the constituents of an
employed. In contrast, we demonstrate that a powerful first
image in the form of a sequence. Instead of building on indi-
stage, which captures as much context as possible in the
vidual pixels, complexity necessitates an approach that uses
learned representation, is critical to enable efficient high-
a discrete codebook of learned representations, such that
resolution image synthesis with transformers.
any image x ∈ RH×W ×3 can be represented by a spatial
3. Approach collection of codebook entries zq ∈ Rh×w×nz , where nz is
the dimensionality of codes. An equivalent representation
Our goal is to exploit the highly promising learning ca- is a sequence of h · w indices which specify the respective
pabilities of transformer models [64] and introduce them to entries in the learned codebook. To effectively learn such
high-resolution image synthesis up to the megapixel range. a discrete spatial codebook, we propose to directly incor-
Previous work [48, 8] which applied transformers to image porate the inductive biases of CNNs and incorporate ideas
generation demonstrated promising results for images up to from neural discrete representation learning [63]. First, we
a size of 64 × 64 pixels but, due to the quadratically in- learn a convolutional model consisting of an encoder E and
creasing cost in sequence length, cannot simply be scaled a decoder G, such that taken together, they learn to repre-

12875
sent images with codes from a learned, discrete codebook where Lrec is the perceptual reconstruction loss [71], ∇GL [·]
Z = {zk }K k=1 ⊂ R
nz
(see Fig. 2 for an overview). More denotes the gradient of its input w.r.t. the last layer L of
precisely, we approximate a given image x by x̂ = G(zq ). the decoder, and δ = 10−6 is used for numerical stability.
We obtain zq using the encoding ẑ = E(x) ∈ Rh×w×nz To aggregate context from everywhere, we apply a single
and a subsequent element-wise quantization q(·) of each attention layer on the lowest resolution. This training pro-
spatial code ẑij ∈ Rnz onto its closest codebook entry zk : cedure significantly reduces the sequence length when un-
  rolling the latent code and thereby enables the application
zq = q(ẑ) := arg minkẑij − zk k ∈ Rh×w×nz . (2) of powerful transformer models.
zk ∈Z
3.2. Learning the Composition of Images with
The reconstruction x̂ ≈ x is then given by Transformers
x̂ = G(zq ) = G (q(E(x))) . (3) Latent Transformers With E and G available, we can
now represent images in terms of the codebook-indices of
Backpropagation through the non-differentiable quantiza- their encodings. More precisely, the quantized encoding of
tion operation in Eq. (3) is achieved by a straight-through an image x is given by zq = q(E(x)) ∈ Rh×w×nz and
gradient estimator, which simply copies the gradients from is equivalent to a sequence s ∈ {0, . . . , |Z|−1}h×w of in-
the decoder to the encoder [3], such that the model and dices from the codebook, which is obtained by replacing
codebook can be trained end-to-end via the loss function each code by its index in the codebook Z:
LVQ (E, G, Z) = kx − x̂k2 + ksg[E(x)] − zq k22 sij = k such that (zq )ij = zk . (8)
+ βksg[zq ] − E(x)k22 . (4)
By mapping indices of a sequence s back  to their corre-
Here, Lrec = kx − x̂k2 is a reconstruction loss, sg[·] denotes sponding codebook entries, zq = zsij is readily recov-
the stop-gradient operation, and ksg[zq ] − E(x)k22 is the so- ered and decoded to an image x̂ = G(zq ).
called “commitment loss” with weighting factor β [63]. Thus, after choosing some ordering of the indices in
s, image-generation can be formulated as autoregressive
Learning a Perceptually Rich Codebook Using trans- next-index prediction: Given indices s<i , the transformer
formers to represent images as a distribution over latent learns to predict the distribution of possible next indices,
image constituents requires us to push the limits of com- i.e. p(si |s<i ) to compute the likelihood of the full repre-
Q
pression and learn a rich codebook. To do so, we propose sentation as p(s) = i p(si |s<i ). This allows us to directly
VQGAN, a variant of the original VQVAE, and use a dis- maximize the log-likelihood of the data representations:
criminator and perceptual loss [36, 26, 35, 16] to keep good
perceptual quality at increased compression rate. Note that LTransformer = Ex∼p(x) [− log p(s)] . (9)
this is in contrast to previous works which applied pixel-
based [62, 53] and transformer-based autoregressive mod-
Conditioned Synthesis In many image synthesis tasks a
els [8] on top of only a shallow quantization model. More
user demands control over the generation process by provid-
specifically, we replace the L2 loss used in [63] for Lrec by
ing additional information from which an example shall be
a perceptual loss and introduce an adversarial training pro-
synthesized. This information, which we will call c, could
cedure with a patch-based discriminator D [25] that aims to
be a single label describing the overall image class or even
differentiate between real and reconstructed images:
another image itself. The task is then to learn the likelihood
LGAN ({E, G, Z}, D) = [log D(x) + log(1 − D(x̂))] (5) of the sequence given this information c:
Y
The complete objective for finding the optimal compression p(s|c) = p(si |s<i , c). (10)
model Q∗ = {E ∗ , G∗ , Z ∗ } then reads i
h
Q∗ = arg min max Ex∼p(x) LVQ (E, G, Z) If the conditioning information c has spatial extent, we first
E,G,Z D learn another VQGAN to obtain again an index-based rep-
resentation r ∈ {0, . . . , |Zc |−1}hc ×wc with the newly ob-
i
+λLGAN ({E, G, Z}, D) , (6)
tained codebook Zc Due to the autoregressive structure of
where we compute the adaptive weight λ according to the transformer, we can then simply prepend r to s and
restrict the computation of the negative log-likelihood to
∇GL [Lrec ] entries p(si |s<i , r). This “decoder-only” strategy has also
λ= (7) been successfully used for text-summarization tasks [40].
∇GL [LGAN ] + δ

12876
Negative Log-Likelihood (NLL)
Data / Transformer Transformer PixelSNAIL
# params P-SNAIL steps P-SNAIL time fixed time

RIN / 85M 4.78 4.84 4.96


Figure 3. Sliding attention window.
LSUN-CT / 310M 4.63 4.69 4.89
IN / 310M 4.78 4.83 4.96
Generating High-Resolution Images The attention
D-RIN / 180 M 4.70 4.78 4.88
mechanism of the transformer puts limits on the sequence
S-FLCKR / 310 M 4.49 4.57 4.64
length h · w of its inputs s. While we can adapt the number
Table 1. Comparing Transformer and PixelSNAIL architectures
of downsampling blocks m of our VQGAN to reduce
across different datasets and model sizes. For all settings, trans-
images of size H × W to h = H/2m × w = W/2m , we formers outperform the state-of-the-art model from the PixelCNN
observe degradation of the reconstruction quality beyond family, PixelSNAIL in terms of NLL. This holds both when com-
a critical value of m, which depends on the considered paring NLL at fixed times (PixelSNAIL trains roughly 2 times
dataset. To generate images in the megapixel regime, we faster) and when trained for a fixed number of steps. See Sec. 4.1
therefore have to work patch-wise and crop images to for the abbreviations.
restrict the length of s to a maximally feasible size during
training. To sample images, we then use the transformer
in a sliding-window manner as illustrated in Fig. 3. Our approach. For each task, we train a VQGAN with m = 4
VQGAN ensures that the available context is still sufficient downsampling blocks, and, if needed, another one for the
to faithfully model images, as long as either the statistics of conditioning information, and then train both a transformer
the dataset are approximately spatially invariant or spatial and a PixelSNAIL [10] model on the same representations,
conditioning information is available. In practice, this is as the latter has been used in previous state-of-the-art two-
not a restrictive requirement, because when it is violated, stage approaches [53]. For a thorough comparison, we vary
i.e. unconditional image synthesis on aligned data, we can the model capacities between 85M and 310M parameters
simply condition on image coordinates, similar to [38]. and adjust the number of layers in each model to match one
another. We observe that PixelSNAIL trains roughly twice
4. Experiments as fast as the transformer and thus, for a fair comparison,
report the negative log-likelihood both for the same amount
This section evaluates the ability of our approach to re- of training time (P-SNAIL time) and for the same amount of
tain the advantages of transformers over their convolutional training steps (P-SNAIL steps).
counterparts (Sec. 4.1) while integrating the effectiveness
of convolutional architectures to enable high-resolution im- Results Tab. 1 reports results for unconditional image
age synthesis (Sec. 4.2). Furthermore, in Sec. 4.3, we in- modeling on ImageNet (IN) [14], Restricted ImageNet
vestigate how codebook quality affects our approach. We (RIN) [57], consisting of a subset of animal classes from
close the analysis by providing a quantitative comparison ImageNet, LSUN Churches and Towers (LSUN-CT) [69],
to a wide range of existing approches for generative im- and for conditional image modeling of RIN conditioned on
age synthesis in Sec. 4.4. Based on initial experiments, we depth maps obtained with the approach of [52] (D-RIN) and
usually set |Z|= 1024 and train all subsequent transformer of landscape images collected from Flickr conditioned on
models to predict sequences of length 16 · 16, as this is the semantic layouts (S-FLCKR) obtained with the approach
maximum feasible length to train a GPT2-medium architec- of [7]. Note that for the semantic layouts, we train the
ture (307 M parameters) [51] on a GPU with 12GB VRAM. first-stage using a cross-entropy reconstruction loss due to
More details on architectures and hyperparameters can be their discrete nature. The results shows that the transformer
found in the appendix (Tab. 6 and Tab. 7). consistently outperforms PixelSNAIL across all tasks when
trained for the same amount of time and the gap increases
4.1. Attention Is All You Need in the Latent Space even further when trained for the same number of steps.
Transformers show state-of-the-art results on a wide va- These results demonstrate that gains of transformers carry
riety of tasks, including autoregressive image modeling. over to our proposed two-stage setting.
However, evaluations of previous works were limited to
4.2. A Unified Model for Image Synthesis Tasks
transformers working directly on (low-resolution) pixels
[48, 12, 24], or to deliberately shallow pixel encodings [8]. The versatility and generality of the transformer architec-
This raises the question if our approach retains the advan- ture makes it a promising candidate for image synthesis. In
tages of transformers over convolutional approaches. the conditional case, additional information c such as class
To answer this question, we use a variety of conditional labels or segmentation maps are used and the goal is to learn
and unconditional tasks and compare the performance be- the distribution of images as described in Eq. (10). Using
tween our transformer-based approach and a convolutional the same setting as in Sec. 4.1 (i.e. image size 256 × 256,

12877
conditioning samples (iv): Stochastic superresolution, where low-resolution im-
ages serve as the conditioning information and are thereby
upsampled. We train our model for an upsampling factor of
8 on ImageNet and show results in Fig. 6.
(v): Class-conditional image synthesis: Here, the condi-
tioning information c is a single index describing the class
label of interest. Results on conditional sampling for the
RIN dataset are demonstrated in Fig. 4.
All of these examples make use of the same methodology.
Instead of requiring task specific architectures or modules,
the flexibility of the transformer allows us to learn appropri-
ate interactions for each task, while the VQGAN — which
can be reused across different tasks — leads to short se-
quence lengths. In combination, the presented approach can
be understood as an efficient, general purpose mechanism
for conditional image synthesis. Note that additional results
for each experiment can be found in the appendix, Sec. C.

High-Resolution Synthesis The sliding window ap-


proach introduced in Sec. 3.2 enables image synthesis be-
yond a resolution of 256 × 256 pixels. We evaluate this
approach on unconditional image generation on LSUN-CT
and FacesHQ (see Sec. 4.3) and conditional synthesis on D-
RIN, COCO-Stuff and S-FLCKR, where we show results
in Fig. 1, 6 and the supplementary (Fig. 17-27). Note that
this approach can in principle be used to generate images
of arbitrary ratio and size, given that the image statistics
of the dataset of interest are approximately spatially invari-
Figure 4. Transformers within our setting unify a wide range of
image synthesis tasks. We show 256 × 256 synthesis results ant or spatial information is available. Impressive results
across different conditioning inputs and datasets, all obtained with can be achieved by applying this method to image genera-
the same approach to exploit inductive biases of effective CNN tion from semantic layouts on S-FLCKR, where a strong
based VQGAN architectures in combination with the expressiv- VQGAN can be learned with m = 5, so that its code-
ity of transformer architectures. Top row: Completions from un- book together with the conditioning information provides
conditional training on ImageNet. 2nd row: Depth-to-Image on the transformer with enough context for image generation
RIN. 3rd row: Semantically guided synthesis on ADE20K. 4th in the megapixel regime.
row: Pose-guided person generation on DeepFashion. Bottom
row: Class-conditional samples on RIN. 4.3. Building Context-Rich Vocabularies
How important are context-rich vocabularies? To inves-
tigate this question, we ran experiments where the trans-
latent size 16 × 16), we perform various conditional image
former architecture is kept fixed while the amount of con-
synthesis experiments:
text encoded into the representation of the first stage is var-
(i): Semantic image synthesis, where we condition on ied through the number of downsampling blocks of our VQ-
semantic segmentation masks of ADE20K [72], a web- GAN. We specify the amount of context encoded in terms
scraped landscapes dataset (S-FLCKR) and COCO-Stuff of reduction factor in the side-length between image in-
[6]. Results are depicted in Figure 4, 5 and Fig. 6. puts and the resulting representations, i.e. a first stage en-
(ii): Structure-to-image, where we use either depth or edge coding images of size H × W into discrete codes of size
information to synthesize images from both RIN and IN H/f × W/f is denoted by a factor f . For f = 1, we re-
(see Sec. 4.1). The resulting depth-to-image and edge-to- produce the approach of [8] and replace our VQGAN by a
image translations are visualized in Fig. 4 and Fig. 6. k-means clustering of RGB values with k = 512.
(iii): Pose-guided synthesis: Instead of using the semanti- During training, we always crop images to obtain inputs of
cally rich information of either segmentation or depth maps, size 16 × 16 for the transformer, i.e. when modeling im-
Fig. 4 shows that the same approach as for the previous ex- ages with a factor f in the first stage, we use crops of size
periments can be used to build a shape-conditional genera- 16f × 16f . To sample from the models, we always apply
tive model on the DeepFashion [41] dataset. them in a sliding window manner as described in Sec. 3.

12878
Figure 5. Samples generated from semantic layouts on S-FLCKR. Figure 6. Applying the sliding attention window approach (Fig. 3)
Sizes from top-to-bottom: 1280 × 832, 1024 × 416 and 1280 × to various conditional image synthesis tasks. Top: Depth-to-image
240 pixels. Best viewed zoomed in. A larger visualization can be on RIN, 2nd row: Stochastic superresolution on IN, 3rd and 4th
found in the appendix, see Fig 17. row: Semantic synthesis on S-FLCKR, bottom: Edge-guided syn-
thesis on IN. The resulting images vary between 368 × 496 and
1024 × 576, hence they are best viewed zoomed in.
Results Fig. 7 shows results for unconditional synthesis of
faces on FacesHQ, the combination of CelebA-HQ [27] and Dataset ours SPADE [46] Pix2PixHD (+aug) [65] CRN [9]

FFHQ [29]. It clearly demonstrates the benefits of power- COCO-Stuff 22.4 22.6/23.9(*) 111.5 (54.2) 70.4
ADE20K 35.5 33.9/35.7(*) 81.8 (41.5) 73.3
ful VQGANs by increasing the effective receptive field of
the transformer. For small receptive fields, or equivalently Table 2. FID score comparison for semantic image synthesis
small f , the model cannot capture coherent structures. For (256 × 256 pixels). (*): Recalculated with our evaluation protocol
an intermediate value of f = 8, the overall structure of based on [43] on the validation splits of each dataset.
images can be approximated, but inconsistencies of facial
4.4. Quantitative Comparison to Existing Models
features such as a half-bearded face and of viewpoints in
different parts of the image arise. Only our full setting of In this section we investigate how our approach quantita-
f = 16 can synthesize high-fidelity samples. For analogous tively compares to existing models for generative image
results in the conditional setting on S-FLCKR, we refer to synthesis. In particular, we assess the performance of our
the appendix (Fig. 10 and Sec. B). model in terms of FID and compare to a variety of es-
To assess the effectiveness of our approach quantitatively, tablished models (GANs, VAEs, Flows, AR, Hybrid) on
we compare results between training a transformer directly (i) semantic synthesis in Tab. 2 (where we compare to
on pixels, and training it on top of a VQGAN’s latent code [46, 65, 31, 9]) and (ii) unconditional face synthesis in
with f = 2, given a fixed computational budget. Again, we Tab. 3. Furthermore, to address a direct comparison to the
follow [8] and learn a dictionary of 512 RGB values on CI- original VQVAE-2 model [53], we train a class conditional
FAR10 to operate directly on pixel space and train the same ImageNet transformer on 256 × 256 images, using a VQ-
transformer architecture on top of our VQGAN with a latent GAN with dim Z = 16384 and f = 16, and additionally
code of size 16 × 16 = 256. We observe improvements of compare to BigGAN [4] and MSP [18] in Tab. 4. Note
18.63% for FIDs and 14.08× faster sampling of images. that our model uses ≃ 10× less parameters than VQVAE-2,

12879
f1 f2 f8 f16 downsampling factor

1.0 3.86 65.81 280.68 speed-up


Figure 7. Evaluating the importance of effective codebook for HQ-Faces (CelebA-HQ and FFHQ) for a fixed sequence length |s|= 16·16 =
256. Globally consistent structures can only be modeled with a context-rich vocabulary (right). All samples are generated with temperature
t = 1.0 and top-k sampling with k = 100. Last row reports the speedup over the f1 baseline which operates directly on pixels and takes
7258 seconds to produce a sample on a NVIDIA GeForce GTX Titan X.

CelebA-HQ 256 × 256 FFHQ 256 × 256 Model Codebook Size dim Z FID ↓
Method FID ↓ Method FID ↓ VQVAE-2 64 × 64 & 32 × 32 512 ∼ 10
VQGAN 16 × 16 1024 8.0
GLOW [33] 69.0 VDVAE (t = 0.7) [11] 38.8 VQGAN 16 × 16 16384 4.9
NVAE [60] 40.3 VDVAE (t = 1.0) 33.5
VQGAN 64 × 64 & 32 × 32 512 1.7
PIONEER (B.) [21] 39.2 (25.3) VDVAE (t = 0.8) 29.8
NCPVAE [1] 24.8 VDVAE (t = 0.9) 28.5 Table 5. Reconstruction FID on ImageNet (validation split).
VAEBM [67] 20.4 VQGAN+P.SNAIL 21.9 VQVAE-2 reported their reconstruction FID as “∼ 10”.
Style ALAE [49] 19.2 BigGAN 12.4
DC-VAE [47] 15.8 ours 11.4
ours 10.7 U-Net GAN (+aug) [58] 10.9 (7.6) VQGAN (either in terms of larger codebook sizes or in-
PGGAN [27] 8.0 StyleGAN2 (+aug) [30] 3.8 (3.6) creased code lengths) further improve performance. Us-
Table 3. FID score comparison for face image synthesis. CelebA-
ing the same hierarchical codebook setting as in VQVAE-2
HQ results reproduced from [1, 47, 67, 22], FFHQ from [58, 28]. with our model provides the best reconstruction FID, al-
beit at the cost of a very long and thus impractical se-
Dataset ours (+R) VQVAE-2 (+R) BigGAN (-deep) MSP quence. Furthermore, Fig. 9 qualitatively shows that a stan-
IN 256, 50K 19.8 (11.2) 38.1 (∼ 10) 7.1 (7.3) n.a. dard VQVAE cannot achieve such compressions; the cor-
IN 256, 18K 23.5 n.a. 9.6 (9.7) 50.4 responding reconstruction-FIDs read: VQVAE 254.4; VQ-
Table 4. FID score comparison for class-conditional synthesis. GAN 5.7. Sampling from this VQVAE cannot achieve FIDs
“+R”: classifier-based rejection sampling as proposed in VQVAE- below 254.4, whereas our VQGAN achieves 21.93 with Pix-
2. FID*-values (calculated on reconstructed data, analogous to elSNAIL and 11.44 with a transformer (see Tab. 3).
[53]): ours: 13.5 (8.1), VQVAE-2: 19 (5). BigGAN (-deep) eval-
uated via https://tfhub.dev/deepmind truncated at 1.0. 5. Conclusion
This paper adressed the fundamental challenges that previ-
which has an estimated parameter count of 13.5B (estima- ously confined transformers to low-resolution images. We
tion based on https://github.com/rosinality/ proposed an approach which represents images as a compo-
vq- vae- 2- pytorch). While some task-specialized sition of perceptually rich image constituents and thereby
GAN models report better FID scores, our approach pro- overcomes the infeasible quadratic complexity when mod-
vides a unified model that works well across a wide range eling images directly in pixel space. Modeling constituents
of tasks while retaining the ability to encode and recon- with a CNN architecture and their compositions with a
struct images. It thereby bridges the gap between purely ad- transformer architecture taps into the full potential of their
versarial and likelihood-based approaches. Fig. 11, 12, 13 complementary strengths and thereby allowed us to rep-
and Fig. 14 contain qualitative samples corresponding to the resent the first results on high-resolution image synthesis
quantitative analysis in Tab. 4. with a transformer-based architecture. In experiments, our
approach demonstrates the efficiency of convolutional in-
How good is the VQGAN? Reconstruction FIDs obtained ductive biases and the expressivity of transformers by syn-
via the codebook provide a lower bound on the achiev- thesizing images in the megapixel range and outperforming
able FID of the generative model trained on it. To quan- state-of-the-art convolutional approaches. Equipped with a
tify the performance gains of our VQGAN over VQVAE- general mechanism for conditional synthesis, it offers many
2, we evaluate this metric on ImageNet and report results opportunities for novel neural rendering approaches.
in Tab. 5. Our VQGAN outperforms VQVAE-2 while pro- This work has been supported by the German Research Foundation
viding significantly more compression (seq. length of 256 (DFG) projects 371923335, 421703927 and a hardware donation from
vs. 5120 = 322 + 642 ). As expected, larger versions of NVIDIA corporation.

12880
References [14] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li,
and Li Fei-Fei. Imagenet: A large-scale hierarchical image
[1] Jyoti Aneja, Alexander G. Schwing, Jan Kautz, and Arash database. In 2009 IEEE Computer Society Conference on
Vahdat. NCP-VAE: variational autoencoders with noise con- Computer Vision and Pattern Recognition CVPR, 2009. 5
trastive priors. CoRR, abs/2010.02917, 2020. 8
[15] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov,
[2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio.
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner,
Neural machine translation by jointly learning to align and
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl-
translate, 2016. 2
vain Gelly, et al. An image is worth 16x16 words: Trans-
[3] Yoshua Bengio, Nicholas Léonard, and Aaron C. Courville.
formers for image recognition at scale. 2020. 1
Estimating or propagating gradients through stochastic neu-
rons for conditional computation. CoRR, abs/1308.3432, [16] Alexey Dosovitskiy and Thomas Brox. Generating Images
2013. 4 with Perceptual Similarity Metrics based on Deep Networks.
In Advances in Neural Information Processing Systems 29:
[4] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large
Annual Conference on Neural Information Processing Sys-
Scale GAN Training for High Fidelity Natural Image Syn-
tems, NeurIPS, 2016. 4
thesis. In 7th International Conference on Learning Repre-
sentations, ICLR, 2019. 7, 16, 17, 18, 19 [17] Patrick Esser, Robin Rombach, and Björn Ommer. A Dis-
[5] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Sub- entangling Invertible Interpretation Network for Explaining
biah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakan- Latent Representations. In 2020 IEEE/CVF Conference on
tan, Pranav Shyam, Girish Sastry, Amanda Askell, Sand- Computer Vision and Pattern Recognition, CVPR, 2020. 2
hini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom [18] Jeffrey De Fauw, Sander Dieleman, and Karen Simonyan.
Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Hierarchical autoregressive image models with auxiliary de-
Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, coders. CoRR, abs/1903.04933, 2019. 7, 16, 17, 18, 19
Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, [19] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing
Jack Clark, Christopher Berner, Sam McCandlish, Alec Rad- Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville,
ford, Ilya Sutskever, and Dario Amodei. Language Models and Yoshua Bengio. Generative Adversarial Nets. In Ad-
are Few-Shot Learners. arXiv preprint arXiv:2005.14165, vances in Neural Information Processing Systems 27: An-
2020. 1 nual Conference on Neural Information Processing Systems,
[6] Holger Caesar, Jasper Uijlings, and Vittorio Ferrari. COCO- NeurIPS, 2014. 3
Stuff: Thing and stuff classes in context. In Computer vision [20] Seungwook Han, Akash Srivastava, Cole L. Hurwitz,
and pattern recognition (CVPR), 2018 IEEE conference on. Prasanna Sattigeri, and David D. Cox. not-so-biggan: Gener-
IEEE, 2018. 6 ating high-fidelity images on a small compute budget. CoRR,
[7] Liang-Chieh Chen, G. Papandreou, I. Kokkinos, Kevin Mur- abs/2009.04433, 2020. 3
phy, and A. Yuille. DeepLab: Semantic Image Segmenta- [21] Ari Heljakka, Arno Solin, and Juho Kannala. Pioneer
tion with Deep Convolutional Nets, Atrous Convolution, and networks: Progressively growing generative autoencoder.
Fully Connected CRFs. IEEE Transactions on Pattern Anal- In C. V. Jawahar, Hongdong Li, Greg Mori, and Konrad
ysis and Machine Intelligence, 2018. 5 Schindler, editors, Computer Vision - ACCV 2018 - 14th
[8] Mark Chen, Alec Radford, Rewon Child, Jeff Wu, Hee- Asian Conference on Computer Vision, Perth, Australia, De-
woo Jun, Prafulla Dhariwal, David Luan, and Ilya Sutskever. cember 2-6, 2018, Revised Selected Papers, Part I, 2018. 8
Generative pretraining from pixels. 2020. 1, 2, 3, 4, 5, 6, 7,
[22] Ari Heljakka, Arno Solin, and Juho Kannala. Towards pho-
13, 14, 20, 21
tographic image manipulation with balanced growing of gen-
[9] Qifeng Chen and Vladlen Koltun. Photographic image syn-
erative autoencoders. In IEEE Winter Conference on Appli-
thesis with cascaded refinement networks. In IEEE Interna-
cations of Computer Vision, WACV 2020, Snowmass Village,
tional Conference on Computer Vision, ICCV 2017, Venice,
CO, USA, March 1-5, 2020, pages 3109–3118. IEEE, 2020.
Italy, October 22-29, 2017, pages 1520–1529. IEEE Com-
8
puter Society, 2017. 7
[23] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu-
[10] Xi Chen, Nikhil Mishra, Mostafa Rohaninejad, and Pieter
sion probabilistic models, 2020. 12
Abbeel. Pixelsnail: An improved autoregressive generative
model. In ICML, volume 80 of Proceedings of Machine [24] Jonathan Ho, Nal Kalchbrenner, Dirk Weissenborn, and Tim
Learning Research, pages 863–871. PMLR, 2018. 2, 5 Salimans. Axial attention in multidimensional transformers.
[11] Rewon Child. Very deep vaes generalize autoregressive CoRR, abs/1912.12180, 2019. 2, 5
models and can outperform them on images. CoRR, [25] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A.
abs/2011.10650, 2020. 8 Efros. Image-to-Image Translation with Conditional Adver-
[12] Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. sarial Networks. In 2017 IEEE Conference on Computer Vi-
Generating long sequences with sparse transformers, 2019. sion and Pattern Recognition, CVPR, 2017. 4, 12
1, 2, 5 [26] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual
[13] Bin Dai and David P. Wipf. Diagnosing and enhancing VAE losses for real-time style transfer and super-resolution. In
models. In 7th International Conference on Learning Repre- ECCV (2), volume 9906 of Lecture Notes in Computer Sci-
sentations, ICLR, 2019. 2 ence, pages 694–711. Springer, 2016. 4

12881
[27] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. [41] Ziwei Liu, Ping Luo, Shi Qiu, Xiaogang Wang, and Xi-
Progressive growing of gans for improved quality, stability, aoou Tang. Deepfashion: Powering robust clothes recog-
and variation. CoRR, abs/1710.10196, 2017. 7, 8 nition and retrieval with rich annotations. In Proceedings of
[28] Tero Karras, Miika Aittala, Janne Hellsten, Samuli Laine, IEEE Conference on Computer Vision and Pattern Recogni-
Jaakko Lehtinen, and Timo Aila. Training generative ad- tion (CVPR), June 2016. 6
versarial networks with limited data. In Hugo Larochelle, [42] Jacob Menick and Nal Kalchbrenner. Generating high fi-
Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, delity images with subscale pixel networks and multidimen-
and Hsuan-Tien Lin, editors, Advances in Neural Informa- sional upscaling. In 7th International Conference on Learn-
tion Processing Systems 33: Annual Conference on Neural ing Representations, ICLR 2019, New Orleans, LA, USA,
Information Processing Systems 2020, NeurIPS 2020, De- May 6-9, 2019. OpenReview.net, 2019. 14
cember 6-12, 2020, virtual, 2020. 8 [43] Anton Obukhov, Maximilian Seitzer, Po-Wei Wu, Se-
[29] Tero Karras, Samuli Laine, and Timo Aila. A style-based men Zhydenko, Jonathan Kyl, and Elvis Yu-Jing Lin.
generator architecture for generative adversarial networks. In toshas/torch-fidelity: Version 0.2.0, May 2020. 7
IEEE Conference on Computer Vision and Pattern Recog- [44] B. Ommer and J. M. Buhmann. Learning the compositional
nition, (CVPR) 2019, Long Beach, CA, USA, June 16- nature of visual objects. In 2007 IEEE Conference on Com-
20, 2019, pages 4401–4410. Computer Vision Foundation / puter Vision and Pattern Recognition, pages 1–8, 2007. 2
IEEE, 2019. 7 [45] Ankur P. Parikh, Oscar Täckström, Dipanjan Das, and Jakob
[30] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Uszkoreit. A decomposable attention model for natural lan-
Jaakko Lehtinen, and Timo Aila. Analyzing and improving guage inference, 2016. 2
the image quality of stylegan. In 2020 IEEE/CVF Confer- [46] Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan
ence on Computer Vision and Pattern Recognition, CVPR Zhu. Semantic Image Synthesis with Spatially-Adaptive
2020, Seattle, WA, USA, June 13-19, 2020, pages 8107– Normalization. In Proceedings of the IEEE Conference on
8116. IEEE, 2020. 8 Computer Vision and Pattern Recognition, CVPR, 2019. 2,
[31] Prateek Katiyar and Anna Khoreva. Improving augmentation 7, 33, 34
and evaluation schemes for semantic image synthesis, 2021. [47] Gaurav Parmar, Dacheng Li, Kwonjoon Lee, and Zhuowen
7 Tu. Dual contradistinctive generative autoencoder, 2020. 8
[32] Yoon Kim, Carl Denton, Luong Hoang, and Alexander M. [48] Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz
Rush. Structured attention networks, 2017. 2 Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. Im-
[33] Diederik P. Kingma and Prafulla Dhariwal. Glow: Gener- age transformer. In ICML, volume 80 of Proceedings of Ma-
ative Flow with Invertible 1x1 Convolutions. In Advances chine Learning Research, pages 4052–4061. PMLR, 2018.
in Neural Information Processing Systems 31: Annual Con- 2, 3, 5
ference on Neural Information Processing Systems 2018, [49] Stanislav Pidhorskyi, Donald A. Adjeroh, and Gianfranco
NeurIPS, 2018. 8 Doretto. Adversarial latent autoencoders. In 2020 IEEE/CVF
Conference on Computer Vision and Pattern Recognition,
[34] Diederik P. Kingma and Max Welling. Auto-Encoding Vari-
CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pages
ational Bayes. In 2nd International Conference on Learning
14092–14101. IEEE, 2020. 8
Representations, ICLR, 2014. 2
[50] A. Radford. Improving language understanding by genera-
[35] Alex Lamb, Vincent Dumoulin, and Aaron C. Courville.
tive pre-training. 2018. 1
Discriminative regularization for generative models. CoRR,
[51] A. Radford, Jeffrey Wu, R. Child, David Luan, Dario
abs/1602.03220, 2016. 4
Amodei, and Ilya Sutskever. Language models are unsuper-
[36] Anders Boesen Lindbo Larsen, Søren Kaae Sønderby, Hugo vised multitask learners. 2019. 1, 5, 12
Larochelle, and Ole Winther. Autoencoding beyond pixels
[52] René Ranftl, Katrin Lasinger, David Hafner, Konrad
using a learned similarity metric, 2015. 4
Schindler, and Vladlen Koltun. Towards robust monocular
[37] Naihan Li, Shujie Liu, Yanqing Liu, Sheng Zhao, and Ming depth estimation: Mixing datasets for zero-shot cross-dataset
Liu. Neural speech synthesis with transformer network. In transfer. IEEE Transactions on Pattern Analysis and Ma-
AAAI, pages 6706–6713. AAAI Press, 2019. 2 chine Intelligence (TPAMI), 2020. 5
[38] Chieh Hubert Lin, Chia-Che Chang, Yu-Sheng Chen, Da- [53] Ali Razavi, Aaron van den Oord, and Oriol Vinyals. Gener-
Cheng Juan, Wei Wei, and Hwann-Tzong Chen. COCO- ating diverse high-fidelity images with vq-vae-2, 2019. 3, 4,
GAN: generation by parts via conditional coordinating. In 5, 7, 8, 16, 17, 18, 19
ICCV, pages 4511–4520. IEEE, 2019. 5 [54] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wier-
[39] Jinlin Liu, Yuan Yao, and Jianqiang Ren. An accelera- stra. Stochastic backpropagation and approximate inference
tion framework for high resolution image synthesis. CoRR, in deep generative models. In Proceedings of the 31st In-
abs/1909.03611, 2019. 2 ternational Conference on International Conference on Ma-
[40] Peter J. Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, chine Learning, ICML, 2014. 2
Ryan Sepassi, Lukasz Kaiser, and Noam Shazeer. Gener- [55] Robin Rombach, Patrick Esser, and Björn Ommer. Making
ating wikipedia by summarizing long sequences. In ICLR sense of cnns: Interpreting deep representations and their
(Poster). OpenReview.net, 2018. 4 invariances with inns. In Andrea Vedaldi, Horst Bischof,

12882
Thomas Brox, and Jan-Michael Frahm, editors, Computer [67] Zhisheng Xiao, Karsten Kreis, Jan Kautz, and Arash Vahdat.
Vision - ECCV 2020 - 16th European Conference, Glasgow, Vaebm: A symbiosis between variational autoencoders and
UK, August 23-28, 2020, Proceedings, Part XVII, volume energy-based models, 2021. 8
12362 of Lecture Notes in Computer Science, pages 647– [68] Zhisheng Xiao, Qing Yan, Yi-an Chen, and Yali Amit. Gen-
664. Springer, 2020. 2 erative latent flow: A framework for non-adversarial image
[56] Robin Rombach, Patrick Esser, and Bjorn Ommer. Network- generation. CoRR, abs/1905.10485, 2019. 2
to-network translation with conditional invertible neural net- [69] Fisher Yu, Yinda Zhang, Shuran Song, Ari Seff, and Jianx-
works. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Bal- iong Xiao. Lsun: Construction of a large-scale image dataset
can, and H. Lin, editors, Advances in Neural Information using deep learning with humans in the loop. arXiv preprint
Processing Systems, volume 33, pages 2784–2797. Curran arXiv:1506.03365, 2015. 5
Associates, Inc., 2020. 2 [70] Pan Zhang, Bo Zhang, Dong Chen, Lu Yuan, and Fang Wen.
[57] Shibani Santurkar, Dimitris Tsipras, Brandon Tran, Andrew Cross-Domain Correspondence Learning for Exemplar-
Ilyas, Logan Engstrom, and Aleksander Madry. Computer Based Image Translation. In 2020 IEEE/CVF Conference
vision with a single (robust) classifier. In ArXiv preprint on Computer Vision and Pattern Recognition, CVPR, 2020.
arXiv:1906.09453, 2019. 5 2
[58] Edgar Schönfeld, Bernt Schiele, and Anna Khoreva. A u- [71] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman,
net based discriminator for generative adversarial networks. and Oliver Wang. The Unreasonable Effectiveness of Deep
In 2020 IEEE/CVF Conference on Computer Vision and Pat- Features as a Perceptual Metric. In CVPR, 2018. 4, 12
tern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, [72] Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela
2020, pages 8204–8213. IEEE, 2020. 8 Barriuso, and Antonio Torralba. Semantic understand-
ing of scenes through the ade20k dataset. arXiv preprint
[59] Aliaksandr Siarohin, Stéphane Lathuilière, Sergey Tulyakov,
arXiv:1608.05442, 2016. 6
Elisa Ricci, and Nicu Sebe. First order motion model for
image animation. In Conference on Neural Information Pro- [73] Tinghui Zhou, Shubham Tulsiani, Weilun Sun, Jitendra Ma-
cessing Systems (NeurIPS), December 2019. 2 lik, and Alexei A. Efros. View synthesis by appearance flow,
2017. 2
[60] Arash Vahdat and Jan Kautz. NVAE: A deep hierarchical
[74] Peihao Zhu, Rameen Abdal, Yipeng Qin, and Peter Wonka.
variational autoencoder. In Hugo Larochelle, Marc’Aurelio
Sean: Image synthesis with semantic region-adaptive nor-
Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-
malization, 2019. 2
Tien Lin, editors, Advances in Neural Information Process-
ing Systems 33: Annual Conference on Neural Information
Processing Systems 2020, NeurIPS 2020, December 6-12,
2020, virtual, 2020. 8
[61] Aäron van den Oord, Nal Kalchbrenner, and Koray
Kavukcuoglu. Pixel recurrent neural networks. In ICML,
volume 48 of JMLR Workshop and Conference Proceedings,
pages 1747–1756. JMLR.org, 2016. 2
[62] Aaron van den Oord, Nal Kalchbrenner, Oriol Vinyals, Lasse
Espeholt, Alex Graves, and Koray Kavukcuoglu. Condi-
tional image generation with pixelcnn decoders, 2016. 2,
4, 14
[63] Aaron van den Oord, Oriol Vinyals, and Koray
Kavukcuoglu. Neural discrete representation learning,
2018. 3, 4, 13, 15
[64] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
reit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia
Polosukhin. Attention is All you Need. In Advances in Neu-
ral Information Processing Systems 30: Annual Conference
on Neural Information Processing Systems, NeurIPS, 2017.
1, 2, 3
[65] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao,
Jan Kautz, and Bryan Catanzaro. High-resolution image syn-
thesis and semantic manipulation with conditional gans. In
Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, 2018. 7
[66] Dirk Weissenborn, Oscar Täckström, and Jakob Uszkoreit.
Scaling autoregressive video models. In ICLR. OpenRe-
view.net, 2020. 2

12883

You might also like