0% found this document useful (0 votes)
2 views10 pages

CVPR2020-Image Processing Using Multi-Code GAN Prior

The document presents a novel approach called mGANprior, which utilizes multiple latent codes to enhance image reconstruction quality in various image processing tasks using Generative Adversarial Networks (GANs). This method allows for better recovery of target images by composing feature maps with adaptive channel importance, significantly outperforming existing techniques. The proposed approach is applicable to tasks such as image colorization, super-resolution, and inpainting without the need for retraining the GAN models.

Uploaded by

13847634767lin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views10 pages

CVPR2020-Image Processing Using Multi-Code GAN Prior

The document presents a novel approach called mGANprior, which utilizes multiple latent codes to enhance image reconstruction quality in various image processing tasks using Generative Adversarial Networks (GANs). This method allows for better recovery of target images by composing feature maps with adaptive channel importance, significantly outperforming existing techniques. The proposed approach is applicable to tasks such as image colorization, super-resolution, and inpainting without the need for retraining the GAN models.

Uploaded by

13847634767lin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Image Processing Using Multi-Code GAN Prior

Jinjin Gu1,2 , Yujun Shen1 , Bolei Zhou1


1
The Chinese University of Hong Kong 2 The Chinese University of Hong Kong, Shenzhen
[email protected], {sy116, bzhou}@ie.cuhk.edu.hk
arXiv:1912.07116v2 [cs.CV] 31 Mar 2020

(a) Image Reconstruction (b) Image Colorization (c) Image Super-Resolution

(d) Image Denoising (e) Image Inpainting (f) Semantic Manipulation


Figure 1: Multi-code GAN prior facilitates many image processing applications using the reconstruction from fixed PGGAN [23] models.

Abstract 1. Introduction
Recently, Generative Adversarial Networks (GANs) [16]
Despite the success of Generative Adversarial Networks
have advanced image generation by improving the synthesis
(GANs) in image synthesis, applying trained GAN models
quality [23, 8, 24] and stabilizing the training process [1, 7,
to real image processing remains challenging. Previous
17]. The capability to produce high-quality images makes
methods typically invert a target image back to the latent
GANs applicable to many image processing tasks, such as
space either by back-propagation or by learning an addi-
semantic face editing [27, 36], super-resolution [28, 42],
tional encoder. However, the reconstructions from both of
image-to-image translation [53, 11, 31], etc. However, most
the methods are far from ideal. In this work, we propose
of these GAN-based approaches require special design of
a novel approach, called mGANprior, to incorporate the
network structures [27, 53] or loss functions [36, 28] for
well-trained GANs as effective prior to a variety of image
a particular task, limiting their generalization ability. On
processing tasks. In particular, we employ multiple latent
the other hand, the large-scale GAN models, like StyleGAN
codes to generate multiple feature maps at some intermedi-
[24] and BigGAN [8], can synthesize photo-realistic images
ate layer of the generator, then compose them with adaptive
after being trained with millions of diverse images. Their
channel importance to recover the input image. Such
neural representations are shown to contain various levels
an over-parameterization of the latent space significantly
of semantics underlying the observed data [21, 15, 35, 44].
improves the image reconstruction quality, outperforming
Reusing these models as prior to real image processing with
existing competitors. The resulting high-fidelity image
minor effort could potentially lead to wider applications but
reconstruction enables the trained GAN models as prior to
remains much less explored.
many real-world applications, such as image colorization,
super-resolution, image inpainting, and semantic manipu- The main challenge towards this goal is that the standard
lation. We further analyze the properties of the layer-wise GAN model is initially designed for synthesizing images
representation learned by GAN models and shed light on from random noises, thus is unable to take real images for
what knowledge each layer is capable of representing.1 any post-processing. A common practice is to invert a given
image back to a latent code such that it can be reconstructed
1 Code is available at this link. by the generator. In this way, the inverted code can be

1
used for further processing. To reverse the generation better initialization for optimization. There are also some
process, existing approaches fall into two types. One is models taking invertibility into account at the training stage
to directly optimize the latent code by minimizing the [14, 13, 26]. However, all the above methods only consider
reconstruction error through back-propagation [30, 12, 32]. using a single latent code to recover the input image and the
The other is to train an extra encoder to learn the mapping reconstruction quality is far from ideal, especially when the
from the image space to the latent space [34, 52, 6, 5]. test image shows a huge domain gap to training data. That
However, the reconstructions achieved by both methods are is because the input image may not lie in the synthesis space
far from ideal, especially when the given image is with high of the generator, in which case the perfect inversion with a
resolution. Consequently, the reconstructed image with low single latent code does not exist. By contrast, we propose
quality is unable to be used for image processing tasks. to increase the number of latent codes, which significantly
In principle, it is impossible to recover every detail of improve the inversion quality no matter whether the target
any arbitrary real image using a single latent code, oth- image is in-domain or out-of-domain.
erwise, we would have an unbeatable image compression Image Processing with GANs. GANs have been widely
method. In other words, the expressiveness of the latent used for real image processing due to its great power of
code is limited due to its finite dimensionality. Therefore, synthesizing photo-realistic images. These applications
to faithfully recover a target image, we propose to employ include image denoising [9, 25], image inpainting [45, 47],
multiple latent codes and compose their corresponding fea- super-resolution [28, 42], image colorization [38, 20], style
ture maps at some intermediate layer of the generator. Uti- mixing [19, 10], semantic image manipulation [41, 29], etc.
lizing multiple latent codes allows the generator to recover However, current GAN-based models are usually designed
the target image using all the possible composition knowl- for a particular task with specialized architectures [19, 41]
edge learned in the deep generative representation. The or loss functions [28, 10], and trained with paired data
experiments show that our approach significantly improves by taking one image as input and the other as supervision
the image reconstruction quality. More importantly, being [45, 20]. Differently, our approach can reuse the knowledge
able to better reconstruct the input image, our approach contained in a well-trained GAN model and further enable
facilitates various real image processing applications by a single GAN model as prior to all the aforementioned tasks
using pre-trained GAN models as prior without retraining without retraining or modification. It is worth noticing that
or modification, which is shown in Fig.1. We summarize our method can achieve similar or even better results than
our contributions as follows: existing GAN-based methods that are particularly trained
• We propose mGANprior, shorted for multi-code GAN for a certain task.
prior, as an effective GAN inversion method by using
multiple latent codes and adaptive channel importance. Deep Model Prior. Generally, the impressive performance
The method faithfully reconstructs the given real im- of the deep convolutional model can be attributed to its ca-
age, surpassing existing approaches. pacity of capturing statistical information from large-scale
• We apply the proposed mGANprior to a range of real- data as prior. Such prior can be inversely used for image
world applications, such as image colorization, super- generation and image reconstruction [40, 39, 2]. Upchurch
resolution, image inpainting, semantic manipulation, et al. [40] inverted a discriminative model, starting from
etc, demonstrating its potential in real image process- deep convolutional features, to achieve semantic image
ing. transformation. Ulyanov et al. [39] reconstructed the target
• We further analyze the internal representation of dif- image with a U-Net structure to show that the structure
ferent layers in a GAN generator by composing the of a generator network is sufficient to capture the low-
features from the inverted latent codes at each layer level image statistics prior to any learning. Athar et al.
respectively. [2] learned a universal image prior for a variety of image
restoration tasks. Some work theoretically explored the
2. Related Work prior provided by deep generative models [32, 18], but the
results using GAN prior to real image processing are still
GAN Inversion. The task of GAN inversion targets at unsatisfying. A recent work [3] applied generative image
reversing a given image back to a latent code with a pre- prior to semantic photo manipulation, but it can only edit
trained GAN model. As an important step for applying some partial regions of the input image yet fails to apply
GANs to real-world applications, it has attracted increasing to other tasks like colorization or super-resolution. That is
attention recently. To invert a fixed generator in GAN, because it only inverts the GAN model to some intermediate
existing methods either optimized the latent code based on feature space instead of the earliest hidden space. By
gradient descent [30, 12, 32] or learned an extra encoder contrast, our method reverses the entire generative process,
to project the image space back to the latent space [34, 52, i.e., from the image space to the initial latent space, which
6, 5]. Bau et al. [3] proposed to use encoder to provide supports more flexible image processing tasks.

2
z1

} ·

} }
MSE + Perceptual Loss

}
(`)
F1 ↵1 (`)
(`) G2
G1
z2 ·
(`) ↵2
F2
N
X
F(`)
n ↵n xinv x
Inversion Result Target Image
zN · n=1

FN
(`) ↵N
Figure 2: Pipeline of GAN inversion using multiple latent codes {zn }N n=1 . The generative features from these latent codes are composed
at some intermediate layer (i.e., the `-th layer) of the generator, weighted by the adaptive channel importance scores {αn }N n=1 . All latent
codes and the corresponding channel importance scores are jointly optimized to recover a target image.

3. Multi-Code GAN Prior intermediate feature maps. More concretely, the generator
(`)
G(·) is divided into two sub-networks, i.e., G1 (·) and
A well-trained generator G(·) of GAN can synthesize (`)
high-quality images by sampling codes from the latent G2 (·). Here, ` is the index of the intermediate layer to
space Z. Given a target image x, the GAN inversion perform feature composition. With such a separation, for
task aims at reversing the generation process by finding the any zn , we can extract the corresponding spatial feature
(`) (`)
adequate code to recover x. It can be formulated as Fn = G1 (zn ) for further composition.
Adaptive Channel Importance. Recall that we would
z∗ = arg min L(G(z), x), (1) like each zn to recover some particular regions of the
z∈Z
target image. Bau et al. [4] observed that different units
where L(·, ·) denotes the objective function. (i.e., channels) of the generator in GAN are responsible
However, due to the highly non-convex natural of this for generating different visual concepts such as objects
optimization problem, previous methods fail to ideally and textures. Based on this observation, we introduce the
reconstruct an arbitrary image by optimizing a single latent adaptive channel importance αn for each zn to help them
code. To this end, we propose to use multiple latent codes align with different semantics. Here, αn ∈ RC is a C-
and compose their corresponding intermediate feature maps dimensional vector and C is the number of channels in the
with adaptive channel importance, as illustrated in Fig.2. `-th layer of G(·). We expect each entry of αn to represent
how important the corresponding channel of the feature map
3.1. GAN Inversion with Multiple Latent Codes (`)
Fn is. With such composition, the reconstructed image can
The expressiveness of a single latent code may not be be generated with
enough to recover all the details of a certain image. Then, N
X
how about using N latent codes {zn }N n=1 , each of which
(`)
xinv = G2 ( Fn(`) αn ), (2)
can help reconstruct some sub-regions of the target image? n=1
In the following, we introduce how to utilize multiple latent where denotes the channel-wise multiplication as
codes for GAN inversion.
Feature Composition. One key difficulty after introducing {F(`)
n αn }i,j,c = {F(`)
n }i,j,c × {αn }c . (3)
multiple latent codes is how to integrate them in the gene- Here, i and j indicate the spatial location, while c stands for
ration process. A straightforward solution is to fuse the the channel index.
images generated by each zn from the image space X .
Optimization Objective. After introducing the feature
However, X is not naturally a linear space such that linearly
composition technique together with the introduced adap-
combining synthesized images is not guaranteed to produce
tive channel importance to integrate multiple latent codes,
a meaningful image, let alone recover the input in detail.
there are 2N sets of parameters to be optimized in total.
A recent work [5] pointed out that inverting a generative
Accordingly we reformulate Eq.(1) as
model from the image space to some intermediate feature
space is much easier than to the latent space. Accordingly, {z∗n }N ∗ N
n=1 , {αn }n=1 = arg min L(xinv , x). (4)
we propose to combine the latent codes by composing their {zn }N N
n=1 ,{αn }n=1

3
PGGAN CelebA-HQ PGGAN Church PGGAN Bedroom
To improve the reconstruction quality, we define the objec-
tive function by leveraging both low-level and high-level
information. In particular, we use pixel-wise reconstruction

Target Image
error as well as the l1 distance between the perceptual
features [22] extracted from the two images2 . Therefore,
the objective function is as follows:

L(x1 , x2 ) = ||x1 − x2 ||22 + ||φ(x1 ), φ(x2 )||1 , (5)

(a) Optimization
where φ(·) denotes the perceptual feature extractor. We use
the gradient descent algorithm to find the optimal latent
codes as well as the corresponding channel importance
scores.

3.2. Multi-Code GAN Prior for Image Processing


After inversion, we apply the reconstruction result as

(b) Encoder
multi-code GAN prior to a variety of image processing
tasks. Each task requires an image as a reference, which
is the input image for processing. For example, image
colorization task deals with grayscale images and image (c) Encoder + Optimization
inpainting task restores images with missing holes. Given
an input, we apply the proposed multi-code GAN inver-
sion method to reconstruct it and then post-process the
reconstructed image to approximate the input. When the
approximation is close enough to the input, we assume
the reconstruction before post-processing is what we want.
Here, to adapt mGANprior to a specific task, we modify
Eq.(5) based on the post-processing function:
(d) Ours

• For image colorization task, with a grayscale image


Igray as the input, we expect the inversion result to
have the same gray channel as Igray with

Lcolor = L(gray(xinv ), Igray ), (6) Figure 3: Qualitative comparison of different GAN inversion
methods, including (a) optimizing a single latent code [32], (b)
where gray(·) stands for the operation to take the gray learning an encoder [52], (c) using the encoder as initialization for
channel of an image. optimization [5], and (d) our proposed mGANprior.
• For image super-resolution task, with a low-resolution
image ILR as the input, we downsample the inversion 4. Experiments
result to approximate ILR with
We conduct extensive experiments on state-of-the-art
LSR = L(down(xinv ), ILR ), (7) GAN models, i.e., PGGAN [23] and StyleGAN [24], to
verify the effectiveness of mGANprior. These models are
where down(·) stands for the downsampling operation. trained on various datasets, including CelebA-HQ [23] and
• For image inpainting task, with an intact image Iori FFHQ [24] for faces as well as LSUN [46] for scenes.
and a binary mask m indicating known pixels, we only
reconstruct the incorrupt parts and let the GAN model 4.1. Comparison with Other Inversion Methods
fill in the missing pixels automatically with There are many attempts on GAN inversion in the
literature. In this section, we compare our multi-code
Linp = L(xinv ◦ m, Iori ◦ m), (8)
inversion approach with the following baseline methods:
(a) optimizing a single latent code z as in Eq.(1) [32], (b)
where ◦ denotes the element-wise product.
learning an encoder to reverse the generator [52], and (c)
2 In this experiment, we use pre-trained VGG-16 model [37] as the combing (a) and (b) by using the output of the encoder as
feature extractor, and the output of layer conv 43 is used. the initialization for further optimization [5].

4
Table 1: Quantitative comparison of different GAN inversion 1
methods: including (a) optimizing a single latent code [32], (b) 0.95
learning an encoder [52], (c) using the encoder as initialization for
0.9
optimization [5], and (d) our proposed mGANprior. ↑ means the

Correlation
higher the better while ↓ means the lower the better. 0.85

Bedroom Church Face 0.8


2 latent codes
5 latent codes
Method PSNR↑ LPIPS↓ PSNR↑ LPIPS↓ PSNR↑ LPIPS↓ 0.75
10 latent codes
(a) 17.19 0.5897 17.15 0.5339 19.17 0.5797 0.7 20 latent codes
(b) 11.59 0.6247 11.58 0.5961 11.18 0.6992 30 latent codes
(c) 18.34 0.5201 17.81 0.4789 20.33 0.5321 0.65

r1

r2

r3

r4

r5

r6

r7

r8
(d) 25.13 0.1578 22.76 0.1799 23.59 0.4432

ye

ye

ye

ye

ye

ye

ye

ye
La

La

La

La

La

La

La

La
Figure 4: Effects on inversion performance by the number of latent
To quantitatively evaluate the inversion results, we intro-
codes used and the feature composition position.
duce the Peak Signal-to-Noise Ratio (PSNR) to measure the
Target Image Inversion Segmentation
similarity between the original input and the reconstruction
result from pixel level, as well as the LPIPS metric [49]
which is known to align with human perception. We make
comparisons on three PGGAN [23] models that are trained
on LSUN bedroom (indoor scene), LSUN church (outdoor z #1: Tower z #7: Tree z #9: Building z #14: Road z #17: Tree
IoU=0.21 IoU=0.21 IoU=0.40 IoU=0.33 IoU=0.22
scene), and CelebA-HQ (human face) respectively. For
each model, we invert 300 real images for testing.
Tab.1 and Fig.3 show the quantitative and qualitative
comparisons respectively. From Tab.1, we can tell that
mGANprior beats other competitors on all three models Figure 5: Visualization of the role of each latent code. On the top
from both pixel level (PSNR) and perception level (LPIPS). row are the target image, inversion result, and the corresponding
We also observe in Fig.3 that existing methods fail to segmentation mask, respectively. On the bottom row are several
recover the details of the target image, which is due to latent codes annotated with a specific semantic label.
the limited representation capability of a single latent code.
proposed mGANprior. We thus compose the latent codes
By contrast, our method achieves much more satisfying
on various layers of PGGAN (i.e., from 1st to 8th) and
reconstructions with most details, benefiting from multiple
compare the inversion quality, as shown in Fig.4. In general,
latent codes. We even recover an eastern face with a model
a higher composition layer could lead to a better inversion
trained on western data (CelebA-HQ [23]).
effect. However, as revealed in [4], higher layers contain the
4.2. Analysis on Inverted Codes information of local pixel patterns such as edges and colors
rather than the high-level semantics. Composing features
As described in Sec.3, our method achieves high-fidelity at higher layers is hard to reuse of the semantic knowledge
GAN inversion with N latent codes and N importance learned by GANs. This will be discussed more in Sec.4.4.
factors. Taking PGGAN as an example, if we choose the Role of Each Latent Code. We employ multiple latent
6th layer (i.e., with 512 channels) as the composition layer codes by expecting each of them to take charge of inverting
with N = 10, the number of parameters to optimize is a particular region and hence complement with each other.
10 × (512 + 512), which is 20 times the dimension of the In this part, we visualize the roles that different latent
original latent space. In this section, we perform detailed codes play in the inversion process. As pointed out by
analysis on the inverted codes. [4], for a particular layer in a GAN model, different units
Number of Codes. Obviously, there is a trade-off between (channels) control different semantic concepts. Recall
the dimension of the optimization space and the inversion that mGANprior uses adaptive channel importance to help
quality. To better analysis such trade-off, we evaluate our determine what kind of semantics a particular z should
method by varying the number of latent codes to optimize. focus on. Therefore, for each zn , we set the elements in αn
Fig.4 shows that the more latent codes used, the better that are larger than 0.2 as 0, getting α0n . Then we compute
reconstruction we are able to obtain. However, it does not the difference map between the reconstructions using αn
imply that the performance can be infinitely improved by and α0n . With the help of a segmentation model [51], we can
increasing the number of latent codes. From Fig.4, we can also get the segmentation maps for various visual concepts,
see that after the number reaches 20, there is no significant such as tower and tree. We finally annotate each latent
improvement via involving more latent codes. code based on the Intersection-over-Union (IoU) metric
Different Composition Layers. On which layer to perform between the corresponding difference map and all candidate
feature composition also affects the performance of the segmentation maps. Fig.5 shows the segmentation result

5
Grayscale Image (a) Optimizing Feature Maps (b) DIP (c) Zhang et al. (d) Ours Ground Truth

Figure 6: Qualitative comparison of different colorization methods, including (a) inversion by optimizing feature maps [3], (b) DIP [39],
(c) Zhang et al. [48], and (d) our mGANprior.
Table 2: Quantitative evaluation results on colorization task with
bedroom and church images. AuC refers to the area under the
curve of the cumulative error distribution over ab color space [48].
↑ means higher score is better.
Bedroom Church
Method AuC (%)↑ AuC (%)↑ LR Image (a) DIP (b) RCAN
Grayscale input 88.02 85.50
(a) Optimizing feature maps [3] 85.41 86.10
(b) DIP [39] 84.33 83.31
(c) Zhang et al. [48] 88.55 89.13
(d) Ours 90.02 89.43
(c) ESRGAN (d) Ours Ground Truth

and the IoU maps of some chosen latent codes. It turns Figure 7: Qualitative comparison of different super-resolution
methods with SR factor 16. Competitors include DIP [39], RCAN
out that the latent codes are specialized to invert different
[50], and ESRGAN [42].
meaningful image regions to compose the whole image.
This is also a huge advantage of using multiple latent codes
respectively. It turns out that using the discriminative model
over using a single code.
as prior fails to colorize the image adequately. That is
because discriminative models focus on learning high-level
4.3. Image Processing Applications
representation which are not suitable for low-level tasks.
With the high-fidelity image reconstruction, our multi- On the contrary, using the generative model as prior leads
code inversion method facilitates many image processing to much more satisfying colorful images. We also achieve
tasks with pre-trained GANs as prior. In this section, we comparable results as the model whose primary goal is
apply the proposed mGANprior to a variety of real-world image colorization (Fig.6 (c) and (d)). This benefits from
applications to demonstrate its effectiveness, including im- the rich knowledge learned by GANs. Note that Zhang et al.
age colorization, image super-resolution, image inpainting [48] is proposed for general image colorization, while our
and denoising, as well as semantic manipulation and style approach can be only applied to a certain image category
mixing. For each application, the GAN model is fixed. corresponding to the given GAN model. A larger GAN
Image Colorization. Given a grayscale image as input, we model trained on a more diverse dataset should improve its
can colorize it with mGANprior as described in Sec.3.2. generalization ability.
We compare our inversion method with optimizing the Image Super-Resolution. We also evaluate our approach
intermediate feature maps [3]. We also compare with DIP on the image super-resolution (SR) task. We do experiments
[39], which uses a discriminative model as prior, and Zhang on the PGGAN model trained for face synthesis and set the
et al. [48], which is specially designed for colorization SR factor as 16. Such a large factor is very challenging
task. We do experiments on PGGAN models trained for for the SR task. We compare with DIP [39] as well as
bedroom and church synthesis, and use the area under the the state-of-the-art SR methods, RCAN [50] and ESRGAN
curve of the cumulative error distribution over ab color [42]. Besides PSNR and LPIPS, we introduce Naturalness
space as the evaluation metric, following [48]. Tab.2 and Image Quality Evaluator (NIQE) [33] as an extra metric.
Fig.6 show the quantitative and qualitative comparisons Tab.3 shows the quantitative comparison. We can con-

6
Corrupted Image (a) Single Latent Code (b) Optimizing Feature Maps (c) DIP (d) Ours Ground Truth

Figure 8: Qualitative comparison of different inpainting methods, including (a) inversion by optimizing a single latent code [30, 32], (b)
inversion by optimizing feature maps [3], (c) DIP [39], and (d) our mGANprior.
Target Image Inversion Neutral Expression Laugh Target Image Inversion Female Gender Male

Target Image Inversion Young Age Old Target Image Inversion Left Pose Right

Figure 9: Real face manipulation with respect to four various attributes. In each four-element tuple, from left to right are: input face,
inversion result, and manipulation results by making a particular semantic more negative and more positive.

clude that our approach achieves comparable or even better Table 3: Quantitative comparison of different super-resolution
performance than the advanced learning-based competitors. methods with SR factor 16. Competitors include DIP [39], RCAN
A visualization example is also shown in Fig.7, where [50], and ESRGAN [42]. ↑ means the higher the better while ↓
means the lower the better.
our method reconstructs the human eye with more details.
Method PSNR↑ LPIPS↓ NIQE↓
Compared to existing learning-based models, like RCAN (a) DIP [39] 26.87 0.4236 4.66
and ESRGAN, our mGANprior is more flexible to the SR (b) RCAN [50] 28.82 0.4579 5.70
factor. This suggests that the freely-trained PGGAN model (c) ESRGAN [42] 25.26 0.3862 3.27
has spontaneously learned rich knowledge such that it can (d) Ours 26.93 0.3584 3.19
be used as prior to enhance a low-resolution (LR) image.
Table 4: Quantitative comparison of different inpainting methods.
Image Inpainting and Denoising. We further extend our We do test with both centrally cropping a 64 × 64 box and
approach to image restoration tasks, like image inpainting randomly cropping 80% pixels. ↑ means higher score is better.
and image denoising. We first corrupt the image contents by Center Crop Random Crop
randomly cropping or adding noises, and then use different Method PSNR↑ SSIM↑ PSNR↑ SSIM↑
algorithms to restore them. Experiments are conducted (a) Single latent code [30, 32] 10.37 0.1672 12.79 0.1783
(b) Optimizing feature maps [3] 14.75 0.4563 18.72 0.2793
on PGGAN models and we compare with several baseline (c) DIP [39] 17.92 0.4327 18.02 0.2823
inversion methods as well as DIP [39]. PSNR and Structural (d) Ours 21.43 0.5320 22.11 0.5532
SIMilarity (SSIM) [43] are used as evaluation metrics.

7
CelebA-HQ
PGGAN

Target Image Grayscale Image


PGGAN
Church
Conference Room
PGGAN

Corrupted Image Layer 2 Layer 4 Layer 8 Ground Truth


Figure 11: Colorization and inpainting results with mGANprior
using different composition layers. AuC (the higher the better) for
colorization task are 86.83%, 87.44%, 90.02% with respect to the
Bedroom

2nd, 4th, and 8th layer respectively. PSNR (the higher the better)
PGGAN

for inpainting task are 21.19db, 22.11db, 20.70db with respect to


the 2nd, 4th, and 8th layer respectively. Images in green boxes
Single Latent Code Ours (Layer 2) Ours (Layer 4) Ours (Layer 8) indicate the best results.
Figure 10: Comparison of the inversion results using different
with PGGAN CelebA-HQ model looks like a face instead
GAN models as well as performing feature composition at dif-
ferent layers. Each row stands for a PGGAN model trained on
of a bedroom). On the contrary, our approach is able to
a specific dataset as prior, while each column shows results by compose a bedroom image no matter what data the GAN
composing feature maps at a certain layer. generator is trained with.
We further analyze the layer-wise knowledge of a well-
Tab.4 shows the quantitative comparison, where our ap- trained GAN model by performing feature composition at
proach achieves the best performances on both settings of different layers. Fig.10 suggests that the higher layer is
center crop and random crop. Fig.8 includes some examples used, the better the reconstruction will be. That is because
of restoring corrupted images. It is obvious that both reconstruction focuses on recovering low-level pixel values,
existing inversion methods and DIP fail to adequately fill in and GANs tend to represent abstract semantics at bottom
the missing pixels or completely remove the added noises. layers while represent content details at top layers. We also
By contrast, our method is able to use well-trained GANs observe that the 4th layer is good enough for the bedroom
as prior to convincingly repair the corrupted images with model to invert a bedroom image, but the other three models
meaningful filled content. need the 8th layer for satisfying inversion. The reason is
Semantic Manipulation. Besides the aforementioned low- that bedroom shares different semantics from face, church,
level applications, we also test our approach with some and conference room, therefore the high-level knowledge
high-level tasks, like semantic manipulation and style mix- (contained in bottom layers) from these models cannot be
ing. As pointed out by prior work [21, 15, 35], GANs have reused. We further make per-layer analysis by applying our
already encoded some interpretable semantics inside the approach to image colorization and image inpainting tasks,
latent space. From this point, our inversion method provides as shown in Fig.11. The colorization task gets the best
a feasible way to utilize these learned semantics for real result at the 8th layer while the inpainting task at the 4th
image manipulation. We apply the manipulation framework layer. That is because colorization is more like a low-level
based on latent code proposed in [35] to achieve semantic rendering task while inpainting requires the GAN prior to
facial attribute editing. Fig.9 shows the manipulation fill in the missing content with meaningful objects. This is
results. We see that mGANprior can provide rich enough consistent with the analysis from Fig.10, which is that low-
information for semantic manipulation. level knowledge from GAN prior can be reused at higher
layers while high-level knowledge at lower layers.
4.4. Knowledge Representation in GANs
5. Conclusion
As discussed above, the major limitation of using single
latent code is its limited expressiveness, especially when the We present mGANprior that employs multiple latent
test image presents domain gap to the training data. Here codes for reconstructing real images with a pre-trained
we verify whether using multiple codes can help alleviate GAN model. It enables these GAN models as powerful
this problem. In particular, we try to use GAN models prior to a variety of image processing tasks.
trained for synthesizing face, church, conference room, and Acknowledgement: This work is supported in part by the
bedroom, to invert a bedroom image. As shown in Fig.10, Early Career Scheme (ECS) through the Research Grants
when using a single latent code, the reconstructed image Council of Hong Kong under Grant No.24206219 and in
still lies in the original training domain (e.g., the inversion part by SenseTime Collaborative Grant.

8
References [18] Paul Hand and Vladislav Voroninski. Global guarantees for
enforcing deep generative priors by empirical risk. IEEE
[1] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Transactions on Information Theory, 2019. 2
Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017.
[19] Guang-Yuan Hao, Hong-Xing Yu, and Wei-Shi Zheng. Mix-
1
gan: learning concepts from different domains for mixture
[2] ShahRukh Athar, Evgeny Burnaev, and Victor Lempitsky. generation. In IJCAI, 2018. 2
Latent convolutional models. In ICLR, 2019. 2 [20] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A
[3] David Bau, Hendrik Strobelt, William Peebles, Jonas Wulff, Efros. Image-to-image translation with conditional adver-
Bolei Zhou, Jun-Yan Zhu, and Antonio Torralba. Semantic sarial networks. In CVPR, 2017. 2
photo manipulation with a generative image prior. In [21] Ali Jahanian, Lucy Chai, and Phillip Isola. On
SIGGRAPH, 2019. 2, 6, 7 the”steerability” of generative adversarial networks. In
[4] David Bau, Jun-Yan Zhu, Hendrik Strobelt, Bolei Zhou, ICLR, 2020. 1, 8
Joshua B. Tenenbaum, William T. Freeman, and Antonio [22] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual
Torralba. Gan dissection: Visualizing and understanding losses for real-time style transfer and super-resolution. In
generative adversarial networks. In ICLR, 2019. 3, 5 ECCV, 2016. 4
[5] David Bau, Jun-Yan Zhu, Jonas Wulff, William Peebles, [23] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen.
Hendrik Strobelt, Bolei Zhou, and Antonio Torralba. Invert- Progressive growing of gans for improved quality, stability,
ing layers of a large generator. In ICLR Workshop, 2019. 2, and variation. In ICLR, 2018. 1, 4, 5
3, 4, 5 [24] Tero Karras, Samuli Laine, and Timo Aila. A style-based
[6] David Bau, Jun-Yan Zhu, Jonas Wulff, William Peebles, generator architecture for generative adversarial networks. In
Hendrik Strobelt, Bolei Zhou, and Antonio Torralba. Seeing CVPR, 2019. 1, 4
what a gan cannot generate. In ICCV, 2019. 2 [25] Dong-Wook Kim, Jae Ryun Chung, and Seung-Won Jung.
[7] David Berthelot, Thomas Schumm, and Luke Metz. Be- Grdn: Grouped residual dense network for real image
gan: Boundary equilibrium generative adversarial networks. denoising and gan-based real-world noise modeling. In
arXiv preprint arXiv:1703.10717, 2017. 1 CVPR Workshop, 2019. 2
[8] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large [26] Durk P Kingma and Prafulla Dhariwal. Glow: Generative
scale gan training for high fidelity natural image synthesis. flow with invertible 1x1 convolutions. In NeurIPS, 2018. 2
In ICLR, 2019. 1 [27] Guillaume Lample, Neil Zeghidour, Nicolas Usunier, An-
[9] Jingwen Chen, Jiawei Chen, Hongyang Chao, and Ming toine Bordes, Ludovic Denoyer, and Marc’Aurelio Ranzato.
Yang. Image blind denoising with generative adversarial Fader networks: Manipulating images by sliding attributes.
network based noise modeling. In CVPR, 2018. 2 In NeurIPS, 2017. 1
[10] Xinyuan Chen, Chang Xu, Xiaokang Yang, Li Song, and [28] Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero,
Dacheng Tao. Gated-gan: Adversarial gated networks for Andrew Cunningham, Alejandro Acosta, Andrew Aitken,
multi-collection style transfer. TIP, 2018. 2 Alykhan Tejani, Johannes Totz, Zehan Wang, et al. Photo-
[11] Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, realistic single image super-resolution using a generative
Sunghun Kim, and Jaegul Choo. Stargan: Unified genera- adversarial network. In CVPR, 2017. 1, 2
tive adversarial networks for multi-domain image-to-image [29] Xiaodan Liang, Hao Zhang, Liang Lin, and Eric Xing.
translation. In CVPR, 2018. 1 Generative semantic manipulation with mask-contrasting
[12] Antonia Creswell and Anil Anthony Bharath. Inverting the gan. In ECCV, 2018. 2
generator of a generative adversarial network. TNNLS, 2018. [30] Zachary C Lipton and Subarna Tripathi. Precise recovery of
2 latent vectors from generative adversarial networks. In ICLR
[13] Jeff Donahue, Philipp Krähenbühl, and Trevor Darrell. Ad- Workshop, 2017. 2, 7
versarial feature learning. In ICLR, 2017. 2 [31] Ming-Yu Liu, Xun Huang, Arun Mallya, Tero Karras, Timo
[14] Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Olivier Aila, Jaakko Lehtinen, and Jan Kautz. Few-shot unsuper-
Mastropietro, Alex Lamb, Martin Arjovsky, and Aaron vised image-to-image translation. In ICCV, 2019. 1
Courville. Adversarially learned inference. In ICLR, 2017. [32] Fangchang Ma, Ulas Ayaz, and Sertac Karaman. Invertibility
2 of convolutional generative networks from partial measure-
[15] Lore Goetschalckx, Alex Andonian, Aude Oliva, and Phillip ments. In NeurIPS, 2018. 2, 4, 5, 7
Isola. Ganalyze: Toward visual definitions of cognitive [33] Anish Mittal, Rajiv Soundararajan, and Alan C Bovik.
image properties. In ICCV, 2019. 1, 8 Making a completely blind image quality analyzer. IEEE
[16] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Signal Processing Letters, 2012. 6
Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and [34] Guim Perarnau, Joost Van De Weijer, Bogdan Raducanu,
Yoshua Bengio. Generative adversarial nets. In NeurIPS, and Jose M Álvarez. Invertible conditional gans for image
2014. 1 editing. In NeurIPS Workshop, 2016. 2
[17] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent [35] Yujun Shen, Jinjin Gu, Xiaoou Tang, and Bolei Zhou. Inter-
Dumoulin, and Aaron C Courville. Improved training of preting the latent space of gans for semantic face editing. In
wasserstein gans. In NeurIPS, 2017. 1 CVPR, 2020. 1, 8

9
[36] Yujun Shen, Ping Luo, Junjie Yan, Xiaogang Wang, and
Xiaoou Tang. Faceid-gan: Learning a symmetry three-player
gan for identity-preserving face synthesis. In CVPR, 2018. 1
[37] Karen Simonyan and Andrew Zisserman. Very deep convo-
lutional networks for large-scale image recognition. In ICLR,
2015. 4
[38] Patricia L Suárez, Angel D Sappa, and Boris X Vintimilla.
Infrared image colorization based on a triplet dcgan archi-
tecture. In CVPR Workshop, 2017. 2
[39] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky.
Deep image prior. In CVPR, 2018. 2, 6, 7
[40] Paul Upchurch, Jacob Gardner, Geoff Pleiss, Robert Pless,
Noah Snavely, Kavita Bala, and Kilian Weinberger. Deep
feature interpolation for image content changes. In CVPR,
2017. 2
[41] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao,
Jan Kautz, and Bryan Catanzaro. High-resolution image
synthesis and semantic manipulation with conditional gans.
In CVPR, 2018. 2
[42] Xintao Wang, Ke Yu, Shixiang Wu, Jinjin Gu, Yihao Liu,
Chao Dong, Yu Qiao, and Chen Change Loy. Esrgan:
Enhanced super-resolution generative adversarial networks.
In ECCV Workshop, 2018. 1, 2, 6, 7
[43] Zhou Wang, Alan C Bovik, Hamid R Sheikh, Eero P
Simoncelli, et al. Image quality assessment: from error
visibility to structural similarity. TIP, 2004. 7
[44] Ceyuan Yang, Yujun Shen, and Bolei Zhou. Semantic
hierarchy emerges in deep generative representations for
scene synthesis. arXiv preprint arXiv:1911.09267, 2019. 1
[45] Raymond A Yeh, Chen Chen, Teck Yian Lim, Alexander G
Schwing, Mark Hasegawa-Johnson, and Minh N Do. Seman-
tic image inpainting with deep generative models. In CVPR,
2017. 2
[46] Fisher Yu, Ari Seff, Yinda Zhang, Shuran Song, Thomas
Funkhouser, and Jianxiong Xiao. Lsun: Construction of a
large-scale image dataset using deep learning with humans
in the loop. arXiv preprint arXiv:1506.03365, 2015. 4
[47] Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu,
and Thomas S Huang. Generative image inpainting with
contextual attention. In CVPR, 2018. 2
[48] Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful
image colorization. In ECCV, 2016. 6
[49] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman,
and Oliver Wang. The unreasonable effectiveness of deep
features as a perceptual metric. In CVPR, 2018. 5
[50] Yulun Zhang, Kunpeng Li, Kai Li, Lichen Wang, Bineng
Zhong, and Yun Fu. Image super-resolution using very deep
residual channel attention networks. In ECCV, 2018. 6, 7
[51] Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela
Barriuso, and Antonio Torralba. Scene parsing through
ade20k dataset. In CVPR, 2017. 5
[52] Jun-Yan Zhu, Philipp Krähenbühl, Eli Shechtman, and
Alexei A Efros. Generative visual manipulation on the
natural image manifold. In ECCV, 2016. 2, 4, 5
[53] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A
Efros. Unpaired image-to-image translation using cycle-
consistent adversarial networks. In ICCV, 2017. 1

10

You might also like