Progressive Face Super-Resolution Via Attention To Facial Landmark
Progressive Face Super-Resolution Via Attention To Facial Landmark
Abstract
1 Introduction
Face Super-Resolution (SR) is a domain-specific SR which aims to reconstruct High Reso-
lution (HR) face images from Low Resolution (LR) face images while restoring facial de-
tails. When enlarging the LR face images to high-resolution images, the HR images suffer
from face distortion. The finer details of faces disappear incurring misperception of facial
attributes on faces. In an attempt to address this problem, the previous studies [13, 26]
embedded additional facial attribute vectors into network feature maps to reflect facial at-
tributes in super-resolved face images. These approaches require prior information for face
SR; however, the additional information is difficult to obtain in the wild. Other studies incor-
porate facial landmark information by employing auxiliary networks such as face alignment
network [1, 25], and prior estimation network [5]. However, these approaches tend to con-
centrate on the localization of facial landmarks without sufficient consideration of the facial
attributes in the areas around landmarks.
Different from the previous works, we propose a face SR method which restores original
facial details more precisely by giving strong constraints to the landmark areas. To stably
generate photo-realistic 8× upscaled images, we adopt a progressive training method [2, 11,
19, 20] which grows both generator and discriminator progressively. We also introduce a
novel facial attention loss which makes our SR network to restore the accurate facial details.
The attention loss is applied in both the intermediate and the last step of our progressive
training.
Constraining the outputs by applying the attention loss at each step, the output images
of each step reflect more accurate facial details. To obtain the attention loss, we extract
the heatmaps from the pre-trained Face Alignment Network (FAN). The extracted heatmaps
are used as weights of the pixel difference of the adjacent areas to the landmarks. Instead of
using the state-of-the-art FAN [4], we suggest a compressed network of FAN, called distilled
FAN, which is trained by a hint-based method [17]. The distilled FAN delivers comparable
performance to the original FAN while being much more compact. With our approach, we
obtain SR-oriented landmark heatmaps as well as significantly reduce the overall training
time. Therefore, our method generates super-resolved face images which successfully reflect
accurate details of facial components.
For the evaluation, we measure the performance of our method on both aligned and un-
aligned face images from CelebA [14] and AFLW [10] datasets. To compare the quality
of our results, we calculate the conventional measurements of the average Peak Signal to
Noise Ratio (PSNR), Structural SIMilarity (SSIM) [22], and Multi-Scale Structural SIMi-
larity (MS-SSIM) [21]. By conducting an ablation study, we verify that the proposed loss
function is beneficial to super-resolving LR face images; we demonstrate the superiority of
our method by comparing the results with those of previous studies. We further conduct
Mean-Opinion-Score (MOS) [12] test to measure the perceptual quality. The experimental
results show that our network successfully generates high-fidelity face images, accurately
preserving the original features around the facial landmarks. In summary, our contributions
are as follows:
1. To the best of our knowledge, progressive training method is used in natural image
SR, but this is the first method which leverages the progressive training method for
face SR. We give constraints to each step of the SR network and generate high-quality
face images reflecting details of facial components.
2. Facial attention loss makes the SR network learn to restore facial details with the
method of focusing on the adjacent area of facial landmarks, which is verified by our
super-resolved results.
3. We compress the state-of-the-art FAN into a smaller network using hint-based method.
With the distilled FAN, we are able to extract meaningful landmark heatmaps which
are more suitable for a face SR task and reduce the overall training time.
2 Related work
Utilizing facial information, such as facial attributes and spatial configuration of facial com-
ponents, is the key factor in face SR. Yu et al. [26] interweave multiple spatial transformer
networks to satisfy the requirement of face alignment as well as embeds facial attribute vec-
D. KIM, M. KIM, G. KWON ET AL.: PROGRESSIVE FACE SUPER-RESOLUTION 3
Figure 1: Our network architecture overview. (k : kernel size, n : output channel, s : stride)
tors to lower the ambiguity in facial attributes. Lee et al. [13] fuses the information of both
image domain and attribute domain in order to reflect facial attributes in super-resolved im-
ages. These methods preserve facial attributes indicated by facial attribute vectors. However,
attribute vectors are not only difficult to acquire in the wild but also limited to describing par-
tial facial attributes.
While preserving the facial landmarks location, Chen et al. [5] propose the face shape
prior estimation network, which provides a solution for accurate geometry estimation ob-
tained from coarse HR face images. Yu et al. [25] estimate the spatial position of their facial
components to preserve the original spatial structure in upscaled face images. In addition,
Bulat et al. [1] propose a heatmap loss to localize landmarks in super-resolved images so as
to upscale input face images 4×. Although these methods preserve the spatial configuration
of facial components, they fail to fully reflect accurate facial attributes. In contrast to the
previous works, our method carefully considers the facial attributes around the landmarks to
restore the facial details without prior information as well as preserves landmark location.
3 Approach
In this section, we describe our methods for the enhanced face SR. To generate the high-
fidelity super-resolved face images that reflect the facial attributes of target face images,
three main approaches are used: progressive training, facial attention loss, and distillation of
Face Alignment Network (FAN).
convolution layers (ConvTrans), and Rectifier Linear Unit (ReLU) as an activation function.
The discriminator network has a corresponding architecture to the generator network, which
is comprised of convolution layers (Conv), average pooling layers (AvgPool), and Leaky
ReLU. To improve the discriminator performance, we calculate the standard deviation of in-
put batch, then replicate the value into a one-dimensional feature map, and concatenate it to
the end of the discriminator [19]. We use additional convolution layers in each step in order
to convert the intermediate feature maps into RGB images, and vice versa.
In Step 1, each network employs one block for training and learns to upscale images
2×. These 2× upscaled outputs from the generator go through the corresponding part of
the discriminator, and the outputs are then compared with target images. In Step 2, the 2×
upscaled outputs from Step 1 are upscaled 2× again by nearest-neighbor interpolation, and
then the interpolated outputs are added to 4× upscaled outputs from Step 2. This process
is expressed as follows: (1 − α) ∗ f (GN−1 (I)) + α ∗ GN (I), where G is our SR network, f
is nearest neighbor (NN) interpolation, I and N ∈ {2, 3} denote input images and number
of step, respectively. A weight scale α increases linearly from zero to one. The upscaled
outputs are compared to the corresponding target images through the discriminator. The
same procedure above is implemented in Step 3 (8×). The method allow the network learn
super-resolving face images with different loss in each step effectively and stably.
where G is face SR network, I HR and I LR are target face images and input LR images, re-
spectively. The landmark attention heatmap M ∗ is channel-wise max values of the target
heatmap M generated from target face images. To compensate for the variance between the
landmarks, the heatmap M is min-max normalized into [0,1]. The heatmap M has the dimen-
sion of N × rW × rH, where N is number of landmarks, W and H are width and height of the
input image. The upscale factor r is set to be 4 and 8 in Step 2 and Step 3, respectively. To
give attention at images with enough information, we adjust facial attention loss at upscaled
outputs with size of 64×64 and 128×128.
Perceptual loss A perceptual loss [7] is proposed to prevent generating blurry and unrealistic
face images, and to obtain more realistic HR images. The loss over the pre-trained VGG16
[18] features at a given layer i is defined as:
1 Wi Hi 2
φi I HR x,y − φi G I LR x,y
L f eat/i = ∑ ∑ (3)
Wi Hi x=1 y=1
where φi denotes the feature map obtained after the last convolutional layer of the i−th block.
Adversarial loss We use the WGAN Loss [3] to stabilize the training process. In WGAN,
the loss function is defined as the Wasserstein distance between the distribution of target
I HR ∼ Pr and those of the generated images I˜ ∼ Pg . For further improvement of training
stability, we apply the Gradient Penalty term proposed in WGAN-GP [23], which enforces
the Lipschitz -1 condition of the discriminator. Iˆ is a randomly sampled image among the
samples from Pr and Pg . Therefore, the loss function is as follows:
Heatmap Loss As proposed by [1], the heatmap loss improves the structural consistency of
face images by minimizing the distance between the heatmaps of both generated images and
target ones. The heatmap loss function is described as:
N rW rH
1 n n 2
Lheatmap = ∑ ∑ ∑ (Mx,y − M̃x,y ) (5)
r2 NW H n=1 x=1 y=1
HR
where N is
LR
the number of landmarks, M and M̃ are calculated as M = Fd (I ) and M̃ =
Fd (G I ).
6 D. KIM, M. KIM, G. KWON ET AL.: PROGRESSIVE FACE SUPER-RESOLUTION
Figure 4: Our image results (a) with the original FAN, (b) with the distilled FAN.
Overall training loss Since the landmark losses are applied to Step 2 & 3, we intend not to
include the Lheatmap and Lattention in Step 1. The final loss term is shown as:
4 Experiments
4.1 Implementation details
Datasets In our experiments, we use two different datasets: aligned dataset and unaligned
one. The aligned CelebA dataset [14] is used to test how accurately the facial details can be
restored. The unaligned CelebA and AFLW [10] datasets are used to verify the applicability
of our face SR network in real world. The aligned face images are cropped into square.
The face images of the unaligned dataset are cropped based on the bounding box areas. The
cropped images are resized into 128×128 pixels to be used as targets of Step 3, and bilinearly
downsampled into 64×64 pixels as targets of Step 2, 32×32 pixels as targets of Step 1, and
16×16 pixels as LR inputs. We use all 162,770 images as a training set, and 19,867 images
as a test set from aligned CelebA dataset. For the unaligned dataset, we use 80,000 cropped
images from unaligned CelebA, and 20,000 from AFLW for training. As a test set, 5,000
images from CelebA and 4,384 images from AFLW are used.
Training details We implement our face SR network using PyTorch [16]. We train our
networks using the Adam optimizer [9] with a learning rate of 1 × 10−3 , and the mini-batch
size of 16. The training iteration of each step is set by hyperparameter. In our model, we
train our model 50K, 50K and 100K iterations, empirically. Running totally 200K iterations
takes ∼1 day on single Titan X GPU. In addition, we train the distilled FAN using the Adam
optimizer with a learning rate of 1 × 10−4 , mini-batch size of 16, and 100K iterations.
D. KIM, M. KIM, G. KWON ET AL.: PROGRESSIVE FACE SUPER-RESOLUTION 7
Table 1: Parameters, NME evaluation, PSNR, SSIM, and MS-SSIM comparison results
Figure 5: Ablation study results on aligned and unaligned datasets. (a) L pixel + LW GAN (b)
L pixel + LW GAN + L f eat (c) L pixel + LW GAN + L f eat + Lhm (d) LOurs -no progressive (e) LOurs .
Aligned Unaligned
Method PSNR SSIM MS-SSIM PSNR SSIM MS-SSIM
L pixel + LW GAN 21.62 0.616 0.873 21.62 0.626 0.863
L pixel + LW GAN + L f eat 21.89 0.649 0.887 22.26 0.663 0.884
L pixel + LW GAN + L f eat + Lhm 21.95 0.650 0.890 22.53 0.679 0.892
LOurs -no progressive 22.24 0.660 0.893 22.83 0.680 0.895
LOurs 22.66 0.685 0.902 22.96 0.695 0.897
Table 2: PSNR, SSIM and MS-SSIM values for the ablation study results on aligned and
unaligned datasets.
method. The qualitative comparison of outputs is shown in Figure 5(d). There is some
degradation of facial details in super-resolved images from our non-progressively trained
network, which is trained by minimizing Lours . As shown in Table 2, the measurement
values are also increased by progressive training method.
Aligned Unaligned
PSNR SSIM MS-SSIM MOS PSNR SSIM MS-SSIM MOS
Bilinear 20.75 0.574 0.782 1.72 21.86 0.624 0.795 1.52
URDGN [24] 21.99 0.622 0.875 2.55 23.01 0.643 0.874 2.45
FSRGAN [5] 22.27 0.601 0.841 2.46 20.95 0.515 0.741 2.28
FSRNet [5] 22.62 0.641 0.847 2.34 21.19 0.607 0.760 2.19
VDSR [8] 22.94 0.652 0.880 2.10 23.70 0.682 0.882 1.89
Ours 22.66 0.685 0.902 3.73 22.96 0.695 0.897 3.73
Table 3: PSNR, SSIM and MS-SSIM values for the baseline experimental results on aligned
and unaligned datasets.
are significantly blurred. As we can see, the results of FSRNet and FSRGAN have realistic
features in facial details, but they have artifacts and partially blurred facial components. The
URDGN produces relatively clear images but generates distorted face images. The results
show that our method outperforms other methods especially on SSIM and MS-SSIM, and
generates photo-realistic face images with restoring accurate facial attributes. More image
results are shown in the Supplementary Materials.
We also conduct a MOS test to quantify image quality based on human vision. We asked
26 raters to assign a score from 1 (bad quality) to 5 (excellent quality) to all the super-
resolved images and the high-resolution target images. The raters were calibrated on the
20 images of Nearest Neighbor (score 1) and HR (score 5) before the main test. The raters
rated 8 versions of each image on aligned and unaligned dataset: Nearest neighbor, Bilinear,
URDGN, FSRNet, FSRGAN, VDSR, Ours, and HR images (GT). Each rater rated randomly
presented 240 images with each dataset (total 480 images). More details are explained in the
10 D. KIM, M. KIM, G. KWON ET AL.: PROGRESSIVE FACE SUPER-RESOLUTION
5 Conclusion
We propose a novel face SR method which fully reflects facial details. To achieve this, we
adopt progressive training method to generate photo-realistic face images and learn restora-
tion of facial details with different guidance in each step. In addition, we propose a new
facial attention loss which gives large weights to facial features in the adjacent area of land-
marks. Therefore, the facial details are well expressed in super-resolved images. However,
the original FAN produces landmark heatmaps including occluded landmark area, which re-
sults in degradation of super-resolution performance. Therefore, we suggest distillation of
face alignment network to produce more suitable heatmaps for the SR. Besides, our distilled
face alignment network has relatively light-weight architecture, so the overall training time
is reduced from ∼ 3 days to ∼ 1 day. Our experiments demonstrate that our proposed method
restore more accurate facial details. In particular, our method produces high-quality face im-
ages which are perceptually similar to the real images. As a summary, the proposed method
allows our face SR network to super-resolve face images with more precise facial details.
We give attention to specific areas on the faces and propose a method to obtain the
heatmaps suitable for face SR. If a better method is developed to obtain the heatmaps which
well represent facial landmark areas, we will be able to achieve even better performance
through our proposed method. Since our approach restores lost information by focusing on
specific areas, we will further be able to restore the desired information by applying our
mechanism to any task, such as super-resolution on the medical image, satellite image, and
microscopic image, which requires restoration of lost information using super-resolution.
6 Acknowledgements
This work was supported by Engineering Research Center of Excellence (ERC) Program
supported by National Research Foundation (NRF), Korean Ministry of Science & ICT
(MSIT) (Grant No. NRF-2017R1A5A1014708), Institute for Information & communica-
tions Technology Promotion(IITP) grant funded by the Korea government(MSIT) (No.2016-
0-00563, Research on Adaptive Machine Learning Technology Development for Intelli-
gent Autonomous Digital Companion) and Hyundai Motor Company through HYUNDAI-
TECHNION-KAIST Consortium.
D. KIM, M. KIM, G. KWON ET AL.: PROGRESSIVE FACE SUPER-RESOLUTION 11
References
[1] Georgios Tzimiropoulos Adrian Bulat. Super-fan: Integrated facial landmark localization and
super-resolution of real-world low resolution faces in arbitrary poses with gans. In CVPR, 2018.
[2] Namhyuk Ahn, Byungkon Kang, and Kyung-Ah Sohn. Image super-resolution via progressive
cascading residual network. In CVPR Workshops, 2018.
[3] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial net-
works. In ICML, 2017.
[4] Adrian Bulat and Georgios Tzimiropoulos. How far are we from solving the 2d & 3d face align-
ment problem? (and a dataset of 230,000 3d facial landmarks). In ICCV, 2017.
[5] Yu Chen, Ying Tai, Xiaoming Liu, Chunhua Shen, and Jian Yang. Fsrnet: End-to-end learning
face super-resolution with facial priors. In CVPR, 2018.
[6] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR,
2016.
[7] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer
and super-resolution. In ECCV, 2016.
[8] Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee. Accurate image super-resolution using very
deep convolutional networks. In CVPR, 2016.
[9] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
[10] Martin Koestinger, Paul Wohlhart, Peter M Roth, and Horst Bischof. Annotated facial land-
marks in the wild: A large-scale, real-world database for facial landmark localization. In ICCV
workshop, 2011.
[11] Wei-Sheng Lai, Jia-Bin Huang, Narendra Ahuja, and Ming-Hsuan Yang. Deep laplacian pyramid
networks for fast and accurate super-resolution. In CVPR, 2017.
[12] Christian Ledig, Lucas Theis, Ferenc Huszar, Jose Caballero, Andrew Cunningham, Alejandro
Acosta, Andrew P. Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, and Wenzhe Shi. Photo-
realistic single image super-resolution using a generative adversarial network. In CVPR, 2017.
[13] Cheng-Han Lee, Kaipeng Zhang, Hu-Cheng Lee, Chia-Wen Cheng, and Winston Hsu. Attribute
augmented convolutional neural network for face hallucination. In CVPR Workshops, 2018.
[14] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the
wild. In ICCV, 2015.
[15] Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hourglass networks for human pose esti-
mation. In ECCV, 2016.
[16] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito,
Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in
pytorch. In NeurIPS workshop, 2017.
[17] Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and
Yoshua Bengio. Fitnets: Hints for thin deep nets. In ICLR, 2015.
[18] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image
recognition. In ICLR, 2015.
12 D. KIM, M. KIM, G. KWON ET AL.: PROGRESSIVE FACE SUPER-RESOLUTION
[19] Samuli Laine Jaakko Lehtinen Tero Karras, Timo Aila. Progressive growing of gans for improved
quality, stability, and variation. In ICLR, 2018.
[20] Yifan Wang, Federico Perazzi, Brian McWilliams, Alexander Sorkine-Hornung, Olga Sorkine-
Hornung, and Christopher Schroers. A fully progressive approach to single-image super-
resolution. In CVPR Workshops, 2018.
[21] Zhou Wang, Eero P. Simoncelli, and Alan C. Bovik. Multi-scale structural similarity for image
quality assessment. In IEEE Asilomar Conference on Signals, Systems and Computers, 2003.
[22] Zhou Wang, Alan C. Bovik, Hamid R. Sheikh, and Eero P. Simoncelli. Image quality assessment:
from error visibility to structural similarity. IEEE Transactions Image Processing, 2004.
[23] Xiang Wei, Boqing Gong, Zixia Liu, Wei Lu, and Liqiang Wang. Improving the improved train-
ing of wasserstein gans: A consistency term and its dual effect. In ICLR, 2018.
[24] Xin Yu and Fatih Murat Porikli. Ultra-resolving face images by discriminative generative net-
works. In ECCV, 2016.
[25] Xin Yu, Basura Fernando, Bernard Ghanem, Fatih Porikli, and Richard Hartley. Face super-
resolution guided by facial component heatmaps. In ECCV, 2018.
[26] Xin Yu, Basura Fernando, Richard Hartley, and Fatih Porikli. Super-resolving very low-resolution
face images with supplementary attributes. In CVPR, 2018.
[27] Xiangxin Zhu and Deva Ramanan. Face detection, pose estimation, and landmark localization in
the wild. In CVPR, 2012.