0% found this document useful (0 votes)
3 views15 pages

DR-GAN Distribution Regularization for Text-To-Image Generation

The document introduces DR-GAN, a novel text-to-image generation model that incorporates a semantic disentangling module (SDM) and a distribution normalization module (DNM) to enhance image generation from text descriptions. The SDM utilizes a spatial self-attention mechanism and a semantic disentangling loss to extract key semantic information, while the DNM employs a variational auto-encoder to normalize and denoise image latent distributions. Experimental results demonstrate that DR-GAN outperforms existing models on benchmark datasets, achieving higher quality images in the text-to-image generation task.

Uploaded by

姜祖涛
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views15 pages

DR-GAN Distribution Regularization for Text-To-Image Generation

The document introduces DR-GAN, a novel text-to-image generation model that incorporates a semantic disentangling module (SDM) and a distribution normalization module (DNM) to enhance image generation from text descriptions. The SDM utilizes a spatial self-attention mechanism and a semantic disentangling loss to extract key semantic information, while the DNM employs a variational auto-encoder to normalize and denoise image latent distributions. Experimental results demonstrate that DR-GAN outperforms existing models on benchmark datasets, achieving higher quality images in the text-to-image generation task.

Uploaded by

姜祖涛
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO.

12, DECEMBER 2023 10309

DR-GAN: Distribution Regularization for


Text-to-Image Generation
Hongchen Tan , Xiuping Liu , Baocai Yin , Member, IEEE, and Xin Li, Senior Member, IEEE

Abstract— This article presents a new text-to-image (T2I) [39] first extract text features, then use generative adver-
generation model, named distribution regularization generative sarial networks (GANs) [7] to generate the corresponding
adversarial network (DR-GAN), to generate images from text image. Their essence is to map a text feature distribution to
descriptions from improved distribution learning. In DR-GAN,
we introduce two novel modules: a semantic disentangling mod- the image distribution. However, two factors prevent GAN-
ule (SDM) and a distribution normalization module (DNM). SDM based T2I methods from capturing real image distribution:
combines the spatial self-attention mechanism (SSAM) and a 1) the abstract and ambiguity of text descriptions make the
new semantic disentangling loss (SDL) to help the generator generator difficult capture the key semantic information for
distill key semantic information for the image generation. DNM image generation [32], [50]; and 2) the diversity of visual
uses a variational auto-encoder (VAE) to normalize and denoise
the image latent distribution, which can help the discriminator information makes the distribution of images complex so that
better distinguish synthesized images from real images. DNM also it is difficult for the GAN-based T2I models to capture the real
adopts a distribution adversarial loss (DAL) to guide the genera- image distribution from text feature distribution [8]. Thus, this
tor to align with normalized real image distributions in the latent work explores better distribution learning strategies to enhance
space. Extensive experiments on two public datasets demon- GAN-based T2I models.
strated that our DR-GAN achieved a competitive performance
in the T2I task. The code link: https://github.com/Tan-H-C/DR- In multimodal perceptual information, the semantics of the
GAN-Distribution-Regularization-for-Text-to-Image-Generation. text description is usually abstract and ambiguous; Image
information is usually concrete and has a lot of spatial structure
Index Terms— Distribution normalization, generative adver-
sarial network, semantic disentanglement mechanism, text-to- information. Text and Image information are expressed in
image (T2I) generation. different patterns, which makes it difficult to achieve semantic
correlation based on feature vectors or tensors. Thus, it is
I. I NTRODUCTION difficult for the generator to accurately capture key semantics
from text descriptions for image generation. Since this, in the
G ENERATING photographic images from text descrip-
tions [known as text-to-image generation (T2I)] is a
challenging cross-modal generation technique that is a core
intermediate stage of generation, the image features contain
a lot of non-key semantics. Such inaccurate semantics often
component in many computer vision tasks such as Image leads to ineffective image distribution generation, and then the
Editing [27], [47], Story Visualization [49], and Multimedia generated images are often semantically inconsistent, chaos
Retrieval [18]. Compared with the image generation [16], structure, details, and so on. To alleviate this issue, our first
[21], [25] and image processing [5], [6], [22] tasks between strategy is to design an information disentangling mechanism
the same mode, it is difficult to build the heterogeneous on the intermediate feature, to better distill key information
semantic bridge between text and image [37], [44], [50]. before performing cross-modal distribution learning.
Many state-of-the-art T2I algorithms [3], [9], [24], [29], [33], In addition, images often contain diverse visual information,
messy background, and other non-key visual information.
Manuscript received 30 July 2021; revised 11 December 2021 and Their image latent distribution is often complex. And, the
14 February 2022; accepted 1 April 2022. Date of publication 20 April 2022;
date of current version 1 December 2023. This work was supported in part distribution of images is difficult to model explicitly [8]. This
by the National Key Research and Development Program of China under means that we cannot directly and explicitly learn the target
Grant 2021ZD0111900, in part by the National Natural Science Foundation image distribution from the text features. As an outstanding
of China under Grant 61976040, in part by the National Science Foundation
of USA under Grant OIA-1946231 and Grant CBET-2115405, and in part image generation model, GANs [7] learn the target data
by the Chinese Postdoctoral Science Foundation under Grant 2021M700303. distribution implicitly by sampling data from the true or fake
(Corresponding author: Xin Li.) data distribution. However, such complex image distribution
Hongchen Tan and Baocai Yin are with the Artificial Intelligence Research
Institute, Beijing University of Technology, Beijing 100124, China (e-mail: makes it difficult for the discriminator in GANs to distinguish
[email protected]; [email protected]). whether the current input image is sampled from the real image
Xiuping Liu is with the School of Mathematical Sciences, Dalian University distribution or generated image distribution. So, our second
of Technology, Dalian 116024, China (e-mail: [email protected]).
Xin Li is with the School of Electrical Engineering and Computer Science, strategy is to design an effective distribution normalization
and the Center for Computation & Technology, Louisiana State University, mechanism to normalize the image latent distribution. The
Baton Rouge, LA 70808 USA (e-mail: [email protected]). mechanism aims to help the discriminator better learn the
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/TNNLS.2022.3165573. distribution decision boundary between the generated versus
Digital Object Identifier 10.1109/TNNLS.2022.3165573 real image.
2162-237X © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Mohamed bin Zayed University of Artificial Intelligence. Downloaded on February 25,2025 at 03:53:54 UTC from IEEE Xplore. Restrictions apply.
10310 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 12, DECEMBER 2023

Based on the above two strategies, we built a new T2I


generation model, distribution regularization generative adver-
sarial networks (DR-GANs). DR-GAN contains two novel
modules: semantic disentangling module (SDM) and distri-
bution normalization module (DNM). In SDM, we introduce
a spatial self-attention mechanism (SSAM) and propose a new
semantic disentangling loss (SDL) to help the generator better
distill key information from texts and images in capturing the
image distribution process. In DNM, we introduce a variational
auto-encoder (VAE) [23] into GAN-based T2I methods to
Fig. 1. Framework of proposed DR-GAN.
normalize image distributions in the latent space. We also
propose a distribution adversarial loss (DAL) to align the
learned distribution with the real distribution in the normalized [21], [34], [47], [49]. However, GANs tend to suffer from
latent space. With DNM and SDM, our DR-GAN can generate unstable training, mode collapse, and uncontrollable prob-
an image latent distribution that better matches with the real lem, etc., and they hinder GANs from effectively modeling
image distribution, and generate higher-quality images. The the real data distribution. Recently, many approaches [4],
main contributions are summarized as follows. [11], [28], [30], [31], [51] have been exploited to overcome
1) We propose an SDM to help the generator distill key these issues. LSGANs [28] overcome the vanishing gradient
information (and filter out the non-key information) from problem by introducing a least square loss to replace the
both text and image features. entropy loss. WGAN [28] introduced the Earth mover (EM)
2) We design a new DNM which introduces the VAE into distance to improve the stability of learning distribution and
the GAN-based T2I pipeline so that it can more effec- provide meaningful learning curves useful for hyperparameter
tively normalize and denoise image latent distributions. search and debugging. MDGAN [4] improved the stability of
3) Extensive experimental results and analysis show the distribution learning by utilizing an encoder E(x) : x → z
efficacy of DR-GAN on two benchmarks: CUB- to produce the latent variable z for the generator G. SN-
Bird [43] and large-scale MS-COCO [26] over four GANs [30] proposed a novel weight normalization method
metrics. named spectral normalization to better stabilize the training of
II. R ELATED W ORK the discriminator. F-GAN [31] adopted the Kullback–Leibler
A. GANs in Text-to-Image Generation (KL) divergence to help align the generated image distrib-
ution with the real data distribution. In the T2I task, many
With the recent successes of GANs [7], a large num-
GAN-based methods introduce various strategies such as mul-
ber of GAN-based T2I methods [3], [9], [11], [24], [29],
tistage generation pattern [11], [51], attention mechanism [29],
[33], [34], [36], [39], [46], [51] have boosted the perfor-
[46], and cycle consistent mechanism [33] to help match syn-
mance of the T2I task. Reed et al. [34] first introduced the
thesized distribution with the real image distribution. However,
adversarial process to generate images from text descrip-
diverse visual information, messy background, and other non-
tions. However, they can only generate images with 64 ×
key visual information in images usually make the image
64 resolution. And, the quality of the images is not good.
distribution complicated. It makes distribution learning more
Then, StackGAN/StackGAN++ [11], [51] and HDGAN [52]
difficult. Thus, our idea is to explore an effective distribution
adopted multistage generation patterns to progressively
normalization strategy to overcome the challenge.
enhance the quality and detail of synthesized images. There-
after, the multistage generation framework has been widely
used in GAN-based T2I methods. Based on the multistage III. DR-GAN FOR T EXT- TO -I MAGE G ENERATION
generation framework, AttnGAN [46], DMGAN [29], and A. Overview
CPGAN [17] adopted the word-level or the object-level Most recent GAN-based T2I methods [9], [24], [29], [33],
attention mechanisms to help the generator enhance local [36], [40], [46] adopted a multistage generation framework
regions’ or objects’ semantics. MirrorGAN [33] combined the to progressively map the text embedding distribution to the
T2I generation and Image-to-Text generation to improve the image distribution, to synthesize high-quality images. Like all
global semantic consistency between text description and the these methods, we also adopt such a generation pattern from
generated image. SDGAN [9] and SEGAN [36] combined AttnGAN [46] as our baseline to build the DR-GAN.
the Siamese Network and the contrastive loss to enhance the As shown in Fig. 1, DR-GAN has a Text Encoder [46],
semantics of the synthesized image. Like these T2I methods, a conditioning augmentation module [51] F ca , m generation
we also adopt the multistage generation framework to build modules G oi , (i = 0, 1, 2, . . . , m − 1), and two new designs: m
our DR-GAN. But different from them, we help the generator semantic disentangling modules (SDMs) SDMi , and m DNMs
to better distill key information for distribution learning. DNMi , i = 0, 1, 2, . . . , m − 1.
The Text Encoder transforms the input text description
B. GANs in Distribution Learning (a single sentence) into the sentence feature s  and word
The GANs [7], as a latent distribution learning strategy, features W . The F ca [51] converts a sentence feature s 
has been widely adopted in various generation tasks [1], [16], to a conditioning sentence feature s. The SDM distills key

Authorized licensed use limited to: Mohamed bin Zayed University of Artificial Intelligence. Downloaded on February 25,2025 at 03:53:54 UTC from IEEE Xplore. Restrictions apply.
TAN et al.: DR-GAN: DISTRIBUTION REGULARIZATION FOR TEXT-TO-IMAGE GENERATION 10311

text and image belong to the heterogeneous domain, it is


difficult to achieve reasonable semantics matching in the
feature space [50]. Compared with the text features, the
word-level context feature from WAM contain more structural
and semantic information, and semantically better match the
image features. Besides, the semantics of word-level context
features and image features, are from the input word features
W and sentences features s. Therefore, SDM is designed
to distill key information and filter out non-key information,
on the word-level context features and image features of the
CAGM.
In this section, we first revisit the WAM [46] to acquire
the word-level context features and image features. Second,
Fig. 2. Architecture of the proposed SDM. WAM: word-level attention we introduce the SSAM to represent the key information and
mechanism (WAM) [46]; RIRM: real image reconstruction module; SD Loss: the non-key information. Finally, we introduce the semantic
semantic disentangling loss.
disentangling loss (SDL) to drive the SDM conduct the
semantic disentangling.
information from the text or image features, in the intermediate 1) Word-Level Attention Mechanism (WAM) [46]: In cases
stage of generation, for better approximating the real image where there is no ambiguity, we omit subscripts in this WAM’s
distribution. In the testing stage, these SDMs take noise z ∼ description. As shown in Fig. 2, the “WAM” has two inputs:
N(0, 1), the sentence feature s, and the word features W to the word features W ∈ R D×T and the image features H ∈
produce a series of hidden features Hi (i = 0, 1, 2, . . . , m −1); R D̂×N from the previous hidden layer.
while in the training stage, besides z, s and W , the i th S DMi First, the word features are mapped to the same latent
also takes the i th scale real image, Ii∗ , as input to generate semantic space as the image features, i.e., W  = U W , W  =
Hi . Then, G oi takes Hi to generate i th scaled images Iˆi . {wi ∈ R D̂ |i = 1, 2, . . . , T }, where U ∈ R D̂×D is a perceptual
The DNMi contains a VAE module and a discriminator: the layer. Each column of H = {h j ∈ R D̂ | j = 1, 2, . . . , N}
former normalizes image latent distributions, and the latter (hidden features) is a feature vector of an image’s sub-region.
distinguishes between the real image and the synthesized Second, for the j th image’s sub-region, its dynamic repre-
image. The generation stage information flow is formulated sentation of word vectors w.r.t. h j is
as  
  
T
exp S j,i
SDM0 : H0 = SDM0 z, F ca (s  ), I0∗ in Training Stage qj = θ j,i wi , where θ j,i = T   . (2)
 
SDMi : Hi = SDMi Hi−1 , W, Ii∗ in Training Stage i=1 k=1 exp S j,k
 
SDM0 : H0 = SDM0 z, F ca (s  ) in Testing Stage
Here, S j,i = h Tj wi , and θ j,i indicates the weight the model
SDMi : Hi = SDMi (Hi−1 , W ) in Testing Stage assigned to the i th word when generating the j th sub-region
G oi : Iˆi = G oi (Hi ), i = 0, 1, 2, . . . , m − 1. (1) of the image.
Third, the word-level context feature for H is denoted by
Note that the SDM0 only contains a series of convolution
Q  = (q1 , q2 , . . . , q N ) ∈ R D̂×N .
layers and upsampling modules. The proposed SDM is adopted
As shown in Fig. 2, the WAM generates a word-level context
in the SDMi i = 0, 1, 2, . . . , m − 1.
feature Q  from the given word feature W and image feature
H . Here, Q  is a weighted combination of word features
B. Semantic Disentangling Module (SDM) that expresses the image feature H . The Q  can effectively
Semantic abstract and ambiguity of text description often enrich the semantics of image details [46]. Due to the abstract
makes the generator difficult capture accurate semantics of the and ambiguity of text description, the generator is prone
given sentence. Such inaccurate semantics are not conducive to parsing wrong or inaccurate semantics. However, such
to learning distribution, and then lead to incorrect spatial inaccurate semantic extraction leads to incorrect semantics
structures and semantics in synthesized images. To this end, and structures in Q  and H . Thus, it is necessary to distill
we propose a SDM (see Fig. 2) to help generators suppress the key information from word-level context feature Q  and
irrelevant spatial information and highlight relevant spatial the intermediate image feature H , for image generation. Next,
information for generating high-quality images. we use Spatial Self-Attention Mechanism to represent the key
We build our SDM based on the widely adopted cas- and non-key features of the Q  and the H , respectively.
caded attentional generative model (CAGM) introduced by 2) Spatial Self-Attention Mechanism: As shown in Fig. 2,
AttnGAN [46] because the word-level attention mecha- we use a spatial self-attention mechanism to represent the key
nism (WAM) in CAGM can effectively enhance semantic and non-key information of the word-level context feature Q i
details of the generated images. and the intermediate image feature Hi−1 , respectively.
Initially, the SDM was designed to directly extract the First, we represent the key and non-key information of the
key and non-key information from text features. Because image feature Hi−1 . The spatial attention mask MaskiH of the

Authorized licensed use limited to: Mohamed bin Zayed University of Artificial Intelligence. Downloaded on February 25,2025 at 03:53:54 UTC from IEEE Xplore. Restrictions apply.
10312 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 12, DECEMBER 2023

feature Hi−1 is defined as


   
Mask iH = Sig. Conv21×1 ReLU Conv13×3 (Hi−1 ) (3)

where Hi−1 is followed by a 3 × 3 convolutional layer


Conv13×3 (·) and a 1 × 1 convolutional layer Conv21×1 (·), and
ReLU and Sigmoid (Sig.) are used as activation functions.
We use Hi+ = Mask iH  Hi−1 to express the key spatial
information of Hi−1 , and use Hi− = Hi−1 − Hi+ to express Fig. 3. Image feature H1 (Stage-1), word-context matrix Q 2 (Stage-2), image
feature H2 (Stage-2), and H2 ’s corresponding generated image Iˆ2 (Stage-2).
the non-key information. The examples in the first row is from our Baseline (Base. AttnGAN [46]);
Second, we represent the key and non-key information of The examples in the second row is from Baseline+SDM (Base.+SDM). The
the word-level context feature Q i . Although Q i can reflect the warmer the color, the stronger the feature response.
semantic matching of words and images’ sub-regions, Q i is
still defined in the text embedding space and lacks necessary
spatial structure information. Thus, we design a convolution can provide high-quality real image features for SDM. The
module “ResBlock” to convert the word-context matrix Q i to loss functions in SDM are summarized as
   
the refined feature Q i . The spatial attention mask Mask iQ of Hi
LSDLi = λ1 LSDL Qi
+ λ2 LSDL + λ3 RIRM Ii∗ − Ii∗ 1 . (7)
the feature Q i is defined as
    Here, ||RIRM(Ii∗ )−Ii∗ ||1 is the reconstruction loss in RIRM.
Mask iQ = Sig. Conv21×1 ReLU Conv13×3 (Q i ) . (4) Finally, purified key features Q + +
i and Hi are concatenated
and then fused and upsampled (by using a ResBlock and Up-
We use Q + Q
i = Mask i  Q i to express the key spatial sampling module) to get the next-stage feature map Hi .
information of Q i , and use Q − +
i = Q i − Q i to express the In Fig. 3, we show the image feature H1 (stage-1), word-
non-key information. level context feature Q 2 (stage-2), image feature H2 (stage-2),
3) Semantic Disentangling Loss: To drive the SDM to and the synthesized image Iˆ2 (stage-2). The first line is the
better distinguish between the key information and non-key visualization of Baseline (Base.) and the second line is the
information of Q i and Hi−1 . We further design a new SDL visualization of Baseline+SDM (Base.+SDM).
term. In the generation task, the generated image distribution In the first line (Baseline) of Fig. 3: Due to abstract and
and the real image distribution are assumed to be the same ambiguity of text descriptions, it is difficult for the generator
type of distribution [12]. If the mean and variance of two to accurately capture key semantics; Some non-key semantics
distributions are the same, then two distributions are identical. get mixed up in the generation process; So, strange structures
Therefore, we use the constraints of mean and variance to and details are easy to appear in the generated image features,
separate key information from non-key information of Q i and such as H1 ; the calculation of Q 2 requires the participation
Hi−1 , for constructing the SDL loss. Specifically, we push the of H1, which also further leads to Q 2 ’s strange structure and
mean and variance of key information to approximate that of details; As a result, the structure of the generated image Iˆ2 is
real images in the latent space, and vice versa. The SDL on chaotic. In the second line (Base.+SDM) of Fig. 3: Based on
Hi−1 is defined as the selection strategy of key information driven by SDM, the
          non-key structural information on H1 and Q 2 can be better
Hi
LSDL = SP μ Hi+ − μ Hi∗  − μ Hi− − μ Hi∗ 
          filtered out. Since this, the structure and semantics of image
+SP σ Hi+ − σ Hi∗  − σ Hi− − σ Hi∗  . feature H2 become more reasonable. So, the structure of the
(5) synthesized images Iˆ2 is also reasonable.

Similarly, the SDL on the feature map Q i is denoted as


  ∗   −   C. Distribution Normalization Module (DNM)
Qi
LSDL = SP μ(Q +i ) − μ Hi
 − μ Q − μ(H ∗)
The aforementioned SDM can help the generator distill

    ∗     
i
 
i
+SP σ Q + i − σ Hi
 − σ Q − − σ H ∗  .
i i key semantic information for image generation. But diverse
(6) visual information, messy background, and other non-key
visual information in images usually make the image distri-
Here, μ(·) and σ (·) compute the mean and variance of bution more complicated. On the discriminator’s side: such
feature maps within a batch, the SP(x) = ln(1 + e x ). complicated image distributions make the distinction of real
The Hi∗ is the feature map of the corresponding real and synthesized images harder; the discriminator may fail to
image Ii∗ . We introduce the real image reconstruction mod- effectively identify the synthesized image.
ule (RIRM) shown in Fig. 2 to acquire the real image features We know that the data normalization mechanism [2], [15],
Hi∗ . The RIRM contains an encoder and a decoder. The [41], [45] can reduce the noise and internal covariate shift of
encoder takes the real image Ii∗ as the input and outputs data, and further improve manifold learning efficiency in the
the real image feature Hi∗ . The decoder takes the real image deep learning community. In the discriminant stage of GAN,
feature Hi∗ and reconstruct the real image by the reconstruction image sampling is first carried out from real image distribution
loss function ||RIRM(Ii∗ ) − Ii∗ ||1 . Note that the decoder and and synthesized image distribution. Second, the discriminator
the generation modules G o form the Siamese Network, which determines whether the current input image is sampled from

Authorized licensed use limited to: Mohamed bin Zayed University of Artificial Intelligence. Downloaded on February 25,2025 at 03:53:54 UTC from IEEE Xplore. Restrictions apply.
TAN et al.: DR-GAN: DISTRIBUTION REGULARIZATION FOR TEXT-TO-IMAGE GENERATION 10313

The information flow of the VAE module A is as follows.


1) Given an image x, x is first fed to the encoder E D (·),
and E D (·) outputs the image latent embedding v.
2) ϕ(·) infers the mean and variance of v and
builds a Gaussian distribution N(μ̃(ϕ(v)), σ̃ (ϕ(v))).
N(μ̃(ϕ(v)), σ̃ (ϕ(v))) is further normalized to a normal
distribution by the KL(N(μ̃(ϕ(v)), σ̃ (ϕ(v))))||N(0, 1)).
Fig. 4. Architecture of the DNM. The input x is a generated image Iˆi or To make this procedure differentiable, the re-sampling
real image Ii∗ . The discriminator Di connects an encoder E iD (·) to a logical
classifier ψi (·). The VAE module consists of a variational encoder (that stacks trick [23] is adopted to get z ∗ = z · σ̃ (ϕ(v))) + μ̃(ϕ(v)),
E iD (·) and a variational sampling module ϕi (·)), and a decoder (DiE (·)). z ∼ N(0, 1).
3) z ∗ and the text embedding s are concatenated, and
then fed to the decoder D E (·) to reconstruct image
x ∗ . Note that here the decoder takes both s and z ∗
the real image distribution or synthesized image distribution, for reconstruction, because the image generation here
i.e., True or False. Due to the diversity of images and the is conditioned on the text description.
randomness of synthesized images, the distributions of synthe- 1) Distribution Adversarial Loss: Following the alternate
sized images and real images are complicated. Such complex optimization mechanism of GANs, our VAE module is trained
image distribution makes it difficult for the discriminator to together with the discriminator. Based on the lower variational
distinguish whether the current input image is sampled from bound of the VAE [23], the loss function of the VAE module
the real image distribution or generated image distribution. in DNMi can be defined as
And it is difficult for the generator to align the generated
distribution with the real image distribution. Therefore, it is L DiD
           
necessary to reduce the complexity of the distribution. Nor- =  Iˆi −DiE ϕi E D Iˆi , s 1 + Ii∗ −DiE ϕi E D Ii∗ , s 1
malization is an effective strategy for denoising and reducing          
+KL N μ̃i ϕi E D Iˆi , σ̃i ϕi E D Iˆi  N(0, 1))
complexity. So, we expect to introduce the normalization of     D  ∗    D  ∗ 
+KL N μ̃i ϕi E Ii , σ̃i ϕi E Ii  N(0, 1)).
latent distribution.
As a generative model, VAE [23] can effectively denoise (8)
the latent distribution and reduce complexity of the distri-
In each generator’s training step, the generated images have
bution. In VAE, it assumes the latent embedding vector of
an unnormalized distribution; while the distributions of real
an image follow a Gaussian distribution N(μ̃, σ̃ ), and then
images have been normalized in the discriminator’s training
normalizes the N(μ̃, σ̃ ) to a standard normal distribution
step. Hence, it is difficult for the generator to produce distrib-
N(0, 1). Based on the advantage of image reconstruction
utions to approximate the normalized real image distribution.
in VAE, the normalized embedding vector can preserve key
Therefore, (1) we want to normalize the generated image
semantic visual information. Thus, we build a VAE module in
distribution in the VAE module during the generator’s training
DNM to normalize the image latent distributions to help the
step. In addition, (2) we want to align the normalized gener-
discriminator better distinguish between the “True” image and
ated distribution with the normalized real image distribution.
the “Fake” image.
To achieve the above two goals, we define a distribution
The structure of the i th DNM is shown in Fig. 4. DNM
consistency loss, i.e.,
contains two sub-modules: the discriminator Di and the VAE
         
module Ai . To simplify the notation, we omit the subscript LG iD = KL N μ̃i ϕi E D Iˆi , σ̃i ϕi E D Iˆi  N(0, 1))
i . x can be a generated image Iˆ or a real image I ∗ . The  ∗     
+ I − D E ϕi E D Iˆi , s  (9)
discriminator is composed of an encoder E D (·) and an logical i i 1

classifier ψ(·). E D (·) encodes image x into an embedding where the first term is designed for our first goal, and the
vector v. The embedding v combined with text embedding s second term is designed for our second goal.
is fed to the logical classifier ψ(·), which identifies if x is a We denote two loss functions LG iD and L DiD as the DAL
real or generated image. As mentioned above, diverse visual terms. In the discriminator’s training stage, L DiD helps the
information, messy background, and other non-key visual discriminator better distinguish the synthesized image from the
information in images make the distribution of embedding real image, and better learn the distribution decision boundary
vectors v complicated, and make the identification of x harder. between the generated versus real image latent distributions.
Thus, we adopt a VAE module to normalize and denoise In the generator’s training stage, LG iD can help the generator
the latent distribution of embedding vectors v. In addition to learn and capture the real image distribution in the normalized
reducing the complexity of the image latent distribution, using latent space.
a VAE can also push the encoded image feature vector v to Our DNM module, combining VAE and DAL, can effec-
record important image semantics (through reconstruction). tively reduce the complexity of distribution constructed by the
Our VAE module A adopts a standard design architecture image embedding v, and enrich the high-level semantic infor-
of VAE [23]. As shown in Fig. 4, A has a variational encoder mation of the image embedding v. Such a normalized embed-
(which consists of an encoder E D (·) and a variational sampling ding v helps the discriminator better distinguish between
module ϕ(·)), and a decoder D E (·). the “Fake” image and the “True” image. Consequently, the

Authorized licensed use limited to: Mohamed bin Zayed University of Artificial Intelligence. Downloaded on February 25,2025 at 03:53:54 UTC from IEEE Xplore. Restrictions apply.
10314 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 12, DECEMBER 2023

generator can also better align the generated distribution with [29], [46]. Second, we discuss the effectiveness of each new
the real image distribution. module introduced in DR-GAN: SDM and DNM. All the
experiments are performed with one GTX 2080 Ti using the
D. Objective Functions in DR-GAN PyTorch toolbox.
Combining the above modules, at the i th stage of the DR-
GAN, the generative loss LG i and discriminative loss L Di are
defined as A. Experiment Settings
1   1   1) Datasets: We conduct experiments on two widely used
LG i = − E Iˆi ∼PG log Di ( Iˆi ) − E Iˆi ∼PG log Di ( Iˆi , s) datasets, CUB-Bird [43] and MS-COCO [26] datasets. The
2 i 2 i

unconditional loss conditional loss


CUB-Bird [43] dataset contains 11 788 bird images, and each
(10) bird image has ten sentences to describe the fine-grained visual
details. The MS-COCO dataset [26] contains 80k training
where the unconditional loss is trained to generate high-quality images and 40k test images, and each image has five sentences
images toward the real image distribution to fool the discrim- to describe the visual information of the scene. We pre-process
inator, and the conditional loss is trained to generate images and split the images on the two datasets by following the same
to better match text descriptions. The Di (·) = ψ̂(E iD (·)) setting in [37] and [55].
is the unconditional discriminator, Di (·, ·) = ψ(E iD (·), s) 2) Evaluation: We form three aspects to compare our
is the conditional discriminator. The ψ̂(·) and ψ(·) are the DR-GAN with other GAN-based T2I approaches: Image
unconditional logical classifier and the conditional logical Diversity, Distribution Consistency, and Semantic Consistency.
classifier, respectively. Each model generated 30 000 images conditioning on the text
The discriminator Di is trained to classify the input image descriptions from the unseen test set for evaluation. The ↑
into the “Fake” or “True” class by minimizing the cross- means that the higher the value, the better the performance of
entropy loss the model, and vice versa.
L Di 1) Image Diversity: Following almost all T2I approaches,
   1     we adopt the fine-tuned Inception models [51] to calcu-
1
= − E Ii∗ ∼Pdatai log Di Ii∗ − E Iˆi ∼PG log 1 − Di Iˆi late the inception score (IS ↑), which measures images
2 2 i
diversity.
unconditional loss 2) Distribution Consistency: We use the fréchet inception
1    1     distance (FID ↓) [12] and mode score (MS ↑) [4] to
+ − E Ii∗ ∼Pdatai logDi Ii∗ , s − E Iˆi ∼PG log 1−Di Iˆi , s evaluate the distribution consistency between generated
2 2 i

conditional loss images and real images. The image features in FID
and MS are extracted by a pre-trained Inception-V3
(11)
network [35].
∗ 3) Semantic Consistency: We use the R-precision ↑ and
where Ii is from the realistic image distribution Pdata at the i th
scale, and Iˆi is from distribution PG i of the generative images Human Perceptual score (H.P. score ↑) to evaluate the
at the same scale. semantic consistency between the text description and
To generate realistic images, the final objective functions in the synthesized image.
the generation training stage (LG ) and discrimination training 3) R-Precision: Following [46], we also use R-precision
stage (L D ) are, respectively, defined as to evaluate semantic consistency. Given a pre-trained image-

m−1  to-text retrieval model, we use generated images to query
LG = LG i + λ4 LG iD + LSDLi + αLDAMSM (12) their corresponding text descriptions. First, given generated
i=0 image x̂ conditioned on sentence s and 99 random sampled

m−1  sentences {si : 1 ≤ i ≤ 99}, we rank these 100 sentences by the
LD = L Di + λ5 L DiD . (13) pre-trained image-to-text retrieval model. If the ground truth
i=0 sentence s is ranked highest, we count this as a successful
The loss function LDAMSM [46] is designed to measure the retrieval. For all the images in the test dataset, we perform this
matching degree between images and text descriptions. The retrieval task once and finally count the percentage of success
DAMSM loss makes generated images better conditioned on retrievals as the R-precision score. Higher R-precision means
text descriptions. The DR-GAN has three-stage generators greater semantic consistency.
(m = 3) like the most recent GAN-based T2I methods [3], 4) Human Perceptual Score (H.P. Score): To get H.P.
[24], [29], [33], [39], [46]. score, we randomly select 2000 text descriptions on CUB-
Bird test set and 2000 text descriptions on MS-COCO test
set. Given the same text description, 30 volunteers (not
IV. E XPERIMENTAL R ESULTS including any author) are asked to rank the images gener-
In this section, we perform extensive experiments to evalu- ated by different methods. The average ratio ranked as the
ate the proposed DR-GAN. First, we compare our DR-GAN best by human users is calculated to evaluate the compared
with other SOTA GAN-based T2I methods [11], [20], [24], methods.

Authorized licensed use limited to: Mohamed bin Zayed University of Artificial Intelligence. Downloaded on February 25,2025 at 03:53:54 UTC from IEEE Xplore. Restrictions apply.
TAN et al.: DR-GAN: DISTRIBUTION REGULARIZATION FOR TEXT-TO-IMAGE GENERATION 10315

TABLE I
IS ↑, FID ↓, MS ↑, R-P RECISION ↑, AND H UMAN P ERCEPTUAL SCORE (H.P. SCORE ) ↑ BY S OME SOTA GAN-BASED T2I M ODELS AND O UR DR-GAN
ON THE CUB-B IRD AND MS-COCO T EST S ETS . † I NDICATES THE S CORES ARE C OMPUTED F ROM I MAGES G ENERATED BY THE O PEN -S OURCED
M ODELS . ∗ I NDICATES THE S CORES ARE R EPORTED IN DMGAN [29]. ∗∗ I NDICATES THE S CORES ARE R EPORTED IN ATTN GAN+O.P.*
[40]. O THER R ESULTS W ERE R EPORTED IN THE O RIGINAL A RTICLE . T HE B OLD IS THE B EST R ESULT

Fig. 5. Images of 256 × 256 resolution are generated by our DR-GAN, DM-GAN [29], and AttnGAN [46] conditioned on text descriptions from the
MS-COCO (the upper part) and CUB-Bird (the bottom part) test sets.

B. Comparison With State-of-the-Arts (5.23) of RiFeGAN [20] is higher than that of ours (4.90) on
the CUB-Bird dataset; but RiFeGAN [20] uses 10 sentences
1) Image Diversity: We use IS to evaluate the image to train the generator on the CUB-Bird dataset, while our
diversity. As shown in Table I, DR-GAN achieves the second- DR-GAN only uses one sentence following the standard T2I
highest IS score on the CUB-Bird test set, and achieves the problem formulation. On the larger-scale and challenging MS-
highest IS score on the MS-COCO test set. The IS score COCO dataset, RiFeGAN [20] uses five sentences to train the

Authorized licensed use limited to: Mohamed bin Zayed University of Artificial Intelligence. Downloaded on February 25,2025 at 03:53:54 UTC from IEEE Xplore. Restrictions apply.
10316 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 12, DECEMBER 2023

TABLE II TABLE III


P ERFORMANCE OF A PPLYING SDM AND DNM ON O THER GAN-BASED T RAINING T IME , T RAINING E POCH , M ODEL S IZE , T ESTING T IME OF
T2I M ETHODS ON CUB-B IRD T EST S ET. T HE M EASURES I NCLUDE O UR DR-GAN AND O THER SOTA T2I M ETHODS ON THE MS-COCO
IS↑, FID ↓ AND MS ↑. ATTN GAN − I NDICATES T HAT W E R EMOVE D ATASET
THE WAM IN ATTN GAN [46]. † I NDICATES THE S CORES
ARE C OMPUTED F ROM I MAGES G ENERATED BY THE O PEN -
S OURCED M ODELS . O THER R ESULTS W ERE R EPORTED
IN THE O RIGINAL A RTICLE

performance on all three measures. It indicates that SDM and


DNM can be used as general mechanisms in T2I generation
task.
5) Visualization: We qualitatively evaluate our DR-GAN
and some GAN-based T2I methods by image visualization.
As shown in Fig. 5, compared with AttnGAN [46] and
DMGAN [29], our DG-GAN can synthesize higher-quality
model, but obtains a significantly lower IS score 31.70 than images on CUB-Bird and MS-COCO test sets.
that of our DR-GAN (34.59). When the given database only On the CUB-Bird test set: 1) compared with our DR-GAN,
contains images with a single sentence (which is common the birds synthesized by AttnGAN and DMGAN contain some
in practical tasks such as Story Visualization [49], Text-to- strange or incomplete details and structures; Because SDM
Video [48], and other text-guided image generation [47]), can better distill key information for image generation, our
RiFeGAN [20] cannot be used. In contrast, methods such as DR-GAN can perform better on the structure generation; and
our DR-GAN, AttnGAN [46], and DMGAN [29] that only 2) due to the introduction of DNM, DR-GAN can better
need one sentence per image can be used. capture the real images distribution. Therefore, compared
2) Distribution Consistency: We use FID and MS to eval- with AttnGAN and DMGAN, the bird images generated by
uate the distribution consistency between the generated image DR-GAN are more realistic and contain more full semantic
distribution and the real image distribution. As shown in details. In particular, the birds wings in Fig.-b-c-d-k have very
Table I, compared with these GAN-based T2I methods, our detailed textures and full colors. On the MS-COCO test set:
DR-GAN achieves the competitive performance on the CUB- the semantics of text description are rather sparse; it is very
Bird and MS-COCO test sets over the FID and the MS. difficult for the generator to capture sufficient semantics to
The FID of Obj-GAN [24] is lower than that of our DR- generate high-quality images; therefore, most current methods
GAN. This is because Obj-GAN [24] and its similar meth- cannot synthesize vivid images; compared with AttnGAN and
ods [3], [13], [14], [19], [40] require additional information, DMGAN, our proposed DR-GAN is more reasonable in object
including the interesting object’s bounding boxes and shapes, layout and richer in semantic details.
for training synthesizing. This additional information can Besides, in Fig. 6, we have shown the synthesized images of
help the generator better capture the object’s layout. This StackGAN v2 [11], AttnGAN− [46], and DMGAN [29], and
additional information, although available in the MS-COCO synthesized images of StackGAN v2+SDM+DNM (Stack-
dataset, is often unavailable for other datasets such as the GAN v2∗ ), AttnGAN− +SDM+DNM (AttnGAN−∗ ), and
CUB-Bird dataset. In general, producing more descriptions DMGAN+SDM+DNM (DMGAN∗ ). As shown in Fig. 6, our
or objects’ bounding boxes and shapes on a new database to SDM and DNM can further help these methods (StackGAN
train the generator is expensive. This limits its scalability and v2 [11], AttnGAN− [46], and DMGAN [29]) improve the
usability in more general text-guided image generation. Thus, structure and enrich the semantic details of synthesized birds,
our DR-GAN still achieves the competitive performance in the so as to make the generated image more realistic.
distribution consistency evaluation. 6) Model Cost: As shown in Table III, we compare our
3) Semantic Consistency: We use the R-precision and the proposed DR-GAN with other SOTA T2I methods under
Human Perceptual score (H.P.) to evaluate the semantic con- four model cost measures including Training Time, Training
sistency. As shown in Table I, compared with AttnGAN and Epoch, Model Size, and Testing Time. Using the MS-COCO
DMGAN [29], our DR-GAN also achieves the best perfor- dataset as an example, compared with AttnGAN (Baseline)
mance on the semantic consistency evaluation on these two and DM-GAN, our proposed DR-GAN is between AttnGAN
datasets. and DM-GAN in terms of model cost while achieving the
4) Generalization: To evaluate the generalizability of our highest performance.
proposed SDM and DNM, we integrate the SDM and
DNM modules into several well-known/SOTA GAN-based T2I
C. Ablation Study
models including StackGANv2 [11], AttnGAN− [46], and
DMGAN [29]. Here, AttnGAN− denotes the AttnGAN [46] 1) Effectiveness of New Modules: In this subsection,
with its WAM removed. As shown in Table II, SDM and we evaluate the effectiveness of each new component
DNM can help all these GAN-Based T2I models achieve better qualitatively and quantitatively. The numerical results are

Authorized licensed use limited to: Mohamed bin Zayed University of Artificial Intelligence. Downloaded on February 25,2025 at 03:53:54 UTC from IEEE Xplore. Restrictions apply.
TAN et al.: DR-GAN: DISTRIBUTION REGULARIZATION FOR TEXT-TO-IMAGE GENERATION 10317

Fig. 6. Images of 256 × 256 resolution are generated by StackGAN v2 [11], StackGAN v2+SDM+DNM (StackGAN v2∗ ), AttnGAN− [46],
AttnGAN− +SDM+DNM (AttnGAN−∗ ), DMGAN [29], and DMGAN+SDM+DNM (DMGAN∗ ) conditioned on text descriptions from CUB-Bird test set.

TABLE IV Second, we introduce the DNM into the baseline (Base.),


IS↑, FID ↓ AND MS ↑ P RODUCED BY C OMBINING D IFFERENT C OM - i.e., Base.+DNM. As shown in Table IV, Base.+DNM leads
PONENTS OF THE DR-GAN ON THE CUB-B IRD AND MS-COCO
T EST S ETS . O UR BASELINE (BASE .) IS ATTN GAN [46]. DR-
to 9.86% and 20.74% improvement of IS, 36.95% and 16.23%
GAN=BASE .+SDM+DNM improvement of FID, and 11.63% and 31.92% improvement of
MS, on the CUB-Bird and MS-COCO test sets, respectively.
Finally, when we introduce the SDM and the DNM into the
baseline (Base.), i.e., DR-GAN, our DR-GAN obtains 12.38%
and 33.60% improvement over the baseline in IS, 37.61% and
21.67% improvement over the baseline in FID, and 13.02%
and 43.23% improvement over the baseline in MS, on the
CUB-Bird and MS-COCO test sets, respectively. Our DR-
GAN achieves 4.90 and 34.59 in the term of IS, and achieves
14.96 and 27.80 in the term of FID, and achieves 4.86 and
documented in Table IV. The visualization results are shown 33.96 in the term of MS, on the CUB-Bird and MS-COCO
in Figs. 7–9. test sets, respectively.
2) Quantitative Results: We evaluated the effectiveness of 3) Qualitative Results: We also qualitatively evaluate the
two new components, SDM and DNM, in terms of three effectiveness of each component by image visualization (see
measures. Our Baseline (Base.) is AttnGAN [46]. As shown Figs. 7–9).
in Table IV, both SDM and DNM can effectively improve the In Fig. 7, we can clearly see that the DNM and SDM
performance of the baseline on these two datasets over three can effectively improve the quality of the generated images,
measures. respectively.
First, we introduce the SDM into the baseline (Base.), i.e., a) Semantic disentangling module (SDM): Compared
Base.+SDM. As shown in Table IV, Base.+SDM leads to with Baseline, the introduction of SDM can better improve
7.80% and 19.31% improvement of IS, 35.36% and 11.47% bird body structure on the CUB-Bird test set, and also improve
improvement of FID, and 7.44% and 29.23% improvement of target layout in complex scenes to some extent. This is because
MS, on the CUB-Bird and MS-COCO test sets, respectively. SDM can directly extract key features and filter non-key

Authorized licensed use limited to: Mohamed bin Zayed University of Artificial Intelligence. Downloaded on February 25,2025 at 03:53:54 UTC from IEEE Xplore. Restrictions apply.
10318 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 12, DECEMBER 2023

Fig. 7. Images of 256 × 256 resolution are generated by our Baseline (Base.), Base.+SDM, Base.+DNM, and DR-GAN conditioned on text descriptions
from the CUB-Bird test set.

features from the spatial perspective of feature maps. Based Therefore, generated images are better in terms of visual
on the constraints of real image feature distribution statistics, representation and detail semantics. Compared with SDM,
SDM will try to filter out these non-key structural information DNM lacks direct intervention on image features. Therefore,
which is not conducive to distribution learning. For COCO compared with Base.+SDM, the structure of images generated
data, the excessive sparsity and abstraction of text semantics by Base.+DNM guidance is slightly worse.
make the generator unable to generate vivid images. But with c) Semantic disentangling module+distribution normal-
SDM, the overall layout and structure of generated images are ization module (SDM+DNM): When we introduce both SDM
significantly improved compared to the Baseline. and DNM into the Baseline (Base.), i.e., DR-GAN. Based on
b) Distribution normalization module (DNM): Compared the respective advantages of SDM and DNM, compared with
with Baseline, the introduction of DNM can better improve Baseline, our proposed DR-GAN performs better in visual
visual representation and semantic expression of details to semantics, structure, and layout, and so on.
some extent. This is because DNM normalizes the latent Besides, we present top-3 word-guided attention maps
distribution of real and generated images, which can drive the and 3 stage generation images ( Iˆ0 64 × 6, Iˆ1 128 × 128,
generator to better approximate the real image distribution. Iˆ2 256 × 256) from AttnGAN, DMGAN and Base.+SDM

Authorized licensed use limited to: Mohamed bin Zayed University of Artificial Intelligence. Downloaded on February 25,2025 at 03:53:54 UTC from IEEE Xplore. Restrictions apply.
TAN et al.: DR-GAN: DISTRIBUTION REGULARIZATION FOR TEXT-TO-IMAGE GENERATION 10319

Fig. 8. Top-3 word guided attention maps and Synthesized Images at different stages from AttnGAN, DMGAN and Base.+SDM (AttnGAN+SDM) on the
CUB-Bird test dataset.

(AttnGAN+SDM) on the CUB-Bird test dataset. To facilitate the real image is also large. When we introduce SDM into
display, we pulled the generated image pixels into images the Baseline, i.e., Baseline+SDM, we observe a narrowing
of the same size in Fig. 8. For the AttnGAN and DMGAN, of the difference in the distribution between the generated
we can observe that when the quality of the generated image Iˆ0 image and the real image. When we introduce DNM into
in the initial stage is very poor, the Iˆ0 affects the confusion in the Baseline+SDM, i.e., DR-GAN, we observe a further
the word attention area. Due to the lack of direct intervention narrowing of the difference in the 2-D distribution between
of features, confused attention and image features continue to the generated image and the real image. And DNM makes the
confuse Iˆ1 and attention maps in the SDM2 . In contrast, the scatter plot area more compact. Through the results of T-SNE
introduction of the proposed SDM can gradually capture key visualization, we can see that DNM and SDM are effective
information and filter out the non-key information for image for distribution learning.
generation in the subsequent stage. To this end, the attention 4) Validity of Different Parts of SDM: The IS↑, FID ↓ and
information and the resulting images are gradually becoming MS ↑ of Baseline (Base.) are 4.36 ± 0.02, 23.98 and 4.30,
more reasonable. respectively. As shown in Table V, we set λ3 = 0 to mean
Finally, we use T-SNE [42] to visualize the 2-D distribution that RIRM does not participate in SDM. So, the SD loss does
of generated images and real images. As shown in Fig. 9, not get high-quality real image features. The constraint of
in the initial stage (200 epochs on Cub-Bird dataset), the the whole distribution statistic will deviate from the feature
difference of the distribution between the generated image of distribution of the real image. There will be a serious dete-
Baseline and the real image is very large. And the difference of rioration in image quality. Thus, compared with Base. and
the distribution between the generated image of Baseline and Base.+SDM, the performance of Base.+SDM (λ3 = 0) is

Authorized licensed use limited to: Mohamed bin Zayed University of Artificial Intelligence. Downloaded on February 25,2025 at 03:53:54 UTC from IEEE Xplore. Restrictions apply.
10320 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 12, DECEMBER 2023

TABLE VI
IS↑, FID ↓ AND MS ↑ P RODUCED BY BASE .+DNM AND BASE .+DNM∗
ON THE CUB-B IRD T EST S ET

Fig. 9. Visualization results of the synthesized images and real images under
T-SNE [42]. The green scattered part is the real images, and the red scattered
part is the generated images.

TABLE V TABLE VII


IS↑, FID ↓ AND MS ↑ P RODUCED BY D IFFERENT PARTS IN SDM ON THE C OMPARE W ITH O THER S EMANTICS C ONSISTENT S TRATEGIES IN S OME
CUB-B IRD T EST S ET O UTSTANDING T2I M ETHODS U NDER IS S CORE . “*” R EPRESENTS
THE P ERFORMANCE THE S EMANTICS C ONSTRAINT S TRATEGIES IN
T HESE M ETHODS U NDER IS S CORE

This is because the image encoding and decoding in AE


can enhance the expression of image semantics in v i . In all,
compared with the general AE, adopting VAE can make the
generator obtain higher performance.
6) Discussion of Semantic Consistency Strategies: We com-
pare the performance of SDM with some outstanding semantic
consistency strategies. Some outstanding methods (such as
CSM-GAN∗ [38], SE-GAN∗ [36], MirrorGAN∗ [33]) intro-
duce various semantic constraints to improve the quality
of image generation. As shown in Table VII, we present
Fig. 10. Architecture of the DNM. The first case is VAE without variational
the performance of the semantics constraint strategies in
sampling module, that is, general coding and decoding (AE). The second case these methods under IS score on the CUB-Bird dataset.
is the VAE framework adopted by us. SD-GAN [9] and SE-GAN [36] constrain semantic consis-
tency between synthesized images and real images. CSM-
GAN [38] constrains semantic consistency between the syn-
bad. When we remove the SDL, the performance of the model thesized images and the text descriptions. MirrorGAN [33]
also deteriorates. This is due to the lack of the necessary introduces semantic consistency between the text descriptions
loss function to drive the attention mechanism to extract key that the image is converted to and the real text descriptions.
information. In all, SDM can effectively improve the quality Compared with these methods, distribution statistic constraints
of model generation. (Baseline+SDM) gain better performance under IS score.
5) Impact of Self-Encoding Mode on Performance: We We think that the randomness of images and texts leads to the
modify the VAE module to observe their impact on model instability of semantics constraints. The constraints of mean
performance. The first case in Fig. 10, we remove the vari- and variance can guide the learning direction of distributions
ational sampling module ϕ(·)) in VAE module. Besides, the for generators. It helps the generator better align the generated
DAL is rewritten as the following equations: image distribution with the real image distribution.
     
L DiD =  Iˆi − DiE ϕi E D Iˆi , s 1
     
+ Ii∗ − DiE ϕi E D Ii∗ , s 1 (14) D. Parametric Sensitivity Analysis
 ∗   D    In this subsection, we mainly show and analyze the
 
LG iD = Ii − Di ϕi E Ii , s 1 .
E ˆ   (15)
sensitivity of the hyper-parameters λ1 , λ2 , λ3 , λ4 , λ5 , α. From
Here, we named such a model DNM∗ . As shown in Tables I to VII (excluding Table III), the six parameters are
Table VI, we compare Base.+DNM and Base.+DNM∗ assigned to values when the DR-GAN performance is around
on CUB-Bird dataset under three measures. Compared the median value in Fig. 11. So, λ1 = 10−3 , λ2 = 10−1 ,
with Base.+DNM (contains VAE), the performance of λ3 = 10−5 , λ4 = 1.0, λ5 = 1.0, and α = 5.0. The FID is an
Base.+DNM∗ has significantly decreased. However, the per- evaluation index to evaluate the quality of synthesized images
formance of Base.+DNM∗ is better than that of Baseline. on the CUB-Bird dataset. The FID results of DR-GAN under

Authorized licensed use limited to: Mohamed bin Zayed University of Artificial Intelligence. Downloaded on February 25,2025 at 03:53:54 UTC from IEEE Xplore. Restrictions apply.
TAN et al.: DR-GAN: DISTRIBUTION REGULARIZATION FOR TEXT-TO-IMAGE GENERATION 10321

Fig. 11. FID ↓ scores of our DR-GAN under different values of the hyper-parameters λ1 , λ2 , λ3 , λ4 , λ5 , and α on the CUB-Bird test set.

different values of the hyper-parameters λ1 , λ2 , λ3 , λ4 , λ5 , and would be suppressed. The bad real image features mislead
α are shown in Fig. 11. the generator to learn the bad image distribution and make the
1) Hyper-Parameters λ1 , λ2 : In the Semantic Disentangling quality of the generated image decline. Besides, We found that
Loss (7), the loss (5) and the loss (6) are proposed to drive the performance decreased as the value of λ3 increased. Because
SDM better distill the key information from image feature H the decoder in RIRM and the generation modules G o are
and word-level context feature Q for image generation. The the Siamese Networks. The purpose of this design is that
λ1 and λ2 are important balance parameters in the SDL (7). the real image features H ∗ and the generated image features
As shown in Fig. 11: 1) the SDL has a great influence H can be mapped to the same semantic space. The increase
on the overall performance of DR-GAN. When λ1 = 0 or of weight will make the generation modules G o pay more
λ2 = 0, the FID score of DR-GAN increases, which means attention to the reconstruction of real images by real image
that the quality of the generated image deteriorates; and features, and weaken the generation of generated images. That
2) as the values of λ1 or λ2 increase, the FID score also is, the Siamese Networks are prone to the imbalance between
increases. The increase of weight means that model training two feature sources in the training process. In all, based on
pays more attention to the regression of distribution statistics. the results shown in Fig. 11, the value range of λ3 is about
The statistics reflect the overall information of the distribution. [0.00001, 1].
The constraint of statistics is designed to assist the generator 3) Hyper-Parameter λ4 , λ5 : The hyper-parameter λ4 bal-
to better approximate the real image distribution. The learning ance the weight of (9) in the generation stage loss of
of accurate distribution requires GAN itself to approximate DR-GAN; The hyper-parameter λ5 balance the weight of (8)
the real distribution based on implicit sampling. Therefore, in the generation stage loss of DR-GAN; The loss (9) and
the weight of SDL should be selected appropriately. Based loss form the DAL. When λ4 = 0 or λ5 = 0, the FID score
on the results shown in Fig. 11, the value range of λ1 is increases, and the generated image quality decreases. This
{0.001, 0.01, 0.1}, and the value range of λ2 is {0.001, 0.1}. means that the two-loss terms of DAL can effectively improve
2) Hyper-Parameter λ3 : In the SDL (7), the parameter λ3 the distribution quality of the generated image. When the value
is designed to adjust the weight of the reconstruction loss of the λ4 or λ5 increases, the quality of the image distribution
||RIRM(Ii∗ )−Ii∗ ||1 . The reconstruction loss ||RIRM(Ii∗ )−Ii∗ ||1 tends to decline. When the weight is too large, the image
can provide the real image feature H ∗ for other terms in latent distribution will be over-normalized. At this point, the
SDL. When λ3 = 0, the reconstruction mechanism (i.e., discriminant model becomes very powerful. Just like GAN’s
RIRM) does not work. So, the performance of DR-GAN theory [8], a strong discriminator is not conducive to generator
also drops significantly. With the removal of the loss, it is optimization. So, based on the results shown in Fig. 11, the
difficult for the SDM to match the valid mean and variance value range of λ4 and λ5 is [0.1, 2].
of real image features. Since this, real image features would 4) Hyper-Parameter α: In the training stage of DR-GAN,
be mixed with more semantic information irrelevant to image we also utilize the DAMSM loss [46] to make generated
generation, or some important image semantic information images better conditioned on text descriptions. In Fig. 11,

Authorized licensed use limited to: Mohamed bin Zayed University of Artificial Intelligence. Downloaded on February 25,2025 at 03:53:54 UTC from IEEE Xplore. Restrictions apply.
10322 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 12, DECEMBER 2023

R EFERENCES

[1] A. Neuberger, E. Borenstein, B. Hilleli, E. Oks, and S. Alpert,


“Image based virtual try-on network from unpaired data,” in Proc.
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020,
pp. 5184–5193.
[2] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” 2016,
Fig. 12. Failure cases generated by DR-GAN on CUB-Bird test set (top arXiv:1607.06450.
row) and MS-COCO test set (bottom row). [3] B. Li, X. Qi, T. Lukasiewicz, and P. H. S. Torr, “Controllable text-to-
image generation,” in Proc. NeurIPS, 2019, pp. 1–11.
[4] T. Che, Y. Li, A. P. Jacob, Y. Bengio, and W. Li, “Mode regularized
generative adversarial networks,” in Proc. ICLR, 2017, pp. 1–13.
we show the performance of DR-GAN based on a different [5] C. Deng, Z. Li, X. Gao, and D. Tao, “Deep multi-scale discriminative
value of the hyper-parameter α. When α, the FID score networks for double JPEG compression forensics,” ACM Trans. Intell.
increases, and the generated image quality decreases. This Syst. Technol., vol. 10, no. 2, pp. 1–20, Mar. 2019.
means that the constraints of semantic matching help to [6] X. Fan, Y. Yang, C. Deng, J. Xu, and X. Gao, “Compressed multi-
scale feature fusion network for single image super-resolution,” Signal
improve the quality of image generation. Besides, when the Process., vol. 146, pp. 50–60, May 2018.
value of α changes constantly, the performance of DR-GAN [7] I. Goodfellow et al., “Generative adversarial nets,” in Proc. NeurIPS,
also changes moderately, and the overall performance is rela- 2014, pp. 1–9.
[8] J. Gui, Z. Sun, Y. Wen, D. Tao, and J. Ye, “A review on genera-
tively good. Based on the results shown in Fig. 11, the value tive adversarial networks: Algorithms, theory, and applications,” 2020,
range of α is about [1, 20]. arXiv:2001.06937.
[9] G. Yin, B. Liu, L. Sheng, N. Yu, X. Wang, and J. Shao, “Semantics
disentangling for text-to-image generation,” in Proc. IEEE/CVF Conf.
E. Limitation and Discussion Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 2327–2336.
The experiments showed the effectiveness of our DR-GAN [10] H. Yang, R. Zhang, X. Guo, W. Liu, W. Zuo, and P. Luo, “Towards
in T2I generation. However, there are a few failure cases. photo-realistic virtual try-on by adaptively generating↔preserving
image content,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.
Some failure cases from the CUB-Bird data are shown in the (CVPR), Jun. 2020, pp. 1–10.
top row of Fig. 12. Distribution normalization in SDM could [11] H. Zhang et al., “StackGAN++: Realistic image synthesis with stacked
sometimes lead to missing spatial local structures/parts such generative adversarial networks,” IEEE Trans. Pattern Anal. Mach.
Intell., vol. 41, no. 8, pp. 1947–1962, Aug. 2019.
as heads, feet, and necks. Some failure cases from the MS- [12] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter,
COCO dataset are shown in the second row. Like in most other “GANs trained by a two time-scale update rule converge to a local Nash
existing T2I models [3], [9], [24], [29], [33], [36], [46], it is equilibrium,” in Proc. Adv. Neural Inf. Process. Syst., 2017, pp. 1–12.
difficult for the text encoder to parse reasonable location infor- [13] T. Hinz, S. Heinrich, and S. Wermter, “Semantic object accuracy for
generative text-to-image synthesis,” IEEE Trans. Pattern Anal. Mach.
mation of objects in a scene if it is not specifically provided in Intell., vol. 44, no. 3, pp. 1552–1565, Mar. 2022.
sentences. Then, the generator in DR-GAN tends to randomly [14] S. Hong, D. Yang, J. Choi, and H. Lee, “Inferring semantic layout for
place some objects that are sometimes unreasonable. hierarchical text-to-image synthesis,” in Proc. IEEE/CVF Conf. Comput.
Vis. Pattern Recognit., Jun. 2018, pp. 7986–7994.
We will explore the objects’ location relationship from [15] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep
the text knowledge graph in our future work. We will also network training by reducing internal covariate shift,” in Proc. Int. Conf.
explore the application of SDM and DNM on other/broader Mach. Learn., vol. 37, Jul. 2015, pp. 448–456.
GAN-based image generation tasks, such as Image-to-Image [16] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation
with conditional adversarial networks,” in Proc. IEEE Conf. Comput. Vis.
Translation [16], [21] and Virtual Try-On [1], [10]. Pattern Recognit. (CVPR), Jul. 2017, pp. 1125–1134.
[17] L. Jiadong, P. Wenjie, and L. Feng, “CPGAN: Full-spectrum content-
V. C ONCLUSION parsing generative adversarial networks for text-to-image synthesis,” in
Proc. ECCV, 2020, pp. 1–18.
We proposed a novel SDM and DNM in the GAN-based [18] J. Gu, J. Cai, S. Joty, L. Niu, and G. Wang, “Look, imagine and match:
T2I model, and build a DR-GAN for T2I generation. The Improving textual-visual cross-modal retrieval with generative models,”
SDM helps the generator better distill the key information in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018,
pp. 7181–7189.
and filter out the non-key information for image generation. [19] J. Johnson, A. Gupta, and L. Fei-Fei, “Image generation from scene
The DNM helps GANs better normalize and reduce the com- graphs,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.,
plexity of the image latent distribution, and helps GAN-based Jun. 2018, pp. 1219–1228.
[20] J. Cheng, F. Wu, Y. Tian, L. Wang, and D. Tao, “RiFeGAN: Rich feature
T2I methods better capture the real image distribution from generation for text-to-image synthesis from prior knowledge,” in Proc.
text feature distribution. Extensive experimental results and IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020,
analysis demonstrated the effectiveness of DR-GAN and better pp. 10911–10920.
performance compared against previous outstanding methods. [21] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image
translation using cycle-consistent adversarial networks,” in Proc. IEEE
In addition, the proposed SDM and DNM can further help Int. Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 2223–2232.
other GAN-Based T2I models achieve better performance on [22] M. M. Kalayeh, E. Basaran, M. Gokmen, M. E. Kamasak, and
the T2I generation task. The proposed SDM and DNM can be M. Shah, “Human semantic parsing for person re-identification,” in
Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018,
used as general mechanisms in T2I generation task. pp. 1062–1071.
[23] D. P. Kingma and M. Welling, “Auto-encoding variational Bayes,” 2013,
ACKNOWLEDGMENT arXiv:1312.6114.
[24] W. Li et al., “Object-driven text-to-image synthesis via adversarial
No conflict of interest: Hongchen Tan, Xiuping Liu, Baocai training,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.
Yin, and Xin Li declare that they have no conflict of interest. (CVPR), Jun. 2019, pp. 12174–12182.

Authorized licensed use limited to: Mohamed bin Zayed University of Artificial Intelligence. Downloaded on February 25,2025 at 03:53:54 UTC from IEEE Xplore. Restrictions apply.
TAN et al.: DR-GAN: DISTRIBUTION REGULARIZATION FOR TEXT-TO-IMAGE GENERATION 10323

[25] Y. Li, K. K. Singh, U. Ojha, and Y. J. Lee, “MixNMatch: Multifactor dis- [51] H. Zhang et al., “StackGAN: Text to photo-realistic image synthesis
entanglement and encoding for conditional image generation,” in Proc. with stacked generative adversarial networks,” in Proc. IEEE Int. Conf.
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020, Comput. Vis. (ICCV), Oct. 2017, pp. 5907–5915.
pp. 8039–8048. [52] Z. Zhang, Y. Xie, and L. Yang, “Photographic text-to-image synthesis
[26] T.-Y. Lin et al., “Microsoft COCO: Common objects in context,” in with a hierarchically-nested adversarial network,” in Proc. IEEE/CVF
Proc. ECCV, 2014, pp. 740–755. Conf. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 6199–6208.
[27] L. Zhang, Q. Chen, B. Hu, and S. Jiang, “Text-guided neural image
inpainting,” in Proc. 28th ACM Int. Conf. Multimedia, Oct. 2020,
pp. 1302–1310.
[28] X. Mao, Q. Li, H. Xie, R. Y. K. Lau, Z. Wang, and S. P. Smolley,
“Least squares generative adversarial networks,” in Proc. IEEE Int. Conf.
Comput. Vis. (ICCV), Oct. 2017, pp. 2794–2802. Hongchen Tan received the Ph.D. degree in com-
[29] M. Zhu, P. Pan, W. Chen, and Y. Yang, “DM-GAN: Dynamic memory putational mathematics from the Dalian University
generative adversarial networks for text-to-image synthesis,” in Proc. of Technology, Dalian, China, in 2021.
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, He is currently a Lecturer with the Artificial
pp. 5802–5810. Intelligence Research Institute, Beijing University
[30] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida, “Spectral normaliza- of Technology, Beijing, China. Various parts of his
tion for generative adversarial networks,” in Proc. ICLR, 2018, pp. 1–26. work have been published in top conferences and
journals, such as IEEE International Conference on
[31] S. Nowozin, B. Cseke, and R. Tomioka, “ f -GAN: Training generative
Computer Vision (ICCV), IEEE T RANSACTIONS
neural samplers using variational divergence minimization,” in Proc.
ON I MAGE P ROCESSING (TIP), IEEE T RANSAC -
NeurIPs, 2016, pp. 1–9.
TIONS ON N EURAL N ETWORKS AND L EARNING
[32] Y.-X. Peng et al., “Cross-media analysis and reasoning: Advances and S YSTEMS (TNNLS), IEEE T RANSACTIONS ON M ULTIMEDIA (TMM),
directions,” Frontiers Inf. Technol., vol. 18, no. 5, pp. 44–57, 2017. IEEE T RANSACTIONS ON C IRCUITS AND S YSTEMS FOR V IDEO T ECHNOL -
[33] T. Qiao, J. Zhang, D. Xu, and D. Tao, “MirrorGAN: Learning text-to- OGY (TCSVT), and Neurocomputing. His research interests include person
image generation by redescription,” in Proc. IEEE/CVF Conf. Comput. re-identification, image synthesis, and referring segmentation.
Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 1505–1514.
[34] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and A. Lee,
“Generative adversarial text to image synthesis,” in Proc. ICML, 2016,
pp. 1060–1069.
[35] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna,
“Rethinking the inception architecture for computer vision,” in Proc. Xiuping Liu received the Ph.D. degree in compu-
IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, tational mathematics from the Dalian University of
pp. 2818–2826. Technology, Dalian, China, in 1999.
[36] H. Tan, X. Liu, X. Li, Y. Zhang, and B. Yin, “Semantics-enhanced She is currently a Professor with the School of
adversarial nets for text-to-image synthesis,” in Proc. IEEE/CVF Int. Mathematical Sciences, Dalian University of Tech-
Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 10500–10509. nology. Her research interests include shape model-
[37] H. Tan, X. Liu, M. Liu, B. Yin, and X. Li, “KT-GAN: Knowledge- ing and analyzing, and computer vision.
transfer generative adversarial network for text-to-image synthesis,”
IEEE Trans. Image Process., vol. 30, pp. 1275–1290, 2021.
[38] H. Tan, X. Liu, B. Yin, and X. Li, “Cross-modal semantic matching gen-
erative adversarial networks for text-to-image synthesis,” IEEE Trans.
Multimedia, vol. 24, pp. 832–845, 2022.
[39] T. Qiao, J. Zhang, D. Xu, and D. Tao, “Learn, imagine and create: Text-
to-image generation from prior knowledge,” in Proc. NeurIPS, 2019,
pp. 887–897. Baocai Yin (Member, IEEE) received the M.S. and
[40] T. Hinz, S. Heinrich, and S. Wermter, “Generating multiple objects at Ph.D. degrees in computational mathematics from
spatially distinct locations,” in Proc. ICLR, 2019, pp. 1–23. the Dalian University of Technology, Dalian, China,
[41] D. Ulyanov, A. Vedaldi, and V. Lempitsky, “Instance normalization: in 1988 and 1993, respectively.
The missing ingredient for fast stylization,” vol. abs/1607.08022, He is currently a Professor with the Artificial
pp. 1–6, Jul. 2016. Intelligence Research Institute, Beijing University of
[42] L. van der Maaten and G. Hinton, “Visualizing data using t-SNE,” Technology, Beijing, China. He is also a Researcher
J. Mach. Learn. Res., vol. 9, no. 2605, pp. 2579–2605, Nov. 2008. with the Beijing Key Laboratory of Multimedia
[43] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie, and Intelligent Software Technology, Beijing, and
“The Caltech-UCSD birds-200–2011 dataset,” California Inst. Technol., the Beijing Advanced Innovation Center for Future
Pasadena, CA, USA, Tech. Rep., CNS-TR-2011-001, 2011. Internet Technology, Beijing. His research inter-
[44] H. Wang, C. Deng, F. Ma, and Y. Yang, “Context modulated dynamic ests include multimedia, image processing, computer vision, and pattern
networks for actor and action video segmentation with language recognition.
queries,” in Proc. AAAI, 2020, pp. 12152–12159.
[45] Y. Wu and K. He, “Group normalization,” in Proc. Eur. Conf. Comput.
Vis. (ECCV), vol. abs/1803.08494, 2018, pp. 3–19.
[46] T. Xu et al., “AttnGAN: Fine-grained text to image generation with
attentional generative adversarial networks,” in Proc. IEEE/CVF Conf.
Xin Li (Senior Member, IEEE) received the B.E.
Comput. Vis. Pattern Recognit., Jun. 2018, pp. 1316–1324. degree in computer science from the University of
[47] Y. Liu et al., “Describe what to change: A text-guided unsupervised Science and Technology of China, Hefei, China,
image-to-image translation approach,” in Proc. 28th ACM Int. Conf. in 2003, and the M.S. and Ph.D. degrees in computer
Multimedia, Oct. 2020, pp. 1357–1365. science from the State University of New York at
[48] Y. Li, M. Min, D. Shen, D. Carlson, and L. Carin, “Video generation Stony Brook, Stony Brook, NY, USA, in 2005 and
from text,” in Proc. AAAI, 2018, pp. 1–8. 2008, respectively.
[49] Y. Li et al., “StoryGAN: A sequential conditional GAN for story He is currently a Professor with the Division
visualization,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. of Electrical and Computer Engineering, Louisiana
(CVPR), Jun. 2019, pp. 6329–6338. State University, Baton Rouge, LA, USA. His
[50] M. Yuan and Y. Peng, “Text-to-image synthesis via symmetrical distil- research interests include geometric and visual data
lation networks,” in Proc. 26th ACM Int. Conf. Multimedia, Oct. 2018, computing, processing, and understanding, computer vision, and virtual
pp. 1407–1415. reality.

Authorized licensed use limited to: Mohamed bin Zayed University of Artificial Intelligence. Downloaded on February 25,2025 at 03:53:54 UTC from IEEE Xplore. Restrictions apply.

You might also like