DR-GAN Distribution Regularization for Text-To-Image Generation
DR-GAN Distribution Regularization for Text-To-Image Generation
Abstract— This article presents a new text-to-image (T2I) [39] first extract text features, then use generative adver-
generation model, named distribution regularization generative sarial networks (GANs) [7] to generate the corresponding
adversarial network (DR-GAN), to generate images from text image. Their essence is to map a text feature distribution to
descriptions from improved distribution learning. In DR-GAN,
we introduce two novel modules: a semantic disentangling mod- the image distribution. However, two factors prevent GAN-
ule (SDM) and a distribution normalization module (DNM). SDM based T2I methods from capturing real image distribution:
combines the spatial self-attention mechanism (SSAM) and a 1) the abstract and ambiguity of text descriptions make the
new semantic disentangling loss (SDL) to help the generator generator difficult capture the key semantic information for
distill key semantic information for the image generation. DNM image generation [32], [50]; and 2) the diversity of visual
uses a variational auto-encoder (VAE) to normalize and denoise
the image latent distribution, which can help the discriminator information makes the distribution of images complex so that
better distinguish synthesized images from real images. DNM also it is difficult for the GAN-based T2I models to capture the real
adopts a distribution adversarial loss (DAL) to guide the genera- image distribution from text feature distribution [8]. Thus, this
tor to align with normalized real image distributions in the latent work explores better distribution learning strategies to enhance
space. Extensive experiments on two public datasets demon- GAN-based T2I models.
strated that our DR-GAN achieved a competitive performance
in the T2I task. The code link: https://github.com/Tan-H-C/DR- In multimodal perceptual information, the semantics of the
GAN-Distribution-Regularization-for-Text-to-Image-Generation. text description is usually abstract and ambiguous; Image
information is usually concrete and has a lot of spatial structure
Index Terms— Distribution normalization, generative adver-
sarial network, semantic disentanglement mechanism, text-to- information. Text and Image information are expressed in
image (T2I) generation. different patterns, which makes it difficult to achieve semantic
correlation based on feature vectors or tensors. Thus, it is
I. I NTRODUCTION difficult for the generator to accurately capture key semantics
from text descriptions for image generation. Since this, in the
G ENERATING photographic images from text descrip-
tions [known as text-to-image generation (T2I)] is a
challenging cross-modal generation technique that is a core
intermediate stage of generation, the image features contain
a lot of non-key semantics. Such inaccurate semantics often
component in many computer vision tasks such as Image leads to ineffective image distribution generation, and then the
Editing [27], [47], Story Visualization [49], and Multimedia generated images are often semantically inconsistent, chaos
Retrieval [18]. Compared with the image generation [16], structure, details, and so on. To alleviate this issue, our first
[21], [25] and image processing [5], [6], [22] tasks between strategy is to design an information disentangling mechanism
the same mode, it is difficult to build the heterogeneous on the intermediate feature, to better distill key information
semantic bridge between text and image [37], [44], [50]. before performing cross-modal distribution learning.
Many state-of-the-art T2I algorithms [3], [9], [24], [29], [33], In addition, images often contain diverse visual information,
messy background, and other non-key visual information.
Manuscript received 30 July 2021; revised 11 December 2021 and Their image latent distribution is often complex. And, the
14 February 2022; accepted 1 April 2022. Date of publication 20 April 2022;
date of current version 1 December 2023. This work was supported in part distribution of images is difficult to model explicitly [8]. This
by the National Key Research and Development Program of China under means that we cannot directly and explicitly learn the target
Grant 2021ZD0111900, in part by the National Natural Science Foundation image distribution from the text features. As an outstanding
of China under Grant 61976040, in part by the National Science Foundation
of USA under Grant OIA-1946231 and Grant CBET-2115405, and in part image generation model, GANs [7] learn the target data
by the Chinese Postdoctoral Science Foundation under Grant 2021M700303. distribution implicitly by sampling data from the true or fake
(Corresponding author: Xin Li.) data distribution. However, such complex image distribution
Hongchen Tan and Baocai Yin are with the Artificial Intelligence Research
Institute, Beijing University of Technology, Beijing 100124, China (e-mail: makes it difficult for the discriminator in GANs to distinguish
[email protected]; [email protected]). whether the current input image is sampled from the real image
Xiuping Liu is with the School of Mathematical Sciences, Dalian University distribution or generated image distribution. So, our second
of Technology, Dalian 116024, China (e-mail: [email protected]).
Xin Li is with the School of Electrical Engineering and Computer Science, strategy is to design an effective distribution normalization
and the Center for Computation & Technology, Louisiana State University, mechanism to normalize the image latent distribution. The
Baton Rouge, LA 70808 USA (e-mail: [email protected]). mechanism aims to help the discriminator better learn the
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/TNNLS.2022.3165573. distribution decision boundary between the generated versus
Digital Object Identifier 10.1109/TNNLS.2022.3165573 real image.
2162-237X © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Mohamed bin Zayed University of Artificial Intelligence. Downloaded on February 25,2025 at 03:53:54 UTC from IEEE Xplore. Restrictions apply.
10310 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 12, DECEMBER 2023
Authorized licensed use limited to: Mohamed bin Zayed University of Artificial Intelligence. Downloaded on February 25,2025 at 03:53:54 UTC from IEEE Xplore. Restrictions apply.
TAN et al.: DR-GAN: DISTRIBUTION REGULARIZATION FOR TEXT-TO-IMAGE GENERATION 10311
Authorized licensed use limited to: Mohamed bin Zayed University of Artificial Intelligence. Downloaded on February 25,2025 at 03:53:54 UTC from IEEE Xplore. Restrictions apply.
10312 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 12, DECEMBER 2023
Authorized licensed use limited to: Mohamed bin Zayed University of Artificial Intelligence. Downloaded on February 25,2025 at 03:53:54 UTC from IEEE Xplore. Restrictions apply.
TAN et al.: DR-GAN: DISTRIBUTION REGULARIZATION FOR TEXT-TO-IMAGE GENERATION 10313
classifier ψ(·). E D (·) encodes image x into an embedding where the first term is designed for our first goal, and the
vector v. The embedding v combined with text embedding s second term is designed for our second goal.
is fed to the logical classifier ψ(·), which identifies if x is a We denote two loss functions LG iD and L DiD as the DAL
real or generated image. As mentioned above, diverse visual terms. In the discriminator’s training stage, L DiD helps the
information, messy background, and other non-key visual discriminator better distinguish the synthesized image from the
information in images make the distribution of embedding real image, and better learn the distribution decision boundary
vectors v complicated, and make the identification of x harder. between the generated versus real image latent distributions.
Thus, we adopt a VAE module to normalize and denoise In the generator’s training stage, LG iD can help the generator
the latent distribution of embedding vectors v. In addition to learn and capture the real image distribution in the normalized
reducing the complexity of the image latent distribution, using latent space.
a VAE can also push the encoded image feature vector v to Our DNM module, combining VAE and DAL, can effec-
record important image semantics (through reconstruction). tively reduce the complexity of distribution constructed by the
Our VAE module A adopts a standard design architecture image embedding v, and enrich the high-level semantic infor-
of VAE [23]. As shown in Fig. 4, A has a variational encoder mation of the image embedding v. Such a normalized embed-
(which consists of an encoder E D (·) and a variational sampling ding v helps the discriminator better distinguish between
module ϕ(·)), and a decoder D E (·). the “Fake” image and the “True” image. Consequently, the
Authorized licensed use limited to: Mohamed bin Zayed University of Artificial Intelligence. Downloaded on February 25,2025 at 03:53:54 UTC from IEEE Xplore. Restrictions apply.
10314 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 12, DECEMBER 2023
generator can also better align the generated distribution with [29], [46]. Second, we discuss the effectiveness of each new
the real image distribution. module introduced in DR-GAN: SDM and DNM. All the
experiments are performed with one GTX 2080 Ti using the
D. Objective Functions in DR-GAN PyTorch toolbox.
Combining the above modules, at the i th stage of the DR-
GAN, the generative loss LG i and discriminative loss L Di are
defined as A. Experiment Settings
1 1 1) Datasets: We conduct experiments on two widely used
LG i = − E Iˆi ∼PG log Di ( Iˆi ) − E Iˆi ∼PG log Di ( Iˆi , s) datasets, CUB-Bird [43] and MS-COCO [26] datasets. The
2 i 2 i
conditional loss images and real images. The image features in FID
and MS are extracted by a pre-trained Inception-V3
(11)
network [35].
∗ 3) Semantic Consistency: We use the R-precision ↑ and
where Ii is from the realistic image distribution Pdata at the i th
scale, and Iˆi is from distribution PG i of the generative images Human Perceptual score (H.P. score ↑) to evaluate the
at the same scale. semantic consistency between the text description and
To generate realistic images, the final objective functions in the synthesized image.
the generation training stage (LG ) and discrimination training 3) R-Precision: Following [46], we also use R-precision
stage (L D ) are, respectively, defined as to evaluate semantic consistency. Given a pre-trained image-
m−1 to-text retrieval model, we use generated images to query
LG = LG i + λ4 LG iD + LSDLi + αLDAMSM (12) their corresponding text descriptions. First, given generated
i=0 image x̂ conditioned on sentence s and 99 random sampled
m−1 sentences {si : 1 ≤ i ≤ 99}, we rank these 100 sentences by the
LD = L Di + λ5 L DiD . (13) pre-trained image-to-text retrieval model. If the ground truth
i=0 sentence s is ranked highest, we count this as a successful
The loss function LDAMSM [46] is designed to measure the retrieval. For all the images in the test dataset, we perform this
matching degree between images and text descriptions. The retrieval task once and finally count the percentage of success
DAMSM loss makes generated images better conditioned on retrievals as the R-precision score. Higher R-precision means
text descriptions. The DR-GAN has three-stage generators greater semantic consistency.
(m = 3) like the most recent GAN-based T2I methods [3], 4) Human Perceptual Score (H.P. Score): To get H.P.
[24], [29], [33], [39], [46]. score, we randomly select 2000 text descriptions on CUB-
Bird test set and 2000 text descriptions on MS-COCO test
set. Given the same text description, 30 volunteers (not
IV. E XPERIMENTAL R ESULTS including any author) are asked to rank the images gener-
In this section, we perform extensive experiments to evalu- ated by different methods. The average ratio ranked as the
ate the proposed DR-GAN. First, we compare our DR-GAN best by human users is calculated to evaluate the compared
with other SOTA GAN-based T2I methods [11], [20], [24], methods.
Authorized licensed use limited to: Mohamed bin Zayed University of Artificial Intelligence. Downloaded on February 25,2025 at 03:53:54 UTC from IEEE Xplore. Restrictions apply.
TAN et al.: DR-GAN: DISTRIBUTION REGULARIZATION FOR TEXT-TO-IMAGE GENERATION 10315
TABLE I
IS ↑, FID ↓, MS ↑, R-P RECISION ↑, AND H UMAN P ERCEPTUAL SCORE (H.P. SCORE ) ↑ BY S OME SOTA GAN-BASED T2I M ODELS AND O UR DR-GAN
ON THE CUB-B IRD AND MS-COCO T EST S ETS . † I NDICATES THE S CORES ARE C OMPUTED F ROM I MAGES G ENERATED BY THE O PEN -S OURCED
M ODELS . ∗ I NDICATES THE S CORES ARE R EPORTED IN DMGAN [29]. ∗∗ I NDICATES THE S CORES ARE R EPORTED IN ATTN GAN+O.P.*
[40]. O THER R ESULTS W ERE R EPORTED IN THE O RIGINAL A RTICLE . T HE B OLD IS THE B EST R ESULT
Fig. 5. Images of 256 × 256 resolution are generated by our DR-GAN, DM-GAN [29], and AttnGAN [46] conditioned on text descriptions from the
MS-COCO (the upper part) and CUB-Bird (the bottom part) test sets.
B. Comparison With State-of-the-Arts (5.23) of RiFeGAN [20] is higher than that of ours (4.90) on
the CUB-Bird dataset; but RiFeGAN [20] uses 10 sentences
1) Image Diversity: We use IS to evaluate the image to train the generator on the CUB-Bird dataset, while our
diversity. As shown in Table I, DR-GAN achieves the second- DR-GAN only uses one sentence following the standard T2I
highest IS score on the CUB-Bird test set, and achieves the problem formulation. On the larger-scale and challenging MS-
highest IS score on the MS-COCO test set. The IS score COCO dataset, RiFeGAN [20] uses five sentences to train the
Authorized licensed use limited to: Mohamed bin Zayed University of Artificial Intelligence. Downloaded on February 25,2025 at 03:53:54 UTC from IEEE Xplore. Restrictions apply.
10316 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 12, DECEMBER 2023
Authorized licensed use limited to: Mohamed bin Zayed University of Artificial Intelligence. Downloaded on February 25,2025 at 03:53:54 UTC from IEEE Xplore. Restrictions apply.
TAN et al.: DR-GAN: DISTRIBUTION REGULARIZATION FOR TEXT-TO-IMAGE GENERATION 10317
Fig. 6. Images of 256 × 256 resolution are generated by StackGAN v2 [11], StackGAN v2+SDM+DNM (StackGAN v2∗ ), AttnGAN− [46],
AttnGAN− +SDM+DNM (AttnGAN−∗ ), DMGAN [29], and DMGAN+SDM+DNM (DMGAN∗ ) conditioned on text descriptions from CUB-Bird test set.
Authorized licensed use limited to: Mohamed bin Zayed University of Artificial Intelligence. Downloaded on February 25,2025 at 03:53:54 UTC from IEEE Xplore. Restrictions apply.
10318 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 12, DECEMBER 2023
Fig. 7. Images of 256 × 256 resolution are generated by our Baseline (Base.), Base.+SDM, Base.+DNM, and DR-GAN conditioned on text descriptions
from the CUB-Bird test set.
features from the spatial perspective of feature maps. Based Therefore, generated images are better in terms of visual
on the constraints of real image feature distribution statistics, representation and detail semantics. Compared with SDM,
SDM will try to filter out these non-key structural information DNM lacks direct intervention on image features. Therefore,
which is not conducive to distribution learning. For COCO compared with Base.+SDM, the structure of images generated
data, the excessive sparsity and abstraction of text semantics by Base.+DNM guidance is slightly worse.
make the generator unable to generate vivid images. But with c) Semantic disentangling module+distribution normal-
SDM, the overall layout and structure of generated images are ization module (SDM+DNM): When we introduce both SDM
significantly improved compared to the Baseline. and DNM into the Baseline (Base.), i.e., DR-GAN. Based on
b) Distribution normalization module (DNM): Compared the respective advantages of SDM and DNM, compared with
with Baseline, the introduction of DNM can better improve Baseline, our proposed DR-GAN performs better in visual
visual representation and semantic expression of details to semantics, structure, and layout, and so on.
some extent. This is because DNM normalizes the latent Besides, we present top-3 word-guided attention maps
distribution of real and generated images, which can drive the and 3 stage generation images ( Iˆ0 64 × 6, Iˆ1 128 × 128,
generator to better approximate the real image distribution. Iˆ2 256 × 256) from AttnGAN, DMGAN and Base.+SDM
Authorized licensed use limited to: Mohamed bin Zayed University of Artificial Intelligence. Downloaded on February 25,2025 at 03:53:54 UTC from IEEE Xplore. Restrictions apply.
TAN et al.: DR-GAN: DISTRIBUTION REGULARIZATION FOR TEXT-TO-IMAGE GENERATION 10319
Fig. 8. Top-3 word guided attention maps and Synthesized Images at different stages from AttnGAN, DMGAN and Base.+SDM (AttnGAN+SDM) on the
CUB-Bird test dataset.
(AttnGAN+SDM) on the CUB-Bird test dataset. To facilitate the real image is also large. When we introduce SDM into
display, we pulled the generated image pixels into images the Baseline, i.e., Baseline+SDM, we observe a narrowing
of the same size in Fig. 8. For the AttnGAN and DMGAN, of the difference in the distribution between the generated
we can observe that when the quality of the generated image Iˆ0 image and the real image. When we introduce DNM into
in the initial stage is very poor, the Iˆ0 affects the confusion in the Baseline+SDM, i.e., DR-GAN, we observe a further
the word attention area. Due to the lack of direct intervention narrowing of the difference in the 2-D distribution between
of features, confused attention and image features continue to the generated image and the real image. And DNM makes the
confuse Iˆ1 and attention maps in the SDM2 . In contrast, the scatter plot area more compact. Through the results of T-SNE
introduction of the proposed SDM can gradually capture key visualization, we can see that DNM and SDM are effective
information and filter out the non-key information for image for distribution learning.
generation in the subsequent stage. To this end, the attention 4) Validity of Different Parts of SDM: The IS↑, FID ↓ and
information and the resulting images are gradually becoming MS ↑ of Baseline (Base.) are 4.36 ± 0.02, 23.98 and 4.30,
more reasonable. respectively. As shown in Table V, we set λ3 = 0 to mean
Finally, we use T-SNE [42] to visualize the 2-D distribution that RIRM does not participate in SDM. So, the SD loss does
of generated images and real images. As shown in Fig. 9, not get high-quality real image features. The constraint of
in the initial stage (200 epochs on Cub-Bird dataset), the the whole distribution statistic will deviate from the feature
difference of the distribution between the generated image of distribution of the real image. There will be a serious dete-
Baseline and the real image is very large. And the difference of rioration in image quality. Thus, compared with Base. and
the distribution between the generated image of Baseline and Base.+SDM, the performance of Base.+SDM (λ3 = 0) is
Authorized licensed use limited to: Mohamed bin Zayed University of Artificial Intelligence. Downloaded on February 25,2025 at 03:53:54 UTC from IEEE Xplore. Restrictions apply.
10320 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 12, DECEMBER 2023
TABLE VI
IS↑, FID ↓ AND MS ↑ P RODUCED BY BASE .+DNM AND BASE .+DNM∗
ON THE CUB-B IRD T EST S ET
Fig. 9. Visualization results of the synthesized images and real images under
T-SNE [42]. The green scattered part is the real images, and the red scattered
part is the generated images.
Authorized licensed use limited to: Mohamed bin Zayed University of Artificial Intelligence. Downloaded on February 25,2025 at 03:53:54 UTC from IEEE Xplore. Restrictions apply.
TAN et al.: DR-GAN: DISTRIBUTION REGULARIZATION FOR TEXT-TO-IMAGE GENERATION 10321
Fig. 11. FID ↓ scores of our DR-GAN under different values of the hyper-parameters λ1 , λ2 , λ3 , λ4 , λ5 , and α on the CUB-Bird test set.
different values of the hyper-parameters λ1 , λ2 , λ3 , λ4 , λ5 , and would be suppressed. The bad real image features mislead
α are shown in Fig. 11. the generator to learn the bad image distribution and make the
1) Hyper-Parameters λ1 , λ2 : In the Semantic Disentangling quality of the generated image decline. Besides, We found that
Loss (7), the loss (5) and the loss (6) are proposed to drive the performance decreased as the value of λ3 increased. Because
SDM better distill the key information from image feature H the decoder in RIRM and the generation modules G o are
and word-level context feature Q for image generation. The the Siamese Networks. The purpose of this design is that
λ1 and λ2 are important balance parameters in the SDL (7). the real image features H ∗ and the generated image features
As shown in Fig. 11: 1) the SDL has a great influence H can be mapped to the same semantic space. The increase
on the overall performance of DR-GAN. When λ1 = 0 or of weight will make the generation modules G o pay more
λ2 = 0, the FID score of DR-GAN increases, which means attention to the reconstruction of real images by real image
that the quality of the generated image deteriorates; and features, and weaken the generation of generated images. That
2) as the values of λ1 or λ2 increase, the FID score also is, the Siamese Networks are prone to the imbalance between
increases. The increase of weight means that model training two feature sources in the training process. In all, based on
pays more attention to the regression of distribution statistics. the results shown in Fig. 11, the value range of λ3 is about
The statistics reflect the overall information of the distribution. [0.00001, 1].
The constraint of statistics is designed to assist the generator 3) Hyper-Parameter λ4 , λ5 : The hyper-parameter λ4 bal-
to better approximate the real image distribution. The learning ance the weight of (9) in the generation stage loss of
of accurate distribution requires GAN itself to approximate DR-GAN; The hyper-parameter λ5 balance the weight of (8)
the real distribution based on implicit sampling. Therefore, in the generation stage loss of DR-GAN; The loss (9) and
the weight of SDL should be selected appropriately. Based loss form the DAL. When λ4 = 0 or λ5 = 0, the FID score
on the results shown in Fig. 11, the value range of λ1 is increases, and the generated image quality decreases. This
{0.001, 0.01, 0.1}, and the value range of λ2 is {0.001, 0.1}. means that the two-loss terms of DAL can effectively improve
2) Hyper-Parameter λ3 : In the SDL (7), the parameter λ3 the distribution quality of the generated image. When the value
is designed to adjust the weight of the reconstruction loss of the λ4 or λ5 increases, the quality of the image distribution
||RIRM(Ii∗ )−Ii∗ ||1 . The reconstruction loss ||RIRM(Ii∗ )−Ii∗ ||1 tends to decline. When the weight is too large, the image
can provide the real image feature H ∗ for other terms in latent distribution will be over-normalized. At this point, the
SDL. When λ3 = 0, the reconstruction mechanism (i.e., discriminant model becomes very powerful. Just like GAN’s
RIRM) does not work. So, the performance of DR-GAN theory [8], a strong discriminator is not conducive to generator
also drops significantly. With the removal of the loss, it is optimization. So, based on the results shown in Fig. 11, the
difficult for the SDM to match the valid mean and variance value range of λ4 and λ5 is [0.1, 2].
of real image features. Since this, real image features would 4) Hyper-Parameter α: In the training stage of DR-GAN,
be mixed with more semantic information irrelevant to image we also utilize the DAMSM loss [46] to make generated
generation, or some important image semantic information images better conditioned on text descriptions. In Fig. 11,
Authorized licensed use limited to: Mohamed bin Zayed University of Artificial Intelligence. Downloaded on February 25,2025 at 03:53:54 UTC from IEEE Xplore. Restrictions apply.
10322 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 12, DECEMBER 2023
R EFERENCES
Authorized licensed use limited to: Mohamed bin Zayed University of Artificial Intelligence. Downloaded on February 25,2025 at 03:53:54 UTC from IEEE Xplore. Restrictions apply.
TAN et al.: DR-GAN: DISTRIBUTION REGULARIZATION FOR TEXT-TO-IMAGE GENERATION 10323
[25] Y. Li, K. K. Singh, U. Ojha, and Y. J. Lee, “MixNMatch: Multifactor dis- [51] H. Zhang et al., “StackGAN: Text to photo-realistic image synthesis
entanglement and encoding for conditional image generation,” in Proc. with stacked generative adversarial networks,” in Proc. IEEE Int. Conf.
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020, Comput. Vis. (ICCV), Oct. 2017, pp. 5907–5915.
pp. 8039–8048. [52] Z. Zhang, Y. Xie, and L. Yang, “Photographic text-to-image synthesis
[26] T.-Y. Lin et al., “Microsoft COCO: Common objects in context,” in with a hierarchically-nested adversarial network,” in Proc. IEEE/CVF
Proc. ECCV, 2014, pp. 740–755. Conf. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 6199–6208.
[27] L. Zhang, Q. Chen, B. Hu, and S. Jiang, “Text-guided neural image
inpainting,” in Proc. 28th ACM Int. Conf. Multimedia, Oct. 2020,
pp. 1302–1310.
[28] X. Mao, Q. Li, H. Xie, R. Y. K. Lau, Z. Wang, and S. P. Smolley,
“Least squares generative adversarial networks,” in Proc. IEEE Int. Conf.
Comput. Vis. (ICCV), Oct. 2017, pp. 2794–2802. Hongchen Tan received the Ph.D. degree in com-
[29] M. Zhu, P. Pan, W. Chen, and Y. Yang, “DM-GAN: Dynamic memory putational mathematics from the Dalian University
generative adversarial networks for text-to-image synthesis,” in Proc. of Technology, Dalian, China, in 2021.
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, He is currently a Lecturer with the Artificial
pp. 5802–5810. Intelligence Research Institute, Beijing University
[30] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida, “Spectral normaliza- of Technology, Beijing, China. Various parts of his
tion for generative adversarial networks,” in Proc. ICLR, 2018, pp. 1–26. work have been published in top conferences and
journals, such as IEEE International Conference on
[31] S. Nowozin, B. Cseke, and R. Tomioka, “ f -GAN: Training generative
Computer Vision (ICCV), IEEE T RANSACTIONS
neural samplers using variational divergence minimization,” in Proc.
ON I MAGE P ROCESSING (TIP), IEEE T RANSAC -
NeurIPs, 2016, pp. 1–9.
TIONS ON N EURAL N ETWORKS AND L EARNING
[32] Y.-X. Peng et al., “Cross-media analysis and reasoning: Advances and S YSTEMS (TNNLS), IEEE T RANSACTIONS ON M ULTIMEDIA (TMM),
directions,” Frontiers Inf. Technol., vol. 18, no. 5, pp. 44–57, 2017. IEEE T RANSACTIONS ON C IRCUITS AND S YSTEMS FOR V IDEO T ECHNOL -
[33] T. Qiao, J. Zhang, D. Xu, and D. Tao, “MirrorGAN: Learning text-to- OGY (TCSVT), and Neurocomputing. His research interests include person
image generation by redescription,” in Proc. IEEE/CVF Conf. Comput. re-identification, image synthesis, and referring segmentation.
Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 1505–1514.
[34] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and A. Lee,
“Generative adversarial text to image synthesis,” in Proc. ICML, 2016,
pp. 1060–1069.
[35] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna,
“Rethinking the inception architecture for computer vision,” in Proc. Xiuping Liu received the Ph.D. degree in compu-
IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, tational mathematics from the Dalian University of
pp. 2818–2826. Technology, Dalian, China, in 1999.
[36] H. Tan, X. Liu, X. Li, Y. Zhang, and B. Yin, “Semantics-enhanced She is currently a Professor with the School of
adversarial nets for text-to-image synthesis,” in Proc. IEEE/CVF Int. Mathematical Sciences, Dalian University of Tech-
Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 10500–10509. nology. Her research interests include shape model-
[37] H. Tan, X. Liu, M. Liu, B. Yin, and X. Li, “KT-GAN: Knowledge- ing and analyzing, and computer vision.
transfer generative adversarial network for text-to-image synthesis,”
IEEE Trans. Image Process., vol. 30, pp. 1275–1290, 2021.
[38] H. Tan, X. Liu, B. Yin, and X. Li, “Cross-modal semantic matching gen-
erative adversarial networks for text-to-image synthesis,” IEEE Trans.
Multimedia, vol. 24, pp. 832–845, 2022.
[39] T. Qiao, J. Zhang, D. Xu, and D. Tao, “Learn, imagine and create: Text-
to-image generation from prior knowledge,” in Proc. NeurIPS, 2019,
pp. 887–897. Baocai Yin (Member, IEEE) received the M.S. and
[40] T. Hinz, S. Heinrich, and S. Wermter, “Generating multiple objects at Ph.D. degrees in computational mathematics from
spatially distinct locations,” in Proc. ICLR, 2019, pp. 1–23. the Dalian University of Technology, Dalian, China,
[41] D. Ulyanov, A. Vedaldi, and V. Lempitsky, “Instance normalization: in 1988 and 1993, respectively.
The missing ingredient for fast stylization,” vol. abs/1607.08022, He is currently a Professor with the Artificial
pp. 1–6, Jul. 2016. Intelligence Research Institute, Beijing University of
[42] L. van der Maaten and G. Hinton, “Visualizing data using t-SNE,” Technology, Beijing, China. He is also a Researcher
J. Mach. Learn. Res., vol. 9, no. 2605, pp. 2579–2605, Nov. 2008. with the Beijing Key Laboratory of Multimedia
[43] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie, and Intelligent Software Technology, Beijing, and
“The Caltech-UCSD birds-200–2011 dataset,” California Inst. Technol., the Beijing Advanced Innovation Center for Future
Pasadena, CA, USA, Tech. Rep., CNS-TR-2011-001, 2011. Internet Technology, Beijing. His research inter-
[44] H. Wang, C. Deng, F. Ma, and Y. Yang, “Context modulated dynamic ests include multimedia, image processing, computer vision, and pattern
networks for actor and action video segmentation with language recognition.
queries,” in Proc. AAAI, 2020, pp. 12152–12159.
[45] Y. Wu and K. He, “Group normalization,” in Proc. Eur. Conf. Comput.
Vis. (ECCV), vol. abs/1803.08494, 2018, pp. 3–19.
[46] T. Xu et al., “AttnGAN: Fine-grained text to image generation with
attentional generative adversarial networks,” in Proc. IEEE/CVF Conf.
Xin Li (Senior Member, IEEE) received the B.E.
Comput. Vis. Pattern Recognit., Jun. 2018, pp. 1316–1324. degree in computer science from the University of
[47] Y. Liu et al., “Describe what to change: A text-guided unsupervised Science and Technology of China, Hefei, China,
image-to-image translation approach,” in Proc. 28th ACM Int. Conf. in 2003, and the M.S. and Ph.D. degrees in computer
Multimedia, Oct. 2020, pp. 1357–1365. science from the State University of New York at
[48] Y. Li, M. Min, D. Shen, D. Carlson, and L. Carin, “Video generation Stony Brook, Stony Brook, NY, USA, in 2005 and
from text,” in Proc. AAAI, 2018, pp. 1–8. 2008, respectively.
[49] Y. Li et al., “StoryGAN: A sequential conditional GAN for story He is currently a Professor with the Division
visualization,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. of Electrical and Computer Engineering, Louisiana
(CVPR), Jun. 2019, pp. 6329–6338. State University, Baton Rouge, LA, USA. His
[50] M. Yuan and Y. Peng, “Text-to-image synthesis via symmetrical distil- research interests include geometric and visual data
lation networks,” in Proc. 26th ACM Int. Conf. Multimedia, Oct. 2018, computing, processing, and understanding, computer vision, and virtual
pp. 1407–1415. reality.
Authorized licensed use limited to: Mohamed bin Zayed University of Artificial Intelligence. Downloaded on February 25,2025 at 03:53:54 UTC from IEEE Xplore. Restrictions apply.