0% found this document useful (0 votes)
7 views14 pages

Semantic_Object_Accuracy_for_Generative_Text-to-Image_Synthesis

This document presents a new model for generative text-to-image synthesis that introduces Semantic Object Accuracy (SOA) as an evaluation metric, focusing on the presence of specific objects mentioned in image captions. The proposed model, OP-GAN, improves upon existing methods by explicitly modeling individual objects and their relationships within generated images, leading to better performance in generating complex scenes. User studies indicate that SOA aligns closely with human evaluations, highlighting the limitations of traditional metrics like Inception Score in assessing text-to-image conformity.

Uploaded by

brily0213
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views14 pages

Semantic_Object_Accuracy_for_Generative_Text-to-Image_Synthesis

This document presents a new model for generative text-to-image synthesis that introduces Semantic Object Accuracy (SOA) as an evaluation metric, focusing on the presence of specific objects mentioned in image captions. The proposed model, OP-GAN, improves upon existing methods by explicitly modeling individual objects and their relationships within generated images, leading to better performance in generating complex scenes. User studies indicate that SOA aligns closely with human evaluations, highlighting the limitations of traditional metrics like Inception Score in assessing text-to-image conformity.

Uploaded by

brily0213
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

1552 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 44, NO.

3, MARCH 2022

Semantic Object Accuracy for Generative


Text-to-Image Synthesis
Tobias Hinz , Stefan Heinrich , and Stefan Wermter

Abstract—Generative adversarial networks conditioned on textual image descriptions are capable of generating realistic-looking
images. However, current methods still struggle to generate images based on complex image captions from a heterogeneous domain.
Furthermore, quantitatively evaluating these text-to-image models is challenging, as most evaluation metrics only judge image quality
but not the conformity between the image and its caption. To address these challenges we introduce a new model that explicitly models
individual objects within an image and a new evaluation metric called Semantic Object Accuracy (SOA) that specifically evaluates
images given an image caption. The SOA uses a pre-trained object detector to evaluate if a generated image contains objects that are
mentioned in the image caption, e.g., whether an image generated from “ a car driving down the street” contains a car. We perform a
user study comparing several text-to-image models and show that our SOA metric ranks the models the same way as humans,
whereas other metrics such as the Inception Score do not. Our evaluation also shows that models which explicitly model objects
outperform models which only model global image characteristics.

Index Terms—Text-to-image synthesis, generative adversarial network (GAN), evaluation of generative models, generative models

1 INTRODUCTION Distance (FID) [6], are not designed to evaluate images that
contain multiple objects and depict complex scenes. In fact,
adversarial networks (GANs) [1] are capable
G ENERATIVE
of generating realistic-looking images that adhere to
characteristics described in a textual manner, e.g., an image
both of these metrics depend on an image classifier (the
Inception-Net), which is pre-trained on ImageNet, a data
set whose images almost always contain only a single object
caption. For this, most networks are conditioned on an
at the image center. They also do not evaluate the consis-
embedding of the textual description. Often, the textual
tency between image description and generated image and,
description is used on multiple levels of resolution, e.g., first
therefore, can not evaluate whether a model generates
to obtain a course layout of the image at lower levels and
images that actually depict what is described in the caption.
then to improve the details of the image on higher resolu-
Even evaluation metrics specifically designed for text-to-
tions. This approach has led to good results on simple, well-
image synthesis evaluation such as the R-precision [7] often
structured data sets containing a specific class of objects
fail to evaluate more detailed aspects of an image, such as
(e.g., faces, birds, or flowers) at the image center.
the quality of individual objects.
Once images and textual descriptions become more com-
As such, our contributions are twofold: first, we introduce
plex, e.g., by containing more than one object and having a
a novel GAN architecture called OP-GAN that focuses specif-
large variety in backgrounds and scenery settings, the image
ically on individual objects while simultaneously generating
quality drops drastically. This is likely because, until
a background that fits with the overall image description.
recently, almost all approaches only condition on an embed-
Our approach relies on an object pathway similar to [3],
ding of the complete textual description, without paying
which iteratively attends to all objects that need to be gener-
attention to individual objects. Recent approaches have
ated given the current image description. In parallel, a global
started to tackle this by either relying on specific scene lay-
pathway generates the background features which later on
outs [2] or by explicitly focusing on individual objects [3], [4].
get merged with the object features. Second, we introduce an
In this work, we extend this approach by additionally focus-
evaluation metric specifically for text-to-image synthesis
ing specifically on salient objects within the generated image.
tasks which we call Semantic Object Accuracy (SOA). In con-
However, generating complex scenes containing multiple
trast to most current evaluation metrics, our metric focuses
objects from a variety of classes is still a challenging problem.
on individual objects and parts of an image and also takes
The most commonly used evaluation metrics for GANs,
the caption into consideration when evaluating an image.
the Inception Score (IS) [5] and the Frechet Inception
Image descriptions often explicitly or implicitly mention
what kind of objects are seen in an image, e.g., an image
 The authors are with the Knowledge Technology Group, University of described by the caption “a person holding a cell phone” should
Hamburg, 22527 Hamburg, Germany. E-mail: {hinz, heinrich, wermter} depict both a person and a cell phone. To evaluate this, we
@informatik.uni-hamburg.de. sample all image captions from the COCO validation set that
Manuscript received 29 Oct. 2019; revised 30 Apr. 2020; accepted 28 Aug. 2020. explicitly mention one of the 80 main object categories (e.e.
Date of publication 2 Sept. 2020; date of current version 3 Feb. 2022. “person”, “dog”, “car”, etc.) and use them to generate
(Corresponding author: Tobias Hinz.)
Recommended for acceptance by T. Berg. images. We then use a pre-trained object detector [8] and
Digital Object Identifier no. 10.1109/TPAMI.2020.3021209 check whether it detects the explicitly mentioned objects
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
HINZ ET AL.: SEMANTIC OBJECT ACCURACY FOR GENERATIVE TEXT-TO-IMAGE SYNTHESIS 1553

within the generated images. We perform a user study over [7] extend [9] and are the first ones to introduce an atten-
several current text-to-image models and show that SOA is tion mechanism to the text-to-image synthesis task with
highly compatible with human evaluation whereas other GANs. The attention mechanism attends to specific words
metrics, such as the Inception Score, are not. in the caption and conditions different image regions on dif-
We evaluate several variations of our proposed model as ferent words to improve the image quality. [21] extend this
well as several state-of-the-art approaches that provide pre- and also consider semantics from the text description dur-
trained models. Our results show that current architectures ing the generation process. [22] introduce a dynamic mem-
are not able to generate images that contain objects of the ory part that selects “bad” parts of the initial image and
same quality as the original images. While some models tries to refine them based on the most relevant words. [23]
already achieve results close to or better than real images on refine the attention module by having spatial and channel-
scores such as the IS and R-precision, none of the models wise word-level attention and introduce a word-level dis-
comes close to generating images that achieve SOA scores criminator to provide fine-grained feedback based on indi-
close to the real images. However, our results and user vidual words and image regions. [24] decompose the text-
study also show that models that attend to individual to-image process into three distinct phases by first learning
objects in one way or another tend to perform better than a prior over the text-image space, then sampling from this
models, which only focus on global image semantics. prior, and lastly using the prior to generate the image.
Text-to-Image Synthesis with Layouts. When using more
complex data sets that contain multiple objects per image,
2 RELATED WORK generating an image directly becomes difficult. Therefore,
Modern architectures are able to synthesize realistic, high- many approaches use additional information such as bound-
resolution images of many domains. In order to generate ing boxes for objects or intermediate representations such as
images of high resolution many GAN [1] architectures use scene graphs or scene layouts which can be generated auto-
multiple discriminators at various resolutions [9]. Addition- matically [25], [26], [27]. [28] and [29] build on [10] by addi-
ally, most GAN architectures use some form of attention for tionally conditioning the generator on bounding boxes or
improved image synthesis [7] as well as matching aware keypoints of relevant objects. [30] decomposition textual
discriminators [10] which identify whether images corre- descriptions into basic visual primitives to generate images in
spond to a given textual description. a compositional manner. [2] introduce the concept of generat-
Originally, most GAN approaches for text-to-image syn- ing a scene graph based on a caption. This scene graph is then
thesis encoded the textual description into a single vector used to generate an image layout and finally the image. Simi-
which was used as a condition in a conditional GAN (cGAN) lar to [2], [31] use the caption to infer a scene layout which is
[9], [10]. However, this faces limitations when the image con- used to generate images. [32] predict convolution kernels con-
tent becomes more complex as e.g., in the COCO data set ditioned on the semantic layout, making it possible to control
[11]. As a result, many approaches now use attention mecha- the generation process based on semantic information at dif-
nisms to attend to specific words of the sentence [7], use ferent locations.
intermediate representations such as scene layouts [2], con- Given a coarse image layout (bounding boxes and object
dition on additional information such as object bounding labels) [33] generate images by disentangling each object
boxes [3] or perform interactive image refinement [12]. Other into a specified part (e.g., object label) and unspecified part
approaches generate images directly from semantic layouts (appearance). [3] generate images conditioned on bounding
without additional textual input [13], [14]or perform a trans- boxes for the individual foreground objects by introducing
lation from text to images and back [15], [16]. an object pathway that generates individual objects. [4]
Direct Text-to-Image Synthesis. Approaches that do not update the grid-based attention mechanism [7] by combin-
use intermediate representations such as scene layouts use ing attention with scene layouts. Additionally, an object dis-
only the image caption as conditional input. [10] use a GAN criminator is introduced which focuses on individual
to generate images from captions directly and without any objects and provides feedback whether the object is at the
attention mechanism. Captions are embedded and used as right location. [34] refine the grid-based attention mecha-
conditioning vector and they introduce the widely adopted nism between word phrases and specific image regions of
matching aware discriminator. The matching aware dis- various sizes based on an initial set of bounding boxes. [35]
criminator is trained to distinguish between real and match- introduce a new feature normalization method and fine-
ing caption-image pairs (“real”), real but mismatching grained mask maps to generate visually different images
caption-image pairs (“fake”), and matching captions with from a given layout. [36] generate images from scene graphs
generated images (“fake”). [17] modify the sampling proce- and allow the model to crop objects from other images to
dure during training to obtain a curriculum of mismatching paste them into the generated image. [37] generate a visual-
caption-image pairs and introduce an auxiliary classifier relation scene layout based on the caption. For this, they
that specifically predicts the semantic consistency of a given introduce a dedicated module which generates bounding
caption-image pair. [9], [18] use multiple generators and boxes for objects at a given caption in order to condition the
discriminators and are one of the first ones to achieve good network during the image generation process.
image quality at resolutions of 256  256 on complex data Semantic Image Manipulation. Finally, there are methods
sets. [19] have a similar architecture as [18] with multiple that allow humans to directly describe the image in an itera-
discriminators but only use one generator while [20] gener- tive process or that allow for direct semantic manipulation of
ate realistic high-resolution images from text with a single images. [12] condition generation process on a dialogue
discriminator and generator. describing the image instead of a single caption. [38] facilitate
1554 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 44, NO. 3, MARCH 2022

Fig. 1. Overview of our model architecture called OP-GAN. The top row shows a high-level summary of our architecture, while the bottom two rows
show details of the individual generators and discriminators.

semantic image manipulation by allowing users to modify min max V ðD; GÞ ¼ Eðx;cÞpdata ½log Dðx; cÞ
G D
image layouts which are then used to generate images. [39] (1)
allow users to input object instance masks into an existing þEðzÞpz ;ðcÞpdata ½logð1  DðGðz; cÞ; cÞÞ:
image represented by a semantic layout. [40] generate images
iteratively from consecutive textual commands, [41] provide We use the AttnGAN [7] as our baseline architecture and
interactive image editing based on a current image and add our object-centric modifications to it. The AttnGAN is a
instructions on how to update the image, and [42] generate conditional GAN for text-to-image synthesis that uses atten-
individual images for a sequence of sentences. [43] do interac- tion and a novel additional loss to improve the quality of
tive image generation but do not use text as direct input but the generated images. It consists of a generator and three
instead update a scene graph from text over the course of the discriminators as shown in the top row of Fig. 1. Attention
interaction. [44], [45], and [46] modify visual attributes of indi- is used such that different words of the caption have more
vidual objects in an image while leaving text irrelevant parts or less influence on different regions of the image. This
of the image unchanged. means that, for example, the word “sky” has more influence
on the generation of the top half of the image than the word
“grass” even if both words are present in the image caption.
3 APPROACH [7] also introduce the Deep Attentional Multimodal Simi-
A traditional generative adversarial network (GAN) [1] con- larity Model (DAMSM) which computes the similarity
sists of two networks: a generator G which generates new between images and captions. This DAMSM is used during
data points from randomly sampled inputs, and a discrimi- training to provide additional, fine-grained feedback to the
nator D which tries to distinguish between generated and generator about how well the generated image matches its
real data samples. In conditional GANs (cGANs) [47] both caption. We adapt the AttnGAN architecture with multiple
the discriminator and the generator are conditioned on object pathways which are learned end-to-end in both the
additional information, e.g., a class label or textual informa- discriminator and the generator, see B and C in Fig. 1.
tion. This has been shown to improve performance and These object pathways are conditioned on individual
leads to more control over the data generating process. For object labels (e.g., “person”, “car”, etc.) and the same object
a conventional cGAN with generator G, discriminator D, pathway is applied multiple times at a given image resolu-
condition c (e.g., a class label), data point x, and a randomly tion at different locations and for different objects. This is
sampled noise vector z the training objective V is: similar to the approach introduced by [3]. However, [3]
HINZ ET AL.: SEMANTIC OBJECT ACCURACY FOR GENERATIVE TEXT-TO-IMAGE SYNTHESIS 1555

only use one object pathway in the generator at a small reso- conditional, and a caption-image matching part. The uncon-
lution and only one discriminator was equipped with an ditional loss is
object pathway. In our approach, the generator contains
three object pathways at various resolutions (16  16, 64  Luncon
G ¼ Eð^xÞpG ½log Dð^
xÞÞ; (3)
64, and 128  128) to further refine object features at higher
resolutions and each of our three discriminators is equipped the conditional loss is
with its own object pathway, see D in Fig. 1.
For a given image caption ’ we have several objects G ¼ Eð^
Lcon xÞpG ;ðcÞpdata ½log Dð^
x; cÞÞ; (4)
which are associated with this caption and which we repre-
sent with one-hot vectors s i ; i ¼ 1:::n (e.g., s 0 ¼ person, s 1 ¼ and the caption-image matching loss is LDAMSM
G [7] which
car, etc.). Each object pathway at a given resolution is measures text-image similarity at the word level and is cal-
applied iteratively for each of the objects s i . The location is culated with the pre-trained models provided by [7]. The
determined by a bounding box describing the object’s loca- complete loss for the generator then is:
tion and size. Each object pathway starts with an “empty”
zero-tensor r and the features that are generated (generator) LG ¼ Luncon
G þ Lcon
G þ LG
DAMSM
; (5)
or extracted (discriminator) are added onto r at the location
of the specific object’s bounding box. After the object path- where we set  ¼ 50 as in the original implementation.
way has processed each object, r contains features at each As in our baseline architecture, we employ three discrimi-
object location and is zero everywhere else. nators at three spatial resolutions: 64  64, 128  128, and
For the generator, we first concatenate the image 256  256. Each discriminator possesses a global and an
caption’s embedding ’, the one-hot label s i , and a randomly object pathway which extract features in parallel (D in
sampled noise vector z. We use this concatenated vector to Fig. 1). In the object pathway we use an STN to extract the
obtain the final conditioning label ii for the current object s i : features of object s i and concatenate them with the one-hot
vector s i describing the object. The object pathway then
ii ¼ Fð’; z; s i Þ; (2) applies multiple convolutional layers before adding the
extracted features onto r at the location of the bounding box.
where F is a fully connected layer followed by a non-linear- The global pathway in each of the discriminators works
ity (A in Fig. 1). on the full input image and applies convolutional layers
The generator’s first object pathway (B.2 in Fig. 1) takes with stride two to decrease the spatial resolution (D.1).
this conditioning label ii and uses it to generate features for Once the spatial resolution reaches that of the tensor r we
the given object at a spatial resolution of 16  16. The fea- concatenate the two tensors (full image features and object
tures are then transformed onto r into the location of the features r) along the channel axis and use convolutional
respective bounding box with a spatial transformer network layers with stride two to further reduce the spatial dimen-
(STN) [48]. This procedure is repeated for each object s i sion until we reach a resolution of 4  4.
associated with the given caption ’. We calculate both a conditional (image and image cap-
The global pathway in the first generator also gets the tion) and an unconditional (only image) loss for each of the
locations and labels ii for the individual objects. It spatially discriminators. The conditional input c during training con-
replicates these labels at the locations of the respective sists of the image caption embedding ’ and the information
bounding boxes and then applies convolutional layers to about objects s i (bounding boxes and object labels) associ-
the resulting layout to obtain a layout encoding (B.1 in ated with the image x, i.e. c ¼ f’; s i g. In the unconditional
Fig. 1). This layout encoding, the image caption ’, and the case the discriminators are trained to classify images as real
noise vector z are used to generate coarse features for the or generated without any influence of the image caption by
image at a low resolution. minimizing the following loss:
At higher levels in the generator, the object pathways are
conditioned on the object features of the current object and Luncon
Di ¼ EðxÞpdata ½log DðxÞ  Eð^xÞpG ½logð1  Dð^
xÞÞ:
the one-hot label s i for that object (C.2 in Fig. 1). For this, we (6)
again use an STN to extract the features at the bounding box
location of the object s i and resize the features to a spatial res- In order to optimize the conditional loss we concatenate the
olution of 16  16 (second object pathway) or 32  32 (third extracted features with the image caption embedding ’
object pathway). We obtain a conditioning label in the same along the channel axis and minimize
manner as for the first object pathway (Equation (2)), replicate
Di ¼ Eðx;cÞpdata ½log Dðx; cÞ
Lcon
it spatially to the same dimension as the extracted object fea- (7)
tures, and concatenate it with the object features along the  Eð^xÞpG ;ðcÞpdata ½logð1  Dð^
x; cÞÞ:
channel axis. Following this, we apply multiple convolutional
layers and upsampling to update the features of the given for each discriminator. Finally, to specifically train the dis-
object. Finally, as in the first object pathway, we use an STN to criminators to check for caption-image consistency we use
transform the features into the bounding box location and the matching aware discriminator loss [10] with mismatch-
add them onto r. The global pathway in the higher layers (C.1 ing caption-image pairs and minimize
in Fig. 1) stays unchanged from the baseline architecture [7].
Our final loss function for the generator is the same as in Di ¼ Eðx;sÞpdata ;ð’Þpdata ½logð1  Dðx; cÞÞ;
Lcls (8)
the original AttnGAN and consists of an unconditional, a
1556 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 44, NO. 3, MARCH 2022

where image x and caption ’ are sampled individually and


randomly from the data distribution and are, therefore,
unlikely to align.
We introduce an additional loss term similar to the
matching aware discriminator loss Vcls ðDÞ which works on
individual objects. Instead of using mismatching image-cap-
tion pairs, we use correct image-caption pairs, but with
incorrect bounding boxes and minimize:

Lobj
Di ¼ Eðx;’Þpdata ;ðsÞpdata ½logð1  Dðx; cÞÞ: (9)

Fig. 2. Examples when IS fails for COCO images. The top row shows
Thus, the complete objective we minimize for each indi- images for which the Inception-Net has very high entropy in its output
vidual discriminator is: layer, possibly because the images contain more than one object and
are often not centered. The second row shows images containing differ-
obj
LDi ¼ Luncon
Di þ Lcon
Di þ LDi þ LDi :
cls
(10) ent objects and scenes which were nonetheless all assigned to the
same class by the Inception-Net, thereby negatively affecting the overall
predicted diversity in the images.
We leave all other training parameters as in the original
implementation [7] and the training procedure itself also stays
the same. best approach for a more complex data set in which each
image contains multiple objects at distinct locations through-
out the image, as opposed to the ImageNet data set which
4 EVALUATION OF TEXT-TO-IMAGE MODELS consists of images usually depicting one object in the image
Quantitatively evaluating generative models is difficult [49]. center. Fig. 2 shows some exemplary failure cases of the IS on
While there are several evaluation metrics that are com- images sampled from the COCO data set.
monly used to evaluate GANs, many of them have known The FID relies on representative ground truth data to
weaknesses and are not designed specifically for text-to- compare the generated data against and also assumes that
image synthesis tasks. In the following, we first discuss some features are of Gaussian distribution, which is often not the
of the common evaluation metrics for GANs, their weak- case. For more complex data sets the FID also still suffers
nesses, and why they might be inadequate for evaluating from the problem that the image statistics are obtained with
text-to-image synthesis models. Following this, we introduce a network pre-trained on ImageNet which might not be a
our novel evaluation metric, Semantic Object Accuracy representative data set. Finally, neither the IS nor the FID
(SOA), and describe how it can be used to evaluate text-to- take the image caption into account during their evaluation.
image models in more detail. VS Similarity and R-Precision. [19] introduce the visual-
semantic similarity (VS similarity) metric which measures
4.1 Current Evaluation Metrics the distance between a generated image and its caption.
Inception Score and Fr echet Inception Distance. Most Two models are trained to embed images and captions
GAN approaches are trained on relatively simple images respectively and then minimize the cosine distance between
which only contain one object at the center (e.g., ImageNet, embeddings of matching image-caption pairs while maxi-
CelebA, etc). These methods are evaluated with metrics mizing the cosine distance between mismatching image-
such as the Inception Score (IS) [5] and Frechet Inception caption pairs. A good model then achieves high VS similar-
Distance (FID) [6], which use an Inception-Net usually pre- ity between a generated image and its associated caption.
trained on ImageNet. The IS evaluates roughly how distinc- [7] use the R-precision metric to evaluate how well an
tive an object in each image is (i.e. ideally the classification image matches a given description or caption. The R-preci-
layer of the Inception-Net has small entropy) and how sion score is similar to VS similarity, but instead of scoring
many different objects the GAN generates overall (i.e. high the VS similarity between a given image and caption it
entropy in the output of different images). The FID meas- instead performs a ranking of the similarity between the real
ures how similar generated images are to a control set of caption and randomly sampled captions for a given gener-
images, usually the validation set by calculating the dis- ated image. For this, first, an image is generated conditioned
tance in feature space between generated and real images. on a given caption. Then, another 99 captions are chosen ran-
Consequently, the IS should be as high as possible, while domly from the data set. Both the generated images and the
the FID should be as small as possible. 100 captions are then encoded with the respective image and
Both evaluation metrics have known weaknesses [50], [51]. text encoder. Similar to VS similarity the cosine distance
For example, the IS does not measure the similarity between between the image embedding and each caption embedding
objects of the same class, so a network that only generates is used as proxy for the similarity between the given image
one “perfect” sample for each class can achieve a very good and caption. The 100 captions are then ordered in descending
IS despite showing an intra-class mode dropping behavior. similarity and the top k (usually k=1) most similar captions
Li et al. [4] also note that the IS overfits within the context of are used to calculate the R-precision. Intuitively, R-precision
text-to-image synthesis and can be “gamed” by increasing calculates if the real caption is more similar to the generated
the batch size at the end of the training. Furthermore, the IS image (in feature space) than 99 randomly sampled captions.
uses the output of the classification layer of an Inception-Net The drawback of both metrics is that they do not evaluate
pre-trained on the ImageNet data set. This might not be the the quality of individual objects. For example the real caption
HINZ ET AL.: SEMANTIC OBJECT ACCURACY FOR GENERATIVE TEXT-TO-IMAGE SYNTHESIS 1557

objects. Instead, captions can also describe the general lay-


out of a given scene (e.g., a busy street with lots of traffic) with-
out explicitly mentioning specific objects. Some of these
limitations might potentially be overcome in the future by
novel image caption evaluation metrics that focus more on
objects and semantic content in the scene [55], [57], [59].
Other Approaches. In contrast to the IS, which measures
the diversity of a whole set of images, the diversity score
[33] measures the perceptual difference between a pair of
images in feature space. This metric can be useful when
images are generated from conditional inputs (e.g., labels or
scene layouts) to examine whether a model can generate
Fig. 3. Examples when R-precision fails for COCO images. The top row diverse outputs for a given condition. However, the metric
shows images from the COCO data set. The middle row shows the cor-
rect caption and the bottom row gives examples for characteristics of does not convey anything directly about the quality of the
captions that are rated as being more similar than the original caption. generated images or their congruence with any conditional
information. [14], [60], [61] run a semantic segmentation net-
could state that “a person stands on a snowy hill” while the 99 work on generated images and compare the predicted seg-
random captions do not mention “snow” (which usually mentation mask to the ground truth segmentation mask
covers most of the background in the generated image) or used as input for the model. However, this metric needs a
“person” (but e.g., giraffe, car, bedroom, etc). In this case, an ground truth semantic segmentation mask and does not
image with only white background (snow) would already provide information about specific objects within the image.
make the real caption rank very highly in the R-precision
metric and achieve a high VS similarity. See Fig. 3 for a visu- 4.2 Semantic Object Accuracy (SOA)
alization of this. As such, this metric does not focus on the So far, most evaluation metrics are designed to evaluate the
quality of individual objects but rather concentrates on holistic image quality but do not evaluate individual areas
global background and salient features. or objects within an image. Furthermore, except for Caption
Classification Accuracy Score. [52] introduce the Classi- Generation and R-precision, none of the scores take the image
fication Accuracy Score (CAS) to evaluate conditional image caption into account when evaluating generated images. To
generation models, similar to [53]. For this, a classifier is address the challenges and issues mentioned above we
trained on images generated by the conditional generative introduce a novel evaluation metric based on a pre-trained
model. The classifier’s performance is then evaluated on the object detection network.1 The pre-trained object detector
original test set of the data set that was used to train the gen- evaluates images by checking if it recognizes objects that
erative model. If the classifier achieves high accuracy on the the image should contain based on the caption. For exam-
test set, this indicates that the data it was trained on is repre- ple, if the image caption is “a person is eating a pizza” we can
sentative of the real distribution. The authors find that nei- infer that the image should contain both a person and a
ther the IS, the FID, nor combinations thereof are predictive pizza and the object detector should be able to recognize
of the CAS, further indicating that the IS and FID are only of both objects within the image. Since this evaluation meas-
limited use for evaluating image quality. ures directly whether objects specifically mentioned in the
Caption Generation. [31] suggest evaluating text-to- caption are recognizable in an image we call this metric
image models by comparing original captions with captions Semantic Object Accuracy (SOA).
obtained from generated images. The intuition is that if the Some previous works have used similar approaches to
generated image is relevant to its caption, then it should be evaluate the quality of the generated images. [3] evaluate
possible to infer the original text from it. To this end, [31] how often expected objects (based on the caption) are
use a pre-trained caption generator [54] to generate captions detected by an object detector. However, only a subset of
for each synthesized image and compare these to the origi- the captions is evaluated and the evaluated captions contain
nal ones through standard language similarity metrics, i.e. false positives (e.g., captions containing the phrase “hot
BLEU, METEOR, and CIDEr. Except for CIDEr, these met- dog” are evaluated based on the assumption that the image
rics were originally developed to evaluate machine transla- should contain a dog). [15] introduce a detection score that
tion and text summarization methods and were only later calculates (roughly) whether a pre-trained object detector
adopted for the evaluation of image captions. detects an object in a generated image with high certainty.
One challenge with this caption generation approach is However, no information from the caption is taken into
that often many different captions are valid for a given account, meaning any detection with high confidence is
image. Even if two captions are not similar, this does not “good” even if the detected object does not make sense in
necessarily imply that they do not describe the same image the context of the caption. [62] use a pre-trained object
[55]. Furthermore, it has been shown that metrics such as detector to calculate the mean average precision and report
BLEU, METEOR, and CIDEr are primarily sensitive to n- precision-recall curves. However, the evaluation is done on
gram overlap which is neither necessary nor sufficient for synthetic data sets and without textual information as
two sentences to convey the same meaning [55], [56], [57]
and do also not necessarily correlate with human judgments
1. Code for the evaluation metric and all experiments: https://
of captions [54], [58]. Finally, there is no requirement that github.com/tohinz/semantic-object-accuracy-for-generative-text-to-
captions, either real or generated, need to focus on specific image-synthesis
1558 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 44, NO. 3, MARCH 2022

conditional input. [33] use classification accuracy as an eval- each predicted bounding box for the given object and each
uation metric in which they report the object classification ground truth bounding box. The final IoU for a given image
accuracy in generated images. For this, they use a ResNet- and object is then the maximum of the values, i.e. the
101 model which is trained on real objects cropped and reported IoU is an upper bound on the actual IoU.
resized from the original data. However, in order to calcu- Overall this approach allows a more fine-grained evalua-
late the score, the size and location of each object in the gen- tion of the image content since we can now focus on individ-
erated image must be known, so this evaluation is not ual objects and their features. To get a better idea of the
directly applicable to approaches that do not use scene lay- overall performance of a model we calculate both the class
outs or similar representations. [37] use recall and intersec- average recall/IoU (SOA-C/SOA-IoU-C) and image average
tion-over-union (IoU) to evaluate the bounding boxes in recall/IoU (SOA-I/SOA-IoU-I). Additionally, we report the
their generated scene layout but do not apply these evalua- SOA-C for the forty most and least common labels (SOA-C-
tions to generated images directly. Top40 and SOA-C-Bot40) to see how well the model can gen-
SOA. Since we work with the COCO data set we filter all erate objects of common and less common classes.
captions in the validation set for specific keywords that are
related to the available labels for objects (e.g., person, car, 5 EXPERIMENTS
zebra, etc). For each of the 80 available labels in the COCO We perform multiple experiments and ablation studies. In a
data set we find all captions that imply the existence of the first step, we add the object pathway (OP) on multiple
respective object and generate three images for each of the layers of the generator and to each discriminator and call
captions. The supplementary material, which can be found this model OPv2. We also train this model with the addi-
on the Computer Society Digital Library at http://doi. tional bounding box loss we introduced in Section 3. When
ieeecomputersociety.org/10.1109/TPAMI.2020.3021209, the model is trained with the additional bounding box loss
gives a detailed overview of how exactly the captions were we refer to it as BBL.
chosen for each label. We then run the YOLOv3 network [8] Different approaches differ in how many objects per
pre-trained on the COCO data set on each of the generated image are used during training. If an image layout is used,
images and check whether it recognizes the given object. typically all objects (foreground and background) are used
We report the recall as a class average (SOA-C), i.e. in how as conditioning information. Other approaches limit the
many images per class the YOLOv3 on average detects the number of objects during per training [2], [3]. To examine
given object, and as an image average (SOA-I), i.e. on aver- the effect of training with different numbers of objects per
age in how many images a desired object was detected. Spe- image we train our approach with either a maximum of three
cifically, the SOA-C is calculated as objects per image (standard) or with up to ten objects per
1 X 1 X image, which we refer to as many objects (MO). When train-
SOA-C ¼ YOLOv3ðic Þ; (11) ing with a maximum of three objects per image we sample
jCj c2C jIc j i 2I
c c randomly from the training set at train time, i.e. each batch
contains images which contain zero to three objects. If an
for object classes c 2 C and images i 2 Ic that are supposed image contains more than three objects we choose the three
to contain an object of class c. The SOA-I is calculated as largest ones in terms of area of the bounding box. When
XX training with up to ten objects per image we slightly change
1
SOA-I ¼ P YOLOv3ðic Þ; (12) our sampling strategy so that each batch consists of images
c2C jIc j c2C i that contain the same amount of objects. This means that,
c 2Ic
e.g., each image in a batch contains exactly four objects, while
and in the next batch each image might contain exactly seven
 objects. This increases the training efficiency as most of the
1 if YOLOv3 detected an object of class c
YOLOv3ðic Þ ¼ images contain fewer than five objects.
0 otherwise: As a result of the different settings we perform the fol-
(13) lowing experiments:

Since many images can also contain objects that are not spe- 1) OPv2: apply the object pathway (OP) on multiple
cifically mentioned (for example an image described by “lots layers of the generator and on all discriminators,
of cars are on the street” could still contain persons, dogs, etc.) training without the bounding box loss and with a
in the caption we do not calculate a false negative rate but maximum of three objects per image.
instead only focus on the recall, i.e. the true positives. 2) OPv2 + BBL: same as OPv2 but with the bounding
SOA-Intersection Over Union. Several approaches (e.g., box loss added to the discriminator loss term.
[3], [4], [31], [33], [37]) use additional conditioning informa- 3) OPv2 + MO: same as OPv2 but with a maximum of
tion such as scene layouts or bounding boxes. For these ten objects per image.
approaches, our evaluation metric can also calculate the 4) OPv2 + BBL + MO (OP-GAN): combination of all
intersection over union (IoU) between the location at which three approaches.
different objects should be and locations at which they are We train each model three times on the 2014 split of the
detected, which we call SOA-IoU. To calculate the IoU we COCO data set. At test time we use bounding boxes gener-
use every image in which the YOLOv3 network detected ated by a network [4] as the conditioning information. There-
the respective object. Since many images contain multiple fore, except for the image caption no other ground truth
instances of a given object we calculate the IoU between information is used at test time.
HINZ ET AL.: SEMANTIC OBJECT ACCURACY FOR GENERATIVE TEXT-TO-IMAGE SYNTHESIS 1559

TABLE 1
chet Inception Distance (FID), R-Precision, Caption Generation With CIDEr,
Inception Score (IS), Fre
and Semantic Object Accuracy on Class (SOA-C) and Image Average (SOA-I) on the MS-COCO Data Set

Model IS " FID # R-precision (k=1) " CIDEr " SOA-C " SOA-I "
Original Images 34:88  0:01 6:09  0:05 68:58  0:08 0:795  0:003 74.97 80.84
y
AttnGAN [7] 23:61  0:21 33:10  0:11 83.80 0:695  0:005 25.88 39.01
[34] 23:74  0:36 86:44  3:38
ControlGAN [23] 24:06  0:60 82.43
AttnGAN + OP [3]y 24:76  0:43 33:35  1:15 82.44 0:689  0:008 25.46 40.48
MirrorGAN [16] 26:47  0:41 74.52
Obj-GAN [4]y 24:09  0:28 36:52  0:13 87:84  0:08 0:783  0:002 27.14 41.24
HfGAN [20] 27:53  0:25
DM-GAN [22]y 32:32  0:23 27:34  0:11 91:87  0:28 0:823  0:002 33.44 48:03
SD-GAN [21] 35:69  0:50
OP-GAN (Best Model) 27:88  0:12 24:70  0:09 89:01  0:26 0:819  0:004 35:85 50:47
OPv2, 0 obj 26:80  1:01 30:01  1:81 83:87  1:22 0:760  0:004 26:04  1:47 37:56  1:27
OPv2, 1 obj 27:68  0:47 26:18  0:27 87:37  0:60 0:798  0:013
OPv2, 3 obj 27:78  0:50 26:45  0:40 87:74  1:08 0:805  0:011
OPv2, 10 obj 27:66  0:34 26:52  0:44 87:73  0:98 0:806  0:006 33:82  0:69 48:39  1:01
OPv2 + BBL, 0 obj 24:60  1:25 33:03  0:76 81:27  1:45 0:735  0:029 24:00  2:13 34:01  2:89
OPv2 + BBL, 1 obj 26:34  0:55 26:59  1:04 86:42  0:60 0:783  0:006
OPv2 + BBL, 3 obj 26:52  0:47 26:74  1:08 87:08  0:60 0:793  0:013
OPv2 + BBL, 10 obj 26:48  0:58 26:83  1:10 86:80  0:56 0:794  0:015 33:19  0:40 48:24  0:68
OPv2 + MO, 0 obj 24:32  1:65 35:36  1:95 79:75  1:87 0:695  0:015 21:15  1:47 30:24  2:36
OPv2 + MO, 1 obj 27:36  0:49 25:06  1:11 88:33  0:81 0:789  0:008
OPv2 + MO, 3 obj 27:65  0:37 24:96  1:12 89:13  0:42 0:807  0:014
OPv2 + MO, 10 obj 27:59  0:43 24:94  1:09 89:14  0:41 0:805  0:013 33:46  1:01 47:93  1:56
OPv2 + BBL + MO, 0 obj 21:84  0:83 45:79  1:16 72:71  1:75 0:626  0:025 16:55  1:81 22:76  2:17
OPv2 + BBL + MO, 1 obj 27:61  0:67 26:19  0:82 87:85  0:25 0:791  0:009
OPv2 + BBL + MO, 3 obj 28:04  0:65 25:91  1:03 88:90  0:24 0:810  0:009
OPv2 + BBL + MO, 10 obj 27:90  0:79 25:80  1:01 89:00  0:17 0:814  0:007 34:51  1:12 48:90  0:72
y
Results of our models are obtained with generated bounding boxes. Scores for models marked with were calculated with a pre-trained model provided by the
respective authors.

6 EVALUATION AND ANALYSIS images for each caption in the given class, except for the
“person” class, for which we randomly sample 30,000 cap-
Tables 1 and 2 give an overview of our results for the COCO
tions (from over 60,000) and generate one image for each of
data set. The first half of the table shows the results on the
the 30,000 captions.
original images from the data set and from related literature
while the second half shows our results. To make a direct
comparison we calculated the IS, FID, CIDEr, and R-preci- 6.1 Quantitative Results
sion scores ourselves for all models which are provided by Overall Results. As Table 1 shows, all our models outper-
the authors. As such, the values from AttnGAN [7], Attn- form the baseline AttnGAN in all metrics. The IS is
GAN+OP [3], Obj-GAN [4], and DM-GAN [22] are the ones improved by 16-19 percent, the R-precision by 6-7 percent,
most directly comparable to our reported values since they the SOA-C by 28-33 percent, the SOA-I by 22-25 percent, the
were calculated in the same way. FID by 20-25 percent, and CIDEr by 15-18 percent. This was
Note that there is some inconsistency in how the FID is achieved by adding our object pathways to the baseline
calculated in prior works. Some approaches, e.g., [4], com- model without any further tuning of the architecture, hyper-
pare the statistics of the generated images only with the sta- parameters, or the training procedure. Our approach also
tistics of the respective “original” images (i.e. images outperforms all other approaches based on FID, SOA-C,
corresponding to the captions that were used to generate a and SOA-I. While there are two approaches that report a IS
given image). We, on the other hand, generate 30,000 higher than our models, it has previously been observed
images from 30,000 randomly sampled captions and com- that this score is likely the least meaningful for this task and
pare their statistics with the statistics of the full validation can be gamed to achieve higher numbers [4], [51]. Our user
set. Many of the recent publications also do not report the study also shows that the IS is the score that has the least
FID or R-precision. This makes a direct comparison difficult predictive value for human evaluation.
as we show that the IS is likely the least meaningful score of We also calculated each score using the original images of
the three since it easily overfits [4] and due to the reasons the COCO data set. For the IS we sampled three times 30,000
mentioned in Section 4. We calculate each of the reported images from the validation set and resized them to 256  256
values of our models three times for each trained model pixels. These images were also used to calculate the CIDEr
(nine times in total) and report the average and standard score. To calculate the FID we randomly sampled three times
deviation. To calculate the SOA scores we generate three 30,000 images from the training set and compared them to
1560 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 44, NO. 3, MARCH 2022

TABLE 2
Comparison of the Recall Values for the Different Models

Model SOA-C / IoU SOA-I / IoU SOA-C-Top40 / IoU SOA-C-Bot40 / IoU


Original Images 74.97 / 0.550 80.84 / 0.570 78.77 / 0.546 71.18 / 0.554
AttnGAN [7] 25.88 /  39.01 /  37.47 /  14.29 / 
AttnGAN + OP [3] 25.46 / 0.236 40.48 / 0.311 39.77 / 0.308 11.15 / 0.164
Obj-GAN [4] 27.14 / 0.513 41.24 / 0.598 39.88 / 0.587 14.40 / 0.438
DM-GAN [22] 33.44 /  48.03 /  47.73 /  19.15 / 
OPv2 33.82 (26.04) / 0.207 48.39 (37.56) / 0.270 48.34 (36.53) / 0.260 19.31 (15.55) / 0.152
OPv2 + BBL 33.19 (24.00) / 0.210 48.24 (34.01) / 0.270 47.96 (32.96) / 0.261 18.43 (15.04) / 0.159
OPv2 + MO 33.46 (21.15) / 0.214 47.93 (30.24) / 0.275 47.84 (28.15) / 0.264 19.07 (14.15) / 0.163
OPv2 + BBL + MO 34.51 (16.55) / 0.217 48.90 (22.76) / 0.278 49.70 (22.19) / 0.269 19.32 (10.91) / 0.165

We used generated bounding boxes to calculate the values. Numbers in brackets show scores when the object pathway was not used at test time.

the statistics of the validation set. The R-precision was calcu- data (6.09) compared to what current models can achieve
lated on three times 30,000 randomly sampled images and (> 24 for the best models). While the FID still uses a net-
the corresponding caption from the validation set and the work pre-trained on ImageNet it compares activations of
SOA-C and SOA-I were calculated on the real images corre- convolutional layers for different images and is, therefore,
sponding to the originally chosen captions. likely still more meaningful and less dependent on specific
As we can see, the IS is close to the current state of the art object settings than the IS. Similarly, the SOA-C (SOA-I) on
models with a value of 34.88. It is possible to achieve a real data is 74.97 (80.84), while current models achieve val-
much higher IS on other, simpler data sets, e.g., IS > 100 on ues of around 30  36 (40  50). Since the network used to
the ImageNet data set [63]. This indicates that the IS is calculate the SOA values is not part of the training loop the
indeed not a good evaluation metric, especially for complex models can not easily overfit to this evaluation metric like
images consisting of multiple objects and various locations. they can for the R-precision. Furthermore, the results of the
The difference between the R-precision on real and gener- SOA evaluation confirm the impression that none of the
ated images is even larger. On the original images, the R- models is able to generate images with multiple distinct
precision score is only 68.58, which is much worse than objects of a quality similar to real images.
what current models can achieve (> 88). Impact of the Object Pathway. To get a clearer understand-
One reason for this might be that the R-precision calcu- ing of how the evaluation metrics might be impacted by the
lates the cosine similarity between an image embedding and object pathway we calculate our scores for a different number
a caption embedding and measures how often the caption of generated objects. More specifically, we only apply the
that was used to generate an image is more similar than 99 object pathway for a maximum given number of objects (0, 1,
other, randomly sampled captions. However, the same 3, or 10) per image. Intuitively, we would assume that without
encoders that are used to calculate the R-precision are also the application of the object pathway the IS and FID should be
used during training to minimize the cosine similarity decreased, since the object pathway is not used to generate any
between an image and the caption it was generated from. As object features and the images should, therefore, consist
a result, the model might already overfit to this metric mostly of background. Additionally, we can get an intuition of
through the training procedure. Our observation is that the how important the object pathway is for the overall perfor-
models tend to heavily focus on the background to make it mance of the network by looking at how it affects the R-preci-
match a specific word in the caption (e.g., images tend to be sion, SOA-C, and SOA-I.
very white when the caption mentions “snow” or “ski”, very As Table 1 shows, all models perform markedly worse
blue when the caption mentions “surf” or “beach”, very when the object pathway is not used (0 obj). We find that
green when the caption mentions “grass” or “savanna”, etc.) the models trained with up to ten objects per image seem to
This matching might lead to a high R-precision score since it rely more heavily on the object pathway than models
leads, on average, to a large cosine similarity. Real images do trained with three objects per image. For models trained
not always reflect this, since a large part of the image might with only three objects per image (OPv2 and OPv2 + BBL)
be occupied by a person or an animal, essentially “blocking the IS decreases by around 1  2, the R-precision decreases
out” the background information. We see a similar trend for by around 4  5, the SOA-C (SOA-I) decreases by around
the CIDEr evaluation where many models achieve a score 7  9 (11  14), CIDEr decreases by around 6-8 percent, and
similar to the score reached by real images. Regardless of the FID increases by around 4  7. On the other hand, mod-
what the actual reason is, the question remains whether eval- els trained with up to 10 objects suffer much more when the
uation metrics like the IS, R-precision, and CIDEr are mean- object pathway is removed, with the IS decreasing by
ing- and helpful when models that can not (as of now) around 3  6, the R-precision decreasing by around 9  15,
generate images that would be confused as “real” achieve the SOA-C (SOA-I) decreasing by around 12  18 (17  28),
scores comparable to or better than real images. CIDEr decreasing by around 16-30 percent, and the FID
The FID and the SOA values are the only two evaluation increasing by around 10  20. These results indicate that the
metrics (that we used) for which none of the current state of object pathways are an important part of the model and are
the art models can come close to the values obtained with responsible for at least some of the improvements compared
the original images. The FID is still much smaller on the real to the baseline architecture.
HINZ ET AL.: SEMANTIC OBJECT ACCURACY FOR GENERATIVE TEXT-TO-IMAGE SYNTHESIS 1561

Impact of Bounding Box Loss. Adding the bounding box (e.g., bus or elephant), are usually modeled better than objects
loss to the object pathways has a small negative effect on all that are usually small (spoon or baseball glove). The final and
scores, but does slightly improve the IoU scores (see more subtle characteristic is the surface texture of an object.
Table 2). Note that the weighting of the bounding box loss Objects with highly distinct surface textures (e.g., zebra,
in the overall loss term was not optimized but simply giraffe, pizza, etc.) achieve high SOA scores because the
weighted with the same strength as the matching aware dis- object detection network relies on these textures to detect
criminator loss Lcls D . It is possible that the positive effect of objects. However, while the models are able to correctly
the bounding box loss could be increased by weighting it match the surface texture (e.g., black and white stripes for a
differently. zebra) they are still not capable of generating a realistic-look-
Impact of Training on Many Objects. Training the model ing shape of many objects. As a result, many of these objects
with up to ten objects per image has only minor effects on the possess the “correct” surface texture but their shape is more
IS and SOA scores, but improves the FID and R-precision. a general “blob” consisting of the texture and not a distinct
However, we observe that the models trained with only three form (e.g., a snout and for legs for a zebra). See Fig. 6 for a
objects per image slightly decrease in their performance once visualization of this.
the object pathway is applied multiple times. Usually, the This is one of the weaknesses of the SOA score as it might
models trained on only three objects achieve their best per- give the wrong impression that an 80 percent object detec-
formance when applying the object pathway three times as tion rate means in 80 percent of the cases the object is recog-
at training time. Once the model is trained on up to ten nizable and of real-world quality. This is not the case, as the
objects though, we do not observe this behavior anymore SOA scores are calculated with a pre-trained object detector
and instead achieve comparable or even better results when which might focus more on texture and less on shapes of
applying the object pathway ten times per image. objects [64]. Consequently, the results of the SOA are more
SOA Scores. Table 2 shows the results for the SOA and aptly interpreted as cases where a model was able to gener-
SOA-IoU. The SOA-I values are consistently higher than the ate features that an independently pre-trained object detec-
SOA-C values. Since the SOA-I is calculated on image aver- tor would classify as a given object. The overall quality of
age (instead of class average like the SOA-C) it is skewed by the metric is, therefore, strongly dependent on the object
objects that often occur in captions and images (e.g., per- detector and future improvements in this area might also
sons, cats, dogs, etc.). The SOA values for the most and least lead to more meaningful interpretations of the SOA scores.
common 40 objects show that the models perform much bet- Fig. 4 shows images generated by our different models.
ter on the more common objects. Actually, most models per- All images shown in this paper were generated without
form about two times better on the common objects ground truth bounding boxes but instead use generated
showing their problem in generating objects that are not bounding boxes [4]. The first column shows the respective
often observed during training. For a detailed overview of image from the data set, while the next four columns show
how each model performed on the individual labels please the generated images. We can see that all models are capa-
refer to the supplementary material, available online. ble of generating recognizable foreground objects. It is often
When we look at the IoU scores we see that the Obj-GAN difficult to find qualitative differences in the images gener-
[4] achieves by far the best IoU scores (around 0.5), albeit at ated by the different models. However, we find that the
the cost of lower SOA scores. Our models usually achieve models using the bounding box loss usually improve the
an IoU of around 0:2  0:3 on average. Training with up to generation of rare objects. Training with ten objects per
ten objects per image and using the bounding box loss image usually leads to a slightly better image quality over-
slightly increases the IoU. However, similar to previous all, especially for images that contain many objects.
work [3], [4] we find that the AttnGAN architecture tends to As we saw in the quantitative evaluation, the object path-
place salient object features at many locations of the image way can have a large impact on the image quality. Fig. 7
which affects the IoU scores negatively. shows what happens when (some of) the object pathways
When looking at the SOA for individual objects (see are not used in the full model (OPv2 + BBL + MO). Again,
Fig. 5) we find that there are objects for which we can the first column shows the original image from the data set
achieve very high SOA values (e.g., person, cat, dog, zebra, and the second column shows images generated without
pizza, etc.). Interestingly, we find that all tested methods the use any of the object pathways. The next three columns
perform “good” or “bad” at the same objects. For example, show generated images when we consecutively use the
all models perform reasonably well on objects such as person object pathways, starting with the lowest object pathway
and pizza (many examples in the training set) as well as e.g., and iteratively adding the next object pathway until we
plane and traffic light (few examples in the training set). Con- reach the full model. When no object pathway is used (first
versely, all models fail on objects such as table and skateboard column) we clearly see that only background information is
(many examples in the training set) as well as e.g., hair drier generated. Once the first object pathway is added we also
and toaster (few examples in the training set). get foreground objects and their quality gets slightly better
We found that objects need to have three characteristics to by adding the higher-level object pathways.
achieve a high SOA and the highest SOA scores are achieved User Study. In order to further validate our results, we
when objects possess all three characteristics. The first performed a user study on Amazon Mechanical Turk. Simi-
important characteristic is easily predictable: the higher the lar to other approaches [9], [21], [31] we sampled 5,000 ran-
occurrence of an object in the training data, the better (on dom captions from the COCO validation set. For each
average) the final performance on this object. Second, large caption, we generated one image with each of the following
objects, i.e. objects that usually cover a large part of the image models: our OP-GAN, the AttnGAN [7], the AttnGAN-OP
1562 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 44, NO. 3, MARCH 2022

Fig. 4. Comparison of images generated by different variations of our models.

Fig. 5. Comparison of SOA scores: SOA per class with degree of a bin reflecting relative frequency of that class.

[3], the Obj-GAN [4], and the DM-GAN [22]. We showed caption best. We evaluated each image caption twice, for a
each user a given caption and the respective five images total of 10,000 evaluations with the help of 200 participants.
(without time limit) from the models in random order and Table 3 shows how often each model was chosen as hav-
asked them to choose the image that depicts the given ing produced the best image given a caption (variance was
estimated by bootstrap [65]). This evaluation reveals that

TABLE 3
Human Evaluation Results (Ratio of 1st by Human Ranking) of
Five Models on the MS-COCO Data Set Given a Caption

AttnGAN-OP [3] 14:65%  0:35


AttnGAN [7] 16:80%  0:43
Fig. 6. Generated images and objects recognized by the pre-trained Obj-GAN [4] 20:96%  0:33
object detector (YOLOv3) which was used to calculate the SOA scores. DM-GAN [22] 22:42%  0:41
The results highlight that, like most other CNN based object detectors, OP-GAN (ours) 25:17%  0:43
YOLOv3 focuses much more on texture and less on actual shapes.
HINZ ET AL.: SEMANTIC OBJECT ACCURACY FOR GENERATIVE TEXT-TO-IMAGE SYNTHESIS 1563

Fig. 7. Comparison of images generated by our model (OP-GAN) with OPs switched on and off.

Fig. 8. Comparison of images generated by our model (OP-GAN) with images generated by other current models.

the human ranking closely reflects the ranking obtained 6.2 Qualitative Results
through the SOA and FID scores. One notable exception are Fig. 8 shows examples of images generated by our model
the two worst performing models (AttnGAN and Attn- (OPv2 + BBL + MO) and those generated by several other
GAN-OP), which we measure to perform similar according models [3], [4], [7], [22]. We observe that our model often
to the SOA and FID scores, but obtain different results in generates images with foreground objects that are more rec-
the user study. We find that the IS score is not predictive of ognizable than the ones generated by other models. For
the performance in the user study. The R-precision and more common objects (e.g., person, bus or plane) all models
CIDEr are somewhat predictive, but predict a different manage to generate features that resemble the object but in
ranking of the top-three performing models. Overall, we most cases do not generate a coherent representation from
find that our OP-GAN performs best according to both the these features and instead distribute them throughout the
SOA scores and the human evaluation. As hypothesized in image. As a result, we notice features that are associated
Section 4 we also observe that the FID and SOA scores are with an object but not necessarily form one distinct and
the best predictors for a model’s performance in a human coherent appearance of that object. Our model, on the other
user evaluation. hand, is often able to generate one (or multiple) coherent
1564 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 44, NO. 3, MARCH 2022

object(s) from the features, see e.g., the generated images [4] W. Li et al., “Object-driven text-to-image synthesis via adversarial
training,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019,
containing a bus, cattle, or the plane. pp. 12 174–12 182.
When generating rare objects (e.g., cake or hot dog) we [5] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford,
observe that our model generates a much more distinct and X. Chen, “Improved techniques for training GANs,” in Proc.
object than the other models. Indeed, most models fail Advances Neural Inf. Process. Syst., 2016, pp. 2234–2242.
[6] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and
completely to generate rare objects and instead only generate S. Hochreiter, “GANs trained by a two time-scale update rule con-
colors associated with these objects. Finally, when we inspect verge to a local nash equilibrium,” in Proc. Advances Neural Inf.
more complex scenes we see that our model is also capable of Process. Syst., 2017, pp. 6626–6637.
[7] T. Xu et al., “AttnGAN: Fine-grained text to image generation with
generating multiple diverse objects within an image. As attentional generative adversarial networks,” in Proc. IEEE Conf.
opposed to the other images for “room showing a sink and some Comput. Vis. Pattern Recognit., 2018, pp. 1316–1324.
drawers” we can recognize a sink-like shape and drawers in [8] J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,”
the image generated by our model. Similarly, our model can 2018, arXiv: 1804.02767.
[9] H. Zhang et al., “StackGAN++: Realistic image synthesis with
also generate an image containing a reasonable shape of a stacked generative adversarial networks,” IEEE Trans. Pattern
banana and a cup of coffee, whereas the other models only Anal. Mach. Intell., vol. 41, no. 8, pp. 1947–1962, Aug. 2019.
seem to generate the texture of a banana without the shape [10] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee,
“Generative adversarial text to image synthesis,” in Proc. Int. Conf.
and completely ignore the cup of coffee. Mach. Learn., 2016, pp. 1060–1069.
[11] T.-Y. Lin et al., “Microsoft coco: Common objects in context,” in
7 CONCLUSION Proc. Europ. Conf. Comput. Vis., 2014, pp. 740–755.
[12] S. Sharma, D. Suhubdy, V. Michalski, S. E. Kahou, and Y. Bengio,
In this paper, we introduced a novel GAN architecture (OP- “Chatpainter: Improving text to image generation using dia-
GAN) that specifically models individual objects based on logue,” in Proc. Int. Conf. Learn. Representations Workshop, 2018.
[13] L. Karacan, Z. Akata, A. Erdem, and E. Erdem, “Learning to gen-
some textual image description. This is achieved by adding erate images of outdoor scenes from attributes and semantic
object pathways to both the generator and discriminator layouts,” 2016, arXiv:1612.00215.
which learn features for individual objects at different resolu- [14] T. Park, M.-Y. Liu, T.-C. Wang, and J.-Y. Zhu, “Semantic image
synthesis with spatially-adaptive normalization,” in Proc. IEEE
tions and scales. Our experiments show that this consistently Conf. Comput. Vis. Pattern Recognit., 2019, pp. 2337–2346.
improves the baseline architecture based on quantitative and [15] S. Sah et al., “Semantically invariant text-to-image generation,” in
qualitative evaluations. Proc. IEEE Conf. Image Process., 2018, pp. 3783–3787.
We also introduce a novel evaluation metric named Seman- [16] T. Qiao, J. Zhang, D. Xu, and D. Tao, “Mirrorgan: Learning text-to-
image generation by redescription,” in Proc. IEEE Conf. Comput.
tic Object Accuracy (SOA) which evaluates how well a model Vis. Pattern Recognit., 2019, pp. 1505–1514.
can generate individual objects in images. This new SOA eval- [17] M. Cha, Y. L. Gwon, and H. Kung, “Adversarial learning of
uation allows to evaluate text-to-image synthesis models in semantic relevance in text to image synthesis,” in Proc. AAAI Conf.
Artificial Intell., 2019, pp. 3272–3279.
more detail and to detect failure and success modes for indi- [18] H. Zhang et al., “Stackgan: Text to photo-realistic image synthesis
vidual objects and object classes. A user study with 200 partic- with stacked generative adversarial networks,” in Proc. IEEE Int.
ipants shows that the SOA score is consistent with the Conf. Comput. Vis., 2017, pp. 5907–5915.
ranking obtained by human evaluation, whereas other scores [19] Z. Zhang, Y. Xie, and L. Yang, “Photographic text-to-image syn-
thesis with a hierarchically-nested adversarial network,” in Proc.
such as the Inceptions Score are not. Evaluation of several IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 6199–6208.
state-of-the-art approaches using SOA shows that no current [20] X. Huang, M. Wang, and M. Gong, “Hierarchically-fused genera-
approach is able to generate realistic foreground objects for tive adversarial network for text to realistic image synthesis,” in
the 80 classes in the COCO data set. While some models Proc. IEEE Conf. Computer and Robot Vis., 2019, pp. 73–80.
[21] G. Yin, B. Liu, L. Sheng, N. Yu, X. Wang, and J. Shao, “Semantics
achieve high accuracy for several of the most common objects, disentangling for text-to-image generation,” in Proc. IEEE Conf.
all of them fail when it comes to modeling rare objects or Comput. Vis. Pattern Recognit., 2019, pp. 2327–2336.
objects that do not have an easily recognizable surface struc- [22] M. Zhu, P. Pan, W. Chen, and Y. Yang, “DM-GAN: Dynamic
memory generative adversarial networks for text-to-image syn-
ture. However, using the SOA as an evaluation metric on text- thesis,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019,
to-image models provides more detailed information about pp. 5802–5810.
how well they perform for different object classes or image [23] B. Li, X. Qi, T. Lukasiewicz, and P. Torr, “Controllable text-to-
captions and is well aligned with human evaluation. image generation,” in Proc. Advances Neural Inf. Process. Syst.,
2019, pp. 2065–2075.
[24] T. Qiao, J. Zhang, D. Xu, and D. Tao, “Learn, imagine and create:
ACKNOWLEDGMENTS Text-to-image generation from prior knowledge,” in Proc. Advan-
ces Neural Inf. Process. Syst., 2019, pp. 885–895.
The authors would like to thank the partial support from [25] J. Li, J. Yang, A. Hertzmann, J. Zhang, and T. Xu, “LayoutGAN:
the German Research Foundation DFG under project CML Generating graphic layouts with wireframe discriminators,” Int.
(TRR 169). They would also like to thank the NVIDIA Cor- Conf. Learn. Representations, 2019.
[26] A. A. Jyothi, T. Durand, J. He, L. Sigal, and G. Mori, “LayoutVAE:
poration for their support through the GPU Grant Program. Stochastic scene layout generation from a label set,” Proc. IEEE
Int. Conf. Comput. Vis., 2019, pp. 9894–9903.
[27] B. Li, B. Zhuang, M. Li, and J. Gu, “Seq-SG2SL: Inferring semantic
REFERENCES layout from scene graph through sequence to sequence learning,”
[1] I. Goodfellow et al., “Generative adversarial nets,” in Proc. Advan- in Proc. IEEE Int. Conf. Comput. Vis., 2019, pp. 7434–7442.
ces Neural Inf. Process. Syst., 2014, pp. 2672–2680. [28] S. Reed, A. van den Oord, N. Kalchbrenner, V. Bapst, M. Botvi-
[2] J. Johnson, A. Gupta, and L. Fei-Fei, “Image generation from scene nick, and N. de Freitas, “Generating interpretable images with
graphs,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, controllable structure,” 2016. [Online]. Available: https://
pp. 1219–1228. openreview.net/forum?id=Hyvw0L9el&noteId=Hyvw0L9el
[3] T. Hinz, S. Heinrich, and S. Wermter, “Generating multiple objects [29] S. Reed, Z. Akata, S. Mohan, S. Tenka, B. Schiele, and H. Lee,
at spatially distinct locations,” in Proc. Int. Conf. Learn. Representa- “Learning what and where to draw,” in Proc. Advances Neural Inf.
tions, 2019. Process. Syst., 2016, pp. 217–225.
HINZ ET AL.: SEMANTIC OBJECT ACCURACY FOR GENERATIVE TEXT-TO-IMAGE SYNTHESIS 1565

[30] A. Raj, C. Ham, H. Alamri, V. Cartillier, S. Lee, and J. Hays, [58] M. Kilickaya, A. Erdem, N. Ikizler-Cinbis, and E. Erdem, “Re-
“Compositional generation of images,” in Proc. Advances Neural evaluating automatic metrics for image captioning,” in Proc. Conf.
Inf. Process. Syst. ViGIL, 2017. Europ. Chapter Assoc. Comput. Linguistics, 2017, pp. 199–209.
[31] S. Hong, D. Yang, J. Choi, and H. Lee, “Inferring semantic layout [59] P. Agarwal, A. Betancourt, V. Panagiotou, and N. Dıaz-Rodrıguez,
for hierarchical text-to-image synthesis,” in Proc. IEEE Conf. Com- “Egoshots, an ego-vision life-logging dataset and semantic fidelity
put. Vis. Pattern Recognit., 2018, pp. 7986–7994. metric to evaluate diversity in image captioning models,” in Proc.
[32] X. Liu, G. Yin, J. Shao, X. Wang, and H. Li, “Learning to predict lay- ICLR Workshop Mach. Learn. Real Life, 2020.
out-to-image conditional convolutions for semantic image syn- [60] Q. Chen and V. Koltun, “Photographic image synthesis with cas-
thesis,” in Proc. Advances Neural Inf. Process. Syst., 2019, pp. 568–578. caded refinement networks,” in Proc. IEEE Int. Conf. Comput. Vis.,
[33] B. Zhao, L. Meng, W. Yin, and L. Sigal, “Image generation from 2017, pp. 1520–1529.
layout,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019, [61] T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, A. Tao, J. Kautz, and B. Catanzaro,
pp. 8584–8593. “High-resolution image synthesis and semantic manipulation with
[34] W. Huang, R. Y. D. Xu, and I. Oppermann, “Realistic image gener- conditional GANs,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog-
ation using region-phrase attention,” in Proc. Asian Conf. Mach. nit., 2018, pp. 8798–8807.
Learn., 2019, pp. 284–299. [62] Z. Deng, J. Chen, Y. Fu, and G. Mori, “Probabilistic neural pro-
[35] W. Sun and T. Wu, “Image synthesis from reconfigurable layout and grammed networks for scene generation,” in Proc. Advances Neural
style,” in Proc. IEEE Int. Conf. Comput. Vis., 2019, pp. 10530–10539. Inf. Process. Syst., 2018, pp. 4028–4038.
[36] Y. Li, T. Ma, Y. Bai, N. Duan, S. Wei, and X. Wang, “PasteGAN: A [63] A. Brock, J. Donahue, and K. Simonyan, “Large scale GAN train-
semi-parametric method to generate image from scene graph,” in ing for high fidelity natural image synthesis,” in Proc. Int. Conf.
Proc. Advances Neural Inf. Process. Syst., 2019, pp. 3948–3958. Learn. Representations, 2019.
[37] D. M. Vo and A. Sugimoto, “Visual-relation conscious image gen- [64] R. Geirhos, P. Rubisch, C. Michaelis, M. Bethge, F. A. Wichmann,
eration from structured-text,” 2019, arXiv: 1908.01741. and W. Brendel, “Imagenet-trained CNNs are biased towards tex-
[38] S. Hong, X. Yan, T. Huang, and H. Lee, “Learning hierarchical ture; increasing shape bias improves accuracy and robustness,” in
semantic image manipulation through structured representations,” Proc. Int. Conf. Learn. Representations, 2019.
in Proc. Advances Neural Inf. Process. Syst., 2018, pp. 2712–2722. [65] B. Efron, “Bootstrap methods: another look at the jackknife,” in
[39] D. Lee, M.-Y. Liu, M.-H. Yang, S. Liu, J. Gu, and J. Kautz, Breakthroughs in Statistics. Berlin, Germany: Springer, 1992,
“Context-aware synthesis and placement of object instances,” in pp. 569–593.
Proc. Advances Neural Inf. Process. Syst., 2018, pp. 10 413–10 423.
[40] A. El-Nouby et al., “Tell, draw, and repeat: Generating and modi- Tobias Hinz received the bachelor’s degree in business informatics from
fying images based on continual linguistic instruction,” in Proc. the University of Mannheim, in 2014, and the master’s degree in intelli-
IEEE Int. Conf. Comput. Vis., 2019, 10304–10312. gent adaptive systems from the University of Hamburg, Germany, in
[41] Y. Cheng, Z. Gan, Y. Li, J. Liu, and J. Gao, “Sequential attention GAN 2016. He is currently working toward the PhD degree at the Knowledge
for interactive image editing via dialogue,” 2018, arXiv: 1812.08352. Technology Group, University of Hamburg and since 2019 he is a
[42] Y. Li et al., “Storygan: A sequential conditional gan for story visu- research associate with the international research centre Crossmodal
alization,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019, Learning (TRR-169). His current research interests include generative
pp. 6329–6338. models, computer vision, and scene understanding. In particular, he is
[43] G. Mittal, S. Agrawal, A. Agarwal, S. Mehta, and T. Marwah, interested in how to learn representations of complex visual scenes in an
“Interactive image generation using scene graphs,” in Proc. Int. unsupervised manner.
Conf. Learn. Representations Workshop, 2019.
[44] S. Nam, Y. Kim, and S. J. Kim, “Text-adaptive generative adver-
sarial networks: manipulating images with natural language,” in Stefan Heinrich received the Diplom (German MSc) degree in computer
Proc. Advances Neural Inf. Process. Syst., 2018, pp. 42–51. science and cognitive psychology from the University of Paderborn, and
[45] X. Zhou, S. Huang, B. Li, Y. Li, J. Li, and Z. Zhang, “Text guided the PhD degree in computer science from the Universita €t Hamburg, Ger-
person image synthesis,” in Proc. IEEE Conf. Comput. Vis. Pattern many. He is a currently postdoctoral researcher at the International
Recognit., 2019, pp. 3663–3672. Research Center for Neurointelligence of the University of Tokyo and pre-
[46] S. Yu, H. Dong, F. Liang, Y. Mo, C. Wu, and Y. Guo, “Simgan: Photo- viously was appointmened as a postdoctoral research associate with the
realistic semantic image manipulation using generative adversarial international collaborative research centre Crossmodal Learning (TRR-
networks,” in Proc. IEEE Conf. Image Process., 2019, pp. 734–738. 169). His research interests include located in between artificial intelli-
[47] M. Mirza and S. Osindero, “Conditional generative adversarial gence, cognitive psychology, and computational neuroscience. Here, he
nets,” 2014, arXiv:1411.1784. aims to explore computational principles in the brain, such as timescales,
[48] M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu, compositionality, and uncertainty, to foster our fundamental understand-
“Spatial transformer networks,” in Proc. Advances Neural Inf. Pro- ing of the brain’s mechanisms but also to exploit them in developing
cess. Syst., 2015, pp. 2017–2025. machine learning methods for intelligent systems.
[49] L. Theis, A. van den Oord, and M. Bethge, “A note on the evaluation
of generative models,” in Proc. Int. Conf. Learn. Representations, 2016.
[50] A. Borji, “Pros and cons of GAN evaluation measures,” Comput. Stefan Wermter is currently a full professor at the University of Ham-
Vis. Image Understanding, vol. 179, pp. 41–65, 2019. burg, Germany, and director of the Knowledge Technology Research
[51] S. Barratt and R. Sharma, “A note on the inception score,” in Proc. Group. His main research interests include the fields of neural net-
Int. Conf. Mach. Learn. Workshop, 2018. works, hybrid knowledge technology, neuroscience-inspired comput-
[52] S. Ravuri and O. Vinyals, “Classification accuracy score for condi- ing, cognitive robotics, and human-robot interaction. He has been an
tional generative models,” in Proc. Advances Neural Inf. Process. associate editor of the journal ‘Transactions on Neural Networks and
Syst., 2019, pp. 1268–12279. Learning Systems’, is associate editor of ‘Connection Science’ and
[53] K. Shmelkov, C. Schmid, and K. Alahari, “How good is my ‘International Journal for Hybrid Intelligent Systems’, and is on the edi-
GAN?” in Proc. Europ. Conf. Comput. Vis., 2018, pp. 213–229. torial board of the journals ‘Cognitive Systems Research’, ‘Cognitive
[54] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: Les- Computation’ and ‘Journal of Computational Intelligence.’ Currently, he
sons learned from the 2015 MSCOCO image captioning challenge,” serves as a co-coordinator of the International Collaborative Research
IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 4, pp. 652–663, Centre on Crossmodal Learning (TRR-169) and is the coordinator of
Apr. 2017. the European Training Network SECURE on safety for cognitive robots.
[55] P. Anderson, B. Fernando, M. Johnson, and S. Gould, “Spice: He is also the elected President for the European Neural Network Soci-
Semantic propositional image caption evaluation,” in Proc. Europ. ety for 2020–2022.
Conf. Comput. Vis., 2016, pp. 382–398.
[56] J. Gimenez and L. M arquez, “Linguistic features for automatic
evaluation of heterogenous mt systems,” in Proc. ACL Workshop " For more information on this or any other computing topic,
Statist. Mach. Transl., 2007, pp. 256–264. please visit our Digital Library at www.computer.org/csdl.
[57] P. S. Madhyastha, J. Wang, and L. Specia, “VIFIDEL: Evaluating
the visual fidelity of image descriptions,” in Proc. 57th Annu. Meet-
ing Assoc. Comput. Linguistics, 2019, pp. 6539–6550.

You might also like