0% found this document useful (0 votes)
70 views24 pages

LLM Paper 4

The document discusses improving consistency between text prompts and generated images from text-to-image models. It proposes a framework called OPT2I that uses a large language model to iteratively generate revised prompts with the goal of maximizing a consistency score, without requiring access to the model weights. The framework is shown to boost consistency scores on two datasets while preserving image quality and diversity metrics.

Uploaded by

Fareed Khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
70 views24 pages

LLM Paper 4

The document discusses improving consistency between text prompts and generated images from text-to-image models. It proposes a framework called OPT2I that uses a large language model to iteratively generate revised prompts with the goal of maximizing a consistency score, without requiring access to the model weights. The framework is shown to boost consistency scores on two datasets while preserving image quality and diversity metrics.

Uploaded by

Fareed Khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Improving Text-to-Image Consistency via

Automatic Prompt Optimization


Oscar Mañas1,2,3,∗ , Pietro Astolfi1,∗ , Melissa Hall1 , Candace Ross1 , Jack Urbanek1 , Adina Williams1 ,
Aishwarya Agrawal2,3,5 , Adriana Romero-Soriano1,2,4,5 , Michal Drozdzal1
1
FAIR at Meta, 2 Mila, 3 Université de Montréal, 4 McGill University, 5 Canada CIFAR AI Chair

Contributed equally

Impressive advances in text-to-image (T2I) generative models have yielded a plethora of high performing
models which are able to generate aesthetically appealing, photorealistic images. Despite the progress,
arXiv:2403.17804v1 [cs.CV] 26 Mar 2024

these models still struggle to produce images that are consistent with the input prompt, oftentimes
failing to capture object quantities, relations and attributes properly. Existing solutions to improve
prompt-image consistency suffer from the following challenges: (1) they oftentimes require model
fine-tuning, (2) they only focus on nearby prompt samples, and (3) they are affected by unfavorable
trade-offs among image quality, representation diversity, and prompt-image consistency. In this paper,
we address these challenges and introduce a T2I optimization-by-prompting framework, OPT2I, which
leverages a large language model (LLM) to improve prompt-image consistency in T2I models. Our
framework starts from a user prompt and iteratively generates revised prompts with the goal of
maximizing a consistency score. Our extensive validation on two datasets, MSCOCO and PartiPrompts,
shows that OPT2I can boost the initial consistency score by up to 24.9% in terms of DSG score while
preserving the FID and increasing the recall between generated and real data. Our work paves the
way toward building more reliable and robust T2I systems by harnessing the power of LLMs.

Date: March 27, 2024


Correspondence: Oscar Mañas at [email protected]

1 Introduction have explored avenues such as adjusting the sampling


guidance scale (Ho and Salimans, 2022), modifying
In recent years, we have witnessed remarkable cross-attention operations (Feng et al., 2022; Epstein
progress in text-to-image (T2I) generative mod- et al., 2023; Liu et al., 2022; Chefer et al., 2023; Wu
els (Ramesh et al., 2022; Saharia et al., 2022; et al., 2023a), fine-tuning models (Lee et al., 2023;
Rombach et al., 2022b; Podell et al., 2023; Dai et al., Wu et al., 2023c; Sun et al., 2023), leveraging addi-
2023b). The photorealistic quality and aesthetics tional input modalities such as layouts (Cho et al.,
of generated images has positioned T2I generative 2023b; Lian et al., 2023), and selecting images post-
models at the center of the current AI revolution. hoc (Karthik et al., 2023). Most of these approaches
However, progress in image quality has come at require access to the model’s weights and are not
the expense of model representation diversity and applicable when models are only accessible through
prompt-image consistency (Hall et al., 2023). From a an API interface, e.g., (Betker et al., 2023). The only
user’s perspective, the fact that not all the elements two approaches that are applicable to API-accessible
of the input text prompt are properly represented in models are guidance scale modification and post-hoc
the generated image is particularly problematic as it image selection. However, these approaches rely on
induces a tedious and time-consuming trial-and-error a single text prompt – the one provided by the user
process with the T2I model to refine the initial – so their ability to generate diverse images is limited
prompt in order to generate the originally intended to resampling of the input noise, i.e., changing
image. Common consistency failure modes of the the random seed. Perhaps more importantly, both
T2I generative models include missing objects, wrong these approaches have unfavorable trade-offs, as they
object cardinality, missing or mixed object attributes, improve prompt-image consistency at the expense
and non-compliance with requested spatial relation- of image quality and diversity – e.g., high guidance
ships among objects in the image (Wu et al., 2023a). scales lead to reduced image quality and diversity,
while post-hoc selecting the most consistent images
To improve prompt-image consistency, researchers

1
Optimized prompt
User prompt
A cultured raccoon, wearing formal attire and a
A raccoon wearing formal clothes, wearing a tophat tophat, proudly holds a cane, while a nearby garbage
and holding a cane. The raccoon is holding a garbage bag is artistically included in the scene. The
bag. Oil painting in the style of Vincent Van Gogh painting reflects the vibrant colors and textures of
Vincent Van Gogh's oil paintings.

T2I T2I prompt optimizer T2I

T2I Scorer

LLM

Scorer Scorer

1. Is there a raccoon? Yes 1. Is there a raccoon? Yes


2. Is the raccoon holding a cane? Yes 2. Is the raccoon holding a cane? Yes
3. Is the raccoon wearing a tophat? Yes 3. Is the raccoon wearing a tophat? Yes
4. Is there a garbage bag? No 4. Is there a garbage bag? Yes
5. Is the raccoon holding a garbage bag? No 5. Is the raccoon holding a garbage bag? Yes
... ...

Score: 0.6667 Score: 1.0000

Figure 1 Overview of our backpropagation-free text-to-image optimization by prompting approach that


rewrites user prompts with the goal of improving prompt-image consistency. Our framework is composed
of a text-to-image generative model (T2I), a large language model (LLM) and a consistency objective (Scorer).
The LLM iteratively leverages a history of prompt-score pairs to suggest revised prompts. In the depicted
example, our system improves the consistency score by over 30% in terms of Davidsonian Scene Graph score.

decreases significantly the representation diversity. to refine it. Recently, and once again in the NLP
domain, optimization-by-prompting (OPRO) (Yang
At the same time, the natural language processing et al., 2023) has extended previous ICL approaches
(NLP) community has explored large language models by iteratively optimizing instruction prompts to
(LLMs) as prompt optimizers for NLP tasks (Pryzant maximize task accuracy based on a dataset of
et al., 2023; Yang et al., 2023; Guo et al., 2023; input-output examples, leveraging feedback from
Fernando et al., 2023), showing that LLMs can past instructions and their respective task accuracies.
relieve humans from the manual and tedious task
of prompt-engineering. One particularly interesting In this work, we propose the first ICL-based method
approach is in-context learning (ICL) (Dong et al., to improve prompt-image consistency in T2I mod-
2023), where an LLM can learn to solve a new els. In particular, we leverage optimization-by-
task from just a handful of in-context examples prompting and introduce a novel inference-time op-
provided in the input prompt. Notably, LLMs can timization framework for T2I prompts, which con-
accomplish this without requiring parameter updates structs a dataset of in-context examples on-the-fly.
– e.g., LLMs can solve simple regression tasks when Our framework, OPtimization for T2I generative
provided with a few input-output examples (Mirchan- models (OPT2I), involves a pre-trained T2I model,
dani et al., 2023). ICL enables rapid task adaptation an LLM, and an automatic prompt-image consis-
by changing the problem definition directly in tency score – e.g., CLIPScore (Hessel et al., 2021) or
the prompt. However, ICL commonly relies on Davidsonian Scene Graph score (DSG) (Cho et al.,
predefined datasets of examples (input-output pairs), 2023a). Through ICL, the LLM iteratively improves
which might be challenging to obtain. To the best of a user-provided text prompt by suggesting alternative
our knowledge, ICL has not yet been explored in the prompts that lead to images that are more aligned
context of T2I generative models, which pose unique with the user’s intention, as measured by the con-
challenges: (1) the in-context examples of prompts sistency score. At each optimization iteration, the
and the associated scores are not readily available, in-context examples are updated to include the best
and (2) the LLM needs to ground the prompt in solutions found so far. Intuitively, we expect our
the generated images in order to understand how method to explore the space of possible prompt para-

2
phrases and gradually discover the patterns in its User
previously suggested prompts that lead to highly- T2I
Generated prompt
image
consistent images. Crucially, the optimized prompts
will produce more consistent images on expectation,
across multiple input noise samples. OPT2I is de-
signed to be a versatile approach that works as a plug- Revised Consistency
and-play solution with diverse T2I models, LLMs, prompt Metric
and scoring functions since it does not require any
parameter updates. An overview of our framework is
presented in Figure 1. Meta-prompt

LLM Task description


Through extensive experiments, we show that OPT2I +
Prompt history
consistently outperforms paraphrasing baselines
Prompt
(e.g., random paraphrasing and Promptist (Hao optimization
et al., 2022)), and boosts the prompt-image con-
sistency by up to 12.2% and 24.9% on MSCOCO Figure 2 Our prompt optimization framework, OPT2I,
(Lin et al., 2014) and PartiPrompts (Yu et al., composed of (1) a text-to-image (T2I) generative
2022) datasets, respectively. Notably, we achieve model that generates images from text prompts, (2) a
this improvement while preserving the Fréchet consistency metric that evaluates the fidelity between the
Inception Distance (FID) (Heusel et al., 2017) and generated images and the user prompt, and (3) a large
language model (LLM) that leverages task description
increasing the recall between generated and real data.
and a history of prompt-score tuples to provide revised
Moreover, we observe that OPT2I achieves consistent
prompts. At the beginning, the revised prompt is
improvements for diverse T2I models and is robust initialized with the user prompt.
to the choice of LLM. Our qualitative results reveal
that the optimized prompts oftentimes emphasize
the elements of the initial prompt that do not
score between each generated image and the user
appear in the generated images by either providing
prompt, and average the scores. We then initialize
additional details about those or reordering elements
the meta-prompt history with the user prompt and
in the prompt to place the ignored elements at the
its associated consistency score. Finally, we feed the
beginning. In summary, our contributions are:
meta-prompt to the LLM, which proposes a set of
• We propose OPT2I, a training-free T2I revised prompts. To start a new optimization step,
optimization-by-prompting framework that we feed the revised prompts back to the T2I model
provides refined prompts for a T2I model that that generate new images. Note that the consistency
improve prompt-image consistency. score is always computed w.r.t. the user prompt.
At each optimization step, the meta-prompt history
• OPT2I is a versatile framework as it is not tied to
is updated to include the top-k most consistent
any particular T2I model and is robust to both
prompts (among revised prompts and user prompt),
the choice of LLM as well as consistency metric.
along with their consistency score. The prompt
• We show that OPT2I consistently outperforms optimization process terminates when a maximum
paraphrasing baselines and can boost the number of iterations is reached, or when a perfect/-
prompt-image consistency by up to 24.9%. target consistency score is achieved. Finally, the best
prompt is selected, namely, the optimized prompt.
2 OPT2I: Optimization by prompting
for T2I 2.1 Problem formulation
Figure 2 depicts our T2I optimization-by-prompting We assume access to an LLM, f , and a pre-trained
framework, which is composed of a pre-trained T2I T2I generative model, g. Given a text prompt, p, we
model and an LLM that work together to optimize can generate an image I conditioned on the prompt
a prompt-image consistency score. Our framework with our T2I generator, I = g(p). Let us define
starts from a user prompt and iteratively generates the set of all possible paraphrases from p that can
revised prompts with the goal of maximizing the cho- be obtained with an LLM as P = {pi }, and let us
sen consistency score. More concretely, we start by introduce a prompt-image consistency score, S(p, I).
feeding the user prompt to the T2I model to generate Our objective is to find a prompt paraphrase p̂ ∈ P
multiple images. Next, we compute the consistency that maximizes the expected consistency score of

3
sampled images: DSG assesses prompt-image consistency based on a
question generation and answering approach, simi-
p̂ = argmax E
I∼g(pi )
[S(p0 , I)] , lar to TIFA (Hu et al., 2023). In particular, DSG
pi ∼P
generates atomic and unique binary questions from
where p0 denotes the user prompt. We approach the user prompt that are organized into semantic
the optimization problem with ICL by iteratively dependency graphs. For example, “a bike lying on
searching for revised prompts generated by the LLM, the ground, covered in snow” is decomposed into:
(1) “Is there a bike?”; (2) “Is the bike lying on the
Pt = f (C({p0 } ∪ P1 , . . . , ∪ Pt−1 )), ground?”; (3) “Is the bike covered in snow?”. In this
case, questions (2) and (3) depend on (1), and so
where Pi is the set of prompts generated at iteration (1) is used to validate (2) and (3). These questions
i, t is the current iteration, and C is a function that are then answered by an off-the-shelf VQA model
defines the context of prompt-score pairs fed to the based on the generated image. We include the re-
LLM. sulting question-answer pairs in our meta-prompt. A
global score per prompt-image pair is computed by
2.2 Meta-prompt design averaging across answer scores.

We adapt LLMs for T2I prompt optimization via ICL. Decomposed CLIPScore computes a partial consis-
We denote meta-prompt the prompt which instructs tency score for each noun phrase present in the user
the LLM to optimize prompts for T2I models. Our prompt. For example, “a bike lying on the ground,
meta-prompt is composed of a task instruction and covered in snow” is decomposed into “a bike”, “the
a history of past revised prompt-score pairs. The list ground” and “snow”. Each noun phrase is then
of meta-prompts used can be found in Appendix A. scored against the generated image using CLIPScore,
resulting in a list of pairs of noun phrases and
The meta-prompt provides context about T2I models, their associated scores, which are included in our
the consistency metric and the optimization problem. meta-prompt. A global score per prompt-image
Additionally, it contains a history of prompt-score pair is computed by averaging across subscores. We
pairs that provides context about which paraphrases provide examples of decomposed CLIPScore and
worked best in the past for a particular T2I model, DSG outputs in Appendix A.
encouraging the LLM to build upon the most suc-
cessful prompts and removing the need for explicitly
specifying how to modify the T2I prompt. The con- 2.4 Exploration-exploitation trade-off
sistency score is normalized to an integer between 0 During the optimization process, OPT2I requires con-
and 100, and we only keep in the history the top-k trollability over the LLM’s exploration-exploitation
scoring prompts found so far, sorted by increasing trade-off, as the LLM could either focus on exploring
score. possible revised prompts or on exploiting the context
provided in the meta-prompt history. On the one
2.3 Optimization objective hand, too much exploration would hamper the
optimization as it could be hard to find a high quality
A critical part of our framework is feeding visual solution. On the other hand, too much exploitation
feedback to the LLM. The visual feedback is captured would limit the exploration to prompts that are very
by the consistency score, which determines how good similar to the ones already presented in the meta-
the candidate prompt is at generating consistent prompt history. We control this balance by adjusting
images. Although OPT2I can work with any – even the number of generated revised prompts per itera-
non-differentiable – consistency score, we argue that tion and the LLM sampling temperature. Moreover,
the consistency score must be detailed enough for as our objective is to find prompts that work well
the LLM to infer how to improve the candidate across different T2I input noise samples, we generate
prompt. While CLIPScore (Hessel et al., 2021) is multiple images per prompt at each iteration.
arguably the most popular metric for measuring
prompt-image consistency, in our initial experiments
we found that scoring a prompt with a single scalar 3 Experiments
is too coarse for our purposes. Thus, we opt for two
metrics that provide finer-grained information about We first introduce our experimental setting. Next,
the prompt-image consistency: (1) Davidsonian we validate the effectiveness of OPT2I in improving
Scene Graph (DSG) score (Cho et al., 2023a), and prompt-image consistency, compare it to paraphras-
(2) our proposed decomposed CLIPScore. ing baselines (random paraphrasing and Promptist),

4
and show some qualitative results. Then, we explore et al., 2022). For the LLM, we experiment with the
the trade-offs with image quality and diversity. And open source Llama-2-70B- chat (Llama-2) (Tou-
finally, we ablate OPT2I components and touch on vron et al., 2023) and with GPT-3.5-Turbo-0613
post-hoc image selection. (GPT-3.5) (Brown et al., 2020).

Implementation details. Unless stated otherwise,


3.1 Experimental setting OPT2I runs for at most 30 iterations generating 5
new revised prompts per iteration, while the random
Benchmarks. We run experiments using prompts
paraphrasing baseline generates 150 prompts at once
from MSCOCO (Lin et al., 2014) and PartiPrompts
(see Section 3.4 for a discussion about the relationship
(P2) (Yu et al., 2022). For MSCOCO, we use the
between #iters and #prompts/iter). We instruct the
2000 captions from the validation set as in (Hu et al.,
LLM to generate revised prompts by enumerating
2023). These captions represent real world scenes
them in a single response to prevent duplication,
containing common objects. PartiPrompts, instead,
rather than making multiple calls with a sampling
is a collection of 1600 artificial prompts, often un-
temperature greater than 0. In the optimization
realistic, divided into categories to stress different
meta-prompt, we set the history length to 5. To
capabilities of T2I generative models. We select our
speed up image generation, we use DDIM (Song
PartiPrompts subset by merging the first 50 prompts
et al., 2020) sampling. We perform 50 inference
from the most challenging categories: “Properties &
steps with LDM-2.1, while for CDM-M we perform 100
Positioning”, “Quantity”, “Fine-grained Detail”, and
steps with the low-resolution generator and 50 steps
“Complex”. This results in a set of 185 complex
with the super-resolution network – following the
prompts.
default parameters in both cases. The guidance scale
Baselines. We compare OPT2I against a random for both T2I models is kept to the suggested value
paraphrasing baseline, where the LLM is asked to of 7.5. Finally, in our experiments, we fix the initial
generate diverse paraphrases of the user prompt, random seeds across iterations and prompts wherever
without any context about the consistency of the possible, i.e. we fix 4 random seeds for sampling
images generated from it. The meta-prompt used to different prior/noise vectors to generate 4 images
obtain paraphrases is provided in Appendix A. We from the same prompt. However, we note that CDM-M
also compare to Promptist Hao et al. (2022), which does not allow batched generation with fixed seeds.
relies on a dataset of initial and target prompts to Experiments without seed-fixing are reported in
finetune an LLM to rewrite user prompts with the Appendix B as we observe no substantial differences.
goal of improving image aesthetics, while trying to
maintain prompt-image consistency.
3.2 Main results
Evaluation metrics. We measure the quality of a
T2I prompt by averaging prompt-image consistency T2I optimization by prompting. In Figure 3, we plot
scores across multiple image generations (i.e., the prompt optimization curves with LDM-2.1/CDM-M
multiple random seeds for the initial noise). For as T2I models, Llama-2/GPT-3.5 as LLM, and de-
each generated image, we compute its consistency composed CLIPscore (dCS)/DSG as the scorer for
with the user prompt. We consider our proposed prompts from MSCOCO and PartiPrompts. Each
decomposed CLIPScore (dCS) and the recent DSG data point corresponds to the mean/max relative
score (Cho et al., 2023a) as consistency metrics (see improvement in consistency score (w.r.t. the user
Section 2.3 for more details). For DSG score, we use prompt) achieved by revised prompts generated in
Instruct-BLIP (Dai et al., 2023a) as the VQA model. that iteration, averaged across the full dataset of
To assess the trade-off between prompt-image con- prompts. The optimization curves show an over-
sistency, image quality and diversity, we additionally all upward trend, which confirms that the LLM in
compute FID score (Heusel et al., 2017), precision OPT2I is capable of optimizing T2I prompts. These
and recall metrics (Naeem et al., 2020). improvements are especially noticeable in the max
consistency score. The initial dip in mean consis-
LLMs and T2I models. For the T2I model, we consider tency score is expected due to the initial exploration,
(1) a state-of-the-art latent diffusion model, namely since the LLM has limited context provided only
LDM-2.1 (Rombach et al., 2022a), which uses a CLIP by the user prompt (1-shot ICL). As the optimiza-
text encoder for conditioning, and (2) a cascaded tion progresses, the LLM generates more consistent
pixel-based diffusion model, CDM-M, which instead re- revised prompts at each iteration, as its context is in-
lies on the conditioning from a large language model, creasingly enriched with the performance of previous
T5-XXL (Raffel et al., 2020), similarly to (Saharia revised prompts. Notably, achieving a positive mean

5
2.5
5.0
7.5
0 10 20 30
iteration
Paraphr w/ Llama-2, LDM-2.1 Paraphr w/ GPT-3.5, LDM-2.1 Paraphr w/ Llama-2, CDM-M
OPT2I w/ Llama-2, LDM-2.1 OPT2I w/ GPT-3.5, LDM-2.1 OPT2I w/ Llama-2, CDM-M
max max 8 max max
Relative consistency (%) Relative consistency (%)

6 6 15
6
4 4 10
4
2 2 2 5

0 0 0 0
2 mean 2 mean 5 mean
5 mean
0
0 0 0
2
2 5 5
4
40 10 20 30 6
0 10 20 30 0 10 20 30 0 10 20 30
iteration iteration iteration iteration
(a) dCS (MSCOCO) (b) dCS (P2) (c) DSG (MSCOCO) (d) DSG (P2)

Figure 3 OPT2I curves with different consistency objectives (dCS vs. DSG), LLMs, and T2I models. Each plot track
either the max or the mean relative improvement in consistency across revised prompts per iteration.

relative consistency starting from a single in-context Table 1 Relative improvement in T2I consistency between
example is a very challenging task (Wei et al., 2023), the user prompt and the best prompt found, averaged
and OPT2I achieves this goal for all configurations ex- across all prompts in the dataset. Every method generates
the same total number of prompts (and images).
cept for Llama-2 and GPT-3.5 in the dCS(P2) setting
(Figure 3b), where it only get close to 0. As men- Dataset
Method Objective LLM T2I
tioned Section 3.1, PartiPrompts is a hard benchmark MSCOCO P2
containing highly detailed and complex prompts, so it Paraphrasing +9.96 +8.00
dCS
is perhaps unsurprising that decomposed CLIPScore OPT2I
Llama-2 LDM-2.1
+11.88 +10.34
falls short (with improvements < 7% in the max case, Paraphrasing
DSG Llama-2 LDM-2.1
+10.67 +19.22
OPT2I
and < 2% in the mean case) – given the imperfect +11.21 +22.24
Paraphrasing +9.81 +8.06
decomposition into noun-phrases. We also explored a OPT2I
dCS GPT-3.5 LDM-2.1
+10.35 +9.33
hierarchical version of decomposed CLIPscore lever- Paraphrasing +10.53 +19.09
DSG
aging constituency trees, which did not show any
GPT-3.5 LDM-2.1
OPT2I +10.85 +19.77
improvement over our noun-phrase based decompo- Paraphrasing
dCS Llama-2 CDM-M
+11.00 +10.29
OPT2I +12.21 +12.13
sition, further reinforcing the criticism that CLIP
Paraphrasing +10.53 +22.24
behaves as a bag-of-words and is unable to properly OPT2I
DSG Llama-2 CDM-M
+11.07 +24.93
capture object attributes and relations (Yuksekgonul
et al., 2022; Yamada et al., 2022). Instead, using a
more detailed consistency score during the prompt
optimization process, such as DSG, results in more we can see that both paraphrasing and optimization
significant improvements (< 17% in the max case, get around a 10% boost in consistency improvement
and < 5% in the mean case). when using DSG as optimization objective instead of
Comparison to paraphrasing baselines. Table 1 shows dCS for P2 but not for MSCOCO. This highlights
our proposed OPT2I framework is robust to the again that more complex prompts, such as those from
choice of LLM, T2I model and optimization/eval- PartiPrompts, benefit from a more accurate consis-
uation objective. In particular, for dCS, we re- tency metric. We note that some prompts might
port relative improvement as scorebest /scoreinit − 1. already have a fairly high initial consistency score
For DSG score, since the initial score can be zero (see App. B.2), so there is little room for improvement.
and it is already a percentage, we instead report For instance, prompts from MSCOCO evaluated with
scorebest − scoreinit . We observe that OPT2I consis- DSG have an average initial score of 86.54%, which
tently outperforms the random paraphrasing baseline means that the improvement in this setting has an
across different LLMs and T2I models. Additionally, upper bound of 13.46%.

6
Table 2 Distributional metrics on the MSCOCO dataset.

FID (↓) Prec. (↑) Rec. (↑)


Prompt Obj. LLM T2I
IV3 CLIP IV3 CLIP IV3 CLIP
Initial - - LDM-2.1 34.6 16.0 81.6 85.9 47.2 22.8
OPT2I dCS Llama-2 LDM-2.1 34.6 14.9 70.7 81.0 55.6 30.0
OPT2I DSG Llama-2 LDM-2.1 33.3 15.4 79.0 84.3 54.4 27.0
OPT2I dCS GPT-3.5 LDM-2.1 34.1 14.3 74.9 84.0 53.9 27.3
OPT2I DSG GPT-3.5 LDM-2.1 33.4 15.6 80.3 85.4 50.5 21.7

Initial - - CDM-M 41.2 15.2 82.2 85.6 38.8 26.0


OPT2I dCS Llama-2 CDM-M 39.8 15.2 77.1 80.9 45.4 29.5
OPT2I DSG Llama-2 CDM-M 39.9 15.2 79.6 82.5 39.9 25.0

In addition to random paraphrasing, we compare and that images generated by CDM-M from user
OPT2I to Promptist (Hao et al., 2022) on MSCOCO prompts are generally more consistent than those
prompts by generating images from initial/best generated by LDM-2.1, which we attribute to the use
prompts (4 images/prompt) with SD-1.4 (Promp- of a stronger text encoder (T5-XXL instead of CLIP).
tist’s reference model) and LDM-2.1, evaluating con-
sistency with DSG score. We observe Promptist de- 3.3 Trade-offs with image quality and diver-
creases the consistency score by −3.56%/−3.29% on sity
SD-1.4/LDM-2.1, while OPT2I (Llama-2) improves
consistency by +14.16%/+11.21%. This aligns with Following common practice in the T2I community, we
the results reported in (Hao et al., 2022), which show evaluate the quality of OPT2I generations by comput-
that optimizing prompts primarily for aesthetics ac- ing image generation metrics such as FID, precision
tually decreases prompt-image consistency. (P), and recall (R). We use the 2000 prompts from
the MSCOCO validation set that are included in the
Qualitative results. In Figure 4, we provide examples TIFAv1 benchmark (Hu et al., 2023), and generate 4
of images generated from user and optimized prompts images for each initial and best prompt. To ensure
with OPT2I for different LLMs and T2I models. We robust conclusions, we use two feature extractors
observe OPT2I is capable of finding paraphrases in our metrics: Inception-v3 (IV3) (Szegedy et al.,
of the user prompt which considerably improve the 2016) and CLIP (Radford et al., 2021). Results in
consistency between the generated images and the Table 2 show that the FID of prompts optimized
initial, user-provided prompt, as measured by DSG with OPT2I is either on-par or better compared to
in this case. These examples suggest the optimized that of initial prompts, validating that our method
prompts are capable of steering the T2I model does not trade image quality for consistency. Hence,
towards generating visual elements that were ignored we conclude FID is not affected by our optimization
with the initial phrasing. From our qualitative strategy. However, in terms of precision and recall,
analysis, we observed the LLM uses several strategies we observe that optimized prompts reach higher re-
to emphasize the missing visual elements, such as pro- call at the expense of lower precision compared to the
viding a more detailed description of those elements user prompt. This is explainable as rephrasing the
(e.g., “a flower” → “a vibrant flower arrangement”, input prompt allows to generate more diverse images
“a vase filled with fresh blossoms”) or placing them (higher recall), which may occasionally fall outside
at the beginning of the sentence (e.g., “four teacups of the manifold of natural images (lower precision).
surrounding a kettle” → “surround a kettle placed This phenomenon can be observed in Fig. 12 (Ap-
at the center with four teacups”). We note a perfect pendix B), where optimizing for consistency leads to
consistency score does not ensure perfectly aligned a change of artistic style.
images (e.g., for the user prompt “four teacups
surrounding a kettle”, all optimized prompts reach a
3.4 Ablations
DSG score of 100% while the cardinality of teacups
remains incorrect), which highlights the limitations We perform ablations with Llama-2 and LDM-2.1
of current prompt-image consistency scores. We also on PartiPrompts using default parameters unless
observe that prompts optimized by Llama-2 tend to otherwise specified. Figure 5 illustrates the trade-off
be longer than those from GPT-3.5 (see App. B.5), between exploration and exploitation, implemented

7
A horse and several cows feed on hay. (0.4643) A horse and several cows feed on hay. (0.5000)

0.4286 0.7143 0.1429 0.5714 0.2857 0.5714 0.2857 0.8571

A horse is seated on a stack of hay, surrounded by a herd of cows feasting on


A horse and multiple cows feed on a stack of hay together. (0.8929) the hay, creating a festive atmosphere. (0.8571)

0.7143 0.8571 1.0000 1.0000 0.8571 0.8571 0.7143 1.0000

A bowl full of tomatoes sitting next to a flower. (0.6667) A bowl full of tomatoes sitting next to a flower. (0.6667)

0.6667 0.6667 0.6667 0.6667 0.6667 0.6667 0.6667 0.6667

A ceramic bowl, overflowing with plump tomatoes, is placed beside a vibrant A bowl of tomatoes, topped with a beautiful flower, sitting next to a vase
flower arrangement, creating a visually appealing contrast. (1.0000) filled with fresh blooms. (1.0000)

1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000

(a) LLM: Llama-2, T2I: LDM-2.1 (b) LLM: Llama-2, T2I: CDM-M
A raccoon wearing formal clothes, wearing a tophat and holding a cane. The A raccoon wearing formal clothes, wearing a tophat and holding a cane. The
raccoon is holding a garbage bag. Oil painting in the style of pointilism. raccoon is holding a garbage bag. Oil painting in the style of pointilism.
(0.4722) (0.4722)

0.5556 0.5556 0.1111 0.6667 0.5556 0.5556 0.1111 0.6667


A sophisticated raccoon, attired in formal wear and a tophat, poses with poise
and panache, holding a cane and a garbage bag, against a stunning pointilism- An oil painting in the style of pointillism depicting a raccoon elegantly
style oil painting, showcasing the raccoon's cultured appearance and the dressed in formal clothes, with a top hat and cane, holding a garbage bag.
painting's elaborate details. (0.7500) (0.6944)

0.5556 0.8889 0.6667 0.8889 0.5556 0.4444 0.8889 0.8889

four teacups surounding a kettle (0.5000) four teacups surounding a kettle (0.5000)

0.0000 0.5000 0.7500 0.7500 0.0000 0.5000 0.7500 0.7500

Four teacups encircle a kettle, forming a cohesive and picturesque tea setup.
(1.0000) Surround a kettle placed at the center with four teacups. (1.0000)

1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000

(c) LLM: Llama-2, T2I: LDM-2.1 (d) LLM: GPT-3.5, T2I: LDM-2.1

Figure 4 Selected qualitative results for prompts from MSCOCO (a-b) and P2 (c-d) datasets, using DSG as consistency
metric. For each setup, we display four rows (from the top): initial prompt #1, optimized prompt #1, initial prompt
#2, and optimized prompt #2. Each column corresponds to a different T2I model random seed. We report average
consistency score across seeds in between parenthesis.

8
10 95
90
8

Average DSG score


85
Relative dCS (%)

6 80
75
4 Paraphr, it=1, p/it=150 70
OPT2I, it=1, p/it=150
2 OPT2I, it=5, p/it=30 65
OPT2I, it=15, p/it=10 Initial
OPT2I, it=30, p/it=5 60 Paraphr.
0 OPT2I, it=150, p/it=1 OPT2I
55
0 30 60 90 120 150 4 10 25 50 100 200 400 600
#revised prompts k
Figure 5 Cumulative max relative dCS as a function of Figure 6 Average DSG score for the top-k most consistent
#revised prompts = #iterations · #prompts/iter. images among 600.

Table 3 Meta-prompt ablations.


3.5 Post-hoc image selection
Conciseness Prioritize Reasoning Structure dCS (%) We emphasize OPT2I aims to optimize prompts to
✓ ✗ ✗ ✗ +9.68 generate more consistent images on expectation. How-
✓ ✓ ✗ ✗ +10.34 ever, for the sake of completeness, we also evaluate the
✓ ✓ ✓ ✗ +10.23 setting where we generate the same amount of images
✓ ✓ ✗ ✓ +9.99
from the initial prompt and select the most consistent
ones. In particular, we generate 600 images from Par-
tiPrompts using either random image sampling from
as the number of revised prompts per iteration the initial prompt, paraphrasing, or OPT2I, and se-
(#prompts/iter) and the number of optimization it- lect the top-k most consistent images based on DSG
erations (#iterations), respectively. Generating more score. In Fig. 6, we observe that OPT2I consistently
prompts at each iteration increases the exploration outperforms both baselines. Interestingly, sampling
of multiple solutions given the same context, while from the initial prompt outperforms paraphrasing,
by increasing the number of iterations, the LLM can which might be due to random paraphrases deviating
exploit more frequent feedback from the T2I model too much from the user’s intent.
and the consistency score. We observe that increasing
number of iterations leads to a higher consistency
improvement. In other words, more exploitation is 4 Related work
beneficial with a fixed budget of 150 prompt gener-
ations. However, pushing exploitation too much, i.e., Improving consistency in T2I models. Several recent
#it = 150 and #p/it = 1, becomes harmful. works propose extensions to T2I models to improve
their faithfulness to user prompts. Some studies focus
In table 3, we report relative consistency (dCS) when on improving the guidance with cross-attention (Feng
ablating for different task instructions in the meta- et al., 2022; Epstein et al., 2023; Liu et al., 2022;
prompt. We explore four instruction additions to be Chefer et al., 2023; Wu et al., 2023a). Other studies
combined with our base meta-prompt. Conciseness first convert a textual prompt into a layout before
encourages to explore paraphrases beyond just adding feeding it to a layout-to-image generative model (Cho
details/adjectives; Prioritize encourages focusing on et al., 2023b; Lian et al., 2023). Recent works also
missing/low-score elements; Reasoning encourages finetune T2I models on human (Lee et al., 2023; Wu
to reason about the in-context examples; and Struc- et al., 2023c; Wallace et al., 2023) or AI model (Sun
ture asks for simple vocabulary informative of the et al., 2023) feedback, or perform post-hoc image
structure of images, e.g., foreground and background selection (Karthik et al., 2023). In contrast, OPT2I
(full meta-prompts are provided in Appendix A). We acts exclusively at the level of input prompt in text
observe that Prioritize achieves slightly better perfor- space, without accessing model weights, making it
mance over Reasoning and Structure, yet the LLM re- applicable to a wider range of T2I models, including
mains fairly robust to specific meta-prompt phrasing. those only accessible through an API.

9
LLMs as prompt optimizers. Several recent works ex- in PartiPrompts appear to significantly benefit from
plore the role of LLMs as prompt optimizers for NLP more detailed scores such as DSG. Qualitatively, we
tasks. Some use LLMs to directly optimize the task observed that optimizing prompts for prompt-image
instruction for ICL (Zhou et al., 2022; Pryzant et al., consistency oftentimes translates into emphasizing
2023; Yang et al., 2023). Other studies use LLMs initially ignored elements in the generated images,
to mutate prompts for evolutionary algorithms (Guo by either providing additional details about those or
et al., 2023; Fernando et al., 2023). A crucial differ- rewording the prompt such that the ignored elements
ence between these works and our method is that appear at the beginning. Interestingly, such prompt
they optimize a task instruction prompt by using a modifications steer the generated images away from
training set, which is subsequently applied across test the learned modes, resulting in a higher recall w.r.t.
examples, while we perform multimodal inference- the real data distribution.
time optimization on individual T2I prompts. More
similar to our work, other studies rewrite prompts
for T2I models using an LLM. (Hao et al., 2022)
finetunes an LLM with reinforcement learning to im- Limitations. One limitation of our method is that
prove image aesthetics, while (Valerio et al., 2023) it expects prompt-image consistency scores to work
focuses on filtering out non-visual prompt elements. reasonably well. However, this assumption might not
In contrast, OPT2I aims to improve prompt-image hold in some cases. For instance, it has been shown
consistency via optimization-by-prompting. that CLIP (used for CLIPScore) sometimes behaves
like a bag-of-words (Yuksekgonul et al., 2022; Yamada
Evaluating prompt-image consistency. Several metrics
et al., 2022). VQA-based prompt-image consistency
have been proposed to evaluate prompt-image
metrics such as TIFA or DSG also suffer from limi-
consistency. CLIPScore (Hessel et al., 2021) is the
tations in generating questions (e.g., the question “Is
de facto standard for measuring the compatibility of
the horse on the hay?” is generated from the prompt
image-caption pairs, used both for image captioning
“A horse and several cows feed on hay.”) or in answer-
and text-conditioned image generation. However,
ing them with a VQA model (e.g., for the prompt “A
CLIPScore provides a single global score, which can
bowl full of tomatoes sitting next to a flower.”, the
be too coarse to understand failures in the generated
VQA model answers that there is a flower when it is
images. Consequently, subsequent metrics such as
in fact a bouquet made of tomatoes). Moreover, using
TIFA (Hu et al., 2023), VQ2 (Yarom et al., 2023) or
these metrics as optimization objectives might exac-
DSG (Cho et al., 2023a) propose generating pairs of
erbate their failure modes by finding prompts which
questions and answers from T2I prompts and using
generate images that fulfill the requirements for a
off-the-shelf VQA models to evaluate each of them
high score in an adversarial way. This highlights the
on the generated images, providing a fine-grained
need for further research in developing more robust
score. Other recent studies suggest directly learning
prompt-image consistency metrics which can be used
a prompt-image consistency metric from human
as optimization objectives in addition to evaluation.
feedback (Xu et al., 2023; Wu et al., 2023b; Kirstain
et al., 2023). However, none of these metrics are
without flaws and human judgment remains the most
reliable way of evaluating prompt-image consistency.
Another limitation of our approach is its runtime,
which is a consequence of performing inference-time
5 Conclusions optimization. For instance, running the optimization
process with Llama-2, LDM-2.1 and DSG score,
In this paper, we introduced the first T2I generating 5 prompt paraphrases per iteration and
optimization-by-prompting framework to improve 4 images per prompt with 50 diffusion steps, takes
prompt-image consistency. Through extensive eval- 7.34/20.27 iterations on average for COCO/Par-
uations, we showed that OPT2I can be effectively tiPrompts, which translates to ∼ 10/28 minutes
applied to different combinations of LLM, T2I models when using NVIDIA V100 GPUs. However, we
and consistency metrics, consistently outperforming emphasize that (1) OPT2I is designed to be a
paraphrasing baselines and yielding prompt-image versatile approach that works as a plug-and-play
consistency improvements of up to 24.9% over the solution with diverse T2I models and LLMs since
user prompt, while maintaining the FID between gen- it does not require any parameter updates nor
erated and real images. By contrasting MSCOCO and training data, and (2) optimizing T2I prompts with
PartiPrompts results, we highlighted the importance our automatic framework relieves humans from the
of the choice of consistency score: complex prompts manual and tedious task of prompt-engineering.

10
References prompt evolution. arXiv preprint arXiv:2309.16797,
2023.
James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Qingyan Guo, Rui Wang, Junliang Guo, Bei Li, Kaitao
Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Song, Xu Tan, Guoqing Liu, Jiang Bian, and Yujiu
Lee, Yufei Guo, et al. Improving image generation Yang. Connecting large language models with evolu-
with better captions. Computer Science. https://cdn. tionary algorithms yields powerful prompt optimizers.
openai. com/papers/dall-e-3. pdf, 2(3):8, 2023. arXiv preprint arXiv:2309.08532, 2023.
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- Melissa Hall, Candace Ross, Adina Williams, Nicolas
biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Nee- Carion, Michal Drozdzal, and Adriana Romero Soriano.
lakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Dig in: Evaluating disparities in image generations with
et al. Language models are few-shot learners. Advances indicators for geographic diversity, 2023.
in neural information processing systems, 33:1877–1901,
2020. Yaru Hao, Zewen Chi, Li Dong, and Furu Wei. Optimizing
prompts for text-to-image generation. arXiv preprint
Hila Chefer, Yuval Alaluf, Yael Vinker, Lior Wolf, and arXiv:2212.09611, 2022.
Daniel Cohen-Or. Attend-and-excite: Attention-based
semantic guidance for text-to-image diffusion models. Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le
ACM Transactions on Graphics (TOG), 42(4):1–10, Bras, and Yejin Choi. Clipscore: A reference-free eval-
2023. uation metric for image captioning. arXiv preprint
arXiv:2104.08718, 2021.
Jaemin Cho, Yushi Hu, Roopal Garg, Peter Anderson,
Ranjay Krishna, Jason Baldridge, Mohit Bansal, Jordi Martin Heusel, Hubert Ramsauer, Thomas Unterthiner,
Pont-Tuset, and Su Wang. Davidsonian scene graph: Bernhard Nessler, and Sepp Hochreiter. Gans trained
Improving reliability in fine-grained evaluation for text- by a two time-scale update rule converge to a local nash
image generation. arXiv preprint arXiv:2310.18235, equilibrium. Advances in neural information processing
2023a. systems, 30, 2017.

Jaemin Cho, Abhay Zala, and Mohit Bansal. Visual pro- Jonathan Ho and Tim Salimans. Classifier-free diffusion
gramming for text-to-image generation and evaluation. guidance, 2022.
arXiv preprint arXiv:2305.15328, 2023b.
Yushi Hu, Benlin Liu, Jungo Kasai, Yizhong Wang, Mari
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Ostendorf, Ranjay Krishna, and Noah A. Smith. Tifa:
Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Accurate and interpretable text-to-image faithfulness
Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: evaluation with question answering. In Proceedings of
Towards general-purpose vision-language models with the IEEE/CVF International Conference on Computer
instruction tuning, 2023a. Vision (ICCV), pages 20406–20417, October 2023.
Xiaoliang Dai, Ji Hou, Chih-Yao Ma, Sam Tsai, Jialiang Shyamgopal Karthik, Karsten Roth, Massimiliano
Wang, Rui Wang, Peizhao Zhang, Simon Vandenhende, Mancini, and Zeynep Akata. If at first you don’t
Xiaofang Wang, Abhimanyu Dubey, et al. Emu: En- succeed, try, try again: Faithful diffusion-based text-
hancing image generation models using photogenic nee- to-image generation by selection. arXiv preprint
dles in a haystack. arXiv preprint arXiv:2309.15807, arXiv:2305.13308, 2023.
2023b.
Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland
Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiyong Matiana, Joe Penna, and Omer Levy. Pick-a-pic: An
Wu, Baobao Chang, Xu Sun, Jingjing Xu, Lei Li, and open dataset of user preferences for text-to-image gen-
Zhifang Sui. A survey on in-context learning, 2023. eration. arXiv preprint arXiv:2305.01569, 2023.
Dave Epstein, Allan Jabri, Ben Poole, Alexei A Efros, Kimin Lee, Hao Liu, Moonkyung Ryu, Olivia Watkins,
and Aleksander Holynski. Diffusion self-guidance Yuqing Du, Craig Boutilier, Pieter Abbeel, Mohammad
for controllable image generation. arXiv preprint Ghavamzadeh, and Shixiang Shane Gu. Aligning text-
arXiv:2306.00986, 2023. to-image models using human feedback. arXiv preprint
arXiv:2302.12192, 2023.
Weixi Feng, Xuehai He, Tsu-Jui Fu, Varun Jampani, Ar-
jun Reddy Akula, Pradyumna Narayana, Sugato Basu, Long Lian, Boyi Li, Adam Yala, and Trevor Darrell. Llm-
Xin Eric Wang, and William Yang Wang. Training- grounded diffusion: Enhancing prompt understanding
free structured diffusion guidance for compositional of text-to-image diffusion models with large language
text-to-image synthesis. In The Eleventh International models. arXiv preprint arXiv:2305.13655, 2023.
Conference on Learning Representations, 2022.
Tsung-Yi Lin, Michael Maire, Serge Belongie, James
Chrisantha Fernando, Dylan Banarse, Henryk Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and
Michalewski, Simon Osindero, and Tim Rocktäschel. C Lawrence Zitnick. Microsoft coco: Common objects
Promptbreeder: Self-referential self-improvement via in context. In Computer Vision–ECCV 2014: 13th

11
European Conference, Zurich, Switzerland, September Chitwan Saharia, William Chan, Saurabh Saxena, Lala
6-12, 2014, Proceedings, Part V 13, pages 740–755. Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed
Springer, 2014. Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi,
Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho,
Nan Liu, Shuang Li, Yilun Du, Antonio Torralba, and David J Fleet, and Mohammad Norouzi. Photorealis-
Joshua B Tenenbaum. Compositional visual generation tic text-to-image diffusion models with deep language
with composable diffusion models. In European Con- understanding, 2022.
ference on Computer Vision, pages 423–439. Springer,
2022. Jiaming Song, Chenlin Meng, and Stefano Ermon. De-
noising diffusion implicit models. arXiv preprint
Suvir Mirchandani, Fei Xia, Pete Florence, Danny Driess, arXiv:2010.02502, 2020.
Montserrat Gonzalez Arenas, Kanishka Rao, Dorsa
Sadigh, Andy Zeng, et al. Large language models as Jiao Sun, Deqing Fu, Yushi Hu, Su Wang, Royi Rassin,
general pattern machines. In 7th Annual Conference Da-Cheng Juan, Dana Alon, Charles Herrmann, Sjo-
on Robot Learning, 2023. erd van Steenkiste, Ranjay Krishna, et al. Dreamsync:
Aligning text-to-image generation with image under-
Muhammad Ferjad Naeem, Seong Joon Oh, Youngjung standing feedback. arXiv preprint arXiv:2311.17946,
Uh, Yunjey Choi, and Jaejun Yoo. Reliable fidelity and 2023.
diversity metrics for generative models. In International
Conference on Machine Learning, pages 7176–7185. Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon
PMLR, 2020. Shlens, and Zbigniew Wojna. Rethinking the inception
architecture for computer vision. In Proceedings of
Large Model Systems Organization. Chatbot arena leader- the IEEE conference on computer vision and pattern
board. https://huggingface.co/spaces/lmsys/ recognition, pages 2818–2826, 2016.
chatbot-arena-leaderboard.
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert,
Dustin Podell, Zion English, Kyle Lacey, Andreas Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov,
Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale,
and Robin Rombach. Sdxl: Improving latent diffu- et al. Llama 2: Open foundation and fine-tuned chat
sion models for high-resolution image synthesis. arXiv models. arXiv preprint arXiv:2307.09288, 2023.
preprint arXiv:2307.01952, 2023.
Rodrigo Valerio, Joao Bordalo, Michal Yarom, Yonattan
Reid Pryzant, Dan Iter, Jerry Li, Yin Tat Lee, Chenguang Bitton, Idan Szpektor, and Joao Magalhaes. Transfer-
Zhu, and Michael Zeng. Automatic prompt optimiza- ring visual attributes from natural language to verified
tion with" gradient descent" and beam search. arXiv image generation. arXiv preprint arXiv:2305.15026,
preprint arXiv:2305.03495, 2023. 2023.
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou,
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sas- Aaron Lou, Senthil Purushwalkam, Stefano Ermon,
try, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion
Learning transferable visual models from natural lan- model alignment using direct preference optimization.
guage supervision. In International conference on ma- arXiv preprint arXiv:2311.12908, 2023.
chine learning, pages 8748–8763. PMLR, 2021.
Jerry Wei, Jason Wei, Yi Tay, Dustin Tran, Albert
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Webson, Yifeng Lu, Xinyun Chen, Hanxiao Liu,
Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Da Huang, Denny Zhou, et al. Larger language mod-
Wei Li, and Peter J Liu. Exploring the limits of trans- els do in-context learning differently. arXiv preprint
fer learning with a unified text-to-text transformer. arXiv:2303.03846, 2023.
The Journal of Machine Learning Research, 21(1):5485–
5551, 2020. Qiucheng Wu, Yujian Liu, Handong Zhao, Trung Bui,
Zhe Lin, Yang Zhang, and Shiyu Chang. Harnessing
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey the spatial-temporal attention of diffusion models for
Chu, and Mark Chen. Hierarchical text-conditional high-fidelity text-to-image synthesis. In Proceedings of
image generation with clip latents, 2022. the IEEE/CVF International Conference on Computer
Vision, pages 7766–7776, 2023a.
Robin Rombach, Andreas Blattmann, Dominik Lorenz,
Patrick Esser, and Björn Ommer. High-resolution im- Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen,
age synthesis with latent diffusion models. In Proceed- Feng Zhu, Rui Zhao, and Hongsheng Li. Human prefer-
ings of the IEEE/CVF conference on computer vision ence score v2: A solid benchmark for evaluating human
and pattern recognition, pages 10684–10695, 2022a. preferences of text-to-image synthesis. arXiv preprint
arXiv:2306.09341, 2023b.
Robin Rombach, Andreas Blattmann, Dominik Lorenz,
Patrick Esser, and Björn Ommer. High-resolution Xiaoshi Wu, Keqiang Sun, Feng Zhu, Rui Zhao, and Hong-
image synthesis with latent diffusion models, 2022b. sheng Li. Better aligning text-to-image models with

12
human preference. arXiv preprint arXiv:2303.14420,
2023c.
Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai
Li, Ming Ding, Jie Tang, and Yuxiao Dong. Im-
agereward: Learning and evaluating human prefer-
ences for text-to-image generation. arXiv preprint
arXiv:2304.05977, 2023.
Yutaro Yamada, Yingtian Tang, and Ilker Yildirim. When
are lemons purple? the concept association bias of clip.
arXiv preprint arXiv:2212.12043, 2022.
Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao
Liu, Quoc V Le, Denny Zhou, and Xinyun Chen.
Large language models as optimizers. arXiv preprint
arXiv:2309.03409, 2023.
Michal Yarom, Yonatan Bitton, Soravit Changpinyo,
Roee Aharoni, Jonathan Herzig, Oran Lang, Eran Ofek,
and Idan Szpektor. What you see is what you read?
improving text-image alignment evaluation. arXiv
preprint arXiv:2305.10400, 2023.
Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong,
Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander
Ku, Yinfei Yang, Burcu Karagol Ayan, et al. Scaling
autoregressive models for content-rich text-to-image
generation. arXiv preprint arXiv:2206.10789, 2022.
Mert Yuksekgonul, Federico Bianchi, Pratyusha Kalluri,
Dan Jurafsky, and James Zou. When and why vision-
language models behave like bags-of-words, and what to
do about it? In The Eleventh International Conference
on Learning Representations, 2022.
Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han,
Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy
Ba. Large language models are human-level prompt
engineers. In The Eleventh International Conference
on Learning Representations, 2022.

13
A Additional method details

A.1 Meta-prompts
Meta-prompts 1-5 include all the prompts used in our
work. The teal text is fed to the LLM as a system
prompt, and the purple text denotes placeholders to
be dynamically filled.

Decompose the following sentence into


individual noun phrases. Ignore prefixes such
as ’a photo of’, ’a picture of’, ’a portrait
of’, etc. Your response should only be a list
of comma separated values, eg: ’foo, bar, baz’

{prompt}

Prompt 1 Meta-prompt used for decomposing prompts


into noun phrases for dCS.

Generate {num_solutions} paraphrases of the


following image description while keeping the
semantic meaning: "{user_prompt}". Respond
with each new prompt in between <PROMPT> and
</PROMPT>, eg:

1. <PROMPT>paraphrase 1</PROMPT>
2. <PROMPT>paraphase 2</PROMPT>
...
{num_solutions}. <PROMPT>paraphrase {
num_solutions}</PROMPT>

Prompt 2 Meta-prompt used for the paraphrasing baseline.

A.2 Examples of prompt decompositions


Table 4 shows a few examples of outputs generated
by dCS (noun phrases) and DSG (binary questions)
from the input prompt. These outputs are then used
to compute a fine-grained consistency score w.r.t. the
generated image, either with a multimodal encoder
(dCS) or a VQA model (DSG).

14
You are an expert prompt optimizer for text-to-image models. Text-to-image models take a text prompt as
input and generate images depicting the prompt as output. You translate prompts written by humans into
better prompts for the text-to-image models. Your answers should be concise and effective.

Your task is to optimize this initial prompt written by a human: "{user_prompt}". Below are some
previous prompts with a decomposition of their visual elements. Each element is paired with a score
indicating its presence in the generated image. The prompts are arranged in ascending order based on
their scores, which range from 0 to 100. Higher scores indicate higher likelihood of presence.

1. {revised_prompt_1}
score: {avg_score_1}
visual elements:
{subprompt_1_1} {clip_score_1_1}
{subprompt_1_2} {clip_score_1_2}
(... more questions ...)

(... more examples ...)

Generate {num_solutions} paraphrases of the initial prompt which keep the semantic meaning and that have
higher scores than all the prompts above. Prioritize optimizing for object with lowest scores. Favor
substitutions and reorderings over additions. Respond with each new prompt in between <PROMPT> and </
PROMPT>, eg:

1. <PROMPT>paraphrase 1</PROMPT>
2. <PROMPT>paraphase 2</PROMPT>
...
{num_solutions}. <PROMPT>paraphrase {num_solutions}</PROMPT>

Prompt 3 Meta-prompt used for OPT2I with dCS as scorer.

You are an expert prompt optimizer for text-to-image models. Text-to-image models take a text prompt as
input and generate images depicting the prompt as output. You translate prompts written by humans into
better prompts for the text-to-image models. Your answers should be concise and effective.

Your task is to optimize this initial prompt written by a human: "{user_prompt}". Below are some
previous prompts with the consistency of each prompt’s visual elements in the generated image via a set
of binary questions. The prompts are arranged in ascending order based on their overall consistency
score, which ranges from 0 to 100 (higher is better).

1. {revised_prompt_1}
overall score: {dsg_score_1}
evaluation questions:
{question_1_1} {vqa_score_1_1}
{question_1_2} {vqa_score_1_2}
(... more questions ...)

(... more examples ...)

Generate {num_solutions} paraphrases of the initial prompt which keep the semantic meaning and that have
higher scores than all the prompts above. Focus on optimizing for the visual elements that are not
consistent. Favor substitutions and reorderings over additions. Respond with each new prompt in between
<PROMPT> and </PROMPT>, eg:

1. <PROMPT>paraphrase 1</PROMPT
2. <PROMPT>paraphase 2</PROMPT>
...
{num_solutions}. <PROMPT>paraphrase {num_solutions}</PROMPT>

Prompt 4 Meta-prompt used for OPT2I with DGS as scorer.

15
Conciseness:

(... more examples ...)

Generate {num_solutions} paraphrases of the initial prompt which keep the semantic meaning and that have
higher scores than all the prompts above. Favor substitutions and reorderings over additions. Respond
with each new prompt in between <PROMPT> and </PROMPT>, eg:
...

Conciseness + prioritize:

(... more examples ...)

Generate {num_solutions} paraphrases of the initial prompt which keep the semantic meaning and that have
higher scores than all the prompts above. Prioritize optimizing for object with lowest scores. Favor
substitutions and reorderings over additions. Respond with each new prompt in between <PROMPT> and </
PROMPT>, eg:
...

Conciseness + prioritize + reasoning:

(... more examples ...)

Briefly reason (max two sentences) about the prompts above to understand why certain objects have higher
or lower scores in certain prompts. Then, based on this reasoning, generate {num_solutions} paraphrases
of the initial prompt which keep the semantic meaning and that have higher scores than all the prompts
above. Prioritize optimizing for objects with lowest scores while keeping high scores for the other
objects. Favor substitutions and reorderings over additions. Respond with each new prompt in between <
PROMPT> and </PROMPT>, eg:
...

Conciseness + prioritize + structure:

(... more examples ...)

Generate {num_solutions} paraphrases of the initial prompt which keep the semantic meaning and that have
higher scores than all the prompts above. PRIORITIZE optimizing for objects with lowest scores while
keeping high scores for the other objects. FAVOR substitutions and reorderings over additions. USE
simple words/concepts, understable from a text-to-image model, e.g., distinguish foreground and
background. Respond with each new prompt in between <PROMPT> and </PROMPT>, eg:
...

Prompt 5 Meta-prompt ablations. Modifications w.r.t. the base meta-prompt are denoted with different colors.

16
Table 4 Prompt decompositions into noun phrases (dCS) and binary questions (DGS).

Prompt dCS DSG


“A ginger cat is sleeping next to “ginger cat”, “window” “Is there a cat?”, “Is the cat ginger?”, “Is the cat sleeping?”,
the window.” “Is there a window?”, “Is the cat next to the window?”
“Many electronic wires pass over “electronic wires”, “road”, “cars” “Are there many electronic wires?”, “Is there a road?”, “Are
the road with few cars on it.” there few cars?”, “Do the electronic wires pass over the
road?”, “Are the cars on the road?”
“there is a long hot dog that has “long hot dog”, “toppings” “Is there a hot dog?”, “Is the hot dog long?”, “Is the hot
toppings on it” dog hot?”, “Are there toppings on the hot dog?”
“the mona lisa wearing a cowboy “the mona lisa”, “a cowboy hat”, “Is there the Mona Lisa?”, “Is the Mona Lisa wearing a
hat and screaming a punk song “a punk song”, “a microphone” cowboy hat?”, “Is the Mona Lisa holding a microphone?”,
into a microphone” “Is the hat a cowboy hat?”, “Is the Mona Lisa screaming?”,
“Is the song the Mona Lisa is singing punk?”, “Is the Mona
Lisa screaming into the microphone?”
“a photograph of a bird wearing “photograph”, “bird”, “head- “Is this a photograph?”, “Is there a bird?”, “Does the bird
headphones and speaking into a phones”, “microphone”, “record- have headphones?”, “Does the bird have a microphone?”,
microphone in a recording stu- ing studio” “Is there a recording studio?”, “Is the bird speaking into
dio” the microphone?”, “Is the bird wearing headphones?”, “Is
the bird in the recording studio?”
“concentric squares fading from “concentric squares”, “yellow”, “Are there squares?”, “Are the squares concentric?”, “Is the
yellow on the outside to deep or- “outside”, “deep orange”, “inside” outer square yellow?”, “Is the inner square deep orange?”,
ange on the inside” “Is the inner square inside the outer square?”, “Is the outer
square on the outside?”, “Is the inner square on the inside?”

17
B Additional results reinforce our previous claims: (1) the iterative proce-
dure allows OPT2I to keep improving revised prompts
over time, and (2) 1-shot ICL is challenging due to
Table 5 Comparison with 1-shot in-context learning (ICL).
the limited feedback provided to the LLM about how
We report relative improvement (%) in prompt-image
to improve the user prompt, and thus only minor
consistency between the user prompt and the best prompt
found, averaged across all prompts in the dataset.
improvements in prompt-image consistency can be
obtained over random paraphrasing.
Dataset
Method Objective LLM T2I
MSCOCO P2
Paraphrasing
1-shot ICL dCS Llama-2 LDM-2.1
+9.86
+10.67
+8.00
+8.74
B.2 Filtering out already consistent user
OPT2I +11.68 +10.34 prompts
Paraphrasing +9.92 +19.22
1-shot ICL DSG
OPT2I
Llama-2 LDM-2.1 +10.14
+11.02
+19.69
+22.24
When adopting the DSG objective, user prompts
Paraphrasing +9.64 +8.06
might already achieve a perfect consistency score
1-shot ICL dCS GPT-3.5 LDM-2.1 +9.21∗ +8.72 initially, i.e., SDSG (p0 , g(p0 )) = 1. We observe this
OPT2I +10.56 +9.33
phenomenon happens with higher frequency on the
Paraphrasing +10.21 +19.09
1-shot ICL DSG GPT-3.5 LDM-2.1 +10.09∗ +18.94∗ simpler MSCOCO prompts (∼40%), and with less fre-
OPT2I +11.19 +19.77 quency on the more complex PartiPrompts prompts
Paraphrasing +11.38 +10.29 (∼10%). Since SDSG (p0 , g(p0 )) can be computed be-
1-shot ICL dCS Llama-2 CDM-M +12.19 +11.34
OPT2I +12.65 +12.13 forehand, we can avoid optimizing for those user
Paraphrasing +9.86 +22.24 prompts that have already a perfect initial SDSG
1-shot ICL DSG Llama-2 CDM-M +10.10 +22.25 and better showcase the optimization performance of
OPT2I +10.15 +24.93
OPT2I. We provide the updated optimization curves
in Figure 7, and report the final results in Table 6. In
both cases, we highlight results obtained by filtering
B.1 1-shot in-context learning as baseline
out “perfect” user prompts with full colors, and con-
In this experiment, we compare OPT2I with 1-shot in- trast them against results obtained with all prompts
context learning (ICL), which we implement by run- in faded colors (equivalent to Figure 3).
ning OPT2I with #iter = 1 and #prompts/iter = 150.
Note that, in this setting, the LLM only receives feed- In Table 6, we observe a higher relative improvement
back about the performance of the user prompt. We for both MSCOCO and PartiPrompts in all config-
maintain the same experimental setting described in urations when filtering out “perfect” user prompts,
Section 3, except we use 200 prompts for MSCOCO, which is more prominent for MSCOCO because the
and report the results in Table 5. First, we notice number of excluded prompts is higher. In Figure 7,
that 1-shot ICL achieves higher prompt-image con- we observe similar consistent and considerable in-
sistency than random paraphrasing in all settings creases of all optimization curves when considering
except when using GPT-3.5, which performs on-par both mean and max consistency improvement. In
or slightly worse (marked with * in Table 5, see also the mean case, we remark a reduction in the initial
the discussion in Section B.5). Second, and more dip in relative consistency, especially in MSCOCO,
importantly, we observe that OPT2I outperforms the where OPT2I reaches a positive relative consistency
1-shot ICL baseline regardless of the consistency ob- much earlier, i.e., it = [6, 2, 2] vs. it = [23, 8, 5] with
jective, LLM, or T2I model adopted. These results Llama-2, GPT-3.5, and CDM-M, respectively.

Table 6 Relative prompt-image consistency improvement between the user prompt and the best prompt found, averaged
across prompts.

MSCOCO P2
Method LLM T2I
SDSG (p0 , g(p0 )) < 1 All SDSG (p0 , g(p0 )) < 1 All
Paraphrasing +17.74 +10.67 +21.95 +19.22
Llama-2 LDM-2.1
OPT2I +18.63 +11.21 +25.39 +22.24
Paraphrasing +17.52 +10.53 +21.80 +19.09
GPT-3.5 LDM-2.1
OPT2I +18.05 +10.85 +22.58 +19.77
Paraphrasing +18.65 +10.53 +26.54 +22.24
Llama-2 CDM-M
OPT2I +19.61 +11.07 +29.19 +24.93

18
**Paraphrasing **Paraphrasing **Paraphrasing
Paraphrasing Paraphrasing Paraphrasing
**Llama-2_SD2.1 **GPT-3.5_SD2.1 **Llama-2_IF-M
Llama-2_SD2.1 GPT-3.5_SD2.1 Llama-2_IF-M
15 max 20 max

Relative consistency (%) Relative consistency (%)


15
10
10
5
5
0 0
mean mean
5 5

0 0

5 5

0 10 20 30 0 10 20 30
iteration iteration
(a) DSG (MSCOCO) (b) DSG (P2)

Figure 7 OPT2I optimization curves obtained with prompts having SDSG (p0 ) < 1, marked by “**” and full colors. In
contrast, faded-color curves consider all prompts. Each plot tracks either the max or the mean relative improvement
in consistency across revised prompts per iteration.

B.3 Impact of seed-fixing and #im- the optimization curves, we notice that optimizing
ages/prompt a single image without fixing the seed is more dif-
ficult for OPT2I, which results in a noisy and less
In this experiment, we ablate the impact of fixing steep trajectory, especially in the mean case. In con-
the random seed of the initial noise for the diffusion trast, when OPT2I optimizes 4 or 10 images/prompt
model throughout the optimization process when op- with no fixed seed, both the max and mean curve
timizing different numbers of images/prompt. We use remain similar w.r.t. to using a fixed seed. This sup-
our default configuration with Llama-2 and LDM-2.1 ports our choice of generating 4 images/prompt, as
on MSCOCO. In Figure 8a, we show the optimiza- it provides enough diversity in the generations while
tion curves obtained when optimizing 1, 4 (default), being substantially more computationally efficient
and 10 images/prompt with fixed image seed. As than generating 10.
expected, we observe no meaningful differences in
mean consistency improvement. In contrast, the max
consistency improvement shows a clear distinction B.4 Stratified PartiPrompts results
between optimizing a single image (single seed) and Figure 9 shows the relative improvement in consis-
optimizing 4 or 10, with the former achieving more tency score (dCS or DSG) on prompts from Par-
substantial improvements. We argue that when opti- tiPrompts (P2), broken down by challenge aspect.
mizing a single image seed, OPT2I is more sensitive Note that we only sampled prompts from four of the
to changes in the prompts, i.e., there is a higher most difficult dimensions in P2: “Complex”, “Fine-
variance among the scores of revised prompts. We grained Detail”, “Quantity”, and “Properties & Posi-
then contrast the optimization curves with fixed seed tioning”. Intuitively, this plot shows what kinds of
(8a) against the non-fixed seed ones (8b). Our hy- prompts are easier to optimize for OPT2I when using
pothesis is that optimizing, when not fixing the seed, different LLMs, T2I models and consistency scores.
generating too few images/prompt leads to an un-
stable/unreliable feedback for the LLM due to the The most significant improvement in consistency is ob-
high variance of the generations. Indeed, looking at served for prompts related to “Properties & Position-

19
OPT2I, #imgs/p=1, seed=0 OPT2I, #imgs/p=1, No seed
OPT2I, #imgs/p=4, seed=0 OPT2I, #imgs/p=4, No seed
OPT2I, #imgs/p=10, seed=0 OPT2I, #imgs/p=10, No seed
10 max 10 max

Relative consistency (%) Relative consistency (%)


8 8
6 6
4 4
2 2
0 0
mean mean
2 2
1 1
0 0
1 1
2 2
0 10 20 30 0 10 20 30
iteration iteration
(a) fixed seed (b) non-fixed seed

Figure 8 OPT2I optimization curves with (a) fixed or (b) non-fixed seed. Each curve uses optimizes a different number
of images per prompt. Y-axis is aligned between (a) and (b). Curves obtained with Llama-2 and LDM-2.1 on 200 out
of the 2000 prompts from MSCOCO.

ing” when using Llama-2 in conjunction with CDM-M the exact same prompt with GPT-3.5. Hence, one
and dCS. Similarly, the combination of Llama-2, hypothesis for the observed phenomenon is that our
CDM-M, and DSG yields the best results for prompts meta-prompt is better optimized for Llama-2. An-
about “Quantity”. For other challenges, CDM-M con- other hypothesis is that each LLM has a different
tinues to provide the most substantial consistency balance point between exploration and exploitation
improvement, although the margin is narrower com- for the same sampling temperature of 1.0. In partic-
pared to LDM-2.1. Interestingly, GPT-3.5 shows the ular, given the flatter optimization curves drawn by
smallest improvement in consistency for prompts GPT-3.5, we conjecture that it explores less diverse
about “Quantity”, regardless of whether dCS or DGS prompts than Llama-2. To verify this, we analyze
metrics are used. Consistency improvements for some text properties of the revised prompts generated
prompts from the “Complex” and “Fine-grained De- by both LLMs.
tail” challenges are comparable, which is expected
due to their inherent similarities. Figure 10a tracks the length (in number of characters)
of the revised prompts at each iteration, and Fig-
ure 10b tracks CLIP text similarity between revised
B.5 Why is GPT-3.5 not as good as Llama-2?
prompts and the user prompt along the optimization
In Figure 3 and Table 1, we observe that OPT2I process, both averaged over the revised prompts gen-
achieves worse results when using GPT-3.5 as the erated at the same iterations and over all prompts
LLM. Notably, the optimization curves with GPT-3.5 in the dataset. We observe that when using Llama-2
are flatter than when using Llama-2. This result is for OPT2I, the revised prompts generated at each
rather surprising, as current leaderboards ? indicate iteration are longer and more semantically dissimi-
that GPT-3.5 generally outperforms Llama-2 on a lar to the user prompt compared to those generated
wide variety of NLP tasks. So, in this experiment, by GPT-3.5. This means that OPT2I benefits from
we aim to shed light on the possible causes. Given greater prompt diversity to find the best T2I prompts
the closed (and expensive) access to GPT-3.5, our that lead to more consistent images, which is better
initial exploration of the meta-prompt structure and achieved with Llama-2. Additionally, we note that
phrasing was based on Llama-2, and later on we used both the prompt length and the semantic similarity

20
Llama-2, LDM-2.1 GPT-3.5, LDM-2.1 Llama-2, CDM-M

14 31
13 29

Relative DSG (%)


Relative dCS (%)
12 27
11 25
10 23
9 21
8 19
7 17
6 15
all lex ned tity s& all lex ned tity s&
Over Comp Fine-grtaai il Quan Properitoiening Over Comp Fine-grtaai il Quan Properitoiening
D e Pos i t D e Pos i t

Figure 9 Relative improvement in prompt-image consistency between the user prompt and the best prompt found,
averaged across PartiPrompts prompts and broken down by challenge aspect.

with the user prompt start plateauing around the as the scorer. Since DSG is computed as an average
maximum number of iterations we set, which further of binary scores, it is more coarse than CLIPScore
validates our selected value of 30. We leave as fu- and thus there are fewer leaps in consistency. Over-
ture work ablating for the sampling temperature with all, we observe that the intermediate revised prompt
both LLMs. manages to increase the consistency score in some of
the generated images but not for all of them. The
best prompt, however, usually manages to improve
B.6 Additional qualitative examples all 4 generated images.
Figures 11 and 12 show some additional selected ex-
amples of user prompt and revised prompts through- Figure 12 shows revised prompts generated with dCS
out the optimization process, along with the gener- as the scorer. In this case, we can see a gradual
ated images and consistency scores. In particular, increase in average dCS, which visually translates
we select revised prompts such that the consistency to generated images which are more consistent with
score of the generated images (w.r.t. the user prompt) the user prompt on average. The strong effect of the
is strictly higher than the previous best score found initial latent noise in the image structure is evident,
so far, i.e., the leaps in prompt-image consistency. yet substantial modifications in the format of the in-
put prompt used to condition the generation lead to
Figure 11 shows revised prompts generated with DSG significant changes in how the T2I model interprets

175 1.00 llama-2


0.95 gpt-3.5
150
length (chars)

0.90
similarity

125
0.85
100
0.80
75 llama-2
gpt-3.5 0.75
50
0 10 20 30 0 10 20 30
iterations iterations
(a) Prompt length (b) Text similarity w/ user prompt

Figure 10 Text analysis of revised prompts generated by Llama-2 and GPT-3.5.

21
the structure determined by the initial noise (e.g.,
between rows 2-3 and 4-5 in the squirrel example).
We also note that dCS (CLIPScore averaged over sub-
prompts) can occasionally fall short as an evaluation
metric for image-text consistency. This is primarily
because dCS tends to focus on the presence of visual
elements, overlooking other aspects such as spatial
relationships. In the toilet example, for instance, we
observe how the generated images progressively be-
come more consistent up to a certain point (around
row 6). Beyond this point, the revised prompts and
the generated images start degenerating (e.g., by
overemphasizing certain elements), while dCS contin-
ues to improve. This highlights that, while dCS may
serve as a useful fine-grained consistency metric to
provide visual feedback for T2I prompt optimization
with an LLM, it may not be as effective for evaluation
purposes.

22
a cute wooden owl statue holding a large globe of the Earth above its head (0.3929)

There is a pizza on the cutting board (0.5000)

0.2857 1.0000 0.1429 0.1429

An owl statue made of wood, with a charming expression, holds a large Earth globe
above its head, boasting a precision-crafted surface. (0.7857)
1.0000 0.3333 0.3333 0.3333
A cutting board holds a delicious pizza. (0.8333)

1.0000 1.0000 0.1429 1.0000


An enchanting wooden owl statue, with an endearing expression, cradles a massive
1.0000 1.0000 0.3333 1.0000 globe showcasing the Earth's diverse terrain, as it sits atop a pedestal, ensuring a
On a clean cutting board, a mouthwatering pizza awaits slicing. (1.0000) striking visual harmony, with the globe resting on its head. (1.0000)

1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000

A punk rock squirrel in a studded leather jacket shouting into a microphone while
standing on a stump (0.5000)

A traffic light and a signpost at a crossroads intersection near a waterway. (0.3889)

0.3750 0.6250 0.3750 0.6250

0.4444 0.4444 0.2222 0.4444


Atop a tree stump, a rebellious squirrel wears a leather jacket with metal studs and
A traffic light and signpost standing at a crossroads intersection, surrounded by a holds a microphone, shouting into the wind with punk rock fervor. (0.7188)
waterway. (0.6667)

1.0000 0.7500 0.5000 0.6250


0.8889 0.8889 0.4444 0.4444 A microphone-wielding squirrel, draped in a black leather jacket with silver studs,
A traffic light and signpost are positioned at a crossroads intersection, with a passionately shouts into the wind while standing on a stump, epitomizing the DIY
beautiful waterway flowing nearby, producing a striking visual. (0.9167) ethos and raw energy of punk rock. (0.9375)

0.8889 1.0000 0.8889 0.8889 1.0000 0.8750 0.8750 1.0000

Figure 11 Selected examples of initial prompts from MSCOCO (left) and PartiPrompts (right) and revised prompts
across the optimization, along with the generated images. The optimizer refines prompts for LDM-2.1 using Llama-2
as LLM and DSG as scorer. We report DSG score averaged across images.

23
A punk rock squirrel in a studded leather jacket shouting into a microphone while
Small white toilet with seashells sitting on top of it. (0.1899) standing on a lily pad (0.1690)

A rebellious squirrel wearing a studded leather jacket passionately sings into a


A clean white toilet adorned with seashells on its lid. (0.1985) microphone while standing on a lily pad. (0.1813)

A punk-rockin' squirrel stands atop a lily pad and belts into a microphone, wearing a
A small, pristine toilet crowned with seashells. (0.2341) studded leather jacket that oozes attitude. (0.1859)

A punk rockin' squirrel, wearing a leather jacket adorned with metal studs, belts
A miniature white toilet graces a seashell-covered pedestal. (0.2480) into a microphone with conviction, positioned atop a lily pad. (0.1939)

A rebellious squirrel, clad in a black leather jacket adorned with metal studs,
A miniature toilet, flawlessly white, is adorned with an array of vibrant seashells, passionately sings into a microphone while standing on a lily pad, exuding punk rock
elevating its charm. (0.2507) spirit. (0.1988)

A rebellious squirrel, adorned in a black leather jacket featuring metal studs,


A miniature toilet, flawlessly white, is covered in a diverse array of seashells, passionately sings into a microphone while standing on a lily pad, exuding punk rock
elevating its charm and creating a captivating display. (0.2527) spirit. (0.2017)

A bold squirrel, adorned in a black leather jacket featuring metal studs, sings into
A small, immaculate white toilet, covered in a diverse array of seashells, perches on a microphone with intensity, standing on a lily pad, embracing the punk rock style.
a pedestal, boasting an eye-catching exhibition. (0.2605) (0.2053)

A rebellious squirrel, attired in a black leather jacket featuring metal studs,


A miniature white toilet, adorned with an array of seashells, perches on a pedestal, forcefully belts out a tune into a microphone on a sturdy lily pad, channeling the
boasting an unforgettable exhibition. (0.2675) raw energy of punk rock. (0.2158)

Figure 12 Selected examples of initial prompts from MSCOCO (left) and PartiPrompts (right) and revised prompts
across the optimization, along with the generated images. The optimizer refines prompts for LDM-2.1 using Llama-2
as LLM and dCS as scorer. We report dCS score averaged across images.

24

You might also like