LLM Paper 4
LLM Paper 4
Impressive advances in text-to-image (T2I) generative models have yielded a plethora of high performing
models which are able to generate aesthetically appealing, photorealistic images. Despite the progress,
arXiv:2403.17804v1 [cs.CV] 26 Mar 2024
these models still struggle to produce images that are consistent with the input prompt, oftentimes
failing to capture object quantities, relations and attributes properly. Existing solutions to improve
prompt-image consistency suffer from the following challenges: (1) they oftentimes require model
fine-tuning, (2) they only focus on nearby prompt samples, and (3) they are affected by unfavorable
trade-offs among image quality, representation diversity, and prompt-image consistency. In this paper,
we address these challenges and introduce a T2I optimization-by-prompting framework, OPT2I, which
leverages a large language model (LLM) to improve prompt-image consistency in T2I models. Our
framework starts from a user prompt and iteratively generates revised prompts with the goal of
maximizing a consistency score. Our extensive validation on two datasets, MSCOCO and PartiPrompts,
shows that OPT2I can boost the initial consistency score by up to 24.9% in terms of DSG score while
preserving the FID and increasing the recall between generated and real data. Our work paves the
way toward building more reliable and robust T2I systems by harnessing the power of LLMs.
1
Optimized prompt
User prompt
A cultured raccoon, wearing formal attire and a
A raccoon wearing formal clothes, wearing a tophat tophat, proudly holds a cane, while a nearby garbage
and holding a cane. The raccoon is holding a garbage bag is artistically included in the scene. The
bag. Oil painting in the style of Vincent Van Gogh painting reflects the vibrant colors and textures of
Vincent Van Gogh's oil paintings.
T2I Scorer
LLM
Scorer Scorer
decreases significantly the representation diversity. to refine it. Recently, and once again in the NLP
domain, optimization-by-prompting (OPRO) (Yang
At the same time, the natural language processing et al., 2023) has extended previous ICL approaches
(NLP) community has explored large language models by iteratively optimizing instruction prompts to
(LLMs) as prompt optimizers for NLP tasks (Pryzant maximize task accuracy based on a dataset of
et al., 2023; Yang et al., 2023; Guo et al., 2023; input-output examples, leveraging feedback from
Fernando et al., 2023), showing that LLMs can past instructions and their respective task accuracies.
relieve humans from the manual and tedious task
of prompt-engineering. One particularly interesting In this work, we propose the first ICL-based method
approach is in-context learning (ICL) (Dong et al., to improve prompt-image consistency in T2I mod-
2023), where an LLM can learn to solve a new els. In particular, we leverage optimization-by-
task from just a handful of in-context examples prompting and introduce a novel inference-time op-
provided in the input prompt. Notably, LLMs can timization framework for T2I prompts, which con-
accomplish this without requiring parameter updates structs a dataset of in-context examples on-the-fly.
– e.g., LLMs can solve simple regression tasks when Our framework, OPtimization for T2I generative
provided with a few input-output examples (Mirchan- models (OPT2I), involves a pre-trained T2I model,
dani et al., 2023). ICL enables rapid task adaptation an LLM, and an automatic prompt-image consis-
by changing the problem definition directly in tency score – e.g., CLIPScore (Hessel et al., 2021) or
the prompt. However, ICL commonly relies on Davidsonian Scene Graph score (DSG) (Cho et al.,
predefined datasets of examples (input-output pairs), 2023a). Through ICL, the LLM iteratively improves
which might be challenging to obtain. To the best of a user-provided text prompt by suggesting alternative
our knowledge, ICL has not yet been explored in the prompts that lead to images that are more aligned
context of T2I generative models, which pose unique with the user’s intention, as measured by the con-
challenges: (1) the in-context examples of prompts sistency score. At each optimization iteration, the
and the associated scores are not readily available, in-context examples are updated to include the best
and (2) the LLM needs to ground the prompt in solutions found so far. Intuitively, we expect our
the generated images in order to understand how method to explore the space of possible prompt para-
2
phrases and gradually discover the patterns in its User
previously suggested prompts that lead to highly- T2I
Generated prompt
image
consistent images. Crucially, the optimized prompts
will produce more consistent images on expectation,
across multiple input noise samples. OPT2I is de-
signed to be a versatile approach that works as a plug- Revised Consistency
and-play solution with diverse T2I models, LLMs, prompt Metric
and scoring functions since it does not require any
parameter updates. An overview of our framework is
presented in Figure 1. Meta-prompt
3
sampled images: DSG assesses prompt-image consistency based on a
question generation and answering approach, simi-
p̂ = argmax E
I∼g(pi )
[S(p0 , I)] , lar to TIFA (Hu et al., 2023). In particular, DSG
pi ∼P
generates atomic and unique binary questions from
where p0 denotes the user prompt. We approach the user prompt that are organized into semantic
the optimization problem with ICL by iteratively dependency graphs. For example, “a bike lying on
searching for revised prompts generated by the LLM, the ground, covered in snow” is decomposed into:
(1) “Is there a bike?”; (2) “Is the bike lying on the
Pt = f (C({p0 } ∪ P1 , . . . , ∪ Pt−1 )), ground?”; (3) “Is the bike covered in snow?”. In this
case, questions (2) and (3) depend on (1), and so
where Pi is the set of prompts generated at iteration (1) is used to validate (2) and (3). These questions
i, t is the current iteration, and C is a function that are then answered by an off-the-shelf VQA model
defines the context of prompt-score pairs fed to the based on the generated image. We include the re-
LLM. sulting question-answer pairs in our meta-prompt. A
global score per prompt-image pair is computed by
2.2 Meta-prompt design averaging across answer scores.
We adapt LLMs for T2I prompt optimization via ICL. Decomposed CLIPScore computes a partial consis-
We denote meta-prompt the prompt which instructs tency score for each noun phrase present in the user
the LLM to optimize prompts for T2I models. Our prompt. For example, “a bike lying on the ground,
meta-prompt is composed of a task instruction and covered in snow” is decomposed into “a bike”, “the
a history of past revised prompt-score pairs. The list ground” and “snow”. Each noun phrase is then
of meta-prompts used can be found in Appendix A. scored against the generated image using CLIPScore,
resulting in a list of pairs of noun phrases and
The meta-prompt provides context about T2I models, their associated scores, which are included in our
the consistency metric and the optimization problem. meta-prompt. A global score per prompt-image
Additionally, it contains a history of prompt-score pair is computed by averaging across subscores. We
pairs that provides context about which paraphrases provide examples of decomposed CLIPScore and
worked best in the past for a particular T2I model, DSG outputs in Appendix A.
encouraging the LLM to build upon the most suc-
cessful prompts and removing the need for explicitly
specifying how to modify the T2I prompt. The con- 2.4 Exploration-exploitation trade-off
sistency score is normalized to an integer between 0 During the optimization process, OPT2I requires con-
and 100, and we only keep in the history the top-k trollability over the LLM’s exploration-exploitation
scoring prompts found so far, sorted by increasing trade-off, as the LLM could either focus on exploring
score. possible revised prompts or on exploiting the context
provided in the meta-prompt history. On the one
2.3 Optimization objective hand, too much exploration would hamper the
optimization as it could be hard to find a high quality
A critical part of our framework is feeding visual solution. On the other hand, too much exploitation
feedback to the LLM. The visual feedback is captured would limit the exploration to prompts that are very
by the consistency score, which determines how good similar to the ones already presented in the meta-
the candidate prompt is at generating consistent prompt history. We control this balance by adjusting
images. Although OPT2I can work with any – even the number of generated revised prompts per itera-
non-differentiable – consistency score, we argue that tion and the LLM sampling temperature. Moreover,
the consistency score must be detailed enough for as our objective is to find prompts that work well
the LLM to infer how to improve the candidate across different T2I input noise samples, we generate
prompt. While CLIPScore (Hessel et al., 2021) is multiple images per prompt at each iteration.
arguably the most popular metric for measuring
prompt-image consistency, in our initial experiments
we found that scoring a prompt with a single scalar 3 Experiments
is too coarse for our purposes. Thus, we opt for two
metrics that provide finer-grained information about We first introduce our experimental setting. Next,
the prompt-image consistency: (1) Davidsonian we validate the effectiveness of OPT2I in improving
Scene Graph (DSG) score (Cho et al., 2023a), and prompt-image consistency, compare it to paraphras-
(2) our proposed decomposed CLIPScore. ing baselines (random paraphrasing and Promptist),
4
and show some qualitative results. Then, we explore et al., 2022). For the LLM, we experiment with the
the trade-offs with image quality and diversity. And open source Llama-2-70B- chat (Llama-2) (Tou-
finally, we ablate OPT2I components and touch on vron et al., 2023) and with GPT-3.5-Turbo-0613
post-hoc image selection. (GPT-3.5) (Brown et al., 2020).
5
2.5
5.0
7.5
0 10 20 30
iteration
Paraphr w/ Llama-2, LDM-2.1 Paraphr w/ GPT-3.5, LDM-2.1 Paraphr w/ Llama-2, CDM-M
OPT2I w/ Llama-2, LDM-2.1 OPT2I w/ GPT-3.5, LDM-2.1 OPT2I w/ Llama-2, CDM-M
max max 8 max max
Relative consistency (%) Relative consistency (%)
6 6 15
6
4 4 10
4
2 2 2 5
0 0 0 0
2 mean 2 mean 5 mean
5 mean
0
0 0 0
2
2 5 5
4
40 10 20 30 6
0 10 20 30 0 10 20 30 0 10 20 30
iteration iteration iteration iteration
(a) dCS (MSCOCO) (b) dCS (P2) (c) DSG (MSCOCO) (d) DSG (P2)
Figure 3 OPT2I curves with different consistency objectives (dCS vs. DSG), LLMs, and T2I models. Each plot track
either the max or the mean relative improvement in consistency across revised prompts per iteration.
relative consistency starting from a single in-context Table 1 Relative improvement in T2I consistency between
example is a very challenging task (Wei et al., 2023), the user prompt and the best prompt found, averaged
and OPT2I achieves this goal for all configurations ex- across all prompts in the dataset. Every method generates
the same total number of prompts (and images).
cept for Llama-2 and GPT-3.5 in the dCS(P2) setting
(Figure 3b), where it only get close to 0. As men- Dataset
Method Objective LLM T2I
tioned Section 3.1, PartiPrompts is a hard benchmark MSCOCO P2
containing highly detailed and complex prompts, so it Paraphrasing +9.96 +8.00
dCS
is perhaps unsurprising that decomposed CLIPScore OPT2I
Llama-2 LDM-2.1
+11.88 +10.34
falls short (with improvements < 7% in the max case, Paraphrasing
DSG Llama-2 LDM-2.1
+10.67 +19.22
OPT2I
and < 2% in the mean case) – given the imperfect +11.21 +22.24
Paraphrasing +9.81 +8.06
decomposition into noun-phrases. We also explored a OPT2I
dCS GPT-3.5 LDM-2.1
+10.35 +9.33
hierarchical version of decomposed CLIPscore lever- Paraphrasing +10.53 +19.09
DSG
aging constituency trees, which did not show any
GPT-3.5 LDM-2.1
OPT2I +10.85 +19.77
improvement over our noun-phrase based decompo- Paraphrasing
dCS Llama-2 CDM-M
+11.00 +10.29
OPT2I +12.21 +12.13
sition, further reinforcing the criticism that CLIP
Paraphrasing +10.53 +22.24
behaves as a bag-of-words and is unable to properly OPT2I
DSG Llama-2 CDM-M
+11.07 +24.93
capture object attributes and relations (Yuksekgonul
et al., 2022; Yamada et al., 2022). Instead, using a
more detailed consistency score during the prompt
optimization process, such as DSG, results in more we can see that both paraphrasing and optimization
significant improvements (< 17% in the max case, get around a 10% boost in consistency improvement
and < 5% in the mean case). when using DSG as optimization objective instead of
Comparison to paraphrasing baselines. Table 1 shows dCS for P2 but not for MSCOCO. This highlights
our proposed OPT2I framework is robust to the again that more complex prompts, such as those from
choice of LLM, T2I model and optimization/eval- PartiPrompts, benefit from a more accurate consis-
uation objective. In particular, for dCS, we re- tency metric. We note that some prompts might
port relative improvement as scorebest /scoreinit − 1. already have a fairly high initial consistency score
For DSG score, since the initial score can be zero (see App. B.2), so there is little room for improvement.
and it is already a percentage, we instead report For instance, prompts from MSCOCO evaluated with
scorebest − scoreinit . We observe that OPT2I consis- DSG have an average initial score of 86.54%, which
tently outperforms the random paraphrasing baseline means that the improvement in this setting has an
across different LLMs and T2I models. Additionally, upper bound of 13.46%.
6
Table 2 Distributional metrics on the MSCOCO dataset.
In addition to random paraphrasing, we compare and that images generated by CDM-M from user
OPT2I to Promptist (Hao et al., 2022) on MSCOCO prompts are generally more consistent than those
prompts by generating images from initial/best generated by LDM-2.1, which we attribute to the use
prompts (4 images/prompt) with SD-1.4 (Promp- of a stronger text encoder (T5-XXL instead of CLIP).
tist’s reference model) and LDM-2.1, evaluating con-
sistency with DSG score. We observe Promptist de- 3.3 Trade-offs with image quality and diver-
creases the consistency score by −3.56%/−3.29% on sity
SD-1.4/LDM-2.1, while OPT2I (Llama-2) improves
consistency by +14.16%/+11.21%. This aligns with Following common practice in the T2I community, we
the results reported in (Hao et al., 2022), which show evaluate the quality of OPT2I generations by comput-
that optimizing prompts primarily for aesthetics ac- ing image generation metrics such as FID, precision
tually decreases prompt-image consistency. (P), and recall (R). We use the 2000 prompts from
the MSCOCO validation set that are included in the
Qualitative results. In Figure 4, we provide examples TIFAv1 benchmark (Hu et al., 2023), and generate 4
of images generated from user and optimized prompts images for each initial and best prompt. To ensure
with OPT2I for different LLMs and T2I models. We robust conclusions, we use two feature extractors
observe OPT2I is capable of finding paraphrases in our metrics: Inception-v3 (IV3) (Szegedy et al.,
of the user prompt which considerably improve the 2016) and CLIP (Radford et al., 2021). Results in
consistency between the generated images and the Table 2 show that the FID of prompts optimized
initial, user-provided prompt, as measured by DSG with OPT2I is either on-par or better compared to
in this case. These examples suggest the optimized that of initial prompts, validating that our method
prompts are capable of steering the T2I model does not trade image quality for consistency. Hence,
towards generating visual elements that were ignored we conclude FID is not affected by our optimization
with the initial phrasing. From our qualitative strategy. However, in terms of precision and recall,
analysis, we observed the LLM uses several strategies we observe that optimized prompts reach higher re-
to emphasize the missing visual elements, such as pro- call at the expense of lower precision compared to the
viding a more detailed description of those elements user prompt. This is explainable as rephrasing the
(e.g., “a flower” → “a vibrant flower arrangement”, input prompt allows to generate more diverse images
“a vase filled with fresh blossoms”) or placing them (higher recall), which may occasionally fall outside
at the beginning of the sentence (e.g., “four teacups of the manifold of natural images (lower precision).
surrounding a kettle” → “surround a kettle placed This phenomenon can be observed in Fig. 12 (Ap-
at the center with four teacups”). We note a perfect pendix B), where optimizing for consistency leads to
consistency score does not ensure perfectly aligned a change of artistic style.
images (e.g., for the user prompt “four teacups
surrounding a kettle”, all optimized prompts reach a
3.4 Ablations
DSG score of 100% while the cardinality of teacups
remains incorrect), which highlights the limitations We perform ablations with Llama-2 and LDM-2.1
of current prompt-image consistency scores. We also on PartiPrompts using default parameters unless
observe that prompts optimized by Llama-2 tend to otherwise specified. Figure 5 illustrates the trade-off
be longer than those from GPT-3.5 (see App. B.5), between exploration and exploitation, implemented
7
A horse and several cows feed on hay. (0.4643) A horse and several cows feed on hay. (0.5000)
A bowl full of tomatoes sitting next to a flower. (0.6667) A bowl full of tomatoes sitting next to a flower. (0.6667)
A ceramic bowl, overflowing with plump tomatoes, is placed beside a vibrant A bowl of tomatoes, topped with a beautiful flower, sitting next to a vase
flower arrangement, creating a visually appealing contrast. (1.0000) filled with fresh blooms. (1.0000)
(a) LLM: Llama-2, T2I: LDM-2.1 (b) LLM: Llama-2, T2I: CDM-M
A raccoon wearing formal clothes, wearing a tophat and holding a cane. The A raccoon wearing formal clothes, wearing a tophat and holding a cane. The
raccoon is holding a garbage bag. Oil painting in the style of pointilism. raccoon is holding a garbage bag. Oil painting in the style of pointilism.
(0.4722) (0.4722)
four teacups surounding a kettle (0.5000) four teacups surounding a kettle (0.5000)
Four teacups encircle a kettle, forming a cohesive and picturesque tea setup.
(1.0000) Surround a kettle placed at the center with four teacups. (1.0000)
(c) LLM: Llama-2, T2I: LDM-2.1 (d) LLM: GPT-3.5, T2I: LDM-2.1
Figure 4 Selected qualitative results for prompts from MSCOCO (a-b) and P2 (c-d) datasets, using DSG as consistency
metric. For each setup, we display four rows (from the top): initial prompt #1, optimized prompt #1, initial prompt
#2, and optimized prompt #2. Each column corresponds to a different T2I model random seed. We report average
consistency score across seeds in between parenthesis.
8
10 95
90
8
6 80
75
4 Paraphr, it=1, p/it=150 70
OPT2I, it=1, p/it=150
2 OPT2I, it=5, p/it=30 65
OPT2I, it=15, p/it=10 Initial
OPT2I, it=30, p/it=5 60 Paraphr.
0 OPT2I, it=150, p/it=1 OPT2I
55
0 30 60 90 120 150 4 10 25 50 100 200 400 600
#revised prompts k
Figure 5 Cumulative max relative dCS as a function of Figure 6 Average DSG score for the top-k most consistent
#revised prompts = #iterations · #prompts/iter. images among 600.
9
LLMs as prompt optimizers. Several recent works ex- in PartiPrompts appear to significantly benefit from
plore the role of LLMs as prompt optimizers for NLP more detailed scores such as DSG. Qualitatively, we
tasks. Some use LLMs to directly optimize the task observed that optimizing prompts for prompt-image
instruction for ICL (Zhou et al., 2022; Pryzant et al., consistency oftentimes translates into emphasizing
2023; Yang et al., 2023). Other studies use LLMs initially ignored elements in the generated images,
to mutate prompts for evolutionary algorithms (Guo by either providing additional details about those or
et al., 2023; Fernando et al., 2023). A crucial differ- rewording the prompt such that the ignored elements
ence between these works and our method is that appear at the beginning. Interestingly, such prompt
they optimize a task instruction prompt by using a modifications steer the generated images away from
training set, which is subsequently applied across test the learned modes, resulting in a higher recall w.r.t.
examples, while we perform multimodal inference- the real data distribution.
time optimization on individual T2I prompts. More
similar to our work, other studies rewrite prompts
for T2I models using an LLM. (Hao et al., 2022)
finetunes an LLM with reinforcement learning to im- Limitations. One limitation of our method is that
prove image aesthetics, while (Valerio et al., 2023) it expects prompt-image consistency scores to work
focuses on filtering out non-visual prompt elements. reasonably well. However, this assumption might not
In contrast, OPT2I aims to improve prompt-image hold in some cases. For instance, it has been shown
consistency via optimization-by-prompting. that CLIP (used for CLIPScore) sometimes behaves
like a bag-of-words (Yuksekgonul et al., 2022; Yamada
Evaluating prompt-image consistency. Several metrics
et al., 2022). VQA-based prompt-image consistency
have been proposed to evaluate prompt-image
metrics such as TIFA or DSG also suffer from limi-
consistency. CLIPScore (Hessel et al., 2021) is the
tations in generating questions (e.g., the question “Is
de facto standard for measuring the compatibility of
the horse on the hay?” is generated from the prompt
image-caption pairs, used both for image captioning
“A horse and several cows feed on hay.”) or in answer-
and text-conditioned image generation. However,
ing them with a VQA model (e.g., for the prompt “A
CLIPScore provides a single global score, which can
bowl full of tomatoes sitting next to a flower.”, the
be too coarse to understand failures in the generated
VQA model answers that there is a flower when it is
images. Consequently, subsequent metrics such as
in fact a bouquet made of tomatoes). Moreover, using
TIFA (Hu et al., 2023), VQ2 (Yarom et al., 2023) or
these metrics as optimization objectives might exac-
DSG (Cho et al., 2023a) propose generating pairs of
erbate their failure modes by finding prompts which
questions and answers from T2I prompts and using
generate images that fulfill the requirements for a
off-the-shelf VQA models to evaluate each of them
high score in an adversarial way. This highlights the
on the generated images, providing a fine-grained
need for further research in developing more robust
score. Other recent studies suggest directly learning
prompt-image consistency metrics which can be used
a prompt-image consistency metric from human
as optimization objectives in addition to evaluation.
feedback (Xu et al., 2023; Wu et al., 2023b; Kirstain
et al., 2023). However, none of these metrics are
without flaws and human judgment remains the most
reliable way of evaluating prompt-image consistency.
Another limitation of our approach is its runtime,
which is a consequence of performing inference-time
5 Conclusions optimization. For instance, running the optimization
process with Llama-2, LDM-2.1 and DSG score,
In this paper, we introduced the first T2I generating 5 prompt paraphrases per iteration and
optimization-by-prompting framework to improve 4 images per prompt with 50 diffusion steps, takes
prompt-image consistency. Through extensive eval- 7.34/20.27 iterations on average for COCO/Par-
uations, we showed that OPT2I can be effectively tiPrompts, which translates to ∼ 10/28 minutes
applied to different combinations of LLM, T2I models when using NVIDIA V100 GPUs. However, we
and consistency metrics, consistently outperforming emphasize that (1) OPT2I is designed to be a
paraphrasing baselines and yielding prompt-image versatile approach that works as a plug-and-play
consistency improvements of up to 24.9% over the solution with diverse T2I models and LLMs since
user prompt, while maintaining the FID between gen- it does not require any parameter updates nor
erated and real images. By contrasting MSCOCO and training data, and (2) optimizing T2I prompts with
PartiPrompts results, we highlighted the importance our automatic framework relieves humans from the
of the choice of consistency score: complex prompts manual and tedious task of prompt-engineering.
10
References prompt evolution. arXiv preprint arXiv:2309.16797,
2023.
James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Qingyan Guo, Rui Wang, Junliang Guo, Bei Li, Kaitao
Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Song, Xu Tan, Guoqing Liu, Jiang Bian, and Yujiu
Lee, Yufei Guo, et al. Improving image generation Yang. Connecting large language models with evolu-
with better captions. Computer Science. https://cdn. tionary algorithms yields powerful prompt optimizers.
openai. com/papers/dall-e-3. pdf, 2(3):8, 2023. arXiv preprint arXiv:2309.08532, 2023.
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- Melissa Hall, Candace Ross, Adina Williams, Nicolas
biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Nee- Carion, Michal Drozdzal, and Adriana Romero Soriano.
lakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Dig in: Evaluating disparities in image generations with
et al. Language models are few-shot learners. Advances indicators for geographic diversity, 2023.
in neural information processing systems, 33:1877–1901,
2020. Yaru Hao, Zewen Chi, Li Dong, and Furu Wei. Optimizing
prompts for text-to-image generation. arXiv preprint
Hila Chefer, Yuval Alaluf, Yael Vinker, Lior Wolf, and arXiv:2212.09611, 2022.
Daniel Cohen-Or. Attend-and-excite: Attention-based
semantic guidance for text-to-image diffusion models. Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le
ACM Transactions on Graphics (TOG), 42(4):1–10, Bras, and Yejin Choi. Clipscore: A reference-free eval-
2023. uation metric for image captioning. arXiv preprint
arXiv:2104.08718, 2021.
Jaemin Cho, Yushi Hu, Roopal Garg, Peter Anderson,
Ranjay Krishna, Jason Baldridge, Mohit Bansal, Jordi Martin Heusel, Hubert Ramsauer, Thomas Unterthiner,
Pont-Tuset, and Su Wang. Davidsonian scene graph: Bernhard Nessler, and Sepp Hochreiter. Gans trained
Improving reliability in fine-grained evaluation for text- by a two time-scale update rule converge to a local nash
image generation. arXiv preprint arXiv:2310.18235, equilibrium. Advances in neural information processing
2023a. systems, 30, 2017.
Jaemin Cho, Abhay Zala, and Mohit Bansal. Visual pro- Jonathan Ho and Tim Salimans. Classifier-free diffusion
gramming for text-to-image generation and evaluation. guidance, 2022.
arXiv preprint arXiv:2305.15328, 2023b.
Yushi Hu, Benlin Liu, Jungo Kasai, Yizhong Wang, Mari
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Ostendorf, Ranjay Krishna, and Noah A. Smith. Tifa:
Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Accurate and interpretable text-to-image faithfulness
Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: evaluation with question answering. In Proceedings of
Towards general-purpose vision-language models with the IEEE/CVF International Conference on Computer
instruction tuning, 2023a. Vision (ICCV), pages 20406–20417, October 2023.
Xiaoliang Dai, Ji Hou, Chih-Yao Ma, Sam Tsai, Jialiang Shyamgopal Karthik, Karsten Roth, Massimiliano
Wang, Rui Wang, Peizhao Zhang, Simon Vandenhende, Mancini, and Zeynep Akata. If at first you don’t
Xiaofang Wang, Abhimanyu Dubey, et al. Emu: En- succeed, try, try again: Faithful diffusion-based text-
hancing image generation models using photogenic nee- to-image generation by selection. arXiv preprint
dles in a haystack. arXiv preprint arXiv:2309.15807, arXiv:2305.13308, 2023.
2023b.
Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland
Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiyong Matiana, Joe Penna, and Omer Levy. Pick-a-pic: An
Wu, Baobao Chang, Xu Sun, Jingjing Xu, Lei Li, and open dataset of user preferences for text-to-image gen-
Zhifang Sui. A survey on in-context learning, 2023. eration. arXiv preprint arXiv:2305.01569, 2023.
Dave Epstein, Allan Jabri, Ben Poole, Alexei A Efros, Kimin Lee, Hao Liu, Moonkyung Ryu, Olivia Watkins,
and Aleksander Holynski. Diffusion self-guidance Yuqing Du, Craig Boutilier, Pieter Abbeel, Mohammad
for controllable image generation. arXiv preprint Ghavamzadeh, and Shixiang Shane Gu. Aligning text-
arXiv:2306.00986, 2023. to-image models using human feedback. arXiv preprint
arXiv:2302.12192, 2023.
Weixi Feng, Xuehai He, Tsu-Jui Fu, Varun Jampani, Ar-
jun Reddy Akula, Pradyumna Narayana, Sugato Basu, Long Lian, Boyi Li, Adam Yala, and Trevor Darrell. Llm-
Xin Eric Wang, and William Yang Wang. Training- grounded diffusion: Enhancing prompt understanding
free structured diffusion guidance for compositional of text-to-image diffusion models with large language
text-to-image synthesis. In The Eleventh International models. arXiv preprint arXiv:2305.13655, 2023.
Conference on Learning Representations, 2022.
Tsung-Yi Lin, Michael Maire, Serge Belongie, James
Chrisantha Fernando, Dylan Banarse, Henryk Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and
Michalewski, Simon Osindero, and Tim Rocktäschel. C Lawrence Zitnick. Microsoft coco: Common objects
Promptbreeder: Self-referential self-improvement via in context. In Computer Vision–ECCV 2014: 13th
11
European Conference, Zurich, Switzerland, September Chitwan Saharia, William Chan, Saurabh Saxena, Lala
6-12, 2014, Proceedings, Part V 13, pages 740–755. Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed
Springer, 2014. Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi,
Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho,
Nan Liu, Shuang Li, Yilun Du, Antonio Torralba, and David J Fleet, and Mohammad Norouzi. Photorealis-
Joshua B Tenenbaum. Compositional visual generation tic text-to-image diffusion models with deep language
with composable diffusion models. In European Con- understanding, 2022.
ference on Computer Vision, pages 423–439. Springer,
2022. Jiaming Song, Chenlin Meng, and Stefano Ermon. De-
noising diffusion implicit models. arXiv preprint
Suvir Mirchandani, Fei Xia, Pete Florence, Danny Driess, arXiv:2010.02502, 2020.
Montserrat Gonzalez Arenas, Kanishka Rao, Dorsa
Sadigh, Andy Zeng, et al. Large language models as Jiao Sun, Deqing Fu, Yushi Hu, Su Wang, Royi Rassin,
general pattern machines. In 7th Annual Conference Da-Cheng Juan, Dana Alon, Charles Herrmann, Sjo-
on Robot Learning, 2023. erd van Steenkiste, Ranjay Krishna, et al. Dreamsync:
Aligning text-to-image generation with image under-
Muhammad Ferjad Naeem, Seong Joon Oh, Youngjung standing feedback. arXiv preprint arXiv:2311.17946,
Uh, Yunjey Choi, and Jaejun Yoo. Reliable fidelity and 2023.
diversity metrics for generative models. In International
Conference on Machine Learning, pages 7176–7185. Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon
PMLR, 2020. Shlens, and Zbigniew Wojna. Rethinking the inception
architecture for computer vision. In Proceedings of
Large Model Systems Organization. Chatbot arena leader- the IEEE conference on computer vision and pattern
board. https://huggingface.co/spaces/lmsys/ recognition, pages 2818–2826, 2016.
chatbot-arena-leaderboard.
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert,
Dustin Podell, Zion English, Kyle Lacey, Andreas Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov,
Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale,
and Robin Rombach. Sdxl: Improving latent diffu- et al. Llama 2: Open foundation and fine-tuned chat
sion models for high-resolution image synthesis. arXiv models. arXiv preprint arXiv:2307.09288, 2023.
preprint arXiv:2307.01952, 2023.
Rodrigo Valerio, Joao Bordalo, Michal Yarom, Yonattan
Reid Pryzant, Dan Iter, Jerry Li, Yin Tat Lee, Chenguang Bitton, Idan Szpektor, and Joao Magalhaes. Transfer-
Zhu, and Michael Zeng. Automatic prompt optimiza- ring visual attributes from natural language to verified
tion with" gradient descent" and beam search. arXiv image generation. arXiv preprint arXiv:2305.15026,
preprint arXiv:2305.03495, 2023. 2023.
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou,
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sas- Aaron Lou, Senthil Purushwalkam, Stefano Ermon,
try, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion
Learning transferable visual models from natural lan- model alignment using direct preference optimization.
guage supervision. In International conference on ma- arXiv preprint arXiv:2311.12908, 2023.
chine learning, pages 8748–8763. PMLR, 2021.
Jerry Wei, Jason Wei, Yi Tay, Dustin Tran, Albert
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Webson, Yifeng Lu, Xinyun Chen, Hanxiao Liu,
Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Da Huang, Denny Zhou, et al. Larger language mod-
Wei Li, and Peter J Liu. Exploring the limits of trans- els do in-context learning differently. arXiv preprint
fer learning with a unified text-to-text transformer. arXiv:2303.03846, 2023.
The Journal of Machine Learning Research, 21(1):5485–
5551, 2020. Qiucheng Wu, Yujian Liu, Handong Zhao, Trung Bui,
Zhe Lin, Yang Zhang, and Shiyu Chang. Harnessing
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey the spatial-temporal attention of diffusion models for
Chu, and Mark Chen. Hierarchical text-conditional high-fidelity text-to-image synthesis. In Proceedings of
image generation with clip latents, 2022. the IEEE/CVF International Conference on Computer
Vision, pages 7766–7776, 2023a.
Robin Rombach, Andreas Blattmann, Dominik Lorenz,
Patrick Esser, and Björn Ommer. High-resolution im- Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen,
age synthesis with latent diffusion models. In Proceed- Feng Zhu, Rui Zhao, and Hongsheng Li. Human prefer-
ings of the IEEE/CVF conference on computer vision ence score v2: A solid benchmark for evaluating human
and pattern recognition, pages 10684–10695, 2022a. preferences of text-to-image synthesis. arXiv preprint
arXiv:2306.09341, 2023b.
Robin Rombach, Andreas Blattmann, Dominik Lorenz,
Patrick Esser, and Björn Ommer. High-resolution Xiaoshi Wu, Keqiang Sun, Feng Zhu, Rui Zhao, and Hong-
image synthesis with latent diffusion models, 2022b. sheng Li. Better aligning text-to-image models with
12
human preference. arXiv preprint arXiv:2303.14420,
2023c.
Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai
Li, Ming Ding, Jie Tang, and Yuxiao Dong. Im-
agereward: Learning and evaluating human prefer-
ences for text-to-image generation. arXiv preprint
arXiv:2304.05977, 2023.
Yutaro Yamada, Yingtian Tang, and Ilker Yildirim. When
are lemons purple? the concept association bias of clip.
arXiv preprint arXiv:2212.12043, 2022.
Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao
Liu, Quoc V Le, Denny Zhou, and Xinyun Chen.
Large language models as optimizers. arXiv preprint
arXiv:2309.03409, 2023.
Michal Yarom, Yonatan Bitton, Soravit Changpinyo,
Roee Aharoni, Jonathan Herzig, Oran Lang, Eran Ofek,
and Idan Szpektor. What you see is what you read?
improving text-image alignment evaluation. arXiv
preprint arXiv:2305.10400, 2023.
Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong,
Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander
Ku, Yinfei Yang, Burcu Karagol Ayan, et al. Scaling
autoregressive models for content-rich text-to-image
generation. arXiv preprint arXiv:2206.10789, 2022.
Mert Yuksekgonul, Federico Bianchi, Pratyusha Kalluri,
Dan Jurafsky, and James Zou. When and why vision-
language models behave like bags-of-words, and what to
do about it? In The Eleventh International Conference
on Learning Representations, 2022.
Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han,
Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy
Ba. Large language models are human-level prompt
engineers. In The Eleventh International Conference
on Learning Representations, 2022.
13
A Additional method details
A.1 Meta-prompts
Meta-prompts 1-5 include all the prompts used in our
work. The teal text is fed to the LLM as a system
prompt, and the purple text denotes placeholders to
be dynamically filled.
{prompt}
1. <PROMPT>paraphrase 1</PROMPT>
2. <PROMPT>paraphase 2</PROMPT>
...
{num_solutions}. <PROMPT>paraphrase {
num_solutions}</PROMPT>
14
You are an expert prompt optimizer for text-to-image models. Text-to-image models take a text prompt as
input and generate images depicting the prompt as output. You translate prompts written by humans into
better prompts for the text-to-image models. Your answers should be concise and effective.
Your task is to optimize this initial prompt written by a human: "{user_prompt}". Below are some
previous prompts with a decomposition of their visual elements. Each element is paired with a score
indicating its presence in the generated image. The prompts are arranged in ascending order based on
their scores, which range from 0 to 100. Higher scores indicate higher likelihood of presence.
1. {revised_prompt_1}
score: {avg_score_1}
visual elements:
{subprompt_1_1} {clip_score_1_1}
{subprompt_1_2} {clip_score_1_2}
(... more questions ...)
Generate {num_solutions} paraphrases of the initial prompt which keep the semantic meaning and that have
higher scores than all the prompts above. Prioritize optimizing for object with lowest scores. Favor
substitutions and reorderings over additions. Respond with each new prompt in between <PROMPT> and </
PROMPT>, eg:
1. <PROMPT>paraphrase 1</PROMPT>
2. <PROMPT>paraphase 2</PROMPT>
...
{num_solutions}. <PROMPT>paraphrase {num_solutions}</PROMPT>
You are an expert prompt optimizer for text-to-image models. Text-to-image models take a text prompt as
input and generate images depicting the prompt as output. You translate prompts written by humans into
better prompts for the text-to-image models. Your answers should be concise and effective.
Your task is to optimize this initial prompt written by a human: "{user_prompt}". Below are some
previous prompts with the consistency of each prompt’s visual elements in the generated image via a set
of binary questions. The prompts are arranged in ascending order based on their overall consistency
score, which ranges from 0 to 100 (higher is better).
1. {revised_prompt_1}
overall score: {dsg_score_1}
evaluation questions:
{question_1_1} {vqa_score_1_1}
{question_1_2} {vqa_score_1_2}
(... more questions ...)
Generate {num_solutions} paraphrases of the initial prompt which keep the semantic meaning and that have
higher scores than all the prompts above. Focus on optimizing for the visual elements that are not
consistent. Favor substitutions and reorderings over additions. Respond with each new prompt in between
<PROMPT> and </PROMPT>, eg:
1. <PROMPT>paraphrase 1</PROMPT
2. <PROMPT>paraphase 2</PROMPT>
...
{num_solutions}. <PROMPT>paraphrase {num_solutions}</PROMPT>
15
Conciseness:
Generate {num_solutions} paraphrases of the initial prompt which keep the semantic meaning and that have
higher scores than all the prompts above. Favor substitutions and reorderings over additions. Respond
with each new prompt in between <PROMPT> and </PROMPT>, eg:
...
Conciseness + prioritize:
Generate {num_solutions} paraphrases of the initial prompt which keep the semantic meaning and that have
higher scores than all the prompts above. Prioritize optimizing for object with lowest scores. Favor
substitutions and reorderings over additions. Respond with each new prompt in between <PROMPT> and </
PROMPT>, eg:
...
Briefly reason (max two sentences) about the prompts above to understand why certain objects have higher
or lower scores in certain prompts. Then, based on this reasoning, generate {num_solutions} paraphrases
of the initial prompt which keep the semantic meaning and that have higher scores than all the prompts
above. Prioritize optimizing for objects with lowest scores while keeping high scores for the other
objects. Favor substitutions and reorderings over additions. Respond with each new prompt in between <
PROMPT> and </PROMPT>, eg:
...
Generate {num_solutions} paraphrases of the initial prompt which keep the semantic meaning and that have
higher scores than all the prompts above. PRIORITIZE optimizing for objects with lowest scores while
keeping high scores for the other objects. FAVOR substitutions and reorderings over additions. USE
simple words/concepts, understable from a text-to-image model, e.g., distinguish foreground and
background. Respond with each new prompt in between <PROMPT> and </PROMPT>, eg:
...
Prompt 5 Meta-prompt ablations. Modifications w.r.t. the base meta-prompt are denoted with different colors.
16
Table 4 Prompt decompositions into noun phrases (dCS) and binary questions (DGS).
17
B Additional results reinforce our previous claims: (1) the iterative proce-
dure allows OPT2I to keep improving revised prompts
over time, and (2) 1-shot ICL is challenging due to
Table 5 Comparison with 1-shot in-context learning (ICL).
the limited feedback provided to the LLM about how
We report relative improvement (%) in prompt-image
to improve the user prompt, and thus only minor
consistency between the user prompt and the best prompt
found, averaged across all prompts in the dataset.
improvements in prompt-image consistency can be
obtained over random paraphrasing.
Dataset
Method Objective LLM T2I
MSCOCO P2
Paraphrasing
1-shot ICL dCS Llama-2 LDM-2.1
+9.86
+10.67
+8.00
+8.74
B.2 Filtering out already consistent user
OPT2I +11.68 +10.34 prompts
Paraphrasing +9.92 +19.22
1-shot ICL DSG
OPT2I
Llama-2 LDM-2.1 +10.14
+11.02
+19.69
+22.24
When adopting the DSG objective, user prompts
Paraphrasing +9.64 +8.06
might already achieve a perfect consistency score
1-shot ICL dCS GPT-3.5 LDM-2.1 +9.21∗ +8.72 initially, i.e., SDSG (p0 , g(p0 )) = 1. We observe this
OPT2I +10.56 +9.33
phenomenon happens with higher frequency on the
Paraphrasing +10.21 +19.09
1-shot ICL DSG GPT-3.5 LDM-2.1 +10.09∗ +18.94∗ simpler MSCOCO prompts (∼40%), and with less fre-
OPT2I +11.19 +19.77 quency on the more complex PartiPrompts prompts
Paraphrasing +11.38 +10.29 (∼10%). Since SDSG (p0 , g(p0 )) can be computed be-
1-shot ICL dCS Llama-2 CDM-M +12.19 +11.34
OPT2I +12.65 +12.13 forehand, we can avoid optimizing for those user
Paraphrasing +9.86 +22.24 prompts that have already a perfect initial SDSG
1-shot ICL DSG Llama-2 CDM-M +10.10 +22.25 and better showcase the optimization performance of
OPT2I +10.15 +24.93
OPT2I. We provide the updated optimization curves
in Figure 7, and report the final results in Table 6. In
both cases, we highlight results obtained by filtering
B.1 1-shot in-context learning as baseline
out “perfect” user prompts with full colors, and con-
In this experiment, we compare OPT2I with 1-shot in- trast them against results obtained with all prompts
context learning (ICL), which we implement by run- in faded colors (equivalent to Figure 3).
ning OPT2I with #iter = 1 and #prompts/iter = 150.
Note that, in this setting, the LLM only receives feed- In Table 6, we observe a higher relative improvement
back about the performance of the user prompt. We for both MSCOCO and PartiPrompts in all config-
maintain the same experimental setting described in urations when filtering out “perfect” user prompts,
Section 3, except we use 200 prompts for MSCOCO, which is more prominent for MSCOCO because the
and report the results in Table 5. First, we notice number of excluded prompts is higher. In Figure 7,
that 1-shot ICL achieves higher prompt-image con- we observe similar consistent and considerable in-
sistency than random paraphrasing in all settings creases of all optimization curves when considering
except when using GPT-3.5, which performs on-par both mean and max consistency improvement. In
or slightly worse (marked with * in Table 5, see also the mean case, we remark a reduction in the initial
the discussion in Section B.5). Second, and more dip in relative consistency, especially in MSCOCO,
importantly, we observe that OPT2I outperforms the where OPT2I reaches a positive relative consistency
1-shot ICL baseline regardless of the consistency ob- much earlier, i.e., it = [6, 2, 2] vs. it = [23, 8, 5] with
jective, LLM, or T2I model adopted. These results Llama-2, GPT-3.5, and CDM-M, respectively.
Table 6 Relative prompt-image consistency improvement between the user prompt and the best prompt found, averaged
across prompts.
MSCOCO P2
Method LLM T2I
SDSG (p0 , g(p0 )) < 1 All SDSG (p0 , g(p0 )) < 1 All
Paraphrasing +17.74 +10.67 +21.95 +19.22
Llama-2 LDM-2.1
OPT2I +18.63 +11.21 +25.39 +22.24
Paraphrasing +17.52 +10.53 +21.80 +19.09
GPT-3.5 LDM-2.1
OPT2I +18.05 +10.85 +22.58 +19.77
Paraphrasing +18.65 +10.53 +26.54 +22.24
Llama-2 CDM-M
OPT2I +19.61 +11.07 +29.19 +24.93
18
**Paraphrasing **Paraphrasing **Paraphrasing
Paraphrasing Paraphrasing Paraphrasing
**Llama-2_SD2.1 **GPT-3.5_SD2.1 **Llama-2_IF-M
Llama-2_SD2.1 GPT-3.5_SD2.1 Llama-2_IF-M
15 max 20 max
0 0
5 5
0 10 20 30 0 10 20 30
iteration iteration
(a) DSG (MSCOCO) (b) DSG (P2)
Figure 7 OPT2I optimization curves obtained with prompts having SDSG (p0 ) < 1, marked by “**” and full colors. In
contrast, faded-color curves consider all prompts. Each plot tracks either the max or the mean relative improvement
in consistency across revised prompts per iteration.
B.3 Impact of seed-fixing and #im- the optimization curves, we notice that optimizing
ages/prompt a single image without fixing the seed is more dif-
ficult for OPT2I, which results in a noisy and less
In this experiment, we ablate the impact of fixing steep trajectory, especially in the mean case. In con-
the random seed of the initial noise for the diffusion trast, when OPT2I optimizes 4 or 10 images/prompt
model throughout the optimization process when op- with no fixed seed, both the max and mean curve
timizing different numbers of images/prompt. We use remain similar w.r.t. to using a fixed seed. This sup-
our default configuration with Llama-2 and LDM-2.1 ports our choice of generating 4 images/prompt, as
on MSCOCO. In Figure 8a, we show the optimiza- it provides enough diversity in the generations while
tion curves obtained when optimizing 1, 4 (default), being substantially more computationally efficient
and 10 images/prompt with fixed image seed. As than generating 10.
expected, we observe no meaningful differences in
mean consistency improvement. In contrast, the max
consistency improvement shows a clear distinction B.4 Stratified PartiPrompts results
between optimizing a single image (single seed) and Figure 9 shows the relative improvement in consis-
optimizing 4 or 10, with the former achieving more tency score (dCS or DSG) on prompts from Par-
substantial improvements. We argue that when opti- tiPrompts (P2), broken down by challenge aspect.
mizing a single image seed, OPT2I is more sensitive Note that we only sampled prompts from four of the
to changes in the prompts, i.e., there is a higher most difficult dimensions in P2: “Complex”, “Fine-
variance among the scores of revised prompts. We grained Detail”, “Quantity”, and “Properties & Posi-
then contrast the optimization curves with fixed seed tioning”. Intuitively, this plot shows what kinds of
(8a) against the non-fixed seed ones (8b). Our hy- prompts are easier to optimize for OPT2I when using
pothesis is that optimizing, when not fixing the seed, different LLMs, T2I models and consistency scores.
generating too few images/prompt leads to an un-
stable/unreliable feedback for the LLM due to the The most significant improvement in consistency is ob-
high variance of the generations. Indeed, looking at served for prompts related to “Properties & Position-
19
OPT2I, #imgs/p=1, seed=0 OPT2I, #imgs/p=1, No seed
OPT2I, #imgs/p=4, seed=0 OPT2I, #imgs/p=4, No seed
OPT2I, #imgs/p=10, seed=0 OPT2I, #imgs/p=10, No seed
10 max 10 max
Figure 8 OPT2I optimization curves with (a) fixed or (b) non-fixed seed. Each curve uses optimizes a different number
of images per prompt. Y-axis is aligned between (a) and (b). Curves obtained with Llama-2 and LDM-2.1 on 200 out
of the 2000 prompts from MSCOCO.
ing” when using Llama-2 in conjunction with CDM-M the exact same prompt with GPT-3.5. Hence, one
and dCS. Similarly, the combination of Llama-2, hypothesis for the observed phenomenon is that our
CDM-M, and DSG yields the best results for prompts meta-prompt is better optimized for Llama-2. An-
about “Quantity”. For other challenges, CDM-M con- other hypothesis is that each LLM has a different
tinues to provide the most substantial consistency balance point between exploration and exploitation
improvement, although the margin is narrower com- for the same sampling temperature of 1.0. In partic-
pared to LDM-2.1. Interestingly, GPT-3.5 shows the ular, given the flatter optimization curves drawn by
smallest improvement in consistency for prompts GPT-3.5, we conjecture that it explores less diverse
about “Quantity”, regardless of whether dCS or DGS prompts than Llama-2. To verify this, we analyze
metrics are used. Consistency improvements for some text properties of the revised prompts generated
prompts from the “Complex” and “Fine-grained De- by both LLMs.
tail” challenges are comparable, which is expected
due to their inherent similarities. Figure 10a tracks the length (in number of characters)
of the revised prompts at each iteration, and Fig-
ure 10b tracks CLIP text similarity between revised
B.5 Why is GPT-3.5 not as good as Llama-2?
prompts and the user prompt along the optimization
In Figure 3 and Table 1, we observe that OPT2I process, both averaged over the revised prompts gen-
achieves worse results when using GPT-3.5 as the erated at the same iterations and over all prompts
LLM. Notably, the optimization curves with GPT-3.5 in the dataset. We observe that when using Llama-2
are flatter than when using Llama-2. This result is for OPT2I, the revised prompts generated at each
rather surprising, as current leaderboards ? indicate iteration are longer and more semantically dissimi-
that GPT-3.5 generally outperforms Llama-2 on a lar to the user prompt compared to those generated
wide variety of NLP tasks. So, in this experiment, by GPT-3.5. This means that OPT2I benefits from
we aim to shed light on the possible causes. Given greater prompt diversity to find the best T2I prompts
the closed (and expensive) access to GPT-3.5, our that lead to more consistent images, which is better
initial exploration of the meta-prompt structure and achieved with Llama-2. Additionally, we note that
phrasing was based on Llama-2, and later on we used both the prompt length and the semantic similarity
20
Llama-2, LDM-2.1 GPT-3.5, LDM-2.1 Llama-2, CDM-M
14 31
13 29
Figure 9 Relative improvement in prompt-image consistency between the user prompt and the best prompt found,
averaged across PartiPrompts prompts and broken down by challenge aspect.
with the user prompt start plateauing around the as the scorer. Since DSG is computed as an average
maximum number of iterations we set, which further of binary scores, it is more coarse than CLIPScore
validates our selected value of 30. We leave as fu- and thus there are fewer leaps in consistency. Over-
ture work ablating for the sampling temperature with all, we observe that the intermediate revised prompt
both LLMs. manages to increase the consistency score in some of
the generated images but not for all of them. The
best prompt, however, usually manages to improve
B.6 Additional qualitative examples all 4 generated images.
Figures 11 and 12 show some additional selected ex-
amples of user prompt and revised prompts through- Figure 12 shows revised prompts generated with dCS
out the optimization process, along with the gener- as the scorer. In this case, we can see a gradual
ated images and consistency scores. In particular, increase in average dCS, which visually translates
we select revised prompts such that the consistency to generated images which are more consistent with
score of the generated images (w.r.t. the user prompt) the user prompt on average. The strong effect of the
is strictly higher than the previous best score found initial latent noise in the image structure is evident,
so far, i.e., the leaps in prompt-image consistency. yet substantial modifications in the format of the in-
put prompt used to condition the generation lead to
Figure 11 shows revised prompts generated with DSG significant changes in how the T2I model interprets
0.90
similarity
125
0.85
100
0.80
75 llama-2
gpt-3.5 0.75
50
0 10 20 30 0 10 20 30
iterations iterations
(a) Prompt length (b) Text similarity w/ user prompt
21
the structure determined by the initial noise (e.g.,
between rows 2-3 and 4-5 in the squirrel example).
We also note that dCS (CLIPScore averaged over sub-
prompts) can occasionally fall short as an evaluation
metric for image-text consistency. This is primarily
because dCS tends to focus on the presence of visual
elements, overlooking other aspects such as spatial
relationships. In the toilet example, for instance, we
observe how the generated images progressively be-
come more consistent up to a certain point (around
row 6). Beyond this point, the revised prompts and
the generated images start degenerating (e.g., by
overemphasizing certain elements), while dCS contin-
ues to improve. This highlights that, while dCS may
serve as a useful fine-grained consistency metric to
provide visual feedback for T2I prompt optimization
with an LLM, it may not be as effective for evaluation
purposes.
22
a cute wooden owl statue holding a large globe of the Earth above its head (0.3929)
An owl statue made of wood, with a charming expression, holds a large Earth globe
above its head, boasting a precision-crafted surface. (0.7857)
1.0000 0.3333 0.3333 0.3333
A cutting board holds a delicious pizza. (0.8333)
A punk rock squirrel in a studded leather jacket shouting into a microphone while
standing on a stump (0.5000)
Figure 11 Selected examples of initial prompts from MSCOCO (left) and PartiPrompts (right) and revised prompts
across the optimization, along with the generated images. The optimizer refines prompts for LDM-2.1 using Llama-2
as LLM and DSG as scorer. We report DSG score averaged across images.
23
A punk rock squirrel in a studded leather jacket shouting into a microphone while
Small white toilet with seashells sitting on top of it. (0.1899) standing on a lily pad (0.1690)
A punk-rockin' squirrel stands atop a lily pad and belts into a microphone, wearing a
A small, pristine toilet crowned with seashells. (0.2341) studded leather jacket that oozes attitude. (0.1859)
A punk rockin' squirrel, wearing a leather jacket adorned with metal studs, belts
A miniature white toilet graces a seashell-covered pedestal. (0.2480) into a microphone with conviction, positioned atop a lily pad. (0.1939)
A rebellious squirrel, clad in a black leather jacket adorned with metal studs,
A miniature toilet, flawlessly white, is adorned with an array of vibrant seashells, passionately sings into a microphone while standing on a lily pad, exuding punk rock
elevating its charm. (0.2507) spirit. (0.1988)
A bold squirrel, adorned in a black leather jacket featuring metal studs, sings into
A small, immaculate white toilet, covered in a diverse array of seashells, perches on a microphone with intensity, standing on a lily pad, embracing the punk rock style.
a pedestal, boasting an eye-catching exhibition. (0.2605) (0.2053)
Figure 12 Selected examples of initial prompts from MSCOCO (left) and PartiPrompts (right) and revised prompts
across the optimization, along with the generated images. The optimizer refines prompts for LDM-2.1 using Llama-2
as LLM and dCS as scorer. We report dCS score averaged across images.
24