Connecting Large Language Models With Evolutionary Algorithms Yields Powerful Prompt
Connecting Large Language Models With Evolutionary Algorithms Yields Powerful Prompt
A BSTRACT
Large Language Models (LLMs) excel in various tasks, but they rely on carefully
crafted prompts that often demand substantial human effort. To automate this pro-
cess, in this paper, we propose a novel framework for discrete prompt optimization,
called E VO P ROMPT, which borrows the idea of evolutionary algorithms (EAs) as
they exhibit good performance and fast convergence. To enable EAs to work on
discrete prompts, which are natural language expressions that need to be coherent
and human-readable, we connect LLMs with EAs. This approach allows us to
simultaneously leverage the powerful language processing capabilities of LLMs
and the efficient optimization performance of EAs. Specifically, abstaining from
any gradients or parameters, E VO P ROMPT starts from a population of prompts
and iteratively generates new prompts with LLMs based on the evolutionary op-
erators, improving the population based on the development set. We optimize
prompts for both closed- and open-source LLMs including GPT-3.5 and Alpaca,
on 9 datasets spanning language understanding and generation tasks. E VO P ROMPT
significantly outperforms human-engineered prompts and existing methods for
automatic prompt generation by up to 25% and 14% respectively. Furthermore,
E VO P ROMPT demonstrates that connecting LLMs with EAs creates synergies,
which could inspire further research on the combination of LLMs and conventional
algorithms.
1 I NTRODUCTION
Large language models (LLMs) show remarkable performance on multiple natural language pro-
cessing (NLP) tasks (Floridi & Chiriatti, 2020; Touvron et al., 2023; Ouyang et al., 2022; Zhang
et al., 2022). To adapt to downstream tasks, the traditional fine-tuning paradigm is too costly for
LLMs. Continuous prompt tuning methods (Sun et al., 2022b;a; Li & Liang, 2021; Liu et al., 2021b)
alleviate the huge cost by prepending soft trainable prompt embeddings to the input while keeping
the parameters of LLMs frozen. However, such approaches still rely on access to the parameters of
LLMs as they require the use of continuous embeddings as input, making them inadequate for LLMs
accessing through block-box APIs such as GPT-3 and GPT-4 (Brown et al., 2020; OpenAI, 2023).
Instead, simply adding an instruction to the input text, also known as a type of discrete prompt, steers
LLMs to carry out the desired task with negligible impact on computational cost while eliminating
the need for all the parameters and gradients to LLMs (Liu et al., 2023).
Despite the convenience, the performance of the LLMs towards a certain task is significantly in-
fluenced by the prompt (Liu et al., 2023; Zhu et al., 2023). Accordingly, the key challenge of this
approach lies in the design of the prompt, which has emerged as a crucial technique known as
∗
Work done during an internship at Microsoft Research Asia.
†
Equal Contribution.
‡
Corresponding Author.
1
Preprint
prompt engineering (Zhou et al., 2022). Prompt engineering usually involves elaborated manual
design (Mishra et al., 2022a;b). Given the wide variation in prompts across language models and
tasks, the prompt design typically requires substantial human effort and expertise with subjective and
relatively limited guidelines (Liu et al., 2023; Zamfirescu-Pereira et al., 2023; Prasad et al., 2022).
To alleviate the human effort on discrete prompt design, previous approaches usually rely on access to
the token probabilities from the output layer of LLMs, which may not always be accessible through
APIs (Deng et al., 2022; Zhang et al., 2023a). Some recent works consider enumerating diverse
prompts and selecting the best ones (Zhou et al., 2022; Jiang et al., 2020), or modifying current
prompts to improve them (Guo et al., 2023; Prasad et al., 2022; Pryzant et al., 2023). Such approaches
either emphasize exploring diverse prompts, which may lead to indecisiveness and wasted resources,
or focus on exploiting upon the current identified good prompts, which may result in stagnation and
confine the search to local optima. Several conventional derivative-free algorithms are well-designed
and strike a good balance between exploration and exploitation (Conn et al., 2009; Rios & Sahinidis,
2013). Among these, evolutionary algorithms (EAs) (Storn & Price, 1997; Brest et al., 2006; Zhang
& Sanderson, 2009; Vesterstrom & Thomsen, 2004) stand out as they are simple and efficient, as
well as suitable for discrete prompt optimization. Sequences of phrases in discrete prompts can be
regarded as gene sequences in typical EAs, making them compatible with the natural evolutionary
process.
In this paper, we borrow the idea of EAs and propose a discrete prompt tuning framework, E VO -
P ROMPT. While evolutionary operators in EAs are typically designed for sequences, they tend
to independently alter tokens in order to generate new candidate solutions. Unfortunately, this
approach ignores the connections among the tokens, which is crucial for maintaining coherence
and readability in discrete prompts. Consequently, designing evolutionary operators for discrete
prompts is challenging. Taking advantage of LLMs’ expertise in natural language processing and
the exceptional optimization capabilities of EAs, we synergistically connect these two approaches,
where LLMs generate new candidate prompts following evolutionary operators and EAs guide the
optimization process to retain the optimal prompts. Specifically, based on several initial prompts, we
utilize LLMs to imitate evolutionary operators in EAs to generate new prompt candidates, and the
prompt with better performance on the development set is preserved. The above operations upon the
updating population are iteratively applied to improve the quality. We optimize the prompts for two
different LLMs (i.e., Alpaca (Taori et al., 2023), and GPT-3.5 (Brown et al., 2020)) on a diverse range
of neural language understanding and generation tasks, using a total of 9 datasets. E VO P ROMPT
consistently gets better prompts compared to both manually designed ones and previous automatic
prompt generation methods.
The main contributions of this paper include:
• We propose a novel framework for automatic discrete prompt optimization connecting LLMs
and EAs, called E VO P ROMPT, which enjoys the following advantages: 1) It does not require
access to any parameters or gradients of LLMs; 2) It strikes a balance between exploration
and exploitation leading to better results; 3) The generated prompts are human-readable.
• Experiments conducted over 9 datasets demonstrate the effectiveness of E VO P ROMPT
compared with existing methods, improving up to 14%. We release the optimal prompts
obtained by E VO P ROMPT for these common tasks such as sentiment classification, topic
classification, subjectivity classification, simplification and summarization.
• To the best of our knowledge, we are the first to demonstrate that LLMs are capable of
implementing the evolutionary algorithm provided with appropriate instructions. We aspire
this work to inspire broader applications of combining LLMs and conventional algorithms.
2 R ELATED W ORKS
Prompting is a highly efficient method for employing LLMs in specialized tasks; however, the perfor-
mance is heavily influenced by the choice of the prompt. Recently, automatic prompt optimization
has obtained wide attention. Continuous prompt-based methods, also known as soft prompt tuning,
only tune parameters of the prefix or inserted token (Li & Liang, 2021; Liu et al., 2021b;a; Zhang
2
Preprint
et al., 2021), or tune word embeddings (Lester et al., 2021a; Zhong et al., 2021) have been the
flavored approaches with lower cost, compared with traditional fine-tuning paradigms. In spite of
their effective performance, two drawbacks of such paradigms can not be ignored: 1) The opti-
mization of continuous prompts requires parameters of LLMs that are inaccessible for black-box
APIs. 2) Soft prompts often fall short of interpretability (Khashabi et al., 2021; Lester et al., 2021b;
Hambardzumyan et al., 2021; Mokady et al., 2021). Discrete prompts, simply adding several discrete
tokens, such as “It was” (Schick & Schütze, 2021), or task-specific descriptive instructions, such as
“Classify the comment into positive or negative.”, to the input text, offer an interactive interface to
humans with better interpretability and show promising performance in various NLP tasks (Liu et al.,
2023).
Various approaches have been proposed for automatic discrete prompt searching and generation,
which are usually based on the gradients (Shin et al., 2020; Shi et al., 2022; Wallace et al., 2019).
Discrete prompt tuning approaches based on reinforcement learning (RL) (Deng et al., 2022; Zhang
et al., 2023a) design reward functions using the output layer and also bring training overhead.
More recently, considering the high variance of different prompts for downstream tasks, methods of
prompt generation focus on exploration by enumerating and selecting the best prompt from a number
of candidates (mainly augmented by re-sampling (Zhou et al., 2022; Jiang et al., 2020)). Methods
based on prompt revision (Pryzant et al., 2023; Guo et al., 2023) collect the incorrectly predicted cases
by LLMs and analyze the corresponding root cause to improve the prompt, which prefer exploitation
upon the current prompt with little exploration. Additionally, such approaches are constrained to
tasks with standard answers and cannot be directly applied to generation tasks, as the outputs for
such tasks are flexible and cannot be simply categorized as “correct” or “incorrect”. Approaches
based on prompt edit (Zhang et al., 2023a; Prasad et al., 2022) also emphasize exploitation, which
may potentially lead to local optima. Our proposed E VO P ROMPT empowered with evolutionary
algorithms strikes a balance between exploration and exploitation without requiring any parameters
or gradients.
EAs typically start with an initial population of N solutions (equivalent to prompts in our setting),
then iteratively generate new solutions using evolutionary operators (e.g., mutation and crossover) on
3
Preprint
the current population and update the population based on a score function. Following the typical
EAs, E VO P ROMPT mainly contains three steps:
• Initial population: Based on our notation that most existing prompt-based methods neglect
human knowledge providing efficient priori initialization, we apply several manual prompts
as the initial population to leverage the wisdom of humans as prior knowledge. Besides,
EAs typically start from randomly generated solutions, resulting in a diverse population and
avoiding being trapped in a local optimum. Accordingly, we also introduce some prompts
generated by LLMs (Zhou et al., 2022) into the initial population.
• Evolution: In each iteration, E VO P ROMPT uses LLMs as evolutionary operators to generate
a new prompt based on several parent prompts selected from the current population. To
accomplish this, we carefully design steps of the mutation and crossover operators for each
specific type of EAs, along with corresponding instructions to guide the LLMs in generating
new prompts based on these steps.
• Update: We evaluate the generated candidate prompts on a development set and retain
those with superior performance, similar to the survival of the fittest in nature. The specific
updating strategy may vary depending on the type of EAs used.
The algorithm stops when the number of iterations reaches a predefined upper bound. The details of
E VO P ROMPT are outlined in Algorithm 1. When instantiating E VO P ROMPT with a specific algorithm
of EAs, the evolutionary and update processes need to be adjusted, and the key challenge is to design
the evolutionary operators on discrete prompts.
Selection In GA, two parent solutions are normally selected based on the roulette wheel selection
method according to the fitness value (Lipowski & Lipowska, 2012). Similar to this, we utilize the
roulette wheel selection method to select two parent prompts in the current population according
to the scores evaluated on development sets. Specifically, let si denote the performance score on
the development set of the i-th prompt in the population, which contains a total of N prompts. The
PN
probability of selecting the i-th prompt as a parent can be expressed as pi = si / j=1 sj .
Evolution Following the evolutionary operators in GA, a new candidate prompt is generated
through a two-step process based on the selected two parents: 1) The parent prompts undergo
crossover, resulting in a new prompt that selectively combines components from both parents; 2)
The newly generated prompt from the first step undergoes mutation, in which random alterations are
made to some of its content. Based on this two-step process, we design instructions, guiding LLMs
to generate a new prompt based on these steps to perform Evo(·) in Algorithm 1. The process is
depicted in Figure 1.
4
Preprint
Response: 𝐂𝐫𝐨𝐬𝐬𝐨𝐯𝐞𝐫
Figure 1: GA process implemented by LLMs for discrete prompt optimization (Evo(·) in Algorithm 1).
In Step 1, LLMs perform crossover on the given two prompts (words in orange and blue are inherited
from Prompt 1 and Prompt 2 respectively). In Step 2, LLMs perform mutation on the prompt.
Update E VO P ROMPT iteratively generates new candidate prompts and assesses each prompt using
a development set, denoted as D, to obtain a score that quantifies the quality of the prompt. We
consider a straightforward selection strategy. Specifically, at each iteration, E VO P ROMPT based on
GA produces N new prompts, which are combined with the current population of N prompts. The
updated population is then selected by retaining the N prompts with the highest scores.
Preliminary Knowledge on DE In DE, the solutions are represented by numerical vectors. Each
candidate vector in the population is selected as a basic vector x in turn to perform mutation and
crossover. Mutation is to generate a mutated solution y based on a solution randomly sampled from
the current population, denoted as a. A scaled difference between two distinct solutions, b and c
randomly selected from the population, is added to a: y = a + F (b − c), where F is the scaled
parameter. Crossover is to generate a trial solution x′ = [x′1 , ..., x′n ] by choosing each parameter in
the vector from either the basic solution x or the mutated solution y, as
yi , if ri < CR,
x′i = (1)
xi , otherwise,
where CR is a pre-defined crossover probability and ri is a uniformly distributed random number.
Then, x is replaced with x′ if x′ is better than x. With step-by-step evolution, DE ends with a
population of high quality. A modified version of DE uses the current best solution as vector a to
exploit information from the best one.
Evolution The evolutionary process of DE can be decoupled into three steps: 1) F (b − c); 2)
y = a + F (b − c); 3) Crossover of x and y. In E VO P ROMPT based on DE, we follow the three steps
to design the evolutionary process, as well as the corresponding instructions for LLMs to generate a
new prompt based on these steps as illustrated in Figure 2:
• Inspired by the differential vector in DE, we consider mutating only the different parts of two
randomly selected prompts in the current population (Step 1 and Step 2 in Figure 2). The prompts
in the current population are considered the current best ones. Accordingly, the shared components
of two prompts tend to have a positive impact on the performance, and thus need to be preserved.
• A variant of DE employs the current best vector during the mutation process, where a mutated
vector is generated by adding the scale of the differential vector to the current best vector. Building
5
Preprint
Response:
1. Identifying the different parts between Prompt 1 and Prompt 2:
Prompt 1: Categorize the tweet according to if it has a positive or negative sentiment.
Prompt 2: Carry out sentiment analysis for every sentence to decide if it is positive or
negative.
𝒃−𝒄
Different parts:
"tweet" vs "sentence"
''Categorize'' vs ''Carry out sentiment analysis''
3. Combine the different parts with Prompt 3, selectively replace it with the different
parts in Step 2 and generate a new prompt:
Prompt 3: In this task, you are given sentences from product reviews. The task is to
classify a sentence as positive or as negative.
𝒂 + 𝑭(𝒃 − 𝒄)
New Prompt: In this task, you are given reviews about products. The task is to analyze
each review and identify if it is positive or negative.
4. Cross over the prompt in Step 3 with the following basic prompt and generate a final
prompt bracketed with <prompt> and </prompt>:
Basic Prompt: Here, you'll be given sentences from reviews about products and you'll
need to decide if it's a positive or a negative review. 𝐂𝐫𝐨𝐬𝐬𝐨𝐯𝐞𝐫
Final Prompt: <prompt>Here, you'll be given reviews about products and you'll need
to analyze each review and identify if it is positive or negative.</prompt>
Figure 2: DE process implemented by LLMs for discrete prompt optimization (Evo(·) in Algorithm 1).
In Step 1, LLMs find the different parts (words in ■ and ■) between Prompt 1 and Prompt 2 (b − c
in typical DE). In Step 2, LLMs perform mutation (words in ■ ) on them (imitation of F(b − c)).
Next, LLMs incorporate the current best prompt as Prompt 3 with the mutated results in Step 2, to
generate a new prompt (counterpart of a + F(b − c) in DE). Finally, LLMs perform crossover upon
the current basic prompt pi and the generated prompt in Step 3.
6
Preprint
upon this idea, we also leverage the current best prompt. Specifically, we generate a mutated
prompt by selectively replacing parts of the current best one with the mutated different parts for
combination. (Step 3 in Figure 2).
• Crossover is defined as the process of replacing certain components of a basic prompt (i.e.,
a candidate prompt of the current population) with segments from the mutated prompt. This
operation combines the features of two different prompts, potentially creating a new and improved
solution (Step 4 in Figure 2).
Update Following the standard DE, each prompt pi in the current population is chosen as a basic
prompt in turn to generate a corresponding new prompt p′i using the instruction depicted in Figure 2.
Then, the prompt with a higher score, either pi or p′i , is retained. Accordingly, the population size
remains constant while the overall quality of the population is enhanced.
4 E XPERIMENTS
In this section, we will evaluate the performance of the proposed E VO P ROMPT. We first show the
implementation details of the experiments and the performance baselines, then evaluate E VO P ROMPT
on both language understanding and generation tasks.
With GPT-3.5 performing evolutionary operators, we optimize prompts using E VO P ROMPT for the
open-source Alpaca-7b (Taori et al., 2023) and closed-source GPT-3.5 (Brown et al., 2020). We pick
the prompt with the highest score on the development set and report its score on the testset.
We compare our methods with the following three types of prompt-based methods. Manual In-
structions (MI) of language understanding, summarization and simplification tasks refer to the
instructions designed in Zhang et al. (2023b); Sanh et al. (2021); Zhang et al. (2023c), respectively.
PromptSource (Bach et al., 2022) and Natural Instructions (NI) (Mishra et al., 2022b) collect
human-written prompts for various datasets. We keep the same verbalizer used in Mishra et al. (2022b)
when reproducing the experiment. APE (Zhou et al., 2022) applies iterative Monte Carlo Search
upon the initial prompts by instruction induction given several input-output pairs. We reproduce APE
by using the same resample template as in Zhou et al. (2022) after initializing the population of the
same size as E VO P ROMPT by instruction induction.
Datasets and Settings We experiment on language understanding tasks across 7 datasets to validate
our methods, including sentiment classification (SST-2 (Socher et al., 2013), MR (PANG, 2005),
CR (Hu & Liu, 2004), SST-5 (Socher et al., 2013)), topic classification (AG’s News (Zhang et al.,
2015) and TREC (Voorhees & Tice, 2000)) and subjectivity classification (Subj (Pang & Lee, 2004)).
To constrain the output label space, we prepend the demonstration consisting of one example per
class before the test case. See Appendix A for more details.
Main Results As shown in Table 1, we note that: 1) Compared with previous works on prompt
generation (APE) and human written instructions, E VO P ROMPT achieves significantly better results.
7
Preprint
Alpaca GPT-3.5
Method
ROUGE-1 ROUGE-2 ROUGE-L ROUGE-1 ROUGE-2 ROUGE-L
MI (Sanh et al., 2021) 35.92 11.16 31.67 43.95 17.11 39.09
APE (Zhou et al., 2022) 34.92 10.56 31.59 43.43 16.72 38.25
E VO P ROMPT (GA) 36.61 12.48 32.30 45.22 18.52 41.06
E VO P ROMPT (DE) 39.86 14.24 36.09 46.49 19.49 41.96
Table 2: Main results on SAMSum dataset (summarization task) for Alpaca-7b and GPT-3.5.
2) E VO P ROMPT (GA) is slightly better than E VO P ROMPT (DE) on sentiment classification datasets.
When it comes to topic classification datasets, both E VO P ROMPT (GA) and E VO P ROMPT (DE)
demonstrate comparable results. Then, on the subjectivity classification task (Subj), E VO P ROMPT
(DE) is significantly better than E VO P ROMPT (GA), boasting a 9.7% accuracy advantage.
Datasets and Settings For language generation, we evaluate our E VO P ROMPT on text summa-
rization and simplification tasks. For summarization, we adopt SAMSum (Gliwa et al., 2019),
a challenging and intricate dialogue summarization dataset, and report ROUGE-1/2/L scores on
Alpaca-7b and GPT-3.5.
For the text simplification task aim- Method Alpaca GPT-3.5
ing to simplify the text while keep-
MI (Zhang et al., 2023c) 43.03 43.80
ing the original meaning, we adopt
APE (Zhou et al., 2022) 46.02 46.71
the representative ASSET (Alva-
Manchego et al., 2020) dataset with E VO P ROMPT (GA) 46.67 47.36
multiple references. We report the E VO P ROMPT (DE) 46.58 47.40
SARI score (Xu et al., 2016), an n-
gram-based metric widely used in text Table 3: Main results (SARI) on ASSET dataset (simplifica-
editing tasks. See Appendix A for tion task) for Alpaca-7b and text-davinci-003.
more details.
Main Results The results of summarization and simplification are shown in Table 2 and 3 re-
spectively. We can see that the proposed E VO P ROMPT significantly outperforms both the manually
designed prompt and the prompt generated by APE on two different scales of models, Alpaca-7b
and GPT-3.5. In addition, E VO P ROMPT (DE) is significantly better than E VO P ROMPT (GA) on the
summarization task and performs comparably in the simplification task.
5 A NALYSIS
In this section, we conduct analysis experiments to validate the designs in E VO P ROMPT, as well as
provide insights on how to choose between E VO P ROMPT (GA) and E VO P ROMPT (DE).
Since the evolutionary operator design for GA (i.e., crossover and mutation) is straightforward, we
will only focus on studying the design for E VO P ROMPT (DE). There are two key design aspects in
E VO P ROMPT (DE) when adapting the evolutionary operators to discrete prompts, including mutation
on different parts only and selecting the current best prompt as Prompt 3 in Figure 2. We investigate
these designs that may affect the effectiveness of E VO P ROMPT (DE) on an understanding dataset
Subj, where E VO P ROMPT (DE) performs much better than E VO P ROMPT (GA) and a generation
dataset ASSET, where E VO P ROMPT (DE) and E VO P ROMPT (GA) has similar performance. We use
GPT-3.5 as evolutionary operators and optimize the prompts for Alpaca-7b.
Mutation on Different Parts To illustrate the benefits of mutating only the different parts, we
replace the first two steps in Figure 2 with the instruction “Randomly mutate Prompt 1 and Prompt
8
Preprint
0.475 0.75
0.450
Score on SST-5
Score on Subj
0.70
0.425
0.65
0.400
0.375 0.60
2 4 6 8 10 2 4 6 8 10
Iteration Iteration
GA-best GA-avg DE-best DE-avg
Figure 3: The best and average accuracy of each iteration on the development set of SST-5 (left) and
Subj (right).
2” to allow mutation on all contents in Prompts 1 and 2, denoted as “All” in Table 4. Meanwhile,
the original design in E VO P ROMPT, which mutates only the different parts, is denoted as “Diff”.
As shown in Table 4, the design of mutation on only the different parts consistently provides
improvements.
We instantiate the proposed E VO P ROMPT on two specific algorithms including GA and DE. We
would like to gain insights into the selection process between these two algorithms, and understand
their respective advantages and limitations. To this end, we select two datasets: 1) SST-5, in which
E VO P ROMPT (GA) performs better; 2) Subj, where E VO P ROMPT (DE) exhibits superior performance.
We then show the average and optimal scores on the development set for each iteration in Figure 3.
On SST-5, the average quality of the population with E VO P ROMPT (GA) consistently outperforms
that of E VO P ROMPT (DE), while the optimal prompts are also better. This is contributed by the
selection strategy of GA, in which prompts with higher scores are more likely to be chosen as
parents for generating new prompts. While in DE, each prompt in the population will be sequentially
selected as the basic prompt, with Prompts 1 and 2 being chosen at random. Accordingly, GA has
a higher probability of searching near the current best solutions, which consequently increases the
likelihood of achieving better results when the initial manual prompts are of relatively high quality.
For example, the manual prompts for SST-5 are already well-designed and the improvement of
E VO P ROMPT is not substantial. Conversely, the performance of existing manual prompts on Subj
is poor with E VO P ROMPT achieving a remarkable 25% improvement over the manual one. On this
dataset, E VO P ROMPT (GA) traps in local optimum while E VO P ROMPT (DE) successfully escapes
and yields much better results. Benefits from the selection strategy and the well-designed evolutionary
operators, E VO P ROMPT (DE) has a higher likelihood of escaping local optima. In summary, we
suggest choosing E VO P ROMPT (GA) when several high-quality prompts already exist, and choosing
E VO P ROMPT (DE) otherwise.
9
Preprint
6 F UTURE W ORKS
Firstly, our explorations on E VO P ROMPT mainly focus on several representative NLP tasks, and we
expect to investigate more diverse tasks such as tasks of multi-modality using discrete prompts. Sec-
ondly, it would be interesting to investigate whether LLMs can effectively control hyper-parameters,
such as the CR parameter in Equation 1, by providing appropriate instructions. Thirdly, GA and DE
represent two examples among the plethora of mathematical algorithms available. Further research
can be conducted to explore the extent to which LLMs are capable of performing a wide range of
diverse algorithms by interacting with humans through natural language descriptions. For exam-
ple, future research could investigate whether LLMs can also generate candidate solutions in other
derivative-free algorithms such as Simulated Annealing (Van Laarhoven et al., 1987).
7 C ONCLUSIONS
To address the challenge that the performances of LLMs are highly dependent on well-designed
prompts, we propose E VO P ROMPT to optimize discrete prompts from an initial population, with LLMs
as evolutionary operators to automatically generate and search for optimal prompts. Besides, based
on our findings, we believe that LLMs offer an effective and interpretable interface for implementing
traditional algorithms, ensuring good alignment with human understanding and communication.
Our findings corroborate a recent trend where LLMs perform “Gradient Descent” (Pryzant et al.,
2023; Guo et al., 2023) in discrete space by collecting incorrectly predicted samples. Our work has
taken a significant step forward by demonstrating the potential of LLMs to participate in a large
range of traditional algorithms. We hope that our explorations will inspire further investigations
on the combination of LLMs and conventional algorithms, paving the way for new and innovative
applications of LLMs.
R EFERENCES
Fernando Alva-Manchego, Louis Martin, Antoine Bordes, Carolina Scarton, Benoît Sagot, and Lucia
Specia. Asset: A dataset for tuning and evaluation of sentence simplification models with multiple
rewriting transformations. In Proceedings of the 58th Annual Meeting of the Association for
Computational Linguistics, pp. 4668–4679, 2020.
Stephen Bach, Victor Sanh, Zheng Xin Yong, Albert Webson, Colin Raffel, Nihal V Nayak, Abheesht
Sharma, Taewoon Kim, M Saiful Bari, Thibault Févry, et al. Promptsource: An integrated
development environment and repository for natural language prompts. In Proceedings of the 60th
Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp.
93–104, 2022.
Janez Brest, Sao Greiner, Borko Boskovic, Marjan Mernik, and Viljem Zumer. Self-adapting control
parameters in differential evolution: A comparative study on numerical benchmark problems.
IEEE transactions on evolutionary computation, 10(6):646–657, 2006.
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal,
Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are
few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
Andrew R Conn, Katya Scheinberg, and Luis N Vicente. Introduction to derivative-free optimization.
SIAM, 2009.
Swagatam Das and Ponnuthurai Nagaratnam Suganthan. Differential evolution: A survey of the
state-of-the-art. IEEE transactions on evolutionary computation, 15(1):4–31, 2010.
Mingkai Deng, Jianyu Wang, Cheng-Ping Hsieh, Yihan Wang, Han Guo, Tianmin Shu, Meng Song,
Eric P Xing, and Zhiting Hu. Rlprompt: Optimizing discrete text prompts with reinforcement
learning. arXiv preprint arXiv:2205.12548, 2022.
Luciano Floridi and Massimo Chiriatti. Gpt-3: Its nature, scope, limits, and consequences. Minds
and Machines, 30:681–694, 2020.
10
Preprint
Bogdan Gliwa, Iwona Mochol, Maciej Biesek, and Aleksander Wawer. Samsum corpus: A human-
annotated dialogue dataset for abstractive summarization. arXiv preprint arXiv:1911.12237,
2019.
Yiduo Guo, Yaobo Liang, Chenfei Wu, Wenshan Wu, Dongyan Zhao, and Nan Duan. Learning to
program with natural language. arXiv preprint arXiv:2304.10464, 2023.
Karen Hambardzumyan, Hrant Khachatrian, and Jonathan May. Warp: Word-level adversarial
reprogramming. In ACL-IJCNLP, pp. 4921–4933, 2021.
John H. Holland. Adaptation in Natural and Artificial Systems. University of Michigan Press, Ann
Arbor, 1975. ISBN 0262581116.
John H Holland. Adaptation in natural and artificial systems: an introductory analysis with applica-
tions to biology, control, and artificial intelligence. MIT press, 1992.
Minqing Hu and Bing Liu. Mining and summarizing customer reviews. In KDD, pp. 168–177, 2004.
Srinivasan Iyer, Xi Victoria Lin, Ramakanth Pasunuru, Todor Mihaylov, Daniel Simig, Ping Yu,
Kurt Shuster, Tianlu Wang, Qing Liu, Punit Singh Koura, et al. Opt-iml: Scaling language model
instruction meta learning through the lens of generalization. arXiv preprint arXiv:2212.12017,
2022.
Zhengbao Jiang, Frank F Xu, Jun Araki, and Graham Neubig. How can we know what language
models know? Transactions of the Association for Computational Linguistics, 8:423–438, 2020.
Daniel Khashabi, Shane Lyu, Sewon Min, Lianhui Qin, Kyle Richardson, Sameer Singh, Sean
Welleck, Hannaneh Hajishirzi, Tushar Khot, Ashish Sabharwal, et al. Prompt waywardness: The
curious case of discretized interpretation of continuous prompts. arXiv preprint arXiv:2112.08348,
2021.
Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt
tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language
Processing, pp. 3045–3059, 2021a.
Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt
tuning. In EMNLP, pp. 3045–3059, 2021b.
Bei Li, Rui Wang, Junliang Guo, Kaitao Song, Xu Tan, Hany Hassan, Arul Menezes, Tong Xiao,
Jiang Bian, and JingBo Zhu. Deliberate then generate: Enhanced prompting framework for text
generation. arXiv preprint arXiv:2305.19835, 2023.
Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. In
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the
11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers),
pp. 4582–4597, 2021.
Adam Lipowski and Dorota Lipowska. Roulette-wheel selection via stochastic acceptance. Physica
A: Statistical Mechanics and its Applications, 391(6):2193–2196, 2012.
Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig.
Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language
processing. ACM Computing Surveys, 55(9):1–35, 2023.
Xiao Liu, Kaixuan Ji, Yicheng Fu, Weng Lam Tam, Zhengxiao Du, Zhilin Yang, and Jie Tang.
P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks.
arXiv preprint arXiv:2110.07602, 2021a.
Xiao Liu, Yanan Zheng, Zhengxiao Du, Ming Ding, Yujie Qian, Zhilin Yang, and Jie Tang. Gpt
understands, too. arXiv preprint arXiv:2103.10385, 2021b.
Seyedali Mirjalili, Jin Song Dong, Ali Safa Sadiq, and Hossam Faris. Genetic algorithm: Theory,
literature review, and application in image reconstruction. Nature-Inspired Optimizers: Theories,
Literature Reviews and Applications, pp. 69–85, 2020.
11
Preprint
Swaroop Mishra, Daniel Khashabi, Chitta Baral, Yejin Choi, and Hannaneh Hajishirzi. Reframing
instructional prompts to gptk’s language. In Findings of the Association for Computational
Linguistics: ACL 2022, pp. 589–612, 2022a.
Swaroop Mishra, Daniel Khashabi, Chitta Baral, and Hannaneh Hajishirzi. Cross-task generalization
via natural language crowdsourcing instructions. In Proceedings of the 60th Annual Meeting of the
Association for Computational Linguistics (Volume 1: Long Papers), pp. 3470–3487, 2022b.
Swaroop Mishra, Daniel Khashabi, Chitta Baral, and Hannaneh Hajishirzi. Cross-task generalization
via natural language crowdsourcing instructions. In ACL, 2022c.
Melanie Mitchell. An introduction to genetic algorithms. MIT press, 1998.
Ron Mokady, Amir Hertz, and Amit H Bermano. Clipcap: Clip prefix for image captioning. arXiv
preprint arXiv:2111.09734, 2021.
OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong
Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow
instructions with human feedback. Advances in Neural Information Processing Systems, 35:
27730–27744, 2022.
Bo PANG. Seeing stars: Exploiting class relationships for sentiment categorization with respect to
rating scales. In ACL, 2005.
Bo Pang and Lillian Lee. A sentimental education: Sentiment analysis using subjectivity summariza-
tion based on minimum cuts. arXiv preprint cs/0409058, 2004.
Millie Pant, Hira Zaheer, Laura Garcia-Hernandez, Ajith Abraham, et al. Differential evolution: A
review of more than two decades of research. Engineering Applications of Artificial Intelligence,
90:103479, 2020.
Archiki Prasad, Peter Hase, Xiang Zhou, and Mohit Bansal. Grips: Gradient-free, edit-based
instruction search for prompting large language models. arXiv preprint arXiv:2203.07281, 2022.
Kenneth V Price. Differential evolution. In Handbook of optimization: From classical to modern
approach, pp. 187–214. Springer, 2013.
Reid Pryzant, Dan Iter, Jerry Li, Yin Tat Lee, Chenguang Zhu, and Michael Zeng. Automatic prompt
optimization with" gradient descent" and beam search. arXiv preprint arXiv:2305.03495, 2023.
Chengwei Qin, Aston Zhang, Zhuosheng Zhang, Jiaao Chen, Michihiro Yasunaga, and Diyi
Yang. Is chatgpt a general-purpose natural language processing task solver? arXiv preprint
arXiv:2302.06476, 2023.
Luis Miguel Rios and Nikolaos V Sahinidis. Derivative-free optimization: a review of algorithms and
comparison of software implementations. Journal of Global Optimization, 56:1247–1293, 2013.
Victor Sanh, Albert Webson, Colin Raffel, Stephen H Bach, Lintang Sutawika, Zaid Alyafeai, Antoine
Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, et al. Multitask prompted training enables
zero-shot task generalization. arXiv preprint arXiv:2110.08207, 2021.
Timo Schick and Hinrich Schütze. Exploiting cloze-questions for few-shot text classification and
natural language inference. In Proceedings of the 16th Conference of the European Chapter of the
Association for Computational Linguistics: Main Volume, pp. 255–269, 2021.
Weijia Shi, Xiaochuang Han, Hila Gonen, Ari Holtzman, Yulia Tsvetkov, and Luke Zettlemoyer.
Toward human readable prompt tuning: Kubrick’s the shining is a good movie, and a good prompt
too? arXiv preprint arXiv:2212.10539, 2022.
Taylor Shin, Yasaman Razeghi, Robert L Logan IV, Eric Wallace, and Sameer Singh. Autoprompt:
Eliciting knowledge from language models with automatically generated prompts. In Proceedings
of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.
4222–4235, 2020.
12
Preprint
Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng,
and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment
treebank. In EMNLP, pp. 1631–1642, 2013.
Rainer Storn and Kenneth Price. Differential evolution–a simple and efficient heuristic for global
optimization over continuous spaces. Journal of global optimization, 11:341–359, 1997.
Tianxiang Sun, Zhengfu He, Hong Qian, Yunhua Zhou, Xuanjing Huang, and Xipeng Qiu. BBTv2:
Towards a gradient-free future with large language models. In Proceedings of the 2022 Conference
on Empirical Methods in Natural Language Processing, pp. 3916–3930, Abu Dhabi, United
Arab Emirates, December 2022a. Association for Computational Linguistics. URL https://
aclanthology.org/2022.emnlp-main.259.
Tianxiang Sun, Yunfan Shao, Hong Qian, Xuanjing Huang, and Xipeng Qiu. Black-box tuning for
language-model-as-a-service. In International Conference on Machine Learning, pp. 20841–20855.
PMLR, 2022b.
Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy
Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model.
https://github.com/tatsu-lab/stanford_alpaca, 2023.
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée
Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and
efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
Peter JM Van Laarhoven, Emile HL Aarts, Peter JM van Laarhoven, and Emile HL Aarts. Simulated
annealing. Springer, 1987.
Jakob Vesterstrom and Rene Thomsen. A comparative study of differential evolution, particle swarm
optimization, and evolutionary algorithms on numerical benchmark problems. In Proceedings
of the 2004 congress on evolutionary computation (IEEE Cat. No. 04TH8753), volume 2, pp.
1980–1987. IEEE, 2004.
Ellen M Voorhees and Dawn M Tice. Building a question answering test collection. In Proceedings of
the 23rd annual international ACM SIGIR conference on Research and development in information
retrieval, pp. 200–207, 2000.
Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh. Universal adversarial
triggers for attacking and analyzing nlp. In Proceedings of the 2019 Conference on Empirical
Methods in Natural Language Processing and the 9th International Joint Conference on Natural
Language Processing (EMNLP-IJCNLP), pp. 2153–2162, 2019.
Wei Xu, Courtney Napoles, Ellie Pavlick, Quanze Chen, and Chris Callison-Burch. Optimizing statis-
tical machine translation for text simplification. Transactions of the Association for Computational
Linguistics, 4:401–415, 2016.
JD Zamfirescu-Pereira, Richmond Y Wong, Bjoern Hartmann, and Qian Yang. Why johnny can’t
prompt: how non-ai experts try (and fail) to design llm prompts. In Proceedings of the 2023 CHI
Conference on Human Factors in Computing Systems, pp. 1–21, 2023.
Jingqiao Zhang and Arthur C. Sanderson. Jade: Adaptive differential evolution with optional
external archive. IEEE Transactions on Evolutionary Computation, 13(5):945–958, 2009. doi:
10.1109/TEVC.2009.2014613.
Ningyu Zhang, Luoqiu Li, Xiang Chen, Shumin Deng, Zhen Bi, Chuanqi Tan, Fei Huang, and Huajun
Chen. Differentiable prompt makes pre-trained language models better few-shot learners. arXiv
preprint arXiv:2108.13161, 2021.
Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher
Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language
models. arXiv preprint arXiv:2205.01068, 2022.
13
Preprint
Tianjun Zhang, Xuezhi Wang, Denny Zhou, Dale Schuurmans, and Joseph E Gonzalez. Tempera:
Test-time prompt editing via reinforcement learning. In The Eleventh International Conference on
Learning Representations, 2023a.
Wenxuan Zhang, Yue Deng, Bing Liu, Sinno Jialin Pan, and Lidong Bing. Sentiment analysis in the
era of large language models: A reality check. arXiv preprint arXiv:2305.15005, 2023b.
Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text
classification. NeurIPS, 28, 2015.
Yue Zhang, Leyang Cui, Deng Cai, Xinting Huang, Tao Fang, and Wei Bi. Multi-task instruction
tuning of llama for specific scenarios: A preliminary study on writing assistance. arXiv preprint
arXiv:2305.13225, 2023c.
Zexuan Zhong, Dan Friedman, and Danqi Chen. Factual probing is [mask]: Learning vs. learning to
recall. In Proceedings of the 2021 Conference of the North American Chapter of the Association
for Computational Linguistics: Human Language Technologies, pp. 5017–5033, 2021.
Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan,
and Jimmy Ba. Large language models are human-level prompt engineers. arXiv preprint
arXiv:2211.01910, 2022.
Kaijie Zhu, Jindong Wang, Jiaheng Zhou, Zichen Wang, Hao Chen, Yidong Wang, Linyi Yang, Wei
Ye, Neil Zhenqiang Gong, Yue Zhang, et al. Promptbench: Towards evaluating the robustness of
large language models on adversarial prompts. arXiv preprint arXiv:2306.04528, 2023.
14
Preprint
A E XPERIMENTAL S ETTINGS
A.1 DATASETS
Table 5 shows the statistics of the text classification and summarization datasets we used.
A.2 T EMPLATES
For different models, we apply different templates shown in Table 6 and Table 7, referring to the
previous works (Iyer et al., 2022; Taori et al., 2023; Zhang et al., 2023b; Li et al., 2023).
Below is an instruction that describes a task, paired with an input that provides further context. Write a response
that appropriately completes the request.
### Instruction:
<PROMPT>
### Input:
<INPUT>
### Response:
<COMPLETE>
Zero-shot Example:
Below is an instruction that describes a task, paired with an input that provides further context. Write a response
that appropriately completes the request.
### Instruction:
Please perform Sentiment Classification task. Given the sentence, assign a sentiment label from [’negative’,
’positive’]. Return label only without any other text.
### Input:
beautifully observed , miraculously unsentimental comedy-drama .
### Response:
<COMPLETE>
15
Preprint
<PROMPT>
<INPUT>
The simplification is <COMPLETE>
Zero-shot example:
Simplify the text.
Subsequently, in February 1941, 600 Jews were sent to Buchenwald and Mauthausen concentration camps.
The simplification of the sentence is <COMPLETE>
<PROMPT>
<INPUT>
TL;DR: <COMPLETE>
Zero-shot example:
How would you rephrase that in a few words?
Theresa: have you been at Tom’s new place? Luis: yes, it’s nice Marion: He invited us for a dinner Adam: where
is it? Marion: a bit outside the city Adam: where exactly? Marion: Fiesole Luis: very nice!
TL;DR: <COMPLETE>
Table 7: Templates of summarization (refering to Sanh et al. (2021); Qin et al. (2023)), simplification
(refering to Li et al. (2023)) and the corresponding zero-shot examples used for text-davinci-003.
Table 8: Settings for experiments. |Shots| refers to the number of examples in the demonstration. For
the text classification task, we set the value as 1, which means we prepend with 1 sample of each
category, to constrain the output in the label space.
A.3 PARAMETERS
The parameters for the experiments are shown in Table 8. For evolutionary algorithms implemented
by GPT-3.5, we use Top-P decoding (temperature=0.5, P=0.95). For the task implementation, we use
greedy decoding and the default temperature for Alpaca. For the generation tasks implemented by
GPT-3.5, the temperature is 0.0.
Text Classification The population of prompts is initialized with widely used instructions in the
previous works (Mishra et al., 2022b; Zhang et al., 2022). We paraphrase and rewrite them to initialize
the population. The size of the development set is 200. We report the results on the full test set (the
same as the previous related works (Deng et al., 2022; Zhang et al., 2023a)), as shown in Table 8.
Text Generation For the initial population, we collect instructions for summarization and sim-
plification from Li et al. (2023); Sanh et al. (2021); Zhang et al. (2023c) and augment them to the
expected size (10 in our setting), either written manually or generated by GPT-3.5.
16
Preprint
Table 9: Manual Instructions (following Sanh et al. (2021) as the baseline and instructions with best
performance on Alpaca-7b and text-davinci-003 (GPT) generated by E VO P ROMPT (either DE or
GA) on SAMSum.
Table 10: Manual Instructions (following Zhang et al. (2023c) as the baseline and instructions with
best performance on Alpaca-7b and text-davinci-003 (GPT) generated by E VO P ROMPT (either DE
or GA) on ASSET dataset.
B O PTIMAL P ROMPTS
We release the optimal prompts generated by E VO P ROMPT for different tasks as shown in Tabels 9,
10 and 11.
17
Preprint
Table 11: Manual Instructions (following Zhang et al. (2023b) and Zhang et al. (2023c)), Natural
Instructions (Mishra et al., 2022b), PromptSource (Bach et al., 2022) as baselines and instructions
with best performance on Alpaca-7b generated by E VO P ROMPT (either DE or GA) on classification
datasets.
18