0% found this document useful (0 votes)
49 views18 pages

Connecting Large Language Models With Evolutionary Algorithms Yields Powerful Prompt

The document proposes a novel framework called EVO PROMPT that connects large language models (LLMs) with evolutionary algorithms (EAs) to automatically optimize discrete prompts. EVO PROMPT uses LLMs to generate new prompt candidates by imitating evolutionary operators from EAs, and EAs guide the optimization process to retain the optimal prompts. Experiments on 9 datasets show EVO PROMPT outperforms human-engineered prompts and existing methods for automatic prompt generation, improving performance by up to 14%.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views18 pages

Connecting Large Language Models With Evolutionary Algorithms Yields Powerful Prompt

The document proposes a novel framework called EVO PROMPT that connects large language models (LLMs) with evolutionary algorithms (EAs) to automatically optimize discrete prompts. EVO PROMPT uses LLMs to generate new prompt candidates by imitating evolutionary operators from EAs, and EAs guide the optimization process to retain the optimal prompts. Experiments on 9 datasets show EVO PROMPT outperforms human-engineered prompts and existing methods for automatic prompt generation, improving performance by up to 14%.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Preprint

C ONNECTING L ARGE L ANGUAGE M ODELS WITH E VO -


LUTIONARY A LGORITHMS Y IELDS P OWERFUL P ROMPT
O PTIMIZERS
Qingyan Guo12†∗ , Rui Wang2† , Junliang Guo2 , Bei Li23 , Kaitao Song2 , Xu Tan2‡ ,
Guoqing Liu2 , Jiang Bian2 , Yujiu Yang1‡
1
Tsinghua University 2 Microsoft Research 3 Northeastern University
[email protected], [email protected],
{ruiwa,junliangguo,kaitaosong,xuta,guoqingliu,jiabia}@microsoft.com
arXiv:2309.08532v1 [cs.CL] 15 Sep 2023

[email protected]

A BSTRACT

Large Language Models (LLMs) excel in various tasks, but they rely on carefully
crafted prompts that often demand substantial human effort. To automate this pro-
cess, in this paper, we propose a novel framework for discrete prompt optimization,
called E VO P ROMPT, which borrows the idea of evolutionary algorithms (EAs) as
they exhibit good performance and fast convergence. To enable EAs to work on
discrete prompts, which are natural language expressions that need to be coherent
and human-readable, we connect LLMs with EAs. This approach allows us to
simultaneously leverage the powerful language processing capabilities of LLMs
and the efficient optimization performance of EAs. Specifically, abstaining from
any gradients or parameters, E VO P ROMPT starts from a population of prompts
and iteratively generates new prompts with LLMs based on the evolutionary op-
erators, improving the population based on the development set. We optimize
prompts for both closed- and open-source LLMs including GPT-3.5 and Alpaca,
on 9 datasets spanning language understanding and generation tasks. E VO P ROMPT
significantly outperforms human-engineered prompts and existing methods for
automatic prompt generation by up to 25% and 14% respectively. Furthermore,
E VO P ROMPT demonstrates that connecting LLMs with EAs creates synergies,
which could inspire further research on the combination of LLMs and conventional
algorithms.

1 I NTRODUCTION
Large language models (LLMs) show remarkable performance on multiple natural language pro-
cessing (NLP) tasks (Floridi & Chiriatti, 2020; Touvron et al., 2023; Ouyang et al., 2022; Zhang
et al., 2022). To adapt to downstream tasks, the traditional fine-tuning paradigm is too costly for
LLMs. Continuous prompt tuning methods (Sun et al., 2022b;a; Li & Liang, 2021; Liu et al., 2021b)
alleviate the huge cost by prepending soft trainable prompt embeddings to the input while keeping
the parameters of LLMs frozen. However, such approaches still rely on access to the parameters of
LLMs as they require the use of continuous embeddings as input, making them inadequate for LLMs
accessing through block-box APIs such as GPT-3 and GPT-4 (Brown et al., 2020; OpenAI, 2023).
Instead, simply adding an instruction to the input text, also known as a type of discrete prompt, steers
LLMs to carry out the desired task with negligible impact on computational cost while eliminating
the need for all the parameters and gradients to LLMs (Liu et al., 2023).
Despite the convenience, the performance of the LLMs towards a certain task is significantly in-
fluenced by the prompt (Liu et al., 2023; Zhu et al., 2023). Accordingly, the key challenge of this
approach lies in the design of the prompt, which has emerged as a crucial technique known as

Work done during an internship at Microsoft Research Asia.

Equal Contribution.

Corresponding Author.

1
Preprint

prompt engineering (Zhou et al., 2022). Prompt engineering usually involves elaborated manual
design (Mishra et al., 2022a;b). Given the wide variation in prompts across language models and
tasks, the prompt design typically requires substantial human effort and expertise with subjective and
relatively limited guidelines (Liu et al., 2023; Zamfirescu-Pereira et al., 2023; Prasad et al., 2022).
To alleviate the human effort on discrete prompt design, previous approaches usually rely on access to
the token probabilities from the output layer of LLMs, which may not always be accessible through
APIs (Deng et al., 2022; Zhang et al., 2023a). Some recent works consider enumerating diverse
prompts and selecting the best ones (Zhou et al., 2022; Jiang et al., 2020), or modifying current
prompts to improve them (Guo et al., 2023; Prasad et al., 2022; Pryzant et al., 2023). Such approaches
either emphasize exploring diverse prompts, which may lead to indecisiveness and wasted resources,
or focus on exploiting upon the current identified good prompts, which may result in stagnation and
confine the search to local optima. Several conventional derivative-free algorithms are well-designed
and strike a good balance between exploration and exploitation (Conn et al., 2009; Rios & Sahinidis,
2013). Among these, evolutionary algorithms (EAs) (Storn & Price, 1997; Brest et al., 2006; Zhang
& Sanderson, 2009; Vesterstrom & Thomsen, 2004) stand out as they are simple and efficient, as
well as suitable for discrete prompt optimization. Sequences of phrases in discrete prompts can be
regarded as gene sequences in typical EAs, making them compatible with the natural evolutionary
process.
In this paper, we borrow the idea of EAs and propose a discrete prompt tuning framework, E VO -
P ROMPT. While evolutionary operators in EAs are typically designed for sequences, they tend
to independently alter tokens in order to generate new candidate solutions. Unfortunately, this
approach ignores the connections among the tokens, which is crucial for maintaining coherence
and readability in discrete prompts. Consequently, designing evolutionary operators for discrete
prompts is challenging. Taking advantage of LLMs’ expertise in natural language processing and
the exceptional optimization capabilities of EAs, we synergistically connect these two approaches,
where LLMs generate new candidate prompts following evolutionary operators and EAs guide the
optimization process to retain the optimal prompts. Specifically, based on several initial prompts, we
utilize LLMs to imitate evolutionary operators in EAs to generate new prompt candidates, and the
prompt with better performance on the development set is preserved. The above operations upon the
updating population are iteratively applied to improve the quality. We optimize the prompts for two
different LLMs (i.e., Alpaca (Taori et al., 2023), and GPT-3.5 (Brown et al., 2020)) on a diverse range
of neural language understanding and generation tasks, using a total of 9 datasets. E VO P ROMPT
consistently gets better prompts compared to both manually designed ones and previous automatic
prompt generation methods.
The main contributions of this paper include:

• We propose a novel framework for automatic discrete prompt optimization connecting LLMs
and EAs, called E VO P ROMPT, which enjoys the following advantages: 1) It does not require
access to any parameters or gradients of LLMs; 2) It strikes a balance between exploration
and exploitation leading to better results; 3) The generated prompts are human-readable.
• Experiments conducted over 9 datasets demonstrate the effectiveness of E VO P ROMPT
compared with existing methods, improving up to 14%. We release the optimal prompts
obtained by E VO P ROMPT for these common tasks such as sentiment classification, topic
classification, subjectivity classification, simplification and summarization.
• To the best of our knowledge, we are the first to demonstrate that LLMs are capable of
implementing the evolutionary algorithm provided with appropriate instructions. We aspire
this work to inspire broader applications of combining LLMs and conventional algorithms.

2 R ELATED W ORKS

2.1 P ROMPTS IN LLM S

Prompting is a highly efficient method for employing LLMs in specialized tasks; however, the perfor-
mance is heavily influenced by the choice of the prompt. Recently, automatic prompt optimization
has obtained wide attention. Continuous prompt-based methods, also known as soft prompt tuning,
only tune parameters of the prefix or inserted token (Li & Liang, 2021; Liu et al., 2021b;a; Zhang

2
Preprint

et al., 2021), or tune word embeddings (Lester et al., 2021a; Zhong et al., 2021) have been the
flavored approaches with lower cost, compared with traditional fine-tuning paradigms. In spite of
their effective performance, two drawbacks of such paradigms can not be ignored: 1) The opti-
mization of continuous prompts requires parameters of LLMs that are inaccessible for black-box
APIs. 2) Soft prompts often fall short of interpretability (Khashabi et al., 2021; Lester et al., 2021b;
Hambardzumyan et al., 2021; Mokady et al., 2021). Discrete prompts, simply adding several discrete
tokens, such as “It was” (Schick & Schütze, 2021), or task-specific descriptive instructions, such as
“Classify the comment into positive or negative.”, to the input text, offer an interactive interface to
humans with better interpretability and show promising performance in various NLP tasks (Liu et al.,
2023).

2.2 D ISCRETE P ROMPTS

Various approaches have been proposed for automatic discrete prompt searching and generation,
which are usually based on the gradients (Shin et al., 2020; Shi et al., 2022; Wallace et al., 2019).
Discrete prompt tuning approaches based on reinforcement learning (RL) (Deng et al., 2022; Zhang
et al., 2023a) design reward functions using the output layer and also bring training overhead.
More recently, considering the high variance of different prompts for downstream tasks, methods of
prompt generation focus on exploration by enumerating and selecting the best prompt from a number
of candidates (mainly augmented by re-sampling (Zhou et al., 2022; Jiang et al., 2020)). Methods
based on prompt revision (Pryzant et al., 2023; Guo et al., 2023) collect the incorrectly predicted cases
by LLMs and analyze the corresponding root cause to improve the prompt, which prefer exploitation
upon the current prompt with little exploration. Additionally, such approaches are constrained to
tasks with standard answers and cannot be directly applied to generation tasks, as the outputs for
such tasks are flexible and cannot be simply categorized as “correct” or “incorrect”. Approaches
based on prompt edit (Zhang et al., 2023a; Prasad et al., 2022) also emphasize exploitation, which
may potentially lead to local optima. Our proposed E VO P ROMPT empowered with evolutionary
algorithms strikes a balance between exploration and exploitation without requiring any parameters
or gradients.

3 AUTOMATIC D ISCRETE P ROMPT O PTIMIZATION


Current advanced LLMs are typically interacted via black-box APIs, while the gradients and parame-
ters are inaccessible. Evolutionary algorithms (EAs) are derivative-free algorithms with exceptional
accuracy and rapid convergence, which benefit from a delicate balance between exploration and
exploitation. Accordingly, we consider introducing EAs into discrete prompt optimization. However,
to generate new candidate solutions, evolutionary operators typically edit the elements in current solu-
tions independently, without considering the connections between them. This makes it challenging to
apply evolutionary operators on discrete prompts, which require coherence and readability. To address
this challenge, we propose a synergistic approach that connects the natural language processing
expertise of LLMs with the optimization capabilities of EAs, called E VO P ROMPT. Specifically, LLMs
generate new candidate prompts based on evolutionary operators, while EAs guide the optimization
process to find the optimal prompts.
In order to implement E VO P ROMPT in practice, it is necessary to instantiate it with a specific
algorithm of EAs. There are various types of EAs, and in this paper, we consider two widely used
algorithms (i.e., Genetic Algorithm (Holland, 1975) and Differential Evolution (Storn & Price, 1997)).
GA is among the most highly regarded evolutionary algorithms (Holland, 1975; 1992; Mitchell, 1998;
Mirjalili et al., 2020) and DE has emerged as one of the most widely utilized algorithms for complex
optimization challenges since its inception (Storn & Price, 1997; Price, 2013; Das & Suganthan,
2010; Pant et al., 2020). Both algorithms have gained significant recognition and adoption. In the
following, we will first outline the proposed E VO P ROMPT, and then instantiate E VO P ROMPT with
GA and DE respectively.

3.1 F RAMEWORK OF E VO P ROMPT

EAs typically start with an initial population of N solutions (equivalent to prompts in our setting),
then iteratively generate new solutions using evolutionary operators (e.g., mutation and crossover) on

3
Preprint

Algorithm 1 Discrete prompt optimization: E VO P ROMPT


Require: Initial prompts P0 = {p1 , p2 , . . . , pN }, size of population N , a dev set D, fD (·) denotes
the score of a prompt on the desired LLM evaluated on D, a pre-defined number of iterations T ,
carefully designed evolutionary operators to generate a new prompt Evo(·)
1: Initial evaluation scores: S0 ← {si = fD (pi )|i ∈ [1, N ]}
2: for t = 1 to T do
3: Selection: select a certain number of prompts from current population as parent prompts
pr1 , . . . , prk ∼ Pt−1
4: Evolution: generate a new prompt based on the selected parent prompts by leveraging LLM
to perform evolutionary operators p′i ← Evo(pr1 , . . . , prk )
5: Evaluation: s′i ← f (p′i , D)
6: Update: Pt ← {Pt−1 , p′i } and St ← {St−1 , s′i } based on the evaluation scores
7: end for
8: Return the best prompt, p∗ , among the final population PT : p∗ ← argmaxp∈PT f (p, D)

the current population and update the population based on a score function. Following the typical
EAs, E VO P ROMPT mainly contains three steps:

• Initial population: Based on our notation that most existing prompt-based methods neglect
human knowledge providing efficient priori initialization, we apply several manual prompts
as the initial population to leverage the wisdom of humans as prior knowledge. Besides,
EAs typically start from randomly generated solutions, resulting in a diverse population and
avoiding being trapped in a local optimum. Accordingly, we also introduce some prompts
generated by LLMs (Zhou et al., 2022) into the initial population.
• Evolution: In each iteration, E VO P ROMPT uses LLMs as evolutionary operators to generate
a new prompt based on several parent prompts selected from the current population. To
accomplish this, we carefully design steps of the mutation and crossover operators for each
specific type of EAs, along with corresponding instructions to guide the LLMs in generating
new prompts based on these steps.
• Update: We evaluate the generated candidate prompts on a development set and retain
those with superior performance, similar to the survival of the fittest in nature. The specific
updating strategy may vary depending on the type of EAs used.

The algorithm stops when the number of iterations reaches a predefined upper bound. The details of
E VO P ROMPT are outlined in Algorithm 1. When instantiating E VO P ROMPT with a specific algorithm
of EAs, the evolutionary and update processes need to be adjusted, and the key challenge is to design
the evolutionary operators on discrete prompts.

3.2 I NSTANTIATION WITH G ENETIC A LGORITHM

Selection In GA, two parent solutions are normally selected based on the roulette wheel selection
method according to the fitness value (Lipowski & Lipowska, 2012). Similar to this, we utilize the
roulette wheel selection method to select two parent prompts in the current population according
to the scores evaluated on development sets. Specifically, let si denote the performance score on
the development set of the i-th prompt in the population, which contains a total of N prompts. The
PN
probability of selecting the i-th prompt as a parent can be expressed as pi = si / j=1 sj .

Evolution Following the evolutionary operators in GA, a new candidate prompt is generated
through a two-step process based on the selected two parents: 1) The parent prompts undergo
crossover, resulting in a new prompt that selectively combines components from both parents; 2)
The newly generated prompt from the first step undergoes mutation, in which random alterations are
made to some of its content. Based on this two-step process, we design instructions, guiding LLMs
to generate a new prompt based on these steps to perform Evo(·) in Algorithm 1. The process is
depicted in Figure 1.

4
Preprint

Genetic Algorithm (GA) Implemented by LLMs


Query:

Please follow the instruction step-by-step to generate a better prompt.


1. Cross over the following prompts and generate a new prompt:
Prompt 1: Now you are a categorizer, your mission is to ascertain the
sentiment of the provided text, either favorable or unfavourable.
Prompt 2: Assign a sentiment label to the given sentence from ['negative',
'positive'] and return only the label without any other text.
2. Mutate the prompt generated in Step 1 and generate a final prompt bracketed with
<prompt> and </prompt>.

Response: 𝐂𝐫𝐨𝐬𝐬𝐨𝐯𝐞𝐫

1. Crossover Prompt: Your mission is to ascertain the sentiment of the


provided text and assign a sentiment label from ['negative', 'positive’].
𝐌𝐮𝐭𝐚𝐭𝐞
2. <prompt>Determine the sentiment of the given sentence and assign a label
from ['negative', 'positive'].</prompt>

Figure 1: GA process implemented by LLMs for discrete prompt optimization (Evo(·) in Algorithm 1).
In Step 1, LLMs perform crossover on the given two prompts (words in orange and blue are inherited
from Prompt 1 and Prompt 2 respectively). In Step 2, LLMs perform mutation on the prompt.

Update E VO P ROMPT iteratively generates new candidate prompts and assesses each prompt using
a development set, denoted as D, to obtain a score that quantifies the quality of the prompt. We
consider a straightforward selection strategy. Specifically, at each iteration, E VO P ROMPT based on
GA produces N new prompts, which are combined with the current population of N prompts. The
updated population is then selected by retaining the N prompts with the highest scores.

3.3 I NSTANTIATION WITH D IFFERENTIAL E VOLUTION

Preliminary Knowledge on DE In DE, the solutions are represented by numerical vectors. Each
candidate vector in the population is selected as a basic vector x in turn to perform mutation and
crossover. Mutation is to generate a mutated solution y based on a solution randomly sampled from
the current population, denoted as a. A scaled difference between two distinct solutions, b and c
randomly selected from the population, is added to a: y = a + F (b − c), where F is the scaled
parameter. Crossover is to generate a trial solution x′ = [x′1 , ..., x′n ] by choosing each parameter in
the vector from either the basic solution x or the mutated solution y, as

yi , if ri < CR,
x′i = (1)
xi , otherwise,
where CR is a pre-defined crossover probability and ri is a uniformly distributed random number.
Then, x is replaced with x′ if x′ is better than x. With step-by-step evolution, DE ends with a
population of high quality. A modified version of DE uses the current best solution as vector a to
exploit information from the best one.

Evolution The evolutionary process of DE can be decoupled into three steps: 1) F (b − c); 2)
y = a + F (b − c); 3) Crossover of x and y. In E VO P ROMPT based on DE, we follow the three steps
to design the evolutionary process, as well as the corresponding instructions for LLMs to generate a
new prompt based on these steps as illustrated in Figure 2:

• Inspired by the differential vector in DE, we consider mutating only the different parts of two
randomly selected prompts in the current population (Step 1 and Step 2 in Figure 2). The prompts
in the current population are considered the current best ones. Accordingly, the shared components
of two prompts tend to have a positive impact on the performance, and thus need to be preserved.
• A variant of DE employs the current best vector during the mutation process, where a mutated
vector is generated by adding the scale of the differential vector to the current best vector. Building

5
Preprint

Differential Evolution (DE) Algorithm Implemented by LLMs


Query:
Please follow the instruction step-by-step to generate a better prompt.
1. Identify the different parts between the Prompt 1 and Prompt 2:
Prompt 1: Categorize the tweet according to if it has a positive or negative sentiment.
Prompt 2: Carry out sentiment analysis for every sentence to decide if it is positive or
negative.
2. Randomly mutate the different parts
3. Combine the different parts with Prompt 3, selectively replace it with the different parts
in Step 2 and generate a new prompt.
Prompt 3: In this task, you are given sentences from product reviews. The task is to classify
a sentence as positive or as negative.
4. Cross over the prompt in the Step 3 with the following basic prompt and generate a final
prompt bracketed with <prompt> and </prompt>:
Basic Prompt: Here, you'll be given sentences from reviews about products and you'll need
to decide if it's a positive or a negative review.

Response:
1. Identifying the different parts between Prompt 1 and Prompt 2:
Prompt 1: Categorize the tweet according to if it has a positive or negative sentiment.
Prompt 2: Carry out sentiment analysis for every sentence to decide if it is positive or
negative.
𝒃−𝒄
Different parts:
"tweet" vs "sentence"
''Categorize'' vs ''Carry out sentiment analysis''

2. Randomly mutate the different parts:


"tweet" -> "review" 𝑭(𝒃 − 𝒄)
"Categorize" -> "Analyze"
"Sentiment analysis" -> "Sentiment identification''

3. Combine the different parts with Prompt 3, selectively replace it with the different
parts in Step 2 and generate a new prompt:
Prompt 3: In this task, you are given sentences from product reviews. The task is to
classify a sentence as positive or as negative.
𝒂 + 𝑭(𝒃 − 𝒄)
New Prompt: In this task, you are given reviews about products. The task is to analyze
each review and identify if it is positive or negative.

4. Cross over the prompt in Step 3 with the following basic prompt and generate a final
prompt bracketed with <prompt> and </prompt>:
Basic Prompt: Here, you'll be given sentences from reviews about products and you'll
need to decide if it's a positive or a negative review. 𝐂𝐫𝐨𝐬𝐬𝐨𝐯𝐞𝐫
Final Prompt: <prompt>Here, you'll be given reviews about products and you'll need
to analyze each review and identify if it is positive or negative.</prompt>

Figure 2: DE process implemented by LLMs for discrete prompt optimization (Evo(·) in Algorithm 1).
In Step 1, LLMs find the different parts (words in ■ and ■) between Prompt 1 and Prompt 2 (b − c
in typical DE). In Step 2, LLMs perform mutation (words in ■ ) on them (imitation of F(b − c)).
Next, LLMs incorporate the current best prompt as Prompt 3 with the mutated results in Step 2, to
generate a new prompt (counterpart of a + F(b − c) in DE). Finally, LLMs perform crossover upon
the current basic prompt pi and the generated prompt in Step 3.

6
Preprint

Method SST-2 CR MR SST-5 AG’s News TREC Subj Avg.


MI (Zhang et al., 2023b) 93.68 91.40 88.75 42.90 70.63 50.60 49.75 71.07
NI (Mishra et al., 2022c) 92.86 90.90 89.60 48.64 48.89 55.00 52.55 68.21
PromptSource (Bach et al., 2022) 93.03 - - - 45.43 36.20 - -
APE (Zhou et al., 2022) 94.01 90.50 90.90 46.97 71.18 59.60 63.25 73.77
E VO P ROMPT (GA) 94.84 91.20 90.40 49.37 73.42 63.80 67.90 75.85
E VO P ROMPT (DE) 94.84 91.35 90.15 48.19 73.33 64.40 77.60 77.12

Table 1: Main results on language understanding (accuracy) on Alpaca-7b.

upon this idea, we also leverage the current best prompt. Specifically, we generate a mutated
prompt by selectively replacing parts of the current best one with the mutated different parts for
combination. (Step 3 in Figure 2).
• Crossover is defined as the process of replacing certain components of a basic prompt (i.e.,
a candidate prompt of the current population) with segments from the mutated prompt. This
operation combines the features of two different prompts, potentially creating a new and improved
solution (Step 4 in Figure 2).

Update Following the standard DE, each prompt pi in the current population is chosen as a basic
prompt in turn to generate a corresponding new prompt p′i using the instruction depicted in Figure 2.
Then, the prompt with a higher score, either pi or p′i , is retained. Accordingly, the population size
remains constant while the overall quality of the population is enhanced.

4 E XPERIMENTS

In this section, we will evaluate the performance of the proposed E VO P ROMPT. We first show the
implementation details of the experiments and the performance baselines, then evaluate E VO P ROMPT
on both language understanding and generation tasks.

4.1 I MPLEMENTATION D ETAILS AND BASELINES

With GPT-3.5 performing evolutionary operators, we optimize prompts using E VO P ROMPT for the
open-source Alpaca-7b (Taori et al., 2023) and closed-source GPT-3.5 (Brown et al., 2020). We pick
the prompt with the highest score on the development set and report its score on the testset.
We compare our methods with the following three types of prompt-based methods. Manual In-
structions (MI) of language understanding, summarization and simplification tasks refer to the
instructions designed in Zhang et al. (2023b); Sanh et al. (2021); Zhang et al. (2023c), respectively.
PromptSource (Bach et al., 2022) and Natural Instructions (NI) (Mishra et al., 2022b) collect
human-written prompts for various datasets. We keep the same verbalizer used in Mishra et al. (2022b)
when reproducing the experiment. APE (Zhou et al., 2022) applies iterative Monte Carlo Search
upon the initial prompts by instruction induction given several input-output pairs. We reproduce APE
by using the same resample template as in Zhou et al. (2022) after initializing the population of the
same size as E VO P ROMPT by instruction induction.

4.2 L ANGUAGE U NDERSTANDING

Datasets and Settings We experiment on language understanding tasks across 7 datasets to validate
our methods, including sentiment classification (SST-2 (Socher et al., 2013), MR (PANG, 2005),
CR (Hu & Liu, 2004), SST-5 (Socher et al., 2013)), topic classification (AG’s News (Zhang et al.,
2015) and TREC (Voorhees & Tice, 2000)) and subjectivity classification (Subj (Pang & Lee, 2004)).
To constrain the output label space, we prepend the demonstration consisting of one example per
class before the test case. See Appendix A for more details.

Main Results As shown in Table 1, we note that: 1) Compared with previous works on prompt
generation (APE) and human written instructions, E VO P ROMPT achieves significantly better results.

7
Preprint

Alpaca GPT-3.5
Method
ROUGE-1 ROUGE-2 ROUGE-L ROUGE-1 ROUGE-2 ROUGE-L
MI (Sanh et al., 2021) 35.92 11.16 31.67 43.95 17.11 39.09
APE (Zhou et al., 2022) 34.92 10.56 31.59 43.43 16.72 38.25
E VO P ROMPT (GA) 36.61 12.48 32.30 45.22 18.52 41.06
E VO P ROMPT (DE) 39.86 14.24 36.09 46.49 19.49 41.96

Table 2: Main results on SAMSum dataset (summarization task) for Alpaca-7b and GPT-3.5.

2) E VO P ROMPT (GA) is slightly better than E VO P ROMPT (DE) on sentiment classification datasets.
When it comes to topic classification datasets, both E VO P ROMPT (GA) and E VO P ROMPT (DE)
demonstrate comparable results. Then, on the subjectivity classification task (Subj), E VO P ROMPT
(DE) is significantly better than E VO P ROMPT (GA), boasting a 9.7% accuracy advantage.

4.3 L ANGUAGE G ENERATION

Datasets and Settings For language generation, we evaluate our E VO P ROMPT on text summa-
rization and simplification tasks. For summarization, we adopt SAMSum (Gliwa et al., 2019),
a challenging and intricate dialogue summarization dataset, and report ROUGE-1/2/L scores on
Alpaca-7b and GPT-3.5.
For the text simplification task aim- Method Alpaca GPT-3.5
ing to simplify the text while keep-
MI (Zhang et al., 2023c) 43.03 43.80
ing the original meaning, we adopt
APE (Zhou et al., 2022) 46.02 46.71
the representative ASSET (Alva-
Manchego et al., 2020) dataset with E VO P ROMPT (GA) 46.67 47.36
multiple references. We report the E VO P ROMPT (DE) 46.58 47.40
SARI score (Xu et al., 2016), an n-
gram-based metric widely used in text Table 3: Main results (SARI) on ASSET dataset (simplifica-
editing tasks. See Appendix A for tion task) for Alpaca-7b and text-davinci-003.
more details.

Main Results The results of summarization and simplification are shown in Table 2 and 3 re-
spectively. We can see that the proposed E VO P ROMPT significantly outperforms both the manually
designed prompt and the prompt generated by APE on two different scales of models, Alpaca-7b
and GPT-3.5. In addition, E VO P ROMPT (DE) is significantly better than E VO P ROMPT (GA) on the
summarization task and performs comparably in the simplification task.

5 A NALYSIS

In this section, we conduct analysis experiments to validate the designs in E VO P ROMPT, as well as
provide insights on how to choose between E VO P ROMPT (GA) and E VO P ROMPT (DE).

5.1 E VOLUTIONARY O PERATORS D ESIGN

Since the evolutionary operator design for GA (i.e., crossover and mutation) is straightforward, we
will only focus on studying the design for E VO P ROMPT (DE). There are two key design aspects in
E VO P ROMPT (DE) when adapting the evolutionary operators to discrete prompts, including mutation
on different parts only and selecting the current best prompt as Prompt 3 in Figure 2. We investigate
these designs that may affect the effectiveness of E VO P ROMPT (DE) on an understanding dataset
Subj, where E VO P ROMPT (DE) performs much better than E VO P ROMPT (GA) and a generation
dataset ASSET, where E VO P ROMPT (DE) and E VO P ROMPT (GA) has similar performance. We use
GPT-3.5 as evolutionary operators and optimize the prompts for Alpaca-7b.

Mutation on Different Parts To illustrate the benefits of mutating only the different parts, we
replace the first two steps in Figure 2 with the instruction “Randomly mutate Prompt 1 and Prompt

8
Preprint

0.475 0.75
0.450

Score on SST-5

Score on Subj
0.70
0.425
0.65
0.400

0.375 0.60
2 4 6 8 10 2 4 6 8 10
Iteration Iteration
GA-best GA-avg DE-best DE-avg

Figure 3: The best and average accuracy of each iteration on the development set of SST-5 (left) and
Subj (right).

2” to allow mutation on all contents in Prompts 1 and 2, denoted as “All” in Table 4. Meanwhile,
the original design in E VO P ROMPT, which mutates only the different parts, is denoted as “Diff”.
As shown in Table 4, the design of mutation on only the different parts consistently provides
improvements.

Selection of Prompt 3 Applying one of the Mutation Prompt 3 Subj ASSET


variants of the DE algorithm, in E VO P ROMPT
(DE), we pick the best prompt in the current pop- Diff best 77.60 46.58
ulation as Prompt 3 in Figure 2. We validate this All best 68.85 45.79
design via the following settings: 1) Prompt 3 is Diff random 73.30 46.41
randomly sampled from the current population, Diff eliminate 64.10 46.06
denoted as “random” in Table 4; 2) Eliminate
the use of Prompt 3 by letting the Basic Prompt Table 4: Designs in E VO P ROMPT (DE).
directly cross over with the mutated different parts (i.e., remove Step 3 in Figure 2), denoted as “elim-
inate” in Tabel 4. Table 4 clearly demonstrates the importance of introducing Prompt 3. Moreover, it
is shown that choosing the best prompt as Prompt 3 is more effective than random sampling.

5.2 E VO P ROMPT (GA) VS . E VO P ROMPT (DE)

We instantiate the proposed E VO P ROMPT on two specific algorithms including GA and DE. We
would like to gain insights into the selection process between these two algorithms, and understand
their respective advantages and limitations. To this end, we select two datasets: 1) SST-5, in which
E VO P ROMPT (GA) performs better; 2) Subj, where E VO P ROMPT (DE) exhibits superior performance.
We then show the average and optimal scores on the development set for each iteration in Figure 3.
On SST-5, the average quality of the population with E VO P ROMPT (GA) consistently outperforms
that of E VO P ROMPT (DE), while the optimal prompts are also better. This is contributed by the
selection strategy of GA, in which prompts with higher scores are more likely to be chosen as
parents for generating new prompts. While in DE, each prompt in the population will be sequentially
selected as the basic prompt, with Prompts 1 and 2 being chosen at random. Accordingly, GA has
a higher probability of searching near the current best solutions, which consequently increases the
likelihood of achieving better results when the initial manual prompts are of relatively high quality.
For example, the manual prompts for SST-5 are already well-designed and the improvement of
E VO P ROMPT is not substantial. Conversely, the performance of existing manual prompts on Subj
is poor with E VO P ROMPT achieving a remarkable 25% improvement over the manual one. On this
dataset, E VO P ROMPT (GA) traps in local optimum while E VO P ROMPT (DE) successfully escapes
and yields much better results. Benefits from the selection strategy and the well-designed evolutionary
operators, E VO P ROMPT (DE) has a higher likelihood of escaping local optima. In summary, we
suggest choosing E VO P ROMPT (GA) when several high-quality prompts already exist, and choosing
E VO P ROMPT (DE) otherwise.

9
Preprint

6 F UTURE W ORKS

Firstly, our explorations on E VO P ROMPT mainly focus on several representative NLP tasks, and we
expect to investigate more diverse tasks such as tasks of multi-modality using discrete prompts. Sec-
ondly, it would be interesting to investigate whether LLMs can effectively control hyper-parameters,
such as the CR parameter in Equation 1, by providing appropriate instructions. Thirdly, GA and DE
represent two examples among the plethora of mathematical algorithms available. Further research
can be conducted to explore the extent to which LLMs are capable of performing a wide range of
diverse algorithms by interacting with humans through natural language descriptions. For exam-
ple, future research could investigate whether LLMs can also generate candidate solutions in other
derivative-free algorithms such as Simulated Annealing (Van Laarhoven et al., 1987).

7 C ONCLUSIONS

To address the challenge that the performances of LLMs are highly dependent on well-designed
prompts, we propose E VO P ROMPT to optimize discrete prompts from an initial population, with LLMs
as evolutionary operators to automatically generate and search for optimal prompts. Besides, based
on our findings, we believe that LLMs offer an effective and interpretable interface for implementing
traditional algorithms, ensuring good alignment with human understanding and communication.
Our findings corroborate a recent trend where LLMs perform “Gradient Descent” (Pryzant et al.,
2023; Guo et al., 2023) in discrete space by collecting incorrectly predicted samples. Our work has
taken a significant step forward by demonstrating the potential of LLMs to participate in a large
range of traditional algorithms. We hope that our explorations will inspire further investigations
on the combination of LLMs and conventional algorithms, paving the way for new and innovative
applications of LLMs.

R EFERENCES
Fernando Alva-Manchego, Louis Martin, Antoine Bordes, Carolina Scarton, Benoît Sagot, and Lucia
Specia. Asset: A dataset for tuning and evaluation of sentence simplification models with multiple
rewriting transformations. In Proceedings of the 58th Annual Meeting of the Association for
Computational Linguistics, pp. 4668–4679, 2020.

Stephen Bach, Victor Sanh, Zheng Xin Yong, Albert Webson, Colin Raffel, Nihal V Nayak, Abheesht
Sharma, Taewoon Kim, M Saiful Bari, Thibault Févry, et al. Promptsource: An integrated
development environment and repository for natural language prompts. In Proceedings of the 60th
Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp.
93–104, 2022.

Janez Brest, Sao Greiner, Borko Boskovic, Marjan Mernik, and Viljem Zumer. Self-adapting control
parameters in differential evolution: A comparative study on numerical benchmark problems.
IEEE transactions on evolutionary computation, 10(6):646–657, 2006.

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal,
Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are
few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.

Andrew R Conn, Katya Scheinberg, and Luis N Vicente. Introduction to derivative-free optimization.
SIAM, 2009.

Swagatam Das and Ponnuthurai Nagaratnam Suganthan. Differential evolution: A survey of the
state-of-the-art. IEEE transactions on evolutionary computation, 15(1):4–31, 2010.

Mingkai Deng, Jianyu Wang, Cheng-Ping Hsieh, Yihan Wang, Han Guo, Tianmin Shu, Meng Song,
Eric P Xing, and Zhiting Hu. Rlprompt: Optimizing discrete text prompts with reinforcement
learning. arXiv preprint arXiv:2205.12548, 2022.

Luciano Floridi and Massimo Chiriatti. Gpt-3: Its nature, scope, limits, and consequences. Minds
and Machines, 30:681–694, 2020.

10
Preprint

Bogdan Gliwa, Iwona Mochol, Maciej Biesek, and Aleksander Wawer. Samsum corpus: A human-
annotated dialogue dataset for abstractive summarization. arXiv preprint arXiv:1911.12237,
2019.
Yiduo Guo, Yaobo Liang, Chenfei Wu, Wenshan Wu, Dongyan Zhao, and Nan Duan. Learning to
program with natural language. arXiv preprint arXiv:2304.10464, 2023.
Karen Hambardzumyan, Hrant Khachatrian, and Jonathan May. Warp: Word-level adversarial
reprogramming. In ACL-IJCNLP, pp. 4921–4933, 2021.
John H. Holland. Adaptation in Natural and Artificial Systems. University of Michigan Press, Ann
Arbor, 1975. ISBN 0262581116.
John H Holland. Adaptation in natural and artificial systems: an introductory analysis with applica-
tions to biology, control, and artificial intelligence. MIT press, 1992.
Minqing Hu and Bing Liu. Mining and summarizing customer reviews. In KDD, pp. 168–177, 2004.
Srinivasan Iyer, Xi Victoria Lin, Ramakanth Pasunuru, Todor Mihaylov, Daniel Simig, Ping Yu,
Kurt Shuster, Tianlu Wang, Qing Liu, Punit Singh Koura, et al. Opt-iml: Scaling language model
instruction meta learning through the lens of generalization. arXiv preprint arXiv:2212.12017,
2022.
Zhengbao Jiang, Frank F Xu, Jun Araki, and Graham Neubig. How can we know what language
models know? Transactions of the Association for Computational Linguistics, 8:423–438, 2020.
Daniel Khashabi, Shane Lyu, Sewon Min, Lianhui Qin, Kyle Richardson, Sameer Singh, Sean
Welleck, Hannaneh Hajishirzi, Tushar Khot, Ashish Sabharwal, et al. Prompt waywardness: The
curious case of discretized interpretation of continuous prompts. arXiv preprint arXiv:2112.08348,
2021.
Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt
tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language
Processing, pp. 3045–3059, 2021a.
Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt
tuning. In EMNLP, pp. 3045–3059, 2021b.
Bei Li, Rui Wang, Junliang Guo, Kaitao Song, Xu Tan, Hany Hassan, Arul Menezes, Tong Xiao,
Jiang Bian, and JingBo Zhu. Deliberate then generate: Enhanced prompting framework for text
generation. arXiv preprint arXiv:2305.19835, 2023.
Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. In
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the
11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers),
pp. 4582–4597, 2021.
Adam Lipowski and Dorota Lipowska. Roulette-wheel selection via stochastic acceptance. Physica
A: Statistical Mechanics and its Applications, 391(6):2193–2196, 2012.
Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig.
Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language
processing. ACM Computing Surveys, 55(9):1–35, 2023.
Xiao Liu, Kaixuan Ji, Yicheng Fu, Weng Lam Tam, Zhengxiao Du, Zhilin Yang, and Jie Tang.
P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks.
arXiv preprint arXiv:2110.07602, 2021a.
Xiao Liu, Yanan Zheng, Zhengxiao Du, Ming Ding, Yujie Qian, Zhilin Yang, and Jie Tang. Gpt
understands, too. arXiv preprint arXiv:2103.10385, 2021b.
Seyedali Mirjalili, Jin Song Dong, Ali Safa Sadiq, and Hossam Faris. Genetic algorithm: Theory,
literature review, and application in image reconstruction. Nature-Inspired Optimizers: Theories,
Literature Reviews and Applications, pp. 69–85, 2020.

11
Preprint

Swaroop Mishra, Daniel Khashabi, Chitta Baral, Yejin Choi, and Hannaneh Hajishirzi. Reframing
instructional prompts to gptk’s language. In Findings of the Association for Computational
Linguistics: ACL 2022, pp. 589–612, 2022a.
Swaroop Mishra, Daniel Khashabi, Chitta Baral, and Hannaneh Hajishirzi. Cross-task generalization
via natural language crowdsourcing instructions. In Proceedings of the 60th Annual Meeting of the
Association for Computational Linguistics (Volume 1: Long Papers), pp. 3470–3487, 2022b.
Swaroop Mishra, Daniel Khashabi, Chitta Baral, and Hannaneh Hajishirzi. Cross-task generalization
via natural language crowdsourcing instructions. In ACL, 2022c.
Melanie Mitchell. An introduction to genetic algorithms. MIT press, 1998.
Ron Mokady, Amir Hertz, and Amit H Bermano. Clipcap: Clip prefix for image captioning. arXiv
preprint arXiv:2111.09734, 2021.
OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong
Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow
instructions with human feedback. Advances in Neural Information Processing Systems, 35:
27730–27744, 2022.
Bo PANG. Seeing stars: Exploiting class relationships for sentiment categorization with respect to
rating scales. In ACL, 2005.
Bo Pang and Lillian Lee. A sentimental education: Sentiment analysis using subjectivity summariza-
tion based on minimum cuts. arXiv preprint cs/0409058, 2004.
Millie Pant, Hira Zaheer, Laura Garcia-Hernandez, Ajith Abraham, et al. Differential evolution: A
review of more than two decades of research. Engineering Applications of Artificial Intelligence,
90:103479, 2020.
Archiki Prasad, Peter Hase, Xiang Zhou, and Mohit Bansal. Grips: Gradient-free, edit-based
instruction search for prompting large language models. arXiv preprint arXiv:2203.07281, 2022.
Kenneth V Price. Differential evolution. In Handbook of optimization: From classical to modern
approach, pp. 187–214. Springer, 2013.
Reid Pryzant, Dan Iter, Jerry Li, Yin Tat Lee, Chenguang Zhu, and Michael Zeng. Automatic prompt
optimization with" gradient descent" and beam search. arXiv preprint arXiv:2305.03495, 2023.
Chengwei Qin, Aston Zhang, Zhuosheng Zhang, Jiaao Chen, Michihiro Yasunaga, and Diyi
Yang. Is chatgpt a general-purpose natural language processing task solver? arXiv preprint
arXiv:2302.06476, 2023.
Luis Miguel Rios and Nikolaos V Sahinidis. Derivative-free optimization: a review of algorithms and
comparison of software implementations. Journal of Global Optimization, 56:1247–1293, 2013.
Victor Sanh, Albert Webson, Colin Raffel, Stephen H Bach, Lintang Sutawika, Zaid Alyafeai, Antoine
Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, et al. Multitask prompted training enables
zero-shot task generalization. arXiv preprint arXiv:2110.08207, 2021.
Timo Schick and Hinrich Schütze. Exploiting cloze-questions for few-shot text classification and
natural language inference. In Proceedings of the 16th Conference of the European Chapter of the
Association for Computational Linguistics: Main Volume, pp. 255–269, 2021.
Weijia Shi, Xiaochuang Han, Hila Gonen, Ari Holtzman, Yulia Tsvetkov, and Luke Zettlemoyer.
Toward human readable prompt tuning: Kubrick’s the shining is a good movie, and a good prompt
too? arXiv preprint arXiv:2212.10539, 2022.
Taylor Shin, Yasaman Razeghi, Robert L Logan IV, Eric Wallace, and Sameer Singh. Autoprompt:
Eliciting knowledge from language models with automatically generated prompts. In Proceedings
of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.
4222–4235, 2020.

12
Preprint

Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng,
and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment
treebank. In EMNLP, pp. 1631–1642, 2013.

Rainer Storn and Kenneth Price. Differential evolution–a simple and efficient heuristic for global
optimization over continuous spaces. Journal of global optimization, 11:341–359, 1997.

Tianxiang Sun, Zhengfu He, Hong Qian, Yunhua Zhou, Xuanjing Huang, and Xipeng Qiu. BBTv2:
Towards a gradient-free future with large language models. In Proceedings of the 2022 Conference
on Empirical Methods in Natural Language Processing, pp. 3916–3930, Abu Dhabi, United
Arab Emirates, December 2022a. Association for Computational Linguistics. URL https://
aclanthology.org/2022.emnlp-main.259.

Tianxiang Sun, Yunfan Shao, Hong Qian, Xuanjing Huang, and Xipeng Qiu. Black-box tuning for
language-model-as-a-service. In International Conference on Machine Learning, pp. 20841–20855.
PMLR, 2022b.

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy
Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model.
https://github.com/tatsu-lab/stanford_alpaca, 2023.

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée
Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and
efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.

Peter JM Van Laarhoven, Emile HL Aarts, Peter JM van Laarhoven, and Emile HL Aarts. Simulated
annealing. Springer, 1987.

Jakob Vesterstrom and Rene Thomsen. A comparative study of differential evolution, particle swarm
optimization, and evolutionary algorithms on numerical benchmark problems. In Proceedings
of the 2004 congress on evolutionary computation (IEEE Cat. No. 04TH8753), volume 2, pp.
1980–1987. IEEE, 2004.

Ellen M Voorhees and Dawn M Tice. Building a question answering test collection. In Proceedings of
the 23rd annual international ACM SIGIR conference on Research and development in information
retrieval, pp. 200–207, 2000.

Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh. Universal adversarial
triggers for attacking and analyzing nlp. In Proceedings of the 2019 Conference on Empirical
Methods in Natural Language Processing and the 9th International Joint Conference on Natural
Language Processing (EMNLP-IJCNLP), pp. 2153–2162, 2019.

Wei Xu, Courtney Napoles, Ellie Pavlick, Quanze Chen, and Chris Callison-Burch. Optimizing statis-
tical machine translation for text simplification. Transactions of the Association for Computational
Linguistics, 4:401–415, 2016.

JD Zamfirescu-Pereira, Richmond Y Wong, Bjoern Hartmann, and Qian Yang. Why johnny can’t
prompt: how non-ai experts try (and fail) to design llm prompts. In Proceedings of the 2023 CHI
Conference on Human Factors in Computing Systems, pp. 1–21, 2023.

Jingqiao Zhang and Arthur C. Sanderson. Jade: Adaptive differential evolution with optional
external archive. IEEE Transactions on Evolutionary Computation, 13(5):945–958, 2009. doi:
10.1109/TEVC.2009.2014613.

Ningyu Zhang, Luoqiu Li, Xiang Chen, Shumin Deng, Zhen Bi, Chuanqi Tan, Fei Huang, and Huajun
Chen. Differentiable prompt makes pre-trained language models better few-shot learners. arXiv
preprint arXiv:2108.13161, 2021.

Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher
Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language
models. arXiv preprint arXiv:2205.01068, 2022.

13
Preprint

Tianjun Zhang, Xuezhi Wang, Denny Zhou, Dale Schuurmans, and Joseph E Gonzalez. Tempera:
Test-time prompt editing via reinforcement learning. In The Eleventh International Conference on
Learning Representations, 2023a.
Wenxuan Zhang, Yue Deng, Bing Liu, Sinno Jialin Pan, and Lidong Bing. Sentiment analysis in the
era of large language models: A reality check. arXiv preprint arXiv:2305.15005, 2023b.
Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text
classification. NeurIPS, 28, 2015.
Yue Zhang, Leyang Cui, Deng Cai, Xinting Huang, Tao Fang, and Wei Bi. Multi-task instruction
tuning of llama for specific scenarios: A preliminary study on writing assistance. arXiv preprint
arXiv:2305.13225, 2023c.
Zexuan Zhong, Dan Friedman, and Danqi Chen. Factual probing is [mask]: Learning vs. learning to
recall. In Proceedings of the 2021 Conference of the North American Chapter of the Association
for Computational Linguistics: Human Language Technologies, pp. 5017–5033, 2021.
Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan,
and Jimmy Ba. Large language models are human-level prompt engineers. arXiv preprint
arXiv:2211.01910, 2022.
Kaijie Zhu, Jindong Wang, Jiaheng Zhou, Zichen Wang, Hao Chen, Yidong Wang, Linyi Yang, Wei
Ye, Neil Zhenqiang Gong, Yue Zhang, et al. Promptbench: Towards evaluating the robustness of
large language models on adversarial prompts. arXiv preprint arXiv:2306.04528, 2023.

14
Preprint

Dataset Type Label space |Test|


SST-2 Sentiment {positive, negative} 1,821
CR Sentiment {positive, negative} 2,000
MR Sentiment {positive, negative} 2,000
SST-5 Sentiment {terrible, bad, okay, good, great} 2,210
AG’s News News topic {World, Sports, Business, Tech} 7,600
TREC Question topic {Description, Entity, Expression, Human, Location, Number} 500
Subj Subjectivity {subjective, objective} 2,000
SAMSum Summarization - 819
ASSET Simplification - 359

Table 5: Statistics for datasets used in this work.

A E XPERIMENTAL S ETTINGS

A.1 DATASETS

Table 5 shows the statistics of the text classification and summarization datasets we used.

A.2 T EMPLATES

For different models, we apply different templates shown in Table 6 and Table 7, referring to the
previous works (Iyer et al., 2022; Taori et al., 2023; Zhang et al., 2023b; Li et al., 2023).

============================== I NSTRUCTIONAL P ROMPTS ==============================

Below is an instruction that describes a task, paired with an input that provides further context. Write a response
that appropriately completes the request.

### Instruction:
<PROMPT>

### Input:
<INPUT>

### Response:
<COMPLETE>

Zero-shot Example:
Below is an instruction that describes a task, paired with an input that provides further context. Write a response
that appropriately completes the request.

### Instruction:
Please perform Sentiment Classification task. Given the sentence, assign a sentiment label from [’negative’,
’positive’]. Return label only without any other text.

### Input:
beautifully observed , miraculously unsentimental comedy-drama .

### Response:
<COMPLETE>

Table 6: Template used for Alpaca (referring to Taori et al. (2023)).

15
Preprint

=========================== T EMPLATE FOR S IMPLIFICATION ===========================

<PROMPT>
<INPUT>
The simplification is <COMPLETE>

Zero-shot example:
Simplify the text.
Subsequently, in February 1941, 600 Jews were sent to Buchenwald and Mauthausen concentration camps.
The simplification of the sentence is <COMPLETE>

=========================== T EMPLATE FOR S UMMARIZATION ===========================

<PROMPT>
<INPUT>
TL;DR: <COMPLETE>

Zero-shot example:
How would you rephrase that in a few words?
Theresa: have you been at Tom’s new place? Luis: yes, it’s nice Marion: He invited us for a dinner Adam: where
is it? Marion: a bit outside the city Adam: where exactly? Marion: Fiesole Luis: very nice!
TL;DR: <COMPLETE>

Table 7: Templates of summarization (refering to Sanh et al. (2021); Qin et al. (2023)), simplification
(refering to Li et al. (2023)) and the corresponding zero-shot examples used for text-davinci-003.

Task LM |Population| |Steps| |Dev| |Shots|


Text classification
Alpaca-7b 10 10 200 1
Text Generation
Alpaca-7b 10 10 100 0
text-davinci-003 10 10 100 0

Table 8: Settings for experiments. |Shots| refers to the number of examples in the demonstration. For
the text classification task, we set the value as 1, which means we prepend with 1 sample of each
category, to constrain the output in the label space.

A.3 PARAMETERS

The parameters for the experiments are shown in Table 8. For evolutionary algorithms implemented
by GPT-3.5, we use Top-P decoding (temperature=0.5, P=0.95). For the task implementation, we use
greedy decoding and the default temperature for Alpaca. For the generation tasks implemented by
GPT-3.5, the temperature is 0.0.

Text Classification The population of prompts is initialized with widely used instructions in the
previous works (Mishra et al., 2022b; Zhang et al., 2022). We paraphrase and rewrite them to initialize
the population. The size of the development set is 200. We report the results on the full test set (the
same as the previous related works (Deng et al., 2022; Zhang et al., 2023a)), as shown in Table 8.

Text Generation For the initial population, we collect instructions for summarization and sim-
plification from Li et al. (2023); Sanh et al. (2021); Zhang et al. (2023c) and augment them to the
expected size (10 in our setting), either written manually or generated by GPT-3.5.

16
Preprint

Method Model Content ROUGE-1/2/L


Alpaca-7b How would you rephrase that in a few words? 35.92/11.16/31.67
Manual Instruction
GPT How would you rephrase that in a few words? 43.95/17.11/39.09
Alpaca-7b Carefully examine the text or listen to the conversation to identify the key ideas, comprehend 39.86/14.24/36.09
E VO P ROMPT the main idea, and summarize the critical facts and ideas in concise language without any
unnecessary details or duplication.
GPT Reduce the core by reading or listening carefully to identify the main ideas and key points, so 46.49/19.49/41.96
readers can comprehend the important concepts and essential information.

Table 9: Manual Instructions (following Sanh et al. (2021) as the baseline and instructions with best
performance on Alpaca-7b and text-davinci-003 (GPT) generated by E VO P ROMPT (either DE or
GA) on SAMSum.

Method Model Content SARI


Alpaca-7b Simplify the text. 43.03
Manual Instruction
GPT Simplify the text. 43.80
Alpaca-7b Rewrite the input text into simple English to make it easier to comprehend for non-native English speakers. 46.67
E VO P ROMPT
GPT Rewrite the given sentence to make it more accessible and understandable for both native and non-native 47.40
English speakers.

Table 10: Manual Instructions (following Zhang et al. (2023c) as the baseline and instructions with
best performance on Alpaca-7b and text-davinci-003 (GPT) generated by E VO P ROMPT (either DE
or GA) on ASSET dataset.

B O PTIMAL P ROMPTS
We release the optimal prompts generated by E VO P ROMPT for different tasks as shown in Tabels 9,
10 and 11.

17
Preprint

Dataset Method Content Score


Manual Instruction Please perform Sentiment Classification task. Given the sentence, assign a sentiment label from [’negative’, 93.68
SST-2 ’positive’]. Return label only without any other text.
Natural Instruction In this task, you are given sentences from movie reviews. The task is to classify a sentence as "great" if the 92.86
sentiment of the sentence is positive or as "terrible" if the sentiment of the sentence is negative.
PromptSource Does the following sentence have a positive or negative sentiment? 93.03
E VO P ROMPT Understand the context and message of the reviews of movies, examine the words used and identify the 94.84
sentiment of the text, then assign a sentiment classification from [’negative’, ’positive’] to act as a sentiment
classifier, and only provide the sentiment label.
Manual Instruction Please perform Sentiment Classification task. Given the sentence, assign a sentiment label from [’negative’, 91.40
CR ’positive’]. Return label only without any other text.
Natural Instruction In this task, you are given sentences from movie reviews. The task is to classify a sentence as "great" if the 90.90
sentiment of the sentence is positive or as "terrible" if the sentiment of the sentence is negative.
E VO P ROMPT In this task, you are given sentences from product reviews. The task is to classify a sentence as positive or as 91.35
negative.
Manual Instruction Please perform Sentiment Classification task. Given the sentence, assign a sentiment label from [’negative’, 88.75
MR ’positive’]. Return label only without any other text.
Natural Instruction In this task, you are given sentences from movie reviews. The task is to classify a sentence as "great" if the 89.60
sentiment of the sentence is positive or as "terrible" if the sentiment of the sentence is negative.
E VO P ROMPT Determine if each input is classified as either positive or negative. 90.40
Manual Instruction Please perform Sentiment Classification task. Given the sentence, assign a sentiment label from [’terrible’, 42.90
SST-5 ’bad’, ’okay’, ’good’, ’great’]. Return label only without any other text.
Natural Instruction In this task, you are given sentences from movie reviews. Based on the given review, classify it to one of the 48.64
five classes: (1) terrible, (2) bad, (3) okay, (4) good, and (5) great.
E VO P ROMPT Evaluate the movie your friend has watched according to the plot summary they have been given, using 49.37
words such as "okay", "great", "bad" or "terrible".
Manual Instruction Please perform News Classification task. Given the news item, assign a label from [’World’, ’Sports’, 70.63
AG’s News ’Business’, ’Tech’]. Return label only without any other text.
Natural Instruction In this task, you are given a news article. Your task is to classify the article to one out of the four topics 48.89
"World", "Sports", "Business", "Tech" if the article"s main topic is relevant to the world, sports, business,
and technology, correspondingly. If you are not sure about the topic, choose the closest option.
PromptSource What label best describes this news article? 45.43
E VO P ROMPT Scrutinize the article and classify it as World, Sports, Tech, or Business. 73.42
Manual Instruction Please perform Question Classification task. Given the question, assign a label from [’Description’, ’Entity’, 50.60
TREC ’Expression’, ’Human’, ’Location’, ’Number’]. Return label only without any other text.
Natural Instruction You are given a question. You need to detect which category better describes the question. Answer with 55.00
"Description", "Entity", "Expression", "Human", "Location", and "Number".
PromptSource Which category best describes the following question? Choose from the following list: Description, Entity, 36.20
Abbreviation, Person, Quantity, Location.
E VO P ROMPT Identify the correct response type (number, entity, description) for all the queries so the correct answer can 64.40
be provided.
Manual Instruction Please perform Subjectivity Classification task. Given the sentence, assign a label from [’subjective’, 49.75
Subj ’objective’]. Return label only without any other text.
Natural Instruction In this task, you are given sentences from reviews. The task is to classify a sentence as "subjective" if the 52.55
opinion of the sentence is subjective or as "objective" if the opinion of the sentence is objective.
E VO P ROMPT Construct input-output pairs to demonstrate the subjectivity of reviews and opinions, distinguishing between 77.60
objective and subjective input while producing examples of personal opinions and illustrations of subjective
views, so it can illustrate the subjectivity of judgments and perspectives.

Table 11: Manual Instructions (following Zhang et al. (2023b) and Zhang et al. (2023c)), Natural
Instructions (Mishra et al., 2022b), PromptSource (Bach et al., 2022) as baselines and instructions
with best performance on Alpaca-7b generated by E VO P ROMPT (either DE or GA) on classification
datasets.

18

You might also like