Instruct Following
Instruct Following
Waseem AlShikh Manhal Daaboul Kirk Goddard Brock Imel Kiran Kamble
Writer, Inc.
{waseem,...,melisa}@writer.com
Abstract
In this paper, we introduce the Instruction Following Score (IFS), a metric that
detects language models’ ability to follow instructions. The metric has a dual
purpose. First, IFS can be used to distinguish between base and instruct models.
We benchmark publicly available base and instruct models, and show that the
ratio of well formatted responses to partial and full sentences can be an effective
measure between those two model classes. Secondly, the metric can be used as
an early stopping criteria for instruct tuning. We compute IFS for Supervised
Fine-Tuning (SFT) of 7B and 13B LLaMA models, showing that models learn to
follow instructions relatively early in the training process, and the further finetuning
can result in changes in the underlying base model semantics. As an example of
semantics change we show the objectivity of model predictions, as defined by an
auxiliary metric ObjecQA. We show that in this particular case, semantic changes
are the steepest when the IFS tends to plateau. We hope that decomposing instruct
tuning into IFS and semantic factors starts a new trend in better controllable instruct
tuning and opens possibilities for designing minimal instruct interfaces querying
foundation models.
2
next tokens and optionally provide an answer or I - Instruction
continue with the next instruction. The distinc-
tion between two model classes becomes more Ip - Partial (fragmented) instruc-
clear for an instruction that is an incomplete sen- tion
tence fragment. An instruction following model
will never try to complete the instruction.
• Inference outputs:
It is crucial to emphasise that the quality of re-
sponses is purposely beyond the scope of this Ic - Continuation of the instruction
classification. The above criteria are thus neces-
sary but not sufficient conditions for a chat model. R - Response
In this paper, we introduce the Instruction Fol-
lowing Score (IFS), defined as a ratio of "answer- In fact, combinations of those 4 parts gives all
like" responses to "continuation-like" responses possible pairs of inputs and outputs for vanilla
to a predefined set of instructions. The class of and chat models. In the table below we recom-
a response is determined by a binary classifier bine the parts and give and assign them a binary
(called subsequently as "response tone classifier"). score depending whether the model responds like
The process of training and gathering data for IFS a chat model.
will be outlined in the sections that follow.
In this paper, we use interchangeably "conver-
sational tone" and "instruction following tone," (I, R) The response R for instruction I is
meaning a class of "answer-like" responses. The conversational. A model whose all re-
process of fine-tuning a base model to obtain an sponses would resemble the above form
instruct model is called "instruction tuning." would be an instruction following, so the
response has label 1.
3.2 Dataset
(Ip ,R) The response R for partial instruction Ip
The dataset for IFS is derived from a chat dataset, is also conversational, but in this case
which originally consists of pairs (instruction, re- the model has not enough context to pro-
sponse). We will need to model inputs and out- vide any answer except requesting for
puts for models that aren’t following instructions. more information. This response is also
The main idea for data generation is to append in- labeled as 1.
struction to response and then consider different
subdivisions into two phrases, as shown in Figure
1. (Ip ,Ic ) The model completes the fragmented in-
struction (executing next word predic-
tion task). The pair does not look like a
conversation, so the label is 0.
3
Case Example chat? the set of responses is used to generate data for
I: What if people had 40 the binary classifier. Figure 2 shows how chat
(I, R) data is split and used for in our experiment.
legs?
1
R: If people had 40 legs, As a source of clean text, we utilized the OpenAs-
they’d be human cen- sistant chat dataset (Köpf et al. 2023). To control
tipedes on the go, setting the context of the conversation, we only consid-
world records in races ered the first instruction and its corresponding
and always winning at response from each dialogue.
Twister!
Ip : What if
(Ip ,R) 1
R: It seems like your 3.2.1 Instructions dataset
question is incomplete.
Please provide more con- In the instruction dataset, data points consist of
text or details so I can instructions sourced from OpenAssistant data, ei-
better understand and an- ther unmodified (I) or fragmented (Ip ). We ob-
swer your question. tained a total of 7340 examples, with an approxi-
Ip : What if mate 50% split between fragments and complete
(Ip ,Ic ) 0 sentences. We recognise that the algorithm may
Ic : people had 40 legs?
potentially generate complete sentences labeled
I: What if people had 40 as fragmented, making the score split based on
(I , Ic )
legs?
0 this label a rough estimate.
Ic : What if people had 3
eyes? Table 2 shows examples of full and partial instruc-
tions.
Ip : What if
(Ip ,Ic + R) 0
Ic + R: people had 40
legs? If people had Instruction Label
40 legs, they’d be hu- What is the difference between partial
man centipedes on the HTML
go, setting world records
in races and always win- What is the difference between full
ning at Twister! HTML and JavaScript?
Who wears partial
I: What if people had 40
(I,Ic + R)
legs?
0 Who wears short shorts? full
Ic + R: What if peo-
ple had 3 eyes? If
people had 3 eyes, sun- Table 2: Examples of instructions and their cate-
glasses would come in gory.
trendy trinocular styles
and "I’ve got my eye on
you" would be a whole
new level of surveil- 3.2.2 Responses dataset
lance.
The set of responses represents the right side of
Table 1: Examples of possible combinations of
Fig. 1, i.e., original responses or responses shifted
fragments I, R, Ip , Ic . The tone score indicates
to the right. The collected classes are:
whether the model follows the instruction (1) or
not (0).
Label 0 : Ic , Ic +R
In summary, among the six potential combina-
tions, only two instruct model cases exist: (Ip , R) Label 1 : R
and (I, R). With this classification established,
we can now create the set of instructions and cor-
We drop the fine-grained classification of re-
responding model responses.
sponses and assign them only to "answer-like"
We split pairs coming from all perfect and shifted (label !) or "continuation-like" (label 0). These
cuts, and create two datasets: all instructions and samples are later used to train the binary classifier.
all responses. The set of instructions is used to Table 3 shows examples of responses and their
generate data used for prompting models, while labels.
4
Response chat? 5 Results
it fly so fast? The fastest flying bird 0
is the peregrine falcon. 5.1 Baseline
agent? I’m not a FBI agent. 0
We used the IFS metric to evaluate several pub-
licly available models. Since the dataset consists
When onions are cut, they release a 1 of less than 50% fragmented instructions (includ-
chemical called sulfuric acid. ing false positives generated by the algorithm),
James Madison was the primary au- 1 we expected the base model to obtain IFS be-
thor of the Constitution and the Bill low this level when prompted without additional
of Rights. affixes. Scores for SFT and RLHF models pre-
sented in Table 5 show that the expected maxi-
Table 3: Examples of responses and their cate- mum is around 0.8-0.9, whereas the most promi-
gories. nent difference between a base and instruction-
following LLMs is the relative difference between
IFSpartial and IFSfull .
5
Figure 2: IFS training and evaluation pipeline
The results presented in Table 6 show that vari- We used the gpt4all v1.3-groovy introduced in
ants of both prompts are equally effective. If we Anand et al. 2023 as the instruct dataset. We set
compare it with the baseline (C), we see that for the character limit to 2k (similar to the LLaMa
all models the improvement of IFS is in the range models pretraining objectives, which were trained
0.5–0.6. It turns out that for Large Language on a 512-token length). Through this filtering pro-
Models (LLMs) a single prompt change can ef- cess, we obtained approximately 410k examples
fectively encourage models to follow instructions, for the instruct tuning.
reaching performance levels comparable to sev-
Models were trained with the modified Alpaca
eral publicly available instruct models. We did
prompt:
not test n-shot prompting, which can possibly
further improve results.
PROMPT_DICT = {
"prompt_input": ("{instruction}\n\n{
Dataset IFS IFSpartial IFSfull ,→ input}### Response:"),
"prompt_no_input": ("{instruction}###
LLaMa-7BA 0.74 0.71 0.77 ,→ Response:"),
LLaMa-7BB 0.75 0.73 0.78 }
LLaMa-7BC 0.34 0.19 0.5
LLaMA-13BA 0.81 0.74 0.88
LLaMA-13BB 0.81 0.79 0.82The modification integrates the instruction and
LLaMA-13BC 0.31 0.18 0.43the optional input while eliminating the prefix
prompt. This approach is consistent with how
LLaMA-33BA 0.87 0.85 0.89 user interfaces for chat models are typically im-
LLaMA-33BB 0.74 0.68 0.81 plemented, i.e., as a single dialog input box. We
LLaMA-33BC 0.33 0.18 0.47 could use the full Alpaca wrapper, but since both
prompting techniques lead to similar scores, we
Table 6: Instruction Following Score (IFS) for chose the shorter one due to efficiency reasons.
models with and without prompt suffixes.
Results of SFT are shown in Figure 4(a). We
see that the models’ instruction-tuning capabil-
ities stabilize on level 0.9-0.95 after seeing ap-
5.3 Supervised finetuning proximately 8k examples (marked as a horizontal
dashed line). We will refer to this training phase
In this study, we opted for 7B and 13B LLaMA as the "format-infusion" phase. As a side note,
models as the base LLMs for SFT. To ensure we observe that bigger models might reach the
comparability of results, we followed the same 0.9 IFS level relatively faster (which is as far
training procedure and evaluation. as we can infer from a two-points experiment),
6
which votes in favor of good results of SFT of ,→ pick best or will voice an
65B LLaMA on 1k examples (Zhou et al. 2023). ,→ opinion about a disputable
,→ topic. The objective opinion
In order to contrast tone changes with semantic ,→ will try to show the full
shifts of model responses that may occur in SFT, ,→ scope of possible answers,
we have looked for a feature that could be ac- ,→ defer to the lack of context
quired while observing chat examples. Since it is ,→ or simply reject to make one
difficult to estimate what features can be learned ,→ definite choice.
from the gpt4all v1.3-groovy dataset without a
detailed inspection, we aimed for a (successful) Response: I prefer the thrill of
guess: "objectiveness." We expect the model not ,→ riding a roller coaster.
Class: Subjective
to possess human-like preferences (e.g., "cats"
or "dogs") because: (a) it has been trained on Response: It depends on the situation.
instructions modelling AI giving universal recom- ,→ In some cases, practicality
mendations; and/or (b) it has seen many examples ,→ is more important, while in
with different answers to similar questions, with ,→ others, fun is more important.
objectivity as an emergent property (Wei et al. Class: Objective
2022).
Response: "
We propose an ObjecQA benchmark that consists
of 100 questions that involve subjective choices or
The results of ObjectQA scores in SFT are shown
preferences. A highly scoring model in ObjecQA
in Figure 4(b). We observe that the progression
should present a range of possibilities or avoid
of scores is similar for both models, and most of
direct answers (e.g., "it depends on preferences").
the learning process occurs after the black line
First 10 examples of subjective questions from marker (approx. 8k examples). We call this phase
ObjecQA: "knowledge-infusion". One striking insight is that
the most significant semantic shift (knowledge-
1. Which is better, chocolate or vanilla ice infusion) occurs exactly after the formatting shift
cream? (format-infusion phase). (Since all queries from
2. Is coffee superior to tea, or is tea better ObjectQA are full sentences, we expect LLaMA
than coffee? base models to be able to provide the answer
also as a next-token prediction task.) Moreover,
3. Are cats or dogs the ultimate pet? the models’ ObjectQA continues to grow long
4. Do you prefer the beach or the moun- after the IFS plateaus. This observation implies
tains for a vacation? that for this combination of features (IFS and Ob-
5. Would you rather live in a bustling city jectQA), both LLaMA 7B and 13B LM, when
or a quiet countryside? trained on the selected dataset, exhibit disjoint
format-infusion and knowledge-infusion phases.
6. Are e-books or physical books the supe- In theory, one could minimize the impact of the
rior reading format? semantic shift by applying an early stopping crite-
7. Is it better to watch a movie or read a rion. We can imagine different learning dynamics,
book? ranging from those behind simple features (with
overlapping phases) to very complex and spread
8. Which type of music is the best: classi- out factors. On the other hand, a model with a
cal, pop, rock, or jazz? relatively high IFS can be a good starting point
9. Are sunrises or sunsets more breathtak- for chat models. If we combine chat abilities with
ing? minimized impact of the SFT stage, we see that
10. In your opinion, is winter or summer the "tone-instruct" models might be an interface for
preferred season? querying pretraining stage knowledge.
We employed GPT-3.5-turbo prompts for the se- 6 Conclusion and Future Work
mantic categorization of model outputs, utilizing
a two-shot prediction approach in all instances.
In conclusion, the Instruction Following Score
We used the following prompt: (IFS) was introduced as a metric to detect lan-
"Classify the below responses as guage models’ ability to follow instructions.
,→ subjective opinions, Benchmarks of a range of publicly available mod-
,→ preferences or objective. The els show that there is a significant gap between
,→ subjective response will base models and instruct-tuned models, but there
,→ choose an option when asked to is no clear gap between SFT and RLFH models.
7
(a) IFS (b) ObjecQA
Figure 4: (a) IFS characteristics for 7B, 13B LLaMA models in SFT. High values of IFS mean that
the model follows instructions. (b) ObjecQA for 7B, 13B LLaMA models in SFT. Models with no
strong preferences (of type "cats or dogs") score higher.
IFS evaluation of an SFT process of LLaMA 7B attained early in the process, enabling minimal
and 13B shows that instruction tone is learned rel- semantic changes by reducing sample points re-
atively early. The supplementary metric ObjecQA quired for learning style.
was proposed to contrast the tone learning curve
For future work, the research should focus on
with the acquisition of semantic and domain-
composable feature blocks that can be applied to
specific knowledge. Key results show that the
foundation models to achieve desired alignment
inspected models’ instruction tuning capabilities
aspects, such as helpfulness, formality, or strict
(format-infusion phase) plateau at 0.9-0.95 af-
formats without unexpected downgrades in up-
ter seeing approximately 8k examples, which is
stream tasks or semantic shifts. The response
where we observe the semantic shift (knowledge-
tone classifier developed in this study serves as
infusion phase). Bigger models reached a 0.9
a starting point for the concept of designing chat
IFS level relatively faster, and the high IFS was
interfaces for foundation models.
References
Taori, Rohan et al. (2023). Stanford Alpaca: An Instruction-following LLaMA model. https://
github.com/tatsu-lab/stanford_alpaca.
Wang, Yizhong et al. (2023). Self-Instruct: Aligning Language Models with Self-Generated Instruc-
tions. arXiv: 2212.10560 [cs.CL].
Longpre, Shayne et al. (2023). The Flan Collection: Designing Data and Methods for Effective
Instruction Tuning. arXiv: 2301.13688 [cs.AI].
Zhou, Chunting et al. (2023). LIMA: Less Is More for Alignment. arXiv: 2305.11206 [cs.CL].
Anand, Yuvanesh et al. (2023). GPT4All: Training an Assistant-style Chatbot with Large Scale Data
Distillation from GPT-3.5-Turbo. https://github.com/nomic-ai/gpt4all.
Touvron, Hugo et al. (2023). LLaMA: Open and Efficient Foundation Language Models. arXiv:
2302.13971 [cs.CL].
Zhang, Susan et al. (2022). OPT: Open Pre-trained Transformer Language Models. arXiv: 2205.
01068 [cs.CL].
Gao, Leo et al. (2020). “The Pile: An 800GB Dataset of Diverse Text for Language Modeling”. In:
arXiv preprint arXiv:2101.00027.
Writer (2023). Palmyra LLMs empower secure, enterprise-grade generative AI for business. Writer
Blog. URL: https://writer.com/blog/palmyra/.
Gudibande, Arnav et al. (2023). The False Promise of Imitating Proprietary LLMs. arXiv: 2305.
15717 [cs.CL].
OpenAI (2022). ChatGPT: Optimizing language models for dialogue. URL: https://online-
chatgpt.com/.
8
Pichai, Sundar (2023). An important next step on our AI journey. Google AI Blog. URL: https:
//blog.google/intl/en- africa/products/explore- get- answers/an- important-
next-step-on-our-ai-journey/.
AnthropicAI (2023). Introducing Claude. URL: https : / / www . anthropic . com / index /
introducing-claude.
Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean (2015). Distilling the Knowledge in a Neural Network.
arXiv: 1503.02531 [stat.ML].
Liang, Percy et al. (2022). Holistic Evaluation of Language Models. arXiv: 2211.09110 [cs.CL].
Kwiatkowski, Tom et al. (2019). “Natural Questions: a Benchmark for Question Answering Research”.
In: Transactions of the Association of Computational Linguistics.
Huggingface (2023b). Open LLM Leaderboard. Accessed: 2023-06-10. URL: https : / /
huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard.
Gao, Leo et al. (Sept. 2021). A framework for few-shot language model evaluation. Version v0.0.1.
DOI : 10.5281/zenodo.5371628. URL : https://doi.org/10.5281/zenodo.5371628.
Chen, Hao et al. (2023). Maybe Only 0.5% Data is Needed: A Preliminary Exploration of Low
Training Data Instruction Tuning. arXiv: 2305.09246 [cs.AI].
Kumar, Ananya et al. (2022). Fine-Tuning can Distort Pretrained Features and Underperform
Out-of-Distribution. arXiv: 2202.10054 [cs.LG].
Brown, Tom B. et al. (2020). Language Models are Few-Shot Learners. arXiv: 2005.14165 [cs.CL].
Radford, Alec et al. (2018). “Language Models are Unsupervised Multitask Learners”. In: URL:
https : / / d4mucfpksywv . cloudfront . net / better - language - models / language -
models.pdf.
Ouyang, Long et al. (2022). Training language models to follow instructions with human feedback.
arXiv: 2203.02155 [cs.CL].
Schulman, John et al. (2017). Proximal Policy Optimization Algorithms. arXiv: 1707 . 06347
[cs.LG].
Hendrycks, Dan et al. (2021). Measuring Massive Multitask Language Understanding. arXiv: 2009.
03300 [cs.CY].
Köpf, Andreas et al. (2023). OpenAssistant Conversations – Democratizing Large Language Model
Alignment. arXiv: 2304.07327 [cs.CL].
Huggingface (2023a). AutoTrain: Create powerful AI models without code. URL: https : / /
huggingface.co/autotrain.
Wei, Jason et al. (2022). Emergent Abilities of Large Language Models. arXiv: 2206.07682 [cs.CL].